Big Data

From Federal Burro of Information
Jump to navigationJump to search

Overview

What is data science ? what is the day to day?

1. Be given a problem 
2. examine the data, decide on what to collect. also see /Exploratory Data Analysis (EDA)
3. clean the data ( much of the time is spend here )
 The Importance of Cleaning the Text
 https://www.kaggle.com/currie32/the-importance-of-cleaning-text/notebook
 Faster
 https://www.kaggle.com/tour1st/the-importance-of-cleaning-text-faster
4. analyze the data ( also see https://en.wikipedia.org/wiki/Data_analysis )
5. repsetn the data / visualization
6. start again.



  1. node management
  2. key value stores
  3. storage management
  4. job management

Key aspects:

  • Integration
  • Analysis
  • Visualization
  • Work Load Optimization
  • Security
  • Governance


Key Values Stores

list:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores

http://www.project-voldemort.com/voldemort/

https://en.wikipedia.org/wiki/Redis


Storage

Oracle Cluster File System (OCFS)

Old?

GFS

Hadoop

  • get key value with hbase (no sql)
  • sql with hive

Examples

Log data

Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig
http://cuddletech.com/blog/?p=795

http://www.elasticsearch.org/ - also Elastic Search

JP GOES Sea Surface temperature data

"Geostationaary Operational Environmental Satellites (GOES) 6km Near Real-Time Sea Surface Temperature (SST) Documentation"

ftp://podaac-ftp.jpl.nasa.gov/allData/goes/L3/goes_6km_nrt/docs/goes_sst_doc.html

http://podaac-w10n.jpl.nasa.gov/w10n/allData/goes/L3/goes_6km_nrt/americas/2016/

what is the format of this data?

Learning Progress and Recognition

https://courses.cognitiveclass.ai/certificates/493c0df647484b2082c76328e46feaa5

https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+ML0101EN+2016_T3/courseware/407a9f86565c44189740699636b4fb85/d82ba5edac4f40efa334fff96b944b34/

deep learning https://campus.datacamp.com/courses/deep-learning-in-python/basics-of-deep-learning-and-neural-networks?ex=1

Exploratory Data Analysis (EDA)

what you do when you first get a data set.

what is it: looking into the data, understanding it, getting comfortable with it.

will hep you generate features and build accxureate models

make hypothesis and and have insights

exploring data results in intuition about that data.

careful eda results in insight in to data

visualization is key

visualization -> idea , idea -> visualization

find magic features ,like the promos use versus sent.. the "diff: was 80% acruate.

the point behind EDA is that you _DO NOT_ start making a model. YOu start by trying to understand the data, get an intuitive feel for the data, and to possibly generate some insight.

some steps:

domain knowledge.
google your topic, learn a bit about it.
populate a data dictionary for example
check if the data is intuitive
for example is the age accurate?
are errors due to random error , or is the error due to systematic bug? possibly still useful of not intutive
how was the data generated?
was train and test set created with an algo?
create syntetih feature: "is correct" ? possobly a signal
range of time on train range much greater ) 10X )than test set.
decode faures types, and guess what it's used for, exp in the case of anonymized data, DOB? event counts, timestamp?

techniques:

counting repeated values:

train_set.featureX.mean(15)
train_set.featureX.std(15)
train_set.featureX.value_counts().head(15)
featurex_unique = train_set.featureX.unique()
featurex_unique_sorted = np.sort(featurex_unique)
example: https://hub.coursera-notebooks.org/user/hnlrfyqblxjsscsosbsvlx/notebooks/readonly/reading_materials/EDA_video2.ipynb

Usefule functions in python

df is pandas data frame
df.types() - guesses type
 three types come out:
 object ( fraught with danger : can detect 'object' when mostly numbers but some "odd" values, like text. )
 float
 int ( can be binary 1|0 , event counts, or catagory with label encoder
df.info()
x.value_counts()
x.isnull()

Visualization in EDA

plot.hist(x)
plot.plot(x.'x')
x.var()
x.value_counts()
plt.scatter(f1,f2)
how to add colour?

when feature set is small can use pd.scatter_matrix(df) to do all

df.corr() and plt.matshow(...)

consider plotting the mean values for all features on one graph. It's possible that the graph is pretty random, but if we sort the features by mean we might see something interesting. like this:

df.mean().sort_values().plot(style='.')

Cleaning

constant in training , remove it.

Also drop duplicates, where two features are completely duplicate

traintest.T.drop_duplicates()

THere maybe colums for catelgory data, where if we change the levels that they will turn out to be duplicate ( and should be remoed)

   f4   f5
1  A    C
2  B    A
3  A    C
4  C    B

If we translate in f4 A->C, B->A, and C->b , B->C then F4 will be a duplicate of f5

how do we find those cols:

1st label encode top to bottom.

A->1, B->2 etc ...
for f in catagorical_features:
 traintest[f] = raintest[f].factorize()
traintest.T.drop_duplicates()

Open data - Sources

http://konect.uni-koblenz.de/

https://www.kaggle.com/datasets

Reference

  • "Big data dudes"

Also See

Text Classification with TensorFlow Estimators
https://opendatascience.com/text-classification-with-tensorflow-estimators/