Big Data: Difference between revisions
No edit summary |
|||
(44 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Overview == | == Overview == | ||
What is data science ? what is the day to day? | |||
1. Be given a problem | |||
2. examine the data, decide on what to collect. also see [[#Exploratory Data Analysis (EDA)]] | |||
3. clean the data ( much of the time is spend here ) | |||
The Importance of Cleaning the Text | |||
https://www.kaggle.com/currie32/the-importance-of-cleaning-text/notebook | |||
Faster | |||
https://www.kaggle.com/tour1st/the-importance-of-cleaning-text-faster | |||
4. analyze the data ( also see https://en.wikipedia.org/wiki/Data_analysis ) | |||
5. repsetn the data / visualization | |||
6. start again. | |||
google: "data science eda cycle" and search images.. this is a great start to understanding the cycle. | |||
# node management | # node management | ||
Line 67: | Line 85: | ||
deep learning https://campus.datacamp.com/courses/deep-learning-in-python/basics-of-deep-learning-and-neural-networks?ex=1 | deep learning https://campus.datacamp.com/courses/deep-learning-in-python/basics-of-deep-learning-and-neural-networks?ex=1 | ||
todo: | |||
; Applying the Lambda Architecture with Spark, Kafka, and Cassandra | |||
: https://www.pluralsight.com/courses/spark-kafka-cassandra-applying-lambda-architecture | |||
; Structured Streaming in Apache Spark 2 | |||
: https://www.pluralsight.com/courses/apache-spark-2-structured-streaming | |||
== Exploratory Data Analysis (EDA) == | |||
what you do when you first get a data set. | |||
what is it: looking into the data, understanding it, getting comfortable with it. | |||
will hep you generate features and build accxureate models | |||
make hypothesis and and have insights | |||
exploring data results in intuition about that data. | |||
careful eda results in insight in to data | |||
visualization is key | |||
visualization -> idea , idea -> visualization | |||
find magic features ,like the promos use versus sent.. the "diff: was 80% acruate. | |||
the point behind EDA is that you _DO NOT_ start making a model. YOu start by trying to understand the data, get an intuitive feel for the data, and to possibly generate some insight. | |||
some steps: | |||
; domain knowledge. | |||
: google your topic, learn a bit about it. | |||
: populate a data dictionary for example | |||
; check if the data is intuitive | |||
: for example is the age accurate? | |||
: are errors due to random error , or is the error due to systematic bug? possibly still useful of not intutive | |||
; how was the data generated? | |||
: was train and test set created with an algo? | |||
: create syntetih feature: "is correct" ? possobly a signal | |||
: range of time on train range much greater ) 10X )than test set. | |||
: decode faures types, and guess what it's used for, exp in the case of anonymized data, DOB? event counts, timestamp? | |||
'''techniques''': | |||
counting repeated values: | |||
train_set.featureX.mean(15) | |||
train_set.featureX.std(15) | |||
train_set.featureX.value_counts().head(15) | |||
featurex_unique = train_set.featureX.unique() | |||
featurex_unique_sorted = np.sort(featurex_unique) | |||
example: https://hub.coursera-notebooks.org/user/hnlrfyqblxjsscsosbsvlx/notebooks/readonly/reading_materials/EDA_video2.ipynb | |||
Usefule functions in python | |||
df is pandas data frame | |||
df.types() - guesses type | |||
three types come out: | |||
object ( fraught with danger : can detect 'object' when mostly numbers but some "odd" values, like text. ) | |||
float | |||
int ( can be binary 1|0 , event counts, or catagory with label encoder | |||
df.info() | |||
x.value_counts() | |||
x.isnull() | |||
=== Visualization in EDA === | |||
plot.hist(x) | |||
plot.plot(x.'x') | |||
x.var() | |||
x.value_counts() | |||
plt.scatter(f1,f2) | |||
how to add colour? | |||
when feature set is small can use pd.scatter_matrix(df) to do all | |||
df.corr() and plt.matshow(...) | |||
consider plotting the mean values for all features on one graph. It's possible that the graph is pretty random, but if we sort the features by mean we might see something interesting. like this: | |||
df.mean().sort_values().plot(style='.') | |||
=== Cleaning === | |||
trainer: Dmitry Ulyanov | |||
constant in training , remove it. | |||
Also drop duplicates, where two features are completely duplicate | |||
traintest.T.drop_duplicates() | |||
THere maybe colums for catelgory data, where if we change the levels that they will turn out to be duplicate ( and should be remoed) | |||
f4 f5 | |||
1 A C | |||
2 B A | |||
3 A C | |||
4 C B | |||
If we translate in f4 A->C, B->A, and C->b , B->C then F4 will be a duplicate of f5 | |||
how do we find those cols: | |||
1st label encode top to bottom. | |||
A->1, B->2 etc ... | |||
for f in catagorical_features: | |||
traintest[f] = raintest[f].factorize() | |||
traintest.T.drop_duplicates() | |||
Are there many duplicates rows with different targets? might need to nuke those rows. | |||
is data shuffled? sort shows by ID and plot target... is it evenly distributed? | |||
== Y U Cluster? == | |||
; Command-line Tools can be 235x Faster than your Hadoop Cluster | |||
: https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html | |||
; Spark, Turn Off the Light on the CLI | |||
: https://mindfulmachines.io/blog/2016/1/1/6ywzqtxq79ajny7xkf8jkd13k4t7bu | |||
; Streaming anomaly detection for Big Data/Internet of Things - Adam Drake - FOSSASIA Summit 2016 | |||
: https://www.youtube.com/watch?v=uzsZs30w82M | |||
; Your command-line tools may be 235x faster but they don’t have the same features | |||
: http://blog.michael-cetrulo.com/post/108605064468/your-command-line-tools-may-be-235x-faster-but#_=_ | |||
== Open data - Sources == | == Open data - Sources == | ||
Line 92: | Line 253: | ||
; Text Classification with TensorFlow Estimators | ; Text Classification with TensorFlow Estimators | ||
: https://opendatascience.com/text-classification-with-tensorflow-estimators/ | : https://opendatascience.com/text-classification-with-tensorflow-estimators/ | ||
; New York R Meetup presentations | |||
: https://nyhackr.org/presentations.html | |||
; Adam Drake's Publications at ACM | |||
: https://dl.acm.org/author_page.cfm?id=81309486264 | |||
;Apache Spark Tutorial: ML with PySpark ( 2017 ) | |||
:https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning | |||
;IPython Or Jupyter? ( 2017 ) | |||
:https://www.datacamp.com/community/blog/ipython-jupyter | |||
[[Category:Data Science]] |
Latest revision as of 01:01, 15 December 2021
Overview
What is data science ? what is the day to day?
1. Be given a problem 2. examine the data, decide on what to collect. also see #Exploratory Data Analysis (EDA) 3. clean the data ( much of the time is spend here )
The Importance of Cleaning the Text https://www.kaggle.com/currie32/the-importance-of-cleaning-text/notebook Faster https://www.kaggle.com/tour1st/the-importance-of-cleaning-text-faster
4. analyze the data ( also see https://en.wikipedia.org/wiki/Data_analysis ) 5. repsetn the data / visualization 6. start again.
google: "data science eda cycle" and search images.. this is a great start to understanding the cycle.
- node management
- key value stores
- storage management
- job management
Key aspects:
- Integration
- Analysis
- Visualization
- Work Load Optimization
- Security
- Governance
Key Values Stores
list:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
http://www.project-voldemort.com/voldemort/
https://en.wikipedia.org/wiki/Redis
Storage
Oracle Cluster File System (OCFS)
Old?
- https://oss.oracle.com/projects/ocfs2/
- https://oss.oracle.com/projects/ocfs/dist/documentation/RHAS_best_practices.html
GFS
Hadoop
- get key value with hbase (no sql)
- sql with hive
Examples
Log data
Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig
http://cuddletech.com/blog/?p=795
http://www.elasticsearch.org/ - also Elastic Search
JP GOES Sea Surface temperature data
"Geostationaary Operational Environmental Satellites (GOES) 6km Near Real-Time Sea Surface Temperature (SST) Documentation"
ftp://podaac-ftp.jpl.nasa.gov/allData/goes/L3/goes_6km_nrt/docs/goes_sst_doc.html
http://podaac-w10n.jpl.nasa.gov/w10n/allData/goes/L3/goes_6km_nrt/americas/2016/
what is the format of this data?
Learning Progress and Recognition
https://courses.cognitiveclass.ai/certificates/493c0df647484b2082c76328e46feaa5
deep learning https://campus.datacamp.com/courses/deep-learning-in-python/basics-of-deep-learning-and-neural-networks?ex=1
todo:
- Applying the Lambda Architecture with Spark, Kafka, and Cassandra
- https://www.pluralsight.com/courses/spark-kafka-cassandra-applying-lambda-architecture
- Structured Streaming in Apache Spark 2
- https://www.pluralsight.com/courses/apache-spark-2-structured-streaming
Exploratory Data Analysis (EDA)
what you do when you first get a data set.
what is it: looking into the data, understanding it, getting comfortable with it.
will hep you generate features and build accxureate models
make hypothesis and and have insights
exploring data results in intuition about that data.
careful eda results in insight in to data
visualization is key
visualization -> idea , idea -> visualization
find magic features ,like the promos use versus sent.. the "diff: was 80% acruate.
the point behind EDA is that you _DO NOT_ start making a model. YOu start by trying to understand the data, get an intuitive feel for the data, and to possibly generate some insight.
some steps:
- domain knowledge.
- google your topic, learn a bit about it.
- populate a data dictionary for example
- check if the data is intuitive
- for example is the age accurate?
- are errors due to random error , or is the error due to systematic bug? possibly still useful of not intutive
- how was the data generated?
- was train and test set created with an algo?
- create syntetih feature: "is correct" ? possobly a signal
- range of time on train range much greater ) 10X )than test set.
- decode faures types, and guess what it's used for, exp in the case of anonymized data, DOB? event counts, timestamp?
techniques:
counting repeated values:
train_set.featureX.mean(15) train_set.featureX.std(15) train_set.featureX.value_counts().head(15)
featurex_unique = train_set.featureX.unique() featurex_unique_sorted = np.sort(featurex_unique)
example: https://hub.coursera-notebooks.org/user/hnlrfyqblxjsscsosbsvlx/notebooks/readonly/reading_materials/EDA_video2.ipynb
Usefule functions in python
df is pandas data frame df.types() - guesses type three types come out: object ( fraught with danger : can detect 'object' when mostly numbers but some "odd" values, like text. ) float int ( can be binary 1|0 , event counts, or catagory with label encoder
df.info() x.value_counts() x.isnull()
Visualization in EDA
plot.hist(x)
plot.plot(x.'x')
x.var()
x.value_counts()
plt.scatter(f1,f2)
how to add colour?
when feature set is small can use pd.scatter_matrix(df) to do all
df.corr() and plt.matshow(...)
consider plotting the mean values for all features on one graph. It's possible that the graph is pretty random, but if we sort the features by mean we might see something interesting. like this:
df.mean().sort_values().plot(style='.')
Cleaning
trainer: Dmitry Ulyanov
constant in training , remove it.
Also drop duplicates, where two features are completely duplicate
traintest.T.drop_duplicates()
THere maybe colums for catelgory data, where if we change the levels that they will turn out to be duplicate ( and should be remoed)
f4 f5 1 A C 2 B A 3 A C 4 C B
If we translate in f4 A->C, B->A, and C->b , B->C then F4 will be a duplicate of f5
how do we find those cols:
1st label encode top to bottom.
A->1, B->2 etc ...
for f in catagorical_features: traintest[f] = raintest[f].factorize() traintest.T.drop_duplicates()
Are there many duplicates rows with different targets? might need to nuke those rows.
is data shuffled? sort shows by ID and plot target... is it evenly distributed?
Y U Cluster?
- Command-line Tools can be 235x Faster than your Hadoop Cluster
- https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
- Spark, Turn Off the Light on the CLI
- https://mindfulmachines.io/blog/2016/1/1/6ywzqtxq79ajny7xkf8jkd13k4t7bu
- Streaming anomaly detection for Big Data/Internet of Things - Adam Drake - FOSSASIA Summit 2016
- https://www.youtube.com/watch?v=uzsZs30w82M
- Your command-line tools may be 235x faster but they don’t have the same features
- http://blog.michael-cetrulo.com/post/108605064468/your-command-line-tools-may-be-235x-faster-but#_=_
Open data - Sources
https://www.kaggle.com/datasets
Reference
- "Big data dudes"
Also See
- Text Classification with TensorFlow Estimators
- https://opendatascience.com/text-classification-with-tensorflow-estimators/
- New York R Meetup presentations
- https://nyhackr.org/presentations.html
- Adam Drake's Publications at ACM
- https://dl.acm.org/author_page.cfm?id=81309486264
- Apache Spark Tutorial
- ML with PySpark ( 2017 )
- https://www.datacamp.com/community/tutorials/apache-spark-tutorial-machine-learning
- IPython Or Jupyter? ( 2017 )
- https://www.datacamp.com/community/blog/ipython-jupyter