Big Data
Overview
What is data science ? what is the day to day?
1. Be given a problem 2. examine the data, decide on what to collect. also see /Exploratory Data Analysis (EDA) 3. clean the data ( much of the time is spend here ) 4. analyze the data ( also see https://en.wikipedia.org/wiki/Data_analysis ) 5. repsetn the data / visualization 6. start again.
- node management
- key value stores
- storage management
- job management
Key aspects:
- Integration
- Analysis
- Visualization
- Work Load Optimization
- Security
- Governance
Key Values Stores
list:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
http://www.project-voldemort.com/voldemort/
https://en.wikipedia.org/wiki/Redis
Storage
Oracle Cluster File System (OCFS)
Old?
- https://oss.oracle.com/projects/ocfs2/
- https://oss.oracle.com/projects/ocfs/dist/documentation/RHAS_best_practices.html
GFS
Hadoop
- get key value with hbase (no sql)
- sql with hive
Examples
Log data
Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig
http://cuddletech.com/blog/?p=795
http://www.elasticsearch.org/ - also Elastic Search
JP GOES Sea Surface temperature data
"Geostationaary Operational Environmental Satellites (GOES) 6km Near Real-Time Sea Surface Temperature (SST) Documentation"
ftp://podaac-ftp.jpl.nasa.gov/allData/goes/L3/goes_6km_nrt/docs/goes_sst_doc.html
http://podaac-w10n.jpl.nasa.gov/w10n/allData/goes/L3/goes_6km_nrt/americas/2016/
what is the format of this data?
Learning Progress and Recognition
https://courses.cognitiveclass.ai/certificates/493c0df647484b2082c76328e46feaa5
deep learning https://campus.datacamp.com/courses/deep-learning-in-python/basics-of-deep-learning-and-neural-networks?ex=1
Exploratory Data Analysis (EDA)
what you do when you first get a data set.
what is it: looking into the data, understanding it, getting comfortable with it.
will hep you generate features and build accxureate models
make hypothesis and and have insights
exploring data results in intuition about that data.
careful eda results in insight in to data
visualization is key
visualization -> idea , idea -> visualization
find magic features ,like the promos use versus sent.. the "diff: was 80% acruate.
the point behind EDA is that you _DO NOT_ start making a model. YOu start by trying to understand the data, get an intuitive feel for the data, and to possibly generate some insight.
tools
Open data - Sources
https://www.kaggle.com/datasets
Reference
- "Big data dudes"
Also See
- Text Classification with TensorFlow Estimators
- https://opendatascience.com/text-classification-with-tensorflow-estimators/