Big Data
Overview
What is data science ? what is the day to day?
1. Be given a problem 2. examine the data, decide on what to collect. also see /Exploratory Data Analysis (EDA) 3. clean the data ( much of the time is spend here ) 4. analyze the data ( also see https://en.wikipedia.org/wiki/Data_analysis ) 5. repsetn the data / visualization 6. start again.
- node management
- key value stores
- storage management
- job management
Key aspects:
- Integration
- Analysis
- Visualization
- Work Load Optimization
- Security
- Governance
Key Values Stores
list:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores
http://www.project-voldemort.com/voldemort/
https://en.wikipedia.org/wiki/Redis
Storage
Oracle Cluster File System (OCFS)
Old?
- https://oss.oracle.com/projects/ocfs2/
- https://oss.oracle.com/projects/ocfs/dist/documentation/RHAS_best_practices.html
GFS
Hadoop
- get key value with hbase (no sql)
- sql with hive
Examples
Log data
Hadoop Analysis of Apache Logs Using Flume-NG, Hive and Pig
http://cuddletech.com/blog/?p=795
http://www.elasticsearch.org/ - also Elastic Search
JP GOES Sea Surface temperature data
"Geostationaary Operational Environmental Satellites (GOES) 6km Near Real-Time Sea Surface Temperature (SST) Documentation"
ftp://podaac-ftp.jpl.nasa.gov/allData/goes/L3/goes_6km_nrt/docs/goes_sst_doc.html
http://podaac-w10n.jpl.nasa.gov/w10n/allData/goes/L3/goes_6km_nrt/americas/2016/
what is the format of this data?
Learning Progress and Recognition
https://courses.cognitiveclass.ai/certificates/493c0df647484b2082c76328e46feaa5
deep learning https://campus.datacamp.com/courses/deep-learning-in-python/basics-of-deep-learning-and-neural-networks?ex=1
Exploratory Data Analysis (EDA)
what you do when you first get a data set.
tools
Open data - Sources
https://www.kaggle.com/datasets
Reference
- "Big data dudes"
Also See
- Text Classification with TensorFlow Estimators
- https://opendatascience.com/text-classification-with-tensorflow-estimators/