Machine Learning/Courera's Competitive Data Science Course: Difference between revisions
From Federal Burro of Information
Jump to navigationJump to search
(Created page with "== Exam notes == '''I''' Suppose that you have a credit scoring task, where you have to create a ML model that approximates expert evaluation of an individual's creditworthin...") |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Exam notes == | == Exam notes == | ||
=== I === | |||
Suppose that you have a credit scoring task, where you have to create a ML model that approximates expert evaluation of an individual's creditworthiness. Which of the following can potentially be a data leakage? Select all that apply. | |||
;1. First half of the data points in the train set has a score of 0, while the second half has scores > 0. - '''This should be selected''' | ;1. First half of the data points in the train set has a score of 0, while the second half has scores > 0. - '''This should be selected''' | ||
: Is a leak | : Is a leak - | ||
; 2. Among the features you have a company_id, an identifier of a company where this person works. It turns out that this feature is very important and adding it to the model significantly improves your score. = '''Un-selected is correct''' | ; 2. Among the features you have a company_id, an identifier of a company where this person works. It turns out that this feature is very important and adding it to the model significantly improves your score. = '''Un-selected is correct''' | ||
Line 10: | Line 12: | ||
; 3. An ID of a data point (row) in the train set correlates with target variable. | ; 3. An ID of a data point (row) in the train set correlates with target variable. | ||
: Is a leak | |||
: Explainiation: Data was not shuffled, this information can not be used in real-world scenario | : Explainiation: Data was not shuffled, this information can not be used in real-world scenario | ||
=== II === | |||
What is the most foolproof way to set up a time series competition? | |||
; Split train, public and private parts of data by time. Remove all features except IDs (e.g. timestamp) from test set so that participants will generate all the features based on past and join them themselves. | |||
: Correct Answer | |||
; Make a time based split for train/test and a random split for public/private. | |||
: This should not be selected - WRONG! | |||
: Explanation: Vulnerable to leaderboard probing. | |||
; Split train, public and private parts of data by time. Remove time variable from test set, keep the features. | |||
: Also Ronge - dunno why | |||
== Exploratory Data Analysis (EDA) == | |||
notepad https://hub.coursera-notebooks.org/user/hnlrfyqblxjsscsosbsvlx/notebooks/readonly/reading_materials/EDA_video2.ipynb | |||
also see [[Big_Data#Exploratory_Data_Analysis_.28EDA.29]] | |||
=== Handling analysing annoymoused data === | |||
for example hashed out data | |||
=== Build a quick baseline === | |||
ref: https://hub.coursera-notebooks.org/user/hnlrfyqblxjsscsosbsvlx/notebooks/readonly/reading_materials/EDA_video3_screencast.ipynb | |||
<pre> | |||
from sklearn.ensemble import RandomForestClassifier | |||
| |||
# Create a copy to work with | |||
X = train.copy() | |||
| |||
# Save and drop labels | |||
y = train.y | |||
X = X.drop('y', axis=1) | |||
| |||
# fill NANs | |||
X = X.fillna(-999) | |||
| |||
# Label encoder | |||
for c in train.columns[train.dtypes == 'object']: | |||
X[c] = X[c].factorize()[0] | |||
rf = RandomForestClassifier() | |||
rf.fit(X,y) | |||
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', | |||
max_depth=None, max_features='auto', max_leaf_nodes=None, | |||
min_impurity_split=1e-07, min_samples_leaf=1, | |||
min_samples_split=2, min_weight_fraction_leaf=0.0, | |||
n_estimators=10, n_jobs=1, oob_score=False, random_state=None, | |||
verbose=0, warm_start=False) | |||
plt.plot(rf.feature_importances_) | |||
plt.xticks(np.arange(X.shape[1]), X.columns.tolist(), rotation=90); | |||
</pre> |
Latest revision as of 17:40, 14 May 2018
Exam notes
I
Suppose that you have a credit scoring task, where you have to create a ML model that approximates expert evaluation of an individual's creditworthiness. Which of the following can potentially be a data leakage? Select all that apply.
- 1. First half of the data points in the train set has a score of 0, while the second half has scores > 0. - This should be selected
- Is a leak -
- 2. Among the features you have a company_id, an identifier of a company where this person works. It turns out that this feature is very important and adding it to the model significantly improves your score. = Un-selected is correct
- not a source of leak
- 3. An ID of a data point (row) in the train set correlates with target variable.
- Is a leak
- Explainiation: Data was not shuffled, this information can not be used in real-world scenario
II
What is the most foolproof way to set up a time series competition?
- Split train, public and private parts of data by time. Remove all features except IDs (e.g. timestamp) from test set so that participants will generate all the features based on past and join them themselves.
- Correct Answer
- Make a time based split for train/test and a random split for public/private.
- This should not be selected - WRONG!
- Explanation: Vulnerable to leaderboard probing.
- Split train, public and private parts of data by time. Remove time variable from test set, keep the features.
- Also Ronge - dunno why
Exploratory Data Analysis (EDA)
also see Big_Data#Exploratory_Data_Analysis_.28EDA.29
Handling analysing annoymoused data
for example hashed out data
Build a quick baseline
from sklearn.ensemble import RandomForestClassifier # Create a copy to work with X = train.copy() # Save and drop labels y = train.y X = X.drop('y', axis=1) # fill NANs X = X.fillna(-999) # Label encoder for c in train.columns[train.dtypes == 'object']: X[c] = X[c].factorize()[0] rf = RandomForestClassifier() rf.fit(X,y) RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) plt.plot(rf.feature_importances_) plt.xticks(np.arange(X.shape[1]), X.columns.tolist(), rotation=90);