Machine Learning/Courera's Competitive Data Science Course
From Federal Burro of Information
Jump to navigationJump to search
Exam notes
I Suppose that you have a credit scoring task, where you have to create a ML model that approximates expert evaluation of an individual's creditworthiness. Which of the following can potentially be a data leakage? Select all that apply.
- 1. First half of the data points in the train set has a score of 0, while the second half has scores > 0. - This should be selected
- Is a leak
- 2. Among the features you have a company_id, an identifier of a company where this person works. It turns out that this feature is very important and adding it to the model significantly improves your score. = Un-selected is correct
- not a source of leak
- 3. An ID of a data point (row) in the train set correlates with target variable.
- Explainiation: Data was not shuffled, this information can not be used in real-world scenario