Depth Data Science Interview

  Here are 3 practical ML tips you don't read about in textbooks.


I learned these while building ML solutions at ๐—ฃ๐—ฎ๐˜†๐—ฃ๐—ฎ๐—น and ๐—š๐—ผ๐—ผ๐—ด๐—น๐—ฒ.

Want to level up your ML skills? Read on๐Ÿ‘‡

(๐Ÿญ) ๐—ฉ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—œ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ผ๐—ป ๐—–๐—ผ๐—น๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ ๐—™๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€

❗Don't trust variable importance from random forest blindly. The variable importance of a feature is increased whenever the model splits on the node. When two features are collinear, the variable importance of the features becomes diluted.

⭐ The better approach is to remove collinearity with variable selection using Pearson/Spearman correlation, VIF, or Lasso regression. Then, you can use the random forest or any other tree-based models to get the final model and interpret the variable importance of the features.

(๐Ÿฎ) ๐—ฅ๐—ฎ๐—ป๐—ฑ๐—ผ๐—บ ๐—™๐—ผ๐—ฟ๐—ฒ๐˜€๐˜ (๐—ฅ๐—™) ๐—ผ๐—ป ๐—–๐—ผ๐—ป๐˜๐—ถ๐—ป๐˜‚๐—ผ๐˜‚๐˜€ ๐—ง๐—ฎ๐—ฟ๐—ด๐—ฒ๐˜ ๐—ฉ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ

❗If you are using RF or other tree-based models (e.g. XGboost), be aware that your target prediction will be clipped based on the y range that the model has seen in training.

For instance, suppose that the train_y range is (100, 1000), but the test_y range is (300, 1500). The model will never predict a value beyond 1,000!

⭐ If you suspect the y-range to be unbounded, consider choosing a linear model such as OLS, Lasso, or dense neural networks.

(๐Ÿฏ) ๐—จ๐˜€๐—ฒ ๐—ฆ๐—ถ๐—บ๐—ฝ๐˜€๐—ผ๐—ป'๐˜€ ๐—ฃ๐—ฎ๐—ฟ๐—ฎ๐—ฑ๐—ผ๐˜… ๐˜๐—ผ ๐—ถ๐—บ๐—ฝ๐—ฟ๐—ผ๐˜ƒ๐—ฒ ๐˜†๐—ผ๐˜‚๐—ฟ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น

❗If your model is underperforming the benchmark, don't just add more signals and/or parameters to search in hyperparameter tuning. Do EDA on the residuals of the model. For instance, the global accuracy of your model might be 0.85%, but when you segment it by cohorts (e.g. gender, age, product category), your model might perform better or worse based on cohorts.

⭐ For the segments that the model is underperforming, conduct EDA to see if there are additional signals you can add to improve it.

This is the depth of ML knowledge that you should consider in practice and for data science interviews.

Comments

Popular posts from this blog

Read and Navigate XML - Beautiful Soup

difference-between-stream-processing-and-message-processing

WordNet in Python