Depth Data Science Interview
Here are 3 practical ML tips you don't read about in textbooks.
I learned these while building ML solutions at ๐ฃ๐ฎ๐๐ฃ๐ฎ๐น and ๐๐ผ๐ผ๐ด๐น๐ฒ.
Want to level up your ML skills? Read on๐
(๐ญ) ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐๐บ๐ฝ๐ผ๐ฟ๐๐ฎ๐ป๐ฐ๐ฒ ๐ผ๐ป ๐๐ผ๐น๐น๐ถ๐ป๐ฒ๐ฎ๐ฟ ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ๐
❗Don't trust variable importance from random forest blindly. The variable importance of a feature is increased whenever the model splits on the node. When two features are collinear, the variable importance of the features becomes diluted.
⭐ The better approach is to remove collinearity with variable selection using Pearson/Spearman correlation, VIF, or Lasso regression. Then, you can use the random forest or any other tree-based models to get the final model and interpret the variable importance of the features.
(๐ฎ) ๐ฅ๐ฎ๐ป๐ฑ๐ผ๐บ ๐๐ผ๐ฟ๐ฒ๐๐ (๐ฅ๐) ๐ผ๐ป ๐๐ผ๐ป๐๐ถ๐ป๐๐ผ๐๐ ๐ง๐ฎ๐ฟ๐ด๐ฒ๐ ๐ฉ๐ฎ๐ฟ๐ถ๐ฎ๐ฏ๐น๐ฒ
❗If you are using RF or other tree-based models (e.g. XGboost), be aware that your target prediction will be clipped based on the y range that the model has seen in training.
For instance, suppose that the train_y range is (100, 1000), but the test_y range is (300, 1500). The model will never predict a value beyond 1,000!
⭐ If you suspect the y-range to be unbounded, consider choosing a linear model such as OLS, Lasso, or dense neural networks.
(๐ฏ) ๐จ๐๐ฒ ๐ฆ๐ถ๐บ๐ฝ๐๐ผ๐ป'๐ ๐ฃ๐ฎ๐ฟ๐ฎ๐ฑ๐ผ๐ ๐๐ผ ๐ถ๐บ๐ฝ๐ฟ๐ผ๐๐ฒ ๐๐ผ๐๐ฟ ๐บ๐ผ๐ฑ๐ฒ๐น
❗If your model is underperforming the benchmark, don't just add more signals and/or parameters to search in hyperparameter tuning. Do EDA on the residuals of the model. For instance, the global accuracy of your model might be 0.85%, but when you segment it by cohorts (e.g. gender, age, product category), your model might perform better or worse based on cohorts.
⭐ For the segments that the model is underperforming, conduct EDA to see if there are additional signals you can add to improve it.
This is the depth of ML knowledge that you should consider in practice and for data science interviews.
Comments
Post a Comment