Machine Learning Best Practices

Updated: Apr 2



(Please see the attached document for ML and AI trends in Banking and Credit)


Background


The math and the core tools of Machine Learning have not changed significantly over time.

  • Rene Descartes lived around 400 years ago and theorized the linear equations related to matrix algebra and common regression techniques. Linear regression is a bedrock modeling approach used in bank demand and credit risk algorithms.

  • Thomas Bayes lived over 250 years ago and Bayesian math-based conditional probability is at the core of “next product” algorithms.

  • Neural Networks are the “babies” of only about 30 years old.

What has changed is:

⁃Ability to handle larger data sets

⁃Speed of processing

⁃Automation of quantitative processes that used to be handled manually by the analysts. 

⁃Use of Python to speed/simplify the implementation of new models

AI / ML maturity model:

Manual learning => Supervised Learning => Unsupervised Learning

  • Manual – people (data scientists) perform the machine learning tasks and manually teach (augment) the algorithms to implement. Data scientists primarily focus on functional shape and data relationships. They also apply various algorithm techniques.

  • Supervised - Algorithms apply what has been learned in the past to predict the outcome of new data. Common in credit risk modeling with more accurate results than methodologies such as logistic or linear regression.

  • Unsupervised - Algorithms draw inferences from or look for patterns in datasets. These are more advanced forms of AI and their use in credit risk is limited.

Best Practices


Best Practices in the Data Preparation Stage 

  • Completely understand the project goal (note: This includes a deep dive/documentation of business and technical assumptions)

  • Collect all fields that are relevant

  • Maintain consistency of field values 

  • Deal with missing data 

Best Practices In The Training Sets Generation Stage 

  • Determine categorical features with numerical values 

  • Decide on whether or not to encode categorical features 

  • Decide on whether or not to reduce dimensionality and if so how

  • Likely improving performance as prediction models will learn from data with less redundant or correlated features (Note: Very important. PCA and other variable transformation approaches can help )

  • Decide on whether or not to scale features 

  • Perform feature engineering with domain expertise 

  • Perform feature engineering without domain expertise 

  • Document how each feature is generated  (note: This is critical! Relates to learning as an asset. Best Practices In The Model Training, Evaluation, and Selection )

Best Practices In The Model Training, Evaluation, and Selection Stage 

  • Choose the right algorithm(s) to start with 

  • Naive Bayes

  • Logistic regression

  • SVM 

  • Random forest (or decision tree) 

  • Neural networks (Reduce overfitting, Diagnose overfitting and underfitting)

Best Practices In The Deployment And Monitoring Stage 

  • save, load, and reuse models 

  • Update Models Regularly 


Source: Steve Blair’s book, Python Machine Learning for Beginners 

ML 3-18-19 jh
.pptx
Download PPTX • 91KB

10 views0 comments