I had an enlightening conversation with Andreas Botsikas, author of the Azure Data Scientist Associate Certification Guide, a few days ago. I summarise below the key ideas he shared with us on how to run Machine Learning projects on Azure.
Architecture
Good architecture is key, and it must be tailored to your scenario.
Programming Language
- Python
- All key players contribute to Python –> Python has a much richer ecosystem (more libraries).
- Most organisations use Python for Machine Learning.
- Irrelevance of discussing the programming language.
- It is just some code that you execute on a computer to operate.
- It is easy to switch languages. You can use linters such as flake8 for Python, style formatters including black for Python, or GitHub Copilot.
Machine Learning
- We need many roles and skills to run a Machine Learning project, such as Software Engineer, Data Scientist, Data Engineer, BI Engineer, and Architect.
- Knowing Statistics and programming helps better understand how the ML algorithms work and choose the best one for your use case.
- All the prominent vendors are democratising AI and delivering services that simplify it.
Challenges to Machine Learning
- The conditions to gather the information. E.g., enough light, pictures in motion, etc.
- Code the best possible algorithm.
Key considerations to running a project
- Identify the characteristics of your business that you need to forecast or classify, and define the metrics. E.g., do you know what customers have churned? It’s not easy to define what a churned customer is unless you have a multi-subscription system like Netflix.
- Business stakeholders must explain to the Data Scientists what they need. Then the Data Scientists collaborate with the business to solve the actual problem and choose the appropriate ML algorithm or tool.
Best practices to re-train the ML models
- Balance the computing cost to train models and the benefit of the performance improvement you get after retraining.
- Depends on the type of ML model and when the data drift happens.
- If we can’t compare the results we monitor and the outputs from the model, we can’t know when the data drift happens.
- Periodic retraining helps in several cases:
- When we can’t know when the data drift happens, or
- When we use tools such as AutoML, where new algorithms are added periodically.
- New algorithms are likely added to the AutoML package periodically. A fresh combination may appear to improve performance if there are other algorithms. We should retrain once a month.
Leave a Reply