Andreas Botsikas – Microsoft engineer, author of the Azure Data Scientist Guide

“Customers want to adopt ML such as customer churn predictions. However, it’s not easy to define what a churned customer is unless you have a subscription business model like Netflix, where you can quickly identify the customers who stopped paying. E.g., what does churn mean for a supermarket?”

Andreas Botsikas has worked for Microsoft for the last ten years. He has a PhD in using AI for industrial resources and usage optimisation. He identifies his role as a “poly-míchanos” (πολυ-μήχανος), an engineer who finds solutions for the different challenges that may appear in a project.

Andreas says that “architecture is like a good suit. It’s OK to buy a suit in a store, but it’s better to customise it according to your requirements”. He is happy to share some of his designs in his GitHub account and personal blog, such as his Walk Dog example on how to stream data in Azure or how to do genomics analysis on Azure.

Why did you write a book from zero to hero on AzureML?

Last year Michael [Hlobil, co-author of the book] told me that he got a proposal to write the Azure Data Scientist Associate Certification Guide. He asked me to participate.

Michael knows that I worked with the [Microsoft] Product group on AzureML even five years ago before it was a product. I helped them to shape the final preview of the product. Among other things, I fixed inconsistencies in the CLI [Command-line interface]. Back then, I understood that AI needed software engineers to solve problems.

Your guide is meant to help you pass the Azure Data Scientist Associate Certification. It is also a practical book that explains how to develop a Machine Learning (ML) project in Azure.

The book came from my experience. I have helped many customers with ML projects, such as predictive maintenance, forecasting, model operationalisation, etc. This experience gave me the chance to experiment and find what works.

All the examples in your book are in Python. Why?

There are three main reasons:

Python has a much richer ecosystem. It has far more libraries. Nowadays, all big players contribute with algorithms in Python.
Most organisations use Python for ML. In my case, I had roughly 120 customers this year and only three use R.
It’s on the certification exam.

As the platform evolves, it becomes less of a debate about the programming language you use. It’s just some code that you execute on a computer to train an algorithm.

Additionally, it’s straightforward to switch languages, mainly if you use linters such as flake8 for Python, style formatters including black for Python, or GitHub Copilot.

“It’s critical to balance the computing cost to train models and the benefit of the performance improvement you get after retraining”

Do you prefer to write ML from scratch or use tools that prepare the algorithms instead?

Machine Learning is machines trying to detect patterns that the human brain can’t quickly identify. Knowing Statistics and programming helps better understand how the ML algorithms work and choose the best one for your use case. But it’s not a requirement. All the prominent vendors are democratising AI and delivering services that simplify it. With these tools, there will be more citizen data scientists over time, and they will leverage AI.

What challenges do you face in a real-world project to make it come true?

Suppose you want to do some computer vision on a product line. There are goods on a belt, and you must identify flaws in the packaging as they pass along.

Data Scientists will think about how to do the convolutional neural networks. But before they can start to program, reality kicks in. If they put the camera on the belt, it will vibrate, and it won’t take a stable picture. If they illuminate with a powerful lamp to take the picture, it may emit so much heat that it will spoil the product.

You must beat physics to make the ML project work in the real world.

“New algorithms are likely added to the AutoML package periodically. A fresh combination may appear to improve performance if there are other algorithms. It’s an excellent practice to retrain these models once a month in case there are new combinations”

How often do you recommend your customers retrain their algorithms?

It depends on several factors, such as the model type and when your monitoring system tells you that the data drift (or model drift) happens.

As a rule of thumb, you must know what data is coming into the [ML] model, monitor the software, and monitor the model’s output. Then you should compare the metrics to know if it’s performing well.

Regarding frequency, you need to retrain forecasting models sooner than any other one. My customers usually retrain the traditional [Python] sktime algorithms once every fortnight or even once a week.

As for the classification problems, it’s more challenging to see a shift in the environment that forces you to retrain.

Regarding the neural networks, my customers retrain them roughly once a month.

Anyway, it’s critical to balance the computing cost of training models and the benefit of the performance improvement you get after retraining.

Usually, you don’t need to retrain a model very often. Unless, of course, something changes in the environment that causes a data drift.

On a separate note, sometimes you don’t have the Ground Truth, i.e., you can’t compare the results you monitor and the outputs from the model. E.g., a model to predict loan defaults. It will probably take several years for a borrower to fail to pay back the debt. You won’t be sure whether the model is drifting right now. In these cases, you must know whether the inputs to the model remain the same. So the hypothesis is that if the inputs are in the ranges where you have trained the model, it will probably be correct. Unless, of course, you have external factors like war, oil pressure, etc. Then you will have data drift and have to retrain the model.

So periodic retraining helps, especially with tools like [Azure] AutoML or if you don’t have the Ground Truth.

In the case of AutoML, new algorithms are likely added to the package periodically. A fresh combination may appear to improve performance if there are other algorithms. It’s an excellent practice to retrain these models once a month, if we have the budget for it, in case there are new combinations.

Some companies need to implement ML to keep relevant in their businesses, such as customer churn models. However, they may not have experience. How do you proceed in these cases?

First of all, do you know what customers have churned? It’s not easy to define what a churned customer is unless you have a subscription business model like Netflix, where you can quickly identify the customers who stopped paying. E.g., what does churn mean for a supermarket?

A supermarket has random transactions. You have to correlate those transactions and identify the patterns. Then, you must determine whether a specific customer stopped buying something because he wasn’t supposed to buy at this period or because he churned.

The ideal scenario is when business stakeholders explain to the Data Scientists what they need. Then they collaborate to solve the problem and choose the appropriate ML algorithm or tool.

Celia Muriel