Héctor Parra – A Google Engineer and the exciting realities of ML

“The COVID pandemic has spoiled many forecasting models. We can’t use historical data to predict a scenario that has never happened before. Many companies have not included data from 2021 to train their models because it’s biased” [icon name=”quote-right” prefix=”fas”]”

Héctor Parra is a seasoned IT specialist with experience in architecting, integrating, optimising and automating solutions at a massive scale. He has worked as a Customer Solutions Engineer a Google for the last few years. His role sits between Google’s marketing services (such as Google Business, Google Ads, or Google Analytics) and their customers. He helps companies leverage data to better market their products and services.

How is a day in the life of a Customer Success Engineer?

I explain to my customers that their data, what we call First-Time Data, is key to running their businesses.

Then I find out what kind of data a customer has, and I identify the different data sources. I often see that they have silos. I.e., data sources that are separated from each other. I tell my customers that they get plenty of helpful information if they combine all those data sources.

So when companies eliminate the silos and join the data, they can predict how the characteristics of users map better to their offering. E.g., it allows to find new customers or find better ones. In short, data is gold for Marketing.

On a separate note, users are very protective of their privacy. Thus, Marketing must be a powerful tool for advertisers, but it also must be respectful to the users.

When you say “users”, do you mean the users of the Google public portfolio, such as Google Search, Google Maps, etc.?

Exactly. But also, if a customer delivers pizzas, users are the people buying the pizzas.

Some of our current efforts [at Google] are aimed to pull the advertisers closer to their customers.

Companies generate and store data in many different services, both structured (products and services in a database) and unstructured data (social media data in a data lake). What is the best scenario for companies to combine their information to eliminate the silos?

I propose customers find a storage solution to keep all the data, either a Data Warehouse or a Data Lake. They choose the solution that suits them best. Their decision depends on several factors, including their specific requirements and how they will consume the data (queries, analyses, Machine Learning models, etc.).

At a very high level, I propose my customer run projects as follows:

Correct the data at the sources,
Load all the data in one single place, e.g., a Data Warehouse,
Normalise the data to ensure we can combine it in a data stack.
Join the information, analyse it, and update it for final consumption how the customer needs it.

Additionally, hosting the data in the cloud is usually better because it’s convenient for companies and their partners to upload the data.

The cloud providers offer many connectors to many data sources (certainly Google Cloud does). This feature quickly pulls data from the sources to the target storage, where you keep all the data together.

“Data is gold for Marketing”

Do you develop the Machine Learning (ML) models from scratch (e.g., in Python)? Or do you prefer to use a solution that provides half-programmed or pre-trained models, such as BigQuery ML?

It depends on the use case. If the model is simple, I don’t reinvent the wheel. I use BigQuery ML.

Even if I must build more complex models, I first try a pre-trained model. E.g., there are pre-trained models to detect objects in images or perform OCR in invoices. Often the output of a pre-trained is not that different from a customised one. I save plenty of time with a pre-trained model.

When the previous options don’t work, I try AutoML because it’s beneficial for the hyperparameter tuning and in some other aspects of the ML project.

Anyway, I rarely code. When I do it, it’s usually a Cloud Function to quickly expose an API to an ML model in near real-time. E.g., if a Finance company wants to run a risk assessment for a customer, they can ask the customer to fill up some forms. In the meantime, they run an ML model. When the customer finishes, they already know the level of risk to grant him/ her credit, and they offer a product accordingly. This use case requires a significant amount of work to make the models, make them available online and get a result in a fraction of a second.

How do you approach projects at those companies that don’t have Data Scientists but require an ML solution?

I first have a conversation with my customer to assess:

Where they are from a technological perspective,
Their objectives,
What their skills are, and
What skills they lack.

Then I usually suggest working with a partner. The partner brings those skills to develop and maintain a project. In the meantime, my customers can gather a team to take over the project when they are ready.

However, many companies don’t want to have a Data Scientist team because they think they would need it for just one project. Over time, when they start developing more projects, they realise that Data Science is a long-term investment and need a team.

How much data is good enough to develop a Machine Learning project?

Forecasting or predicting sounds very fancy, which is what Machine Learning does. However, it is nothing but an extension of studying the past. When you look into the historical data, you find a trend. Then you project the trend over the future. In conclusion, you need to gather data to have enough historical data to detect trends.

The current situation, specifically the COVID pandemic, has spoiled many forecasting models. We can’t use historical data to predict a scenario that has never happened before. Many of my customers have not included data from 2021 to train their models because they say that it’s biased. They are using data from 2015 or 2016 to make their following predictions.

We must take decisions to integrate that bias into our models. So we must balance the weight of the impact of the pandemic and normality and find a way to merge both. Once we combine them, we will better understand the company’s situation, always considering that forecasting is imperfect.
Forecasting has become particularly tricky. Even though it’s a bit easier when there is more extended historical information available, or the variables don’t depend much on this pandemic.

The war in Ukraine, as well as any other source of instability, also affects the predictions.

“Many companies don’t want to have a Data Scientist team because they think they would need it for just one project. Over time they realise that Data Science is a long-term investment”

Are the current models that companies build and deploy today not as accurate as they used to be?

Yes, it happens in many cases. But the pandemic doesn’t harm everything. Some companies have seen their sales boosted. It is the case for online sales as it has significantly increased since so many people work from home.

In any case, some trends were very specific to the pandemic.

Now everything has started to come back to normal, but some of these trends have not changed. We are having a fun time figuring out what trends are changing and which ones are here to stay.

More than ever, we need to choose our battle wisely: what we want to predict, what for, and how to reduce bias.

What would you advise is a good way to evaluate what battle to fight?

It depends on what type of company we are considering, the business objectives, and its situation compared to when the pandemic began, and what has changed.

In addition, we must analyse many factors, such as what changed for the better or the risks we didn’t have before. We also need to know what is worse and what we can do to make it better. E.g., is your company able to deliver products at home in a reasonable time?

Companies will also need to do some marketing experiments, such as opening one store and evaluating its results now that people are coming back to public places.

Moreover, companies must identify the opportunities, the risks, and set priorities. Data analysis is vital in this area.

We can predict, but they need to be short-term predictions. E.g., if a company re-opens stores, they can monitor the reaction from their customers. We can forecast for two months. If the results are promising and close to reality, we can predict two more after these two months. Uncertainty impacts the ML models.
Anyway, if we predict in the shorter term, we don’t need such long historical information. It is also easier to account for the effects of the pandemic.

Celia Muriel