Understanding ML In Production: A Crash Course

Motivation, intuition and the process behind this series of articles

Hi there. I’m Theodoros, a Computer Engineering Student here in Greece and I love deep learning.

Welcome to the Understanding Machine Learning in Production. In this article we are going to go over what the main objective of this series is all about and a rough outline of what is going to be covered.

I’m creating these articles because I feel that although the tensorflow ecosystem and high level APIs like keras along with all these free (and non free) tools and services that big companies provide online, like the famous google colab, lower entry barriers to machine learning, the whole ecosystem on the other hand has got so big and it is hard to get a grasp of it.

This is completely based on the Machine Learning End to End methodology that Tensorflow Extended supports.

The guides on the tensorflow extended website do a good job of showcasing what each component does, but I feel that in many parts there it is not a clear enough explanation of why some particular component exists or is coded a certain way, and sometimes, things are out-of-order. This effect is amplified for people without machine learning experience in a scalable, production environment.

I was consistently coming across articles and guides that did not help me truly understand every part of the machine learning lifecycle in a both theoretical and practical way, and this pissed me off. I studied these topics a significant amount of time on my own to be able to present them to you in a structured and easy to understand pipelined way. Every single step taken in this series is explained in depth along with intuition behind it.

Essentially this is a crash course on distributed and production level machine learning systems on the cloud. It’s an essential skill for any machine learning engineer or researcher.

For start, we are going to focus on the problems and the methodology that Tensorflow Extended, or TFX for short is trying to solve. We are going to dive deeper into the hidden machine learning technical debt, develop a solid understanding of why things need to be a certain way for machine learning in production to actually work well and not backfire.

We’re also going to investigate multiple approaches to the same problem whenever possible, so that, given advantages and disadvantages you can choose the correct approach that fits your needs, all the way from raw data input, to model deployment and serving.

In each article, we are going to tackle a small problem, discuss how this can be integrated with the rest of the system and develop an understanding of proposed solutions and possible limitations.

Finally, a full system will be developed that is going to run on the cloud in form of Apache Beam Pipelines, leverages Tensorflow Extended and deals with all the problems that will have been presented and solved until then.

By the end of this series you should have a firm understanding of every step required to deploy a machine learning system in production.

Articles Index

If you are completely new to this, please read the following list in a linear fashion, starting from the start to finish.

*Rows without links are currently being worked on.

Thank you for reading all the way to the end!