Structuring ML Pipeline Projects

An organised codebase enables you to implement changes faster and make less mistakes, ultimately leading to higher code and model quality. Read more to learn how to structure your ML projects with Tensorflow Extended (TFX), the easy and straightforward way.

Theodoros Ntakouris
Towards Data Science

--

Project Structure: Requirements

  • Enable experimentation with multiple pipelines
  • Support both alocal execution mode and a deployment execution mode. This ensures the creation of 2 separate running configurations, with the first being used for local development and end-to-end testing and the second one used for running in the cloud.
  • Reuse code across pipeline variants if it makes sense to do so
  • Provide an easy to useCLI interface for executing pipelines with different configurations and data

A correct implementation also ensures that tests are easy to incorporate in your workflow.

Project Structure: Design Decisions

  • Use Python.
  • Use Tensorflow Extended (TFX) as the pipeline framework.

In this article we will demonstrate how to run a TFX pipeline both locally and on a Kubeflow Pipelines installation with minimum hassle.

Side Effects Caused By Design Decisions

  • By using TFX, we are going to use tensorflow . Keep in mind that tensorflow supports more types of models, like boosted trees.
  • Apache Beam can execute locally, anywhere kubernetes runs and on all public cloud providers. Examples include but are not limited to: GCP Dataflow, Azure Databricks.
  • Due to Apache Beam, we need to make sure that the project code is easily packageable by python’ssdist for maximum portability. This is reflected on the top-level module structure of the project. (If you use external libraries be sure to include them by providing an argument to apache beam. Read more about this on Apache Beam: Managing Python Pipeline Dependencies).

[Optional] Before continuing, take a moment to read about the provided TFX CLI. Currently, it is embarrasingly slow to operate and the directory structure is much more verbose than it needs to be. It also does not include any notes on reproducibility and code reuse.

Directory Structure and Intuition Behind It

  • $project-name is the root directory of your project
  • $project-name/ml includes machine learning related stuff.
  • $project-name/ml/pipelines includes the actual ML pipeline code
  • Typically, you may find yourself with multiple ML pipelines to manage, such as $project-name/ml/pipelines/predict-sales and $project-name/ml/pipelines/classify-fraud or similar.
  • Here is a simple tree view:

$project-name/ml/pipelines includes the following:

  • data → small amount of representative training data to run locally for testing and on CI. That’s true if your system does not have a dedicated component to pull data from somewhere. If this is true, make sure to include a sampling query with a small limited number of items.
  • util → code that is reused and shared across $pipeline-name s. It is not necessary to include input_fn_utils.py and model_utils.py . Use whatever makes sense here. Here are some examples:

In my own projects, it made sense to abstract some parts on the utility module, like building named input and output layers for the keras models.

Building the serving signature metagraph using Tensorflow Transform output.

Preprocessing features into groups by using keys.

And also other common repetitive tasks, like building input pipelines with the Tensorflow dataset api.

  • cli.py → entry point and command line interface for the pipelines. Here are some common things to consider when using TFX.

By using abseil you can declare and access flags globally. Each module defines flags that are specific to it. It is a distributed system. This means that the common flags, like --data_dir=... , --hparam_tuning , --pipeline_root , --ml_metadata_url , --use_cache , --train_epochs is some you can define on the actual cli.py file. Other, more specific ones for each pipeline can be defined on submodules.

This file acts as an entry point for the system. It uses contents in pipeline.py to set up the components of the pipeline as well as provide the user-provided module files ( in the tree example these are constants.py , model.py , training.py ) based on some flag like --pipeline_name=$pipeline-name or some other configuration.

Finally, with the assembled pipeline, it calls some _runner.py file, by using a --runner= flag.

  • pipeline.py → parameterised pipeline component declaration and wiring. This is usually just a function that declares a bunch of TFX components and returns a tfx.orchestration.Pipeline object.
  • local_beam_dag_runner.py → configuration to run locally with the portable Beam runner. This can typically be almost configuration — free, just by using the BeamDagRunner .
  • kfp_runner.py → configuration to run on Kubeflow Pipelines. This typically includes different data path and pipeline output prefixes and auto-binds an ml-metadata instance.

Note: you can have more runners, like something that runs on GCP and just configures more provisioning resource like, TPU instances, parallel AI platform hyperparameter search etc.

$pipeline-name

This is the user-provided code that makes different models, schedules different experiments, etc.

Due to the util submodule, code under each pipeline should be much leaner. No need to split it in more than 3 files. It’s not prohibiting to split your code throughout more files though.

From experimentation, I converged to a constants , model and training split.

  • constants.py → declarations. Sensible default values for training parameters, hyperparameter keys and declarations, feature keys, feature groups, evaluation configurations and metrics to track. Here is a small example:
  • model.py → Model definition. Typically contains a build_keras_model function and uses imports from util and $pipeline-name.constants . Here’s an example from a recent project of mine:
  • Lastly, training.py includes all the fuss required to train the model. This is typically: preprocessing definition, hyperparameter search, setting up training data or model — parallel strategies and tensorboard logs and saving the module for production.

That’s it. Thank you for reading to the end!

I hope that you enjoyed reading this article as much as I enjoyed writing it.

--

--