Deep Learning End to End Pipelines made easy with Fluent Tensorflow Extended

A quick api overview and a self-contained example of fluent-tfx

Theodoros Ntakouris
Towards Data Science

--

If this production e2e ML pipelines thing seems new to you, please read the TFX guide first.

On the other hand, if you’ve used TFX before, or planning to deploy a machine learning model, you’re in the right place.

Image by Michal Jarmoluk from Pixabay

But Tensorflow Extended is already fully capable to construct e2e pipelines by itself, why bother to use another API ?

  • Verbose and long code definitions. Actual preprocessing and training code can be as lengthy as an actual pipeline component definition.
  • Lack of sensible defaults. You have to manually specify inputs and outputs to everything. This allows maximum flexibility on one hand, but on the other 99% of cases, most of the IOs can be automatically wired. For example, your preprocessing component is going to probably read your first input component’s input, and pass outputs to training.
  • Too much boilerplate code. Scaffolding via the TFX CLI produces 15–20 files in 4–5 directories.

The benefits of an easier to use, API layer

  • Fluent and compact pipeline definition and runtime configuration. No more scrolling through endless, huge 300+ line functions that construct pipelines
  • No scaffolding, easy to set up by using a few lines of code
  • Extra helpful utilities to speed up common tasks, such as data input, TFX component construction and wiring
  • Sensible defaults and 99% — suitable component IO wiring

Disclaimer: I am the author of fluent-tfx

API Overview through an example

This is essentially fluent-tfx/examples/usage_guide/simple_e2e.py, but please, read on.

Scroll to Pipeline Building if you already know the basics of TFX.

Data In

the file data/data.csv is essentially 4 columns: a, b, c, lbl . a and b are floats sampled randomly, c is a binary feature ( 0 or 1) and lbl is the binary label ( values 0 or 1 ). That’s a toy problem, just for demonstration purposes.

Model Code

The engineer or ‘user’ provides functions for preprocessing, model building, hyperparameter search and saving the model with proper signatures. We’re going to show how you can easily define these functions in another file (say model_code.py )

We’re going to use all the goodies of TFX to do the bulk of the work.

Tensorflow Transform for preprocessing:

Tensorflow Datasets for feeding input to the model in an effective way:

KerasTuner and TFX Tuner — Trainer for hyperparameter search and model building:

Pipeline Building

Constructing the pipeline is non trivial: provide some evaluation configuration with Tensorflow Model Analysis and just use fluent-tfx:

Runners

There is no extra effort required to run the pipeline on different runners that TFX supports (with Apache Beam), nor extra dependencies required: PipelineDef produces a vanilla TFX pipeline.

However, if you are using ftfx utilities inside your pipeline functions, be sure to include this package in your requirements.txt beam argument.

Appendix: Degrees of Freedom and Limitations

Custom components are supported to a large extent, but there will still some edge cases that would only work with the verbose plain old TFX api.

Assumptions are related to component DAG wiring, paths and naming.

Paths

  • PipelineDef needs a pipeline_name and an optional bucket path.
  • Binary/Temporary/Staging artifacts are stored under {bucket}/{name}/staging
  • Default ml metadata sqlite path is set to {bucket}/{name}/metadata.db unless specified otherwise
  • bucket defaults to ./bucket
  • Pusher’s relative_push_uri will publish the model to {bucket}/{name}/{relative_push_uri}

Component IO and Names

  • An input, or an example_gen component provides .tfrecords (probably in gzipped format) to next components
  • Fluent TFX follows the TFX naming of default components for everything. When providing custom components, make sure that inputs and outputs are on par with TFX.
  • For example, your custom example_gen component should have a .outputs['examples'] attribute
  • When using extra components from input_builders make sure that the names you provide not override defaults, such as standard tfx component names as snake_case and {name}_examples_provider, user_{x}_importer.

Component Wiring Defaults

  • If a user provided schema uri is provided, it will be used for data validation, transform, etc. The generated schema component will still generate artifacts if declared
  • If a user did not provide a model evaluation step, it will not be wired to the pusher
  • The default training input source are transform outputs. The user can specify if he wants the raw tf records instead
  • If hyperparameters are specified and tuner is declared, tuner will still test configurations and produce hyperparameter artifacts, but the provided hyperparameters will be used

Thank’s for reading all the way to the end! If this production ML pipelines thing seems new to you, please read the TFX guide.

--

--