Deep Learning End to End Pipelines made easy with Fluent Tensorflow Extended

A quick api overview and a self-contained example of fluent-tfx

Published in

Towards Data Science

4 min readJul 31, 2020

If this production e2e ML pipelines thing seems new to you, please read the TFX guide first.

On the other hand, if you’ve used TFX before, or planning to deploy a machine learning model, you’re in the right place.

But Tensorflow Extended is already fully capable to construct e2e pipelines by itself, why bother to use another API ?

Verbose and long code definitions. Actual preprocessing and training code can be as lengthy as an actual pipeline component definition.
Lack of sensible defaults. You have to manually specify inputs and outputs to everything. This allows maximum flexibility on one hand, but on the other 99% of cases, most of the IOs can be automatically wired. For example, your preprocessing component is going to probably read your first input component’s input, and pass outputs to training.
Too much boilerplate code. Scaffolding via the TFX CLI produces 15–20 files in 4–5 directories.

The benefits of an easier to use, API layer

Fluent and compact pipeline definition and runtime configuration. No more scrolling through endless, huge 300+ line functions that construct pipelines
No scaffolding, easy to set up by using a few lines of code
Extra helpful utilities to speed up common tasks, such as data input, TFX component construction and wiring
Sensible defaults and 99% — suitable component IO wiring

Disclaimer: I am the author of fluent-tfx

API Overview through an example

This is essentially fluent-tfx/examples/usage_guide/simple_e2e.py, but please, read on.

Scroll to Pipeline Building if you already know the basics of TFX.

Data In

the file data/data.csv is essentially 4 columns: a, b, c, lbl . a and b are floats sampled randomly, c is a binary feature ( 0 or 1) and lbl is the binary label ( values 0 or 1 ). That’s a toy problem, just for demonstration purposes.

Model Code

The engineer or ‘user’ provides functions for preprocessing, model building, hyperparameter search and saving the model with proper signatures. We’re going to show how you can easily define these functions in another file (say model_code.py )

We’re going to use all the goodies of TFX to do the bulk of the work.

Tensorflow Transform for preprocessing:

Tensorflow Datasets for feeding input to the model in an effective way:

KerasTuner and TFX Tuner — Trainer for hyperparameter search and model building:

Pipeline Building

Constructing the pipeline is non trivial: provide some evaluation configuration with Tensorflow Model Analysis and just use fluent-tfx:

Runners

There is no extra effort required to run the pipeline on different runners that TFX supports (with Apache Beam), nor extra dependencies required: PipelineDef produces a vanilla TFX pipeline.

However, if you are using ftfx utilities inside your pipeline functions, be sure to include this package in your requirements.txt beam argument.

Appendix: Degrees of Freedom and Limitations

Custom components are supported to a large extent, but there will still some edge cases that would only work with the verbose plain old TFX api.

Assumptions are related to component DAG wiring, paths and naming.

Paths

PipelineDef needs a pipeline_name and an optional bucket path.
Binary/Temporary/Staging artifacts are stored under {bucket}/{name}/staging
Default ml metadata sqlite path is set to {bucket}/{name}/metadata.db unless specified otherwise
bucket defaults to ./bucket
Pusher’s relative_push_uri will publish the model to {bucket}/{name}/{relative_push_uri}

Component IO and Names

An input, or an example_gen component provides .tfrecords (probably in gzipped format) to next components
Fluent TFX follows the TFX naming of default components for everything. When providing custom components, make sure that inputs and outputs are on par with TFX.
For example, your custom example_gen component should have a .outputs['examples'] attribute
When using extra components from input_builders make sure that the names you provide not override defaults, such as standard tfx component names as snake_case and {name}_examples_provider, user_{x}_importer.

Component Wiring Defaults

If a user provided schema uri is provided, it will be used for data validation, transform, etc. The generated schema component will still generate artifacts if declared
If a user did not provide a model evaluation step, it will not be wired to the pusher
The default training input source are transform outputs. The user can specify if he wants the raw tf records instead
If hyperparameters are specified and tuner is declared, tuner will still test configurations and produce hyperparameter artifacts, but the provided hyperparameters will be used

Thank’s for reading all the way to the end! If this production ML pipelines thing seems new to you, please read the TFX guide.