Last week, I attended apply(), Tecton’s first-ever conference that brought together industry thought leaders and practitioners from over 30 organizations to share and discuss ML data engineering’s current and future state. The complexity of ML data engineering is the most significant barrier between most data teams and transforming their applications and user experiences with operational ML.

In this long-form blog recap, I will dissect content from 23 sessions and lightning talks that I found most useful from attending apply(). These talks cover everything from the rise of feature stores and the evolution of MLOps, to novel techniques and scalable platform design. Let’s dive in!

1/ My notes from #applyconf✍️It covers:
- Feature Stores
- Evolution of MLOps
- Scaling ML Platforms
- Novel Techniques
- Impactful Internal Projects

Thanks @TectonAI and partners for organizing this ML excellent #DataEngineering exhibition. Enjoy!👇https://t.co/d9zouqNTUL
— James (@le_james94) May 2, 2021

I — Feature Stores

1 — Productionizing ML with Modern Feature Stores

Let’s start with a keynote talk from Mike Del Balso and Willem Pienaar, CEO and Tech Lead at Tecton.

Challenges in Operationalizing ML

Why is it hard to productionize ML in the modern data stack?

The modern data stack is built around cloud-native data platforms — either your own data warehouses or next-generation data lakes. These modern data platforms are revolutionizing the data analyst role — enabling the analyst to centralize the business data, reliably clean and aggregate the data, refine the data to higher-value versions that can be used for analytics. In other words, analytics can be self-served, near zero-maintenance, and scalable.

What about ML? ML has two environments — the training one where we build our models and the serving one where we deploy our models. Both model training and model serving need access to different feature data that must be consistent across both environments.

For model training, ideally, we can use the same data from the data warehouse that the analytics team has refined. However, the data used in production is not always available in the data warehouse (maybe for security reasons). Furthermore, often, the data that sits in your warehouse has already been transformed in some way that might not be representative of what your model might see in production. This introduces data consistency issues.
For model serving, your system often can’t handle real-time serving or support for streaming data. Furthermore, the production team sometimes does not trust taking the dependencies from an analytics warehouse system with minimal governance.

Data consistency, management, and access is the hardest problem in MLOps. ML teams need tools and workflows to iterate quickly on features and have them available on production training and inference. A common workaround is to rebuild the offline data pipelines online. ML teams take the transformations from the data warehouse and rebuild them in a streaming or operational ETL environment. This process is slow, painful, and error-prone. It requires a lot of engineering time, and the data scientists do not even own their models in production.

How can we make productionizing ML as fast and easy as building a dashboard on a warehouse?

Feature Stores

Feature stores are built precisely for this reason. They are the hub for a data flow and an ML application. They ensure that:

Transformations are consistently applied across environments.
Datasets are organized for ML use cases.
Features are made accessible for offline training and online inference.
Data is monitored and validated.
The workflow to production is simple and fast.

There are five main capabilities in a modern feature store:

Serve: delivering feature data to ML models. On the training side, we need to train on historical examples, so it’s crucial to represent what a feature value looks like at a specific time in the past. On the serving side, your model needs the feature data to be fresh to be served in real-time and at scale.
Store: containing an online and an offline storage layer. The online layer has the freshest value of each feature and powers the real-time serving of your model. The offline layer contains all the historical values of each feature so that you can go back in time for assembling training datasets that represent historical examples.
Transform: converting raw data into feature data. This happens by orchestrating transformation jobs on your existing infrastructure. The feature store handles two things: (1) running the feature pre-computation and (2) handling the smart feature backfilling.
Monitor: ensuring data quality. The feature store organizes all the important data for your model operations, monitoring, and debugging. A feature store is also pluggable with external monitoring and observability systems. Lastly, a feature store can be extended to measure operational metrics (such as serving latency).
Discover: a feature store has a registry — a single source of truth of features in an organization. A registry contains all the metadata definitions to discover new features and share/reuse existing features for users. Thus, the feature store can become like a data catalog of production-ready signals, which removes the cold-start problem for testing new data science ideas.

Here are the major benefits of using feature stores:

Modern feature stores are lightweight. They have a small footprint by reusing the existing data infrastructure inside your cloud environment.
Modern feature stores are incrementally adoptable. You don’t need to rewrite your existing data pipeline to begin using feature stores. They are built to work alongside your infrastructure. As your pipeline grows, you can take advantage of the feature retrieval interface across your models that feature stores provide.

Willem Pienaar — Productionizing ML with Modern Feature Stores <Tecton apply() 2021>

Feast 0.10 is the fastest way to serve features in production. It provides training datasets built-in with point-in-time correctness to avoid feature leakage, an online serving of your data at low latency, and support for Google Cloud Platform (BigQuery as the offline store and Firestore as the online store). Here are the key features of this release:

Zero configuration: You can deploy a feature store without any custom configuration in seconds.
Local mode: You can test your end-to-end development workflow locally from either an IDE or a notebook.
No infrastructure: You do not need to deal with Kubernetes, Spark, or serving APIs.
Extensible: You can extend Feast to deployment into your own stack.

To quickly recap:

Tecton is an enterprise feature store with advanced functionality, like transformations, access control, a UI, hosting, and production SLAs.
Feast is an entirely open-source production-grade feature store that you can get started with today.

Both tools are working towards a common standard for feature stores. The release of Feast 0.10 is the first step towards that vision. Over time, they will converge their API and unify specific notions for feature definition.

2 — Redis as an Online Feature Store

Redis is an in-memory open-source database, supporting a variety of high-performance operational, analytics, or hybrid use cases. Redis can also be used to either empower ML models or function as an AI serving and monitoring platform at a broad level. Taimur Rashid (Redis) discussed how that works in practice.

Taimur Rashid — Redis as an Online Feature Store <Tecton apply() 2021>

As observed in the figure above:

We have data orchestration tools on the left side, including open-source, MLOps and feature store providers, and big cloud platforms. All of them can effectively interact with a feature store.
On the right side, a feature store can be broken down into data storage, data serving, and data monitoring. Redis has been used as an online feature store (for instance, at DoorDash) to interact with the feature registry.
Additionally, the Redis Labs team has attempted to bring inferences closer to the data. The Redis AI module creates a native data type (tensors, models, etc.) and combines it with a backend runtime environment (TensorFlow, TensorFlow Lite, Torch, ONNX, etc.). This combined value enables developers to maximize computation throughput while still adhering to the principle of data locality.

Some Redis users even extend Redis into a model store (model binaries and metadata). Coupled with monitoring capabilities, they can look at concept drift and model drift, where Redis serves as the evaluation store.

3 — Feature Stores at Tide

Tide is a leading business financial platform on a mission to save businesses time and money so they can get back to doing what they love. The company has been using ML for use cases such as handwritten digit recognition (on receipts), credit-worthiness classification, and payment matching. However, there are a few problems that make it hard for Tide to use ML:

Data was not accessible in real-time: While they did have a data warehouse, data ingestion and transformation took hours, making it unsuitable for many potential ML applications.
Business metrics were a function of both rules and ML systems: Many applications they wanted to build were in the Risk and Compliance space, requiring explicit rules to be deployed and served.
Deployments often caused unexpected business outcomes: Frequently, deployments led to critical changes in business metrics, causing several rollbacks.

Hendrik Brackmann shared how Tide’s ML team tackled these problems by building a feature store, building out a feature layer to serve rules, and implementing shadowing functionality. As observed below, the offline and online architecture were completely separated.

Hendrik Brackmann — Feature Stores at Tide <Tecton apply() 2021>

However, there are three significant problems with this initial architectural design:

Features were only computed based on recent stream data: While they did have a data warehouse, data ingestion and transformation took hours, making it unsuitable for many potential ML applications.
Training data pipelines and production pipelines were separate, causing bugs: Many applications they wanted to build were in the Risk and Compliance space, requiring explicit rules to be deployed next to the model.
Handover between engineers and data scientists was slow: Engineers built real-time pipelines, and data scientists built training pipelines. Handover caused significant delays for them.

Tide decided to partner with Tecton to address these problems:

In the online architecture, all the data sources flow to Tecton, which is also connected to a data warehouse. That data then travels from Tecton to various prediction services, which provides predictions for downstream usage.
In the offline architecture, the training data can be created using the same pipeline as in online serving. As seen below, Tecton serves real-time training data to the notebook environments.

Looking forward, the Tide ML team will consider:

Platformization of new model deployments: At the moment, they invest significant time to do a first model deployment on generic parts of the infrastructure.
Standardization and expansion of their ML monitoring capabilities: At the moment, they ingest their prediction data into their data warehouse and build reports on top of it. They aim to automate this part and apply more distributional tests.
Reusability of external data sources for operational ML: At the moment, their external data sources are closely tied to specific use cases through their freshness requirements — something Tide aims to solve going forward.

4 — Feature Stores at Spotify: Building and Scaling a Centralized Platform

Over 345M Spotify users rely on Spotify’s great recommendations and personalized features in 170 different markets worldwide (with 85 of these markets launching in the first part of 2021). Spotify built these great recommendations, unsurprisingly, with data and ML. But with the massive inflows of data and complexity of production use cases (60M tracks, 1.9M podcasts, 4B playlists), defining a unified approach to ML is challenging.

Aman Khan gave an overview of the challenges his team faced with building a central ML Platform at a highly autonomous organization and their approach of adoption by incentive. In 2020, 50 ML teams trained 30k models with an average of 300k per second prediction requests in the ML platform, increasing Spotify’s overall ML productivity by 700%.

Features at Spotify are data representation taking the form of key-value pairs (how much a user likes a certain artist).

From a technical perspective, they perform online inference for various use cases that utilize dynamic and near “real-time” features. Thus, when ML engineers engineer features, there is an inherent complexity in bridging offline and online features.
From an organizational perspective, because of Spotify’s highly autonomous engineering culture, engineers often move fast and do not realize that some features are common across use cases. They end up recreating and serving the same feature, which is costly in engineering time and data duplication.

One year ago, the ML platform team effectively built a collection of data libraries to help with data utility tasks. Besides that, there is no central strategy on features and no opinionated workflow, causing engineering pain.

Now, they have built Jukebox, a collection of Python and JVM components that help manage data throughout the ML user journey. Jukebox helps with collecting, loading, and reading features during model training and serving.

The starting point for Jukebox is the feature registry, a single source of truth for features. Users can explore the registry via feature gallery’s UI components, searching and discovering features they might consider reusing in their models.
Jukebox has two components during feature preparation: Converter (to convert Avro and BigQuery datasets into TFRecords) and Collector (to select and join features from upstream data end-points).
Jukebox has two components during feature serving: Loader (to load TFRecords or Protobuf into BigTable or BigQuery) and Online Reader (to fetch registered features by name for online serving).

Aman Khan — Feature Stores at Spotify: Building and Scaling a Centralized Platform <Tecton apply() 2021>

Going forward, Aman’s team will focus on expanding the Jukebox API:

The functionality will be exposed as a service API: These include preparing training datasets and creating feature sets for online/offline inference.
The service will automatically convert between formats, register Bigtable + GCS locations in feature registry, join datasets, and compute feature statistics,
Inherent benefits include centralized management of features and enhanced user experience.

Ultimately, Jukebox is part of the ML platform team’s larger strategy to develop a feature marketplace — saving Spotifiers time, money, and toil through feature reuse and enhanced feature management. This marketplace will be built on top of Spotify’s API strategy and infrastructure, platformize the interface between producers and consumers, and enhance trust with information (statistics, lineage, ownership).

Aman ended the talk with three tactical lessons from his Jukebox journey:

Listen to your customers: Use cases can appear highly specific. Try to identify overlapping parts that are painful and do not have a good solution, then prioritize them first. If you pick the proper use cases, you will likely have backend engineers who want to collaborate on those solutions.
Feature stores are an evolving space: The buy-and-build tradeoff is increasingly less clear as your ML platform matures. Early-stage companies should consider how far they can go with off-the-shelf open-source options.
Think about scale: Having a fragmented strategy in different parts of the company leads to challenges in adopting data management. Find opportunities for long-term partnerships early on.

II — The Evolution of MLOps

5 — The Next Generation of Data Analytics Systems

Wes McKinney (Ursa Labs) dug deep into the next generation of data analytics systems and presented his current focus on Apache Arrow.

Modern Computing Silos

There are three different kinds of modern “computing silos”:

Data: How is data stored? How is data accessed? How data flows around the system?
Compute Engines: systems that access data, perform queries, engineer features, and build analytics/ML products.
Languages: languages let us interface with the data and compute engines.

Here are common symptoms of data silos:

Many workloads are dominated by serialization overhead.
There is an inconsistent data access and data delivery performance.
Data is “held hostage” by a vertically integrated system or stored in an open format under “pseudo-lock-in.”
Many systems operate on stale data awaiting ETL jobs.

Here are common symptoms of compute engine silos:

Compute performance is highly variable and inconsistent.
The hardware is frequently under-utilized.
Hardware innovations are slow to be utilized.
“Engine lock-in” may add burdensome constraints in the “data” and “language” layers.

Here are common symptoms of language silos:

There is a lack of code sharing and collaboration across languages and tools.
Using a preferred language (e.g., Python or R) can cause massive performance penalties.

A Computing Utopia

Wes McKinney — The Next Generation of Data Analytics Systems <Tecton apply() 2021>

In an idealized computing utopia:

Programming languages are treated as first-class citizens: we can choose the language that best suits the way we want to work, so we don’t face the penalties around data access or feature support.
Compute engines become more portable and usable: we can utilize hardware accelerations (GPU innovation, specialized computing chips) and benefit everyone.
Data is accessed efficiently and productively everywhere you need it.

How can we manifest such utopia into the real world?

To build better compute engines, we want:

Benefits from hardware improvements more quickly.
Access to better hardware heterogeneity (e.g., hybrid CPU/GPU execution).
Collaboration on core algorithms.
High efficiency at the small and large scale.

To have better language integration:

Extending systems with user-defined code should incur zero overhead due to serialization.
Compute engines should avoid requiring anything more complex than C FFI to use.

To get better data, we want:

Open standards for storage and large-scale metadata.
Standardized high-bandwidth protocols for data on the wire.
Coupling to be reduced to a particular compute engine or platform.

Apache Arrow

Apache Arrow is an open-source community project launched in 2016 that sits at the intersection of database systems, big data, and data science tools. Its purpose is to enable language-independent open standards and libraries to accelerate and simplify in-memory computing. As a user, you can (1) access and move tabular data, (2) build portable and optimized compute engines that use Arrow internally, and (3) integrate with languages in your tech stack with almost no overhead.

In 2021 alone, there have been 20 major releases, over 600 unique contributors, over 100M package installs in 2020, and 11 programming languages represented. Furthermore, Arrow has increasingly been adopted in both the data front (Spark, BigQuery, Parquet, Azure, R, pandas, Athena, Snowflake, DuckDB, TF Extended) and the computing front (Dremio, NVIDIA RAPIDS, blazingSQL, DataFusion, NoisePage, vaex.io, Cylon). The eventual goal of the Apache Arrow project is to become the world’s de-facto data fabric for tabular data.

6 — Third Generation Production ML Architectures

This talk from Waleed Kadous (Anyscale) looks at the evolution of production ML architectures, which are ML systems that are deployed in production environments,

typically operate at a considerable scale (trained on large volumes of data and make many inferences), and require distributed training and distributed inference.

GPU Programming Architectures

The talk started with the evolution of GPU programming architectures.

The first generation is a fixed function pipeline (OpenGL 1.0, Direct3D). This pipeline goes from the input to the display via many steps, including transform & lighting, view & clipping, polygon filling, and pixel operations.

This first-generation enables amazing hardware-accelerated graphics and massively expanded possibilities, which led to the 3D gaming revolution as creators have the flexibility that came from input choices (textures, meshes, etc.). However, fixed-function pipelines could not do complex effects and lacked the flexibility to create more advanced features.

Waleed Kadous — Third Generation Production ML Architectures <Tecton apply() 2021>

The second-generation encodes programmability into the pipeline (Direct3D 10, OpenGL 2.0). This pipeline now has some programmability for different shaders and even its C-like languages (GLSL and HLSL). However, it was still hard to conform to the existing pipeline, especially interchange formats to each stage (vertices, textures, pixels, etc.).

The third generation provides complete programmability (Cg, OpenCL) without involving any pipeline. This generation of GPU architectures can do everything that the first and second generation can. The interface is a programming language, which led to an explosion of applications where deep learning is built (cuDNN, Caffe, Torch, PyTorch!).

Another critical change is that the focus shifts to libraries. Originally, games were written with Direct3D or OpenGL. But this shift enabled people to use an engine like Unity or Unreal Engine 4 instead, which opened up the power of GPUs to a vast number of users.

Production ML Architectures

There are surprising similarities between GPU programming and production ML architectures. Let’s look at Uber’s ML architecture, for instance:

The first generation is a fixed pipeline called Michelangelo, as described in a 2017 blog post. This pipeline includes data collection, model training, model evaluation, and model deployment phases.

The second generation of Michelangelo (as described in a 2019 blog post) enforces a workflow and operator framework on top of common orchestration engines for the flexibility of composing custom, servable pipeline models.

By analogy, what would the third generation look like?

It can do everything that the first and second generations can.
The interface is a programming language.
Could it lead to an explosion of applications?
The key change is the focus on libraries, where the compute engine is just a detail, which opens up the power of ML to a vast number of users.

Ray

Waleed made the case that Ray belongs to this class of third-generation production ML architectures.

It is a simple and flexible framework for distributed computation: simple annotation to make functions and classes distributable and flexibility to create new distributed function calls and instances without any batching required.
It is a cloud-provider independent compute launcher/autoscaler.
It is also an ecosystem of distributed computation libraries built on top of the initial API.

The third generation will tackle the problem of production ML architectures by making them programmable. Furthermore, it moves the focus to libraries instead of worrying about distributed computation and underlying clusters.

Ray customers have already used Ray for a variety of use cases: from coming up with simpler ways to build the first- and second-generation pipelines and parallelizing high-performance ML systems to building applications that make ML accessible to non-specialists and undertaking ML projects that don’t fit the ML pipeline.

To learn more about Ray, check out the Ray summit coming up on June 22–24.

7 — Real-Time Machine Learning

Chip Huyen (Stanford) gave a talk covering the state of real-time machine learning in production, based on her blog post last year.

She first defined the two levels of real-time machine learning:

Level 1 is online predictions, where the system can make predictions in real-time (defined in milliseconds to seconds).
Level 2 is online learning, where the system can incorporate new data and update the model in real-time (defined in minutes).

Online Predictions

Latency matters a lot in online predictions. A 2009 study from Google shows that increasing latency from 100 to 400 ms reduces searches from 0.2% to 0.6%. Another 2019 study from Booking.com shows that a 30% increase in latency cost 0.5% decrease in conversion rate. The crux is that no matter how great your models are, users will click on something else if they take just milliseconds too long.

In the last decade, the ML community has gone down the rabbit hole of building bigger and better models. However, they are also slower as inference latency increases with model size. One obvious way to cope with longer inference time is to serve batch predictions. In particular, we (1) generate the predictions in batches offline, (2) then store them somewhere (SQL tables, for instance), and (3) finally pull out pre-computed predictions given users’ requests.

However, there are two main problems with batch predictions: (1) The system needs to know exactly how many predictions to generate, and (2) The system cannot adapt to changing interests.

Online predictions can address these problems because (1) the input space is in-finite and (2) dynamic features are the inputs. In practice, online predictions require two components: fast inference (models that can make predictions in the order of milliseconds) and real-time pipeline (one that can process data and serve models in real-time).

There are three main approaches to enable fast inference:

You can make models faster by optimizing inference for different hardware devices (TensorRT, Apache TVM).
You can make models smaller via model compression techniques such as quantization, knowledge distillation, pruning, and low-rank factorization (check out this Roblox’s article on how they scaled a BERT model to serve 1+ billion daily requests on CPUs).
You can also make the hardware more powerful, both for training/inference steps and cloud/on-device devices.

A real-time pipeline requires quick access to real-time features. The most practical approach is to store them in a streaming storage (such as Apache Kafka or Amazon Kinesis) and then process them as they arrive. Additionally, we also need to process the static data (in formats like CSV or Parquet) in batches.

Chip Huyen — Machine Learning Is Going Real Time <Tecton apply() 2021>

A model that serves online predictions would need two separate pipelines for streaming data and static data. This is a common source of errors in production when two different teams maintain these two pipelines.

Traditional software systems rely on REST APIs, which are request-driven. Different micro-services within the systems communicate with each other via requests. Because every service does its own thing, it’s difficult to map data transformations through the entire system. Furthermore, debugging it would be a nightmare if the system goes down.
An alternative approach to the above is the event-driven pub-sub way, where all services publish and subscribe to a single stream to collect the necessary information. Because all of the data flows through this stream, we can easily monitor data transformations.

There are several barriers to stream processing:

The first one is that companies don’t see the benefits of streaming (maybe the system is not scalable, maybe batch predictions work fine, or maybe online predictions are unpredictable).
The second one is the high initial investment in infrastructure. Switching from batch to online streaming is a monumental task.
The third one is a mental shift, especially for academically-trained engineers used to the batch mode.
The final one is that current tools for online predictions are built on Java, which presents another hurdle to learn for Python folks.

Online Learning

There is a slight distinction between online learning and online training. Online training means learning from each incoming data point, which can suffer from catastrophic forgetting and can get very expensive. On the other hand, online learning means learning in micro-batches and evaluating the predictions after a certain period of time (whether offline or online). This is often designed in tandem with offline learning.

The most prominent use case for online learning right now is recommendation systems due to user feedback’s natural labels. However, not all recommendation systems need online learning, especially for slow-to-change preferences such as static objects. For quick-to-change preferences such as media artifacts, online learning would indeed be helpful. There are also other use cases for online learning, such as dealing with rare events, tackling the cold-start problem, or making predictions on edge devices.

However, there are a few barriers to online learning: (1) there is no epoch because each data point can be seen only once, (2) there is no convergence due to shifting data distribution, and (3) there is no static test set as data arrives continuously.

Lastly, it seems like China is doing a better job at building online learning systems than the US (check out this extensive CFDI 2019 report on the AI race). Chip provides well-reasoned guesses on how that seems to be the case:

Chinese companies have a more mature adoption of this approach (thanks to successful examples of ByteDance and WeChat).
Chinese companies are younger, thus having less legacy infrastructure to adopt new ways of doing things.
China has a bigger national effort to win the AI race (read Kai-Fu Lee’s “AI Superpowers” if you are interested in this thread).

8 — The Only Truly Difficult Problem In ML

MLOps Problems

There are so many problems in MLOps to choose from. MLOps generally concerns with building reliable and high-quality ML systems, which are complex, interdependent, and multi-stage pipelines. Fortunately, we have many techniques in computing (and, in particular, Site Reliability Engineering) for addressing these problems.

Let’s attempt to express ML production problems in terms of Service-Level Objectives:

Training Pipelines: We want our models to finish training on time. Model configurations are updatable, durable, and functional. Trained models are usable by being correctly formatted and of acceptable quality.
Data Ingestion/Processing: We want the data to be readable, correctly formatted, correctly interpreted, ingested sufficiently quickly, and only accessed by users and programs permitted to do so.
Storage System: Ideally, features, model snapshots, configs, and metadata all exist in a storage system. Furthermore, features can be read quickly and reliably by the training system, configs can be updated and validated, and data is durable and correct.
Serving System: We want queries to models to be fast and error-free. Models can be updated reasonably quickly and reliably. Query logs are created and added to the logging system.

Many MLOps products and services claim to address these production problems by automating the ML lifecycle, efficiently sharing compute resources, optimizing data storage, improving/standardizing data ingestion, providing deployment tools (to deploy to the cloud, on-prem, etc.), model registry (with model versioning and experiment tracking), and feature stores.

The truth is that most ML production problems are software operations problems.

Many techniques exist (either from SRE or software operations in general) to address these. In his talk, Todd Underwood (Google) quickly looked through the SRE bag-of-tricks to determine potential applicability to solve all of MLOps with known approaches.

A quick refresher on the core responsibilities of an SRE:

Designing redundant infrastructure, treating configuration as code, ensuring distributed consensus, etc.
Paying attention to security controls, nonces, API-guarded access.
Putting an extreme focus on monitoring and observability.
Treating infrastructure as a fleet.

SRE Solutions To MLOps Problems

Let’s attempt to map some of the SRE solutions to the so-called MLOps problems brought up above:

Training Pipeline Solutions: We monitor model completion/progress at various stages, validate configs, and test models before serving.
Data Ingestion/Processing Solutions: We monitor data ingestion success and failure rates, run asserts on ingested data to validate format and references, validate read ACLs, run an access prober to check success/failure and log production, and monitor access logs for access.
Storage Solutions: We run a test prober periodically to insert new mode, config, metadata, and verify. We can also monitor read IO and failure rate.
Serving Solutions: We monitor success/failure rates, run update prober to validate that models quickly distribute to servers, and validate query logs’ existence and growth rate.

Operations and Context-Free Decisions

Todd observes that each of the MLOps problems is solvable with (basically) context-free decisions. Those are the tractable problems in operations. We absolutely should address these problems first (because we know how), but there is a class of problems that is harder hidden in the data.

Subtle changes in the amount of data from various sources (or with specific semantic properties) cause meaningful changes in the model. We can’t make context-free decisions and need to have the whole picture, which is expensive and hard.

Both modeling failures and infrastructure failures can impact model quality. Some infrastructure failures will only show up as model quality failures: training-serving skew (feature definitions/semantics differ between training and serving), other feature definition changes, missing or corrupt or changed data dependency, and any bias (along any significant axis) in missing data.

Model quality monitoring is the only (real) end-to-end integration test of an ML system.

Best Practices

Todd concluded his talk with best practices for model stability and monitoring (most from Google’s “ML Test Score” paper):

Notify on dependency changes (easier said than done!).
Look at useful data invariants across training and serving inputs.
Explore the parity of computation between training and serving.
Keep track of model age, don’t let them get too stale (implies strong versioning).
The model is numerically stable and does not suddenly use way more resources for the same results (training speed, serving latency, throughput, RAM usage, etc.).
There is no production regression in prediction quality.

III — Scaling ML at Large Organizations

9 — Scaling Online ML Predictions to Meet DoorDash Logistics Engine and Marketplace Growth

Hien Luu and Arbaz Khan presented a genuinely entertaining and comprehensive talk about the evolution of DoorDash’s ML platform as a result of the company’s growth during the pandemic.

DoorDash Marketplace

DoorDash’s mission is to grow and empower local economies. They accomplish this through a set of products and services: (1) Delivery and Pickup where customers can order food on demand, (2) Convenience and Grocery where customers can order non-food items, and (3) DashPass Subscription that is similar to Amazon Prime. DoorDash also provides a logistics platform that powers delivery for notable merchants such as Chipotle, Walgreens, and Target.

Hien Luu and Arbaz Khan — Scaling Online ML Predictions to Meet DoorDash Logistics Engine and Marketplace Growth <Tecton apply() 2021>

DoorDash platform is a three-sided marketplace with the flywheel effect. As there are more consumers and more orders, there will be more earning opportunities for dashers to increase delivery efficiency and speed for the marketplace. As there are more consumers and increasing revenue for merchants, there will be more selections from the merchants for the consumers to choose from. This flywheel drives growth for merchants, generates earnings for dashers, and brings convenience to consumers.

So what are the incentives for each of these groups?

Merchants want reach and revenue.
Consumers want convenience and selection.
Dashers want flexibility and earnings.

ML Platform at DoorDash

Considering such incentives, DoorDash designs these various ML use cases personalized towards these three groups: Search & Recommendation for the consumers, Selection and Ads & Promos for merchants, Acquisition & Mobilization and Positioning for dashers. Additionally, ML has been used in the logistics platform for Dispatch, Prep & Travel Time Estimates, and Supply/Demand Balancing.

DoorDash’s internal ML platform is the central hub that powers these use cases.

The platform is based on these five pillars: feature engineering, model training, model management, model prediction, and insights.
The core principle that drives the platform journey is to think big but start small.
In addition, the platform strategy includes velocity (how to accelerate ML development), ML-as-a-Service (how to make the models reusable), and ML observability (how to catch and prevent features that might lead to problems).
One of DoorDash’s cultural values is being customer-obsessed. The platform architecture, strategy, and success metrics are all designed with that in mind.

Scaling ML Online Predictions

How did the DoorDash team prepare for reliable business growth and meet all the scalability expectations?

The basic design of any ML system would include these four components: model store, feature store, prediction store, and metrics. DoorDash needed to scale their system because they wanted to get more models deployed, avoid models getting hit more often, thrive in organic product growth, and not stall incidental growth. In particular, they want an increasing number of computations, higher feature lookups and cardinality, larger data volume per prediction, and more models.

Arbaz then walked through an entertaining journey of how DoorDash redesigned its system to make scalable online predictions in the past year:

The initial platform was built in March 2020 to serve 2 models and 1k predictions per second. By June 2020, it ramped up to 16 models and 15k predictions per second. Around this time, they started to see the effect of cannibalization, where some models consume more resources than others.
Round 1 — Divide and Defend: Given a large number of requests, they created a sharding prediction framework to separate different prediction microservices and their respective feature stores. This enabled them to ramp up to 20 models and 130k predictions per second. However, by September 2020, one of their shards hit the AWS limits, and they could not scale horizontally any further.
Round 2 — Beefing Up: They decided to run some exercises on the fat shard to make it leaner and more powerful. They accomplished that with microservice optimizations (load testing, latency profiling, parameter tuning, and runtime optimizations) and feature store optimizations (benchmarking and schema redesign). By October 2020, their system went from serving 1M predictions per second with 28 models to 2M predictions per second with 44 models.
Round 3 — Infrastructure Breaks: By December 2020, DoorDash infrastructure got stifled: their Splunk quota exceeded, their Wavefront metrics limit breached, they got blocked by Segment for sending “too many” events, and their High Service Discovery CPU threatened a total outage. They tackled these issues one-by-one: only log essential input samples (for storage), move to Prometheus from statsd (for observability), use an in-house Kafka streaming instead of Segment (for serving), and reduce the number of discoverable pods by scaling them vertically (for distributed training). By January 2021, the DoorDash platform had served 6.8M predictions per second with 38 models.

Here are three tactics that stand out from DoorDash’s scaling manual:

Isolate use cases whenever possible.
Reserve time to work on scaling up rather than scaling out.
Pen down infrastructure dependencies and implications.

For future work, DoorDash will be looking at:

Caching solutions for feature values (instead of prediction values) to enable more microservice optimizations.
Generalized model serving for use cases in NLP and Image Recognition.
Unified prediction client that allows easy prediction requests.

10 — Supercharging Data Scientists’ Productivity at Netflix

Ravi Kiran Chirravuri and Jan Forjanczyk from Netflix dove into a specific data science use case at the company and the Metaflow solution built for data scientists’ pain points.

Content Demand Modeling

Data science use cases are prevalent at Netflix. This talk looks at the work of the Content Demand Modeling team, which uses sophisticated graph embedding techniques to represent Netflix titles. They answer questions such as “which titles are similar to which other titles and in what way?” and “how large is the potential audience of a show in a given region?”

Three things need to be satisfied for this team during a product launch in 2020:

They need to iterate quickly in a 2-week time frame with a long backlog of feature ideas to try.
They have varying levels of context, where they work with domain-specific ETL and table schemas. Some team members are closer to ML engineering, while others are closer to ML.
They want a fast path to production. They needed to push scores to the existing ecosystem of internal apps. What’s more is that the integration tests for such pushes are costly to repeat, so they did not want to write tests multiple times.

These constraints can be visualized alternatively from the vantage point of an engineer:

Iterate quickly on model development and feature engineering: It’d be incredible to empower data scientists with the highest degree of freedom at this layer of the stack since there are off-the-shelf libraries. Data scientists might have different preferences depending on the task at hand.
For collaboration with varying levels of context, the need for experimentation makes versioning a key concern to solve. Model operations is another pertinent issue: How to keep code running reliably in production? How to monitor model performance? How to deploy a new code version to run in parallel with previous versions?
A fast path to production sounds simple, but taking an ML idea to production is nuanced and involves a lot of infrastructures. The data scientist has to access data from a data warehouse (a database, a data lake, etc.) and write the data required for training into a common place. He/she then needs compute resources to load the data and train the models. A job scheduler might be necessary to orchestrate multiple units of work and run the workload consistently. Lastly, there are architecture questions to answer like: How to architect the code to be executed? How to visualize the workflow in your idea and break it down into the right level of atomic units of work that could run in parallel? How to structure the Python module and package the code to run in production?

Ravi Chirravuri and Jan Forjanczyk — Supercharging Data Scientists’ Productivity at Netflix <Tecton apply() 2021>

It is noteworthy that data scientists prefer to stay on top of the stack as much as possible. They would not want to fiddle around software architecture or infrastructure integrations. Conversely, ML infrastructure adds more value at the bottom of the stack.

Metaflow

Metaflow is Netflix’s human-centric framework for building and managing real-life data science projects. It has been battle-tested at Netflix for over three years for an array of ML use cases related to catalog performance and content production. It beautifully packages all offerings into a simple open-source library with easy cloud integrations.

Architecture: Metaflow has a DAG (directed acyclic graph) architecture that breaks the ML pipeline into a workflow of steps.
Storage: Metaflow lets you handle artifacts as easily as writing to instance variables within each step. Everything is versioned and stored in a data store at the end of task completion. To keep the storage footprint low, the data is efficiently compressed and stored in a content-addressed fashion to avoid duplicate copies across multiple tasks.
Compute: Being cloud-first, Metaflow helps users achieve vertical scalability by running remotely on AWS batch with the relevant resource requirements. Specifying shards where you want to horizontally farm-off compute to multiple containers on AWS is also relatively easy.
Dependencies: Metaflow provides isolated, versioned, and reproducible environments for users to manage their dependencies.
Scheduler: Metaflow provides a reference implementation by integrating with AWS Step Function.
Versioning: Metaflow supports the first-level concept of versioning, where every task execution in every run for every flow of any user is versioned. Metaflow also provides a Python client that enables collaboration between people with varying levels of context.
Client Access: With the same Python client, users can inspect the data within each task for monitoring or reporting purposes.
Resume: Since everything is versioned, stored in S3, and tagged by a single metadata service, Metaflow makes it easy to switch between local scheduling on a laptop and a production environment using Resume. This feature enables users to iterate quickly.

The content demand modeling currently uses Metaflow for:

Live scoring through API: Some models need user inputs, so they can expose their models through APIs for applications that use them.
Batch scoring models: They run about 60–70 production jobs that produce records for their data warehouse.
Collaborative prototyping: Project-owner mapping is many-to-many, so collaboration is frequent and necessary.

The results are overwhelmingly positive!

Accuracy wins: In a 2-week spring, they increased their model’s accuracy by about 25%.
“Pick up and go”: No new domain-specific language means that their production code looks a lot like their experiments (and easier to onboard!).
Low operations overhead: Versioning and dependency management gives them reliable deployment and runs.

11 — Towards a Unified Real-Time ML Data Pipeline

ML at Etsy

Etsy is the global marketplace for unique and creative goods whose mission is to keep commerce human. It’s home to a universe of special, extraordinary items, from unique handcrafted pieces to vintage treasures. By the end of 2020, its marketplace has about 81.9 million active buyers, 4.4 million active sellers, and $10.3 billion in gross merchandise sales in 2020.

Given Etsy’s task to connect millions of passionate and creative buyers and sellers worldwide, where does ML help with this?

There are a variety of ML challenges unique to Etsy:

Etsy has a diverse and ever-changing inventory of 85 million items.
Etsy is driven by personalization at its core. Items can differ in just style alone. Their goal is to show users items that fit their preferences best.
ML models need to adapt to user preferences, which often change in real-time.
Personalization and real-time signals are key to surfacing relevant content.

To serve the most relevant listings to Etsy’s users, their ML models have to be heavily personalized and adapt to real-time feedback and trends. Aakash Sabharwal and Sheila Hu detailed how Etsy’s ML Platform team uses real-time feature logging to capture in-session/trending activities, builds a typed unified feature store for sharing features across models from different domains, and serves feature data at scale with the eventual goal of powering reactive systems.

Designing a Real-Time ML Pipeline with a Feature Store

The feature store is a central part of Etsy’s ML pipeline. It is a centralized place for various types of features, providing access to both batch and real-time features across all product initiatives, thus avoiding many duplicated efforts. Batch features are generated and updated daily, usually resource-intensive, and can be calculated and aggregated over a long period (e.g., 30 days). On the other hand, real-time features are constantly updated with low latency and can be aggregated over short-term periods or ever-changing characteristic information.

Aakash Sabharwal and Sheila Hu — Towards a Unified Real-Time ML Data Pipeline <Tecton apply() 2021>

This feature store has a columnar data structure based on a NoSQL database due to its dynamic schema and horizontal scaling. Each feature source is mapped to one and only column family (referred to as a “feature family”). Each feature source is mapped to a single column. There are multiple benefits to using such a structure:

A major benefit is asynchronous feature updates. Since features are grouped as families, each feature family is updated asynchronously. This process optimizes for feature freshness and facilitates easier on-call processes through feature ownership.
The other benefit is the ability to follow the typed schema used for features. The feature store enforces an explicitly typed schema used to automate processes and help validate feature data for quality control. As a result, data across applications is validated so that they can send and read data between applications smoothly.

Adding new features has never been easier with the columnar structure and typed schema. Testing for a new feature family is smooth in the controlled environment. A new schema will register this family in a database in a controlled sandbox for testing. Then new features, before being uploaded, will be validated against the schema.

Another important property triggered by schema updates is the feature info hub, essentially a feature metadata template. Etsy promotes better documentation of their features by automatically requesting feature owners to fill out schema-generated metadata files. These files are exposed using a web page for increased visibility.

As a result, an easily evolvable and validatable schema helps Etsy connect multiple applications and processes with ease. This goes from database control (registering new columns in the feature database) and data validation (registering new schema in schema registry and validating features before uploading); to feature access (constructing feature fetching requests and generating (de)-serialization functions) and documentation (generating feature metadata file templates).

Training and Serving with Feature Logging

Training data is formed with two main parts at Etsy: (1) attribution labels (clicks or purchases) and (2) contextual & candidate features (user and listing attributes). Training data generation is critical to an ML model’s performance. It has to capture the accurate state of the past instance and meet the ever-evolving needs of model trainers efficiently.

Traditionally, training data generation can only be completed offline with batch jobs, either by (1) joining multiple feature sources separately to attribution labels or (2) joining with daily snapshotted feature stores. Both methods pose major challenges during model training:

They create a training-serving skew due to the inconsistent features between training and serving environments.
It is not easy to access real-time features.
There is a high cost for offline multi-joins (method 1), or adding new features is difficult (method 2).

The Etsy ML Platform team introduces real-time feature logging as a solution to overcome these challenges. The goal is to log features used at model inference time for training. By logging features in real-time, they brought offline batch feature joins to online key-value fetching. Moving it online makes it possible to access real-time features.

When a request comes in, the reactive system talks with the feature service and passes the features to the model layer. Simultaneously, the reactive system triggers a request to the feature logger, so the logger can log features through Kafka and save them in storage.

Feature logging happens asynchronously after inference, so it can capture only features available at the time of inference with a minimal discrepancy, which minimizes training skew. Because the feature logger lives inside the reactive system (which talks to all essential services), it can also log information from those services. It naturally groups all features required for training and can log other data for downstream applications and analysis for explainable ML.

Since both feature logger and inferencer need to talk to the feature store at request time, the Etsy platform team also designs a smart feature selector to determine the exact list of features they want from the database. Selectors are automatically generated based on model config or feature lists, usually wrapped as a database filter or request metadata depending on the feature service.

Schema also plays an important role in the inference and logging process. Not only was schema used for validation at the time of logging, the same schema-generated functions are also used to fetch and de-serialize features for inference and training. A versioned schema gets imported into the feature logger, then serdes are shared between inference and logging (for training). Different products share the same schema for feature logging, which enables sharing of training data.

Real-Time and Personalized Applications at Etsy

The first application is In-Session Personalization with Feature Chaining. On Etsy, user preferences change in real-time. Recent interactions are a good signal of users’ preferences in real-time, while listing content is relatively static. Thus, there is a need to incorporate users’ recent interactions with entities (listings or queries) and “compose” them with static features to capture users’ current shopping mission better.

An example architecture is shown above. A user clicks into the site. A streaming service logs those clicks and chains them with listing features in the feature store. More concretely, the last five items clicked are chained with the items attributes/embeddings, which serve as inputs to Etsy’s personalized ranking models.

The second application lies in feature drift monitoring and validation. By logging features in real-time, Etsy can monitor for any errors or general feature data drift. A simple Kafka application has helped them catch bugs in event logging, cutting down the time to react and fix logging pipelines quickly. As data scientists can better understand changing data distributions affecting their models, they can craft a road towards explainable AI.

The third category includes bandit applications. Traditionally, we use multi-arm bandits techniques for dynamically allocating site traffic to different variants. Example bandit applications for search and recommendation problems include candidate selection or ranker selection. Contextual Bandits are even more powerful to optimize personalized experiences as they incorporate “real-time context,” which is provided by real-time feature pipelines.

Besides the three above, there are a variety of other applications at Etsy leveraging the real-time ML pipeline:

Sharing of Features: By having a centralized feature store, they can share features across different products and make more accurate models.
Training Models Across Products: A unified pipeline along with a standardized schema means they can train a model using logged data from multiple different sources (i.e., Search and Recommendations traffic).
Online Learning: By logging features in real-time, they can generate their training data in real-time and update their models more frequently.

If interested in following Etsy’s ML journey, be sure to read their blog “Code as Craft.”

12 — Real-time Personalization of QuickBooks using Clickstream Data

Intuit serves consumers, small businesses, and self-employed folks with products such as turbotax, quickbooks, and mint. ML is used in Intuit products ranging from search relevance and proactive help (in TurboTax) to cash flows forecasting (in QuickBooks) and transactions categorization (in mint). Intuit’s ML platform vision is to “democratize AI/ML development for anyone at Intuit by simplifying all parts of the model development lifecycle through automation and easy-to-use experiences.”

Intuit ML Platform

Ian Sebanja and Simarpal Khaira — Real-time Personalization of QuickBooks using Clickstream Data <Tecton apply() 2021>

Here is a high-level overview of the Intuit ML platform:

For feature engineering, it uses Apache Beam, Spark, and Flink that run on Kubernetes infrastructure.
For feature stores, it uses Amazon’s DynamoDB and S3.
For model training and inference, it uses Amazon SageMaker.
For model monitoring and anomaly detection, it uses in-house capabilities.

Ian Sebanja and Simarpal Khaira dove deep into Intuit’s ML platform’s real-time personalization pipeline, which includes two key components: (1) a featurization pipeline for featurization using streaming data and online storage for inference; and (2) a model inference pipeline for deployment and hosting, feature fetching and orchestration from the feature store, and model monitoring.

At a high level, as the users interact with Intuit products, a series of events get sent into the event stream (Kafka), which persists into a data lake (S3). These events are used for real-time featurization.

For predictions, features are fetched from the feature store and combined with payload to deliver the customer experience within the products.
Model monitoring is vital to close the feedback loop and improve the model performance and the overall customer experience.

Featurization Pipeline

The clickstream pipeline to do featurization is built using the PySpark framework running on Kubernetes-based infrastructure.

A Kafka-based real-time messaging system collects streaming events from the users in a highly distributed fashion.
The feature processor is written in PySpark. Developers are provided with programming interfaces to easily operate on DataFrame objects, apply data transformations, and produce features.
Once produced, the features are sent to the Kafka-based ingestion topic. This topic acts as a shared message queue for both the stream writer that writes to the low-latency online store for real-time inference (DynamoDB) and the batch writer that writes to the offline store for exploration, training, and batch inference (S3).
The final data serving layer is a GraphQL-based web service that fetches features and combines features across feature sets to create training sets for the model training process.

Inference Pipeline

This pipeline includes three major components: model deployment and hosting, features fetching, and model monitoring.

Deploying and hosting the models: As part of the model deployment, there is a user interface that is self-serve, where the model settings are available in a central dashboard. The container orchestration handles anything that runs on containers, ensuring that they meet the SLAs and security requirements.
Fetching features and orchestration: The data scientist is the author of a contract, specifying the features he/she wants to get and the models that these features will be used as inputs. The contract is abstracted away from the data layer, so there is only a simple interface between the data scientists and client teams. Real-time feature fetching is based on a directed acyclic graph (DAG). Each node represents a different feature fetch strategy (model and data dependency, parallel execution, data transformation / last mile transformation, asynchronous execution, etc.).
Monitoring: There are operational monitoring (out-of-the-box metrics and alerting) and model efficacy prediction monitoring (statistical, custom alerting, and model-specific).

13 — Evolution and Unification of Pinterest ML Platform

Pinterest is the visual discovery engine where people save and organize their pins into collections called boards. With close 459M Monthly Active Users and 300B Pins saved, Pinterest has a diverse set of ML applications, including large-scale online ranking (home feed, search, related, ads, visual search, etc.) and small-to-medium-scale use cases (image analysis, content signals, trust and safety, PA/DS, etc.). Large-scale online ranking requires real-time inference for tens of millions of items per second, while small-to-medium scale use cases require less demanding inference (typically in a batch fashion).

David Liu — Evolution and Unification of Pinterest ML Platform <Tecton apply() 2021>

For a while, each ML team at Pinterest has independently evolved specific solutions for creating features and serving models. This led to a lot of custom infrastructure for each use case, resulting in high maintenance costs and incomplete tooling. David Liu (Head of ML and Signal Platform at Pinterest) shared his team’s layered approach to unify these different use cases into a shared infrastructure, in which each layer provides the technical foundation and standardized interface for the next layer.

The unified feature representation layer defines a standard container for storing features that is general enough for any kind of data. It separates the data type (storage format) from the feature type (interpretation). This layer dramatically simplifies the model development process because there is a standardized conversion from any format to model inputs (like PyTorch or TensorFlow tensors).
The shared feature store layer extracts the entity type, entity key, and feature ID to retrieve the right features. This layer is fundamental for many downstream tasks such as backfilling training data, performing batch/online inference, and cataloging the features. Pinterest’s feature store is built on their in-house signal platform called “Galaxy,” which provides a standardized way to share signals (key-value data about pins, boards, users, ads, and other entities).
Next is a standardized inference and deployment layer built around the open-source MLflow to version-controlled models, track training parameters and evaluation metrics, and ensure reproducible models. On top of MLflow, the Pinterest platform team builds standardized inference solutions that enable fast, code-free deployment with UI-based deployment and rollback capabilities. Interestingly, they design three separate ways to deploy online Inference models: (1) calling a standalone model server if the user wants to generate and manage features manually, (2) embedding the models in the user’s service if features already available, and (3) calling the Scorpionservice to perform managed feature fetching/caching and supports intensive ranking.
The last layer is model insights and analysis, which has capabilities such as real-time feature distribution and coverage, feature importance analysis (both at the local and global level), and model rollout monitoring.

During this unification journey, Pinterest’s ML platform team has faced two core challenges:

Product goals vs. platform goals: Product teams are often under a lot of pressure to achieve their own business goals. Platform work requires a short-term pause in order to go faster in the long run.
Teams might request hyper-specific fixes (band-aids) for their current systems: These requests can lead to the platform team designing “local optima” solutions.

David ends the talk with three approaches to address the two challenges above:

Figuring out a technical path to establish a foundation for platform standardization and code migration.
Identifying bottom-up incentives to save teams from waste engineering efforts.
Mixing in top-down alignment so that teams under pressure have room to pursue platform unification.

14 — Scaling an ML Social Feed with Feature Pipelines

Ettie Eyre and Nadine Sarraf (Cookpad) discussed why a feature store is essential for serving ML at scale. In 2018, they launched an experiment to add ML to the ranking algorithms on the social feed of the Cookpad application. The feed service at this time was a Ruby-on-Rails application backed by Redis and psql. The results of this experiment were plausible for their users (over 30% increase in interactions on the feed). However, the architecture they built for this experiment did not allow them to scale beyond a limited number of users (how to roll out to 100M users?).

Ettie Eyre and Nadine Sarraf — Scaling an ML Social Feed with Feature Pipelines <Tecton apply() 2021>

Two significant bottlenecks exist in the 2019 architecture:

Store-ranked recipes were put in memory; as a result, there were long startup times. They had to replicate data for every service replica and calculate and store ranks even for silent users.
Models training and features calculation were done manually: Their ML researchers spent time on repetitive tasks. Furthermore, it is hard to share feature data between team members.

Therefore, in their next iteration, they focused on redesigning the architecture to scale to their global user base, keeping in mind all the learnings from their first experiment, by incorporating a solution with feature stores.

As observed in the new architecture above:

Since Cookpad’s social feed requires real-time processing of millions of requests, they use DynamoDB and Dax to store features.
To populate features, they started with cron jobs through Argo workflow to ingest data into their feature stores. Unfortunately, these jobs consume historical data from Redshift. In order to use fresh content as soon as they are added to the Cookpad platform, they have gradually been updating their Kafka-based streaming system to make it capable of producing and storing features in near real-time.
These cron jobs also talk with feature extraction services (that provide user and recipe embeddings).
To address disaster recovery for their feature stores, they use point-in-time recovery that automatically backs up the data with per-second granularity.

This new architecture enables the Cookpad team to roll out to 20M users reliably, retrieve pre-calculated features, increase data retrieval performance through DAX, ensure stable service even with increasing web traffic, and facilitates collaboration by sharing and reusing features.

IV — Novel Techniques

15 — Data Observability: The Next Frontier of Data Engineering

As companies become increasingly data-driven, the technologies underlying these rich insights have grown more nuanced and complex. While our ability to collect, store, aggregate, and visualize this data has largely kept up with the needs of modern data and ML teams, the mechanics behind data quality and integrity has lagged. To keep pace with data’s clock speed of innovation, data engineers need to invest not only in the latest modeling and analytics tools but also ML-based technologies that can increase data accuracy and prevent broken pipelines. The solution is data observability, the next frontier of data engineering. Barr Moses (Monte Carlo) discussed why data observability matters to building a better data quality strategy and tactics best-in-class organizations use to address it — including organizational structure, culture, and technology.

The concept of observability comes from DevOps, which measures software metrics, traces, and logs. Translating this concept to data means an organization’s ability to fully understand the health of the data in its system, eliminating data downtime. The five pillars of data observability are defined to be:

Freshness: whether the data is up-to-date.
Distribution: whether the data value at the field level is accurate.
Volume: whether the amount of data expected is in line with the historical rate.
Schema: the structure of the data.
Lineage: tracing and root cause to data problems.

Barr Moses — Data Observability: The Next Frontier of Data Engineering <Tecton apply() 2021>

Using data observability in practice, Monte Carlo’s customers like Yotpo and Blinkist have increased cost savings, facilitated better collaboration between data engineers and analysts, came up with faster data incident resolution, and saved a tremendous amount of data engineering hours per week.

Designing an end-to-end approach to data observability requires you to:

Set baseline expectations about your data (what does “good data” look like?).
Monitor for anomalies across the five pillars of data observability.
Collect (and apply) rich metadata about your most critical data assets to lineage.
Map lineage between upstream and downstream dependencies to determine business applications of bad data.

16 — Centralized Model Performance Management via XAI

There are four main challenges that AI teams are facing when deploying models:

Model transparency: Organizations are being held accountable to disclose their complex models, moderation policies, and data flows to regulators.
Model drift: Since models are stochastic entities, drift in model performance can lead to tremendous losses during unforeseen circumstances.
Model bias: Bias can creep into models at different levels, resulting in customer complaints and expensive lawsuits.
Model compliance: This is something that finance and healthcare companies are dealing with daily.

The fundamental principle behind all of these challenges is that ML models are error-prone:

Model performance decays over time because of changes in user behavior and system errors.
Data drift occurs because production data is likely to differ from training data. That distribution shift clearly impacts the predictive nature of the models.
Data errors are common in ML pipelines. It is difficult to catch these errors before they have already negatively impacted key business metrics.
Models can add or amplify bias, which exposes organizations to regulatory and brand risk.
Complex models are increasingly a black box, which makes them difficult to debug.

Despite these challenges, there is currently no feedback loop to monitor and control AI today. Krishna Gade (Fiddler) described his company as a Datadog or Tableau for ML engineers, a visual analytics tool that captures model performance data, provides visibility across models, displays granular explanations of data and models, diagnoses model performance and bias, and sets up critical alerts and insights.

Krishna Gade — Centralized Model Performance Management via Explainable AI <Tecton apply() 2021>

Fiddler’s Model Performance Management (MPM) module is pluggable across the ML lifecycle to illuminate the black-box nature of ML models and close the feedback loop between offline and online evaluation.

During training, teams can log training/test datasets and check for bias and feature quality.
During validation, teams can ingest the model, explain the performance, discover slices of low model performance, and create model dashboards/reports.
During deployment, teams can record model traffic and metadata and compare challenger and champion models on performance and bias.
During monitoring, teams can observe performance, drift, outliers, bias, and errors by setting alerts on conditions and slices.
During analysis, teams can pin-point root-cause performance issues, slice and dice prediction log data, and generate adversarial examples.

Overall, Fiddler’s MPM reduces operational risks with AI/ML by addressing performance degradation, inadvertent bias, data quality and other undetected issues, alternative performance indicators, and, most importantly, black-box models.

17 — Notebooks for Productionalizing Data & ML Projects

Michelle Ufford and Matthew Seale (Noteable) explored the development lifecycle for data engineering & ML projects before delving into some of the friction points most common when productionalizing those projects.

The traditional data lifecycle follows these six steps: discovery (understand what data is available and what does it mean), preparation (integration, transformation, de-normalization), model planning (iterate the model design based on the input data), model building (the core ML development), communicate results (analysis, reports, memo), and operationalize (get the model ready for production). Once the model gets deployed into production, we need to ensure that it stays good by performing quality checks, writing canary tests, and versioning the data.

But what happens when there is a problem in production? Now we are dealing with multiple artifacts with multiple people, which is a tedious and painful process. It can take hours to figure out what went wrong, let alone how to fix it.

Michelle Ufford and Matthew Seale — Best Practices for Productionalizing Data & ML Projects <Tecton apply() 2021>

The modern data lifecycle is not much different from the traditional life cycle, except that the friction points moving across these steps have become a bottleneck for us to achieve more with our data science projects. A solution is to have a unifying tool that serves as an interface to ensure that activities in production are similar to activities in development.

If we have a problem in production, with a unifying tool, we can take the assets in production and feed them into the original input that causes the problem. In development, we can reproduce the problem right away and get the right people focused on fixing it. For instance, Netflix uses Jupyter Notebook as the unifying tool to reproduce/clean bad data, update the model interface to accept new inputs, and apply fixes in production.

Besides notebooks, Metaflow and MLflow are other viable options to serve as this unifying layer. Because our development environment is closer to our production environment, we can focus on standard software development best practices that will improve both the development and the production life cycle (unit tests, code cleanup, documentation, parameterization, integration tests, version control).

Here are a few solid tools for productionization with a notebook, as recommended by Matthew: testbook for unit test; git-review and reviewnb for code cleanup; github+nbviewer, nbshinx, and commuter for documentation; papermill for parameterization and integration tests; and git for version control.

18 — Programmatic Supervision for Software 2.0

One major bottleneck in the development and deployment of AI applications is the need for the massive labeled training datasets that drive modern ML approaches today (in contrast to models and hardware that are increasingly commoditized and accessible). This is no surprise: modern deep learning approaches are both more powerful and more push-button than ever before, but there’s no such thing as a free lunch- they are also commensurately more data-hungry. Having access to adequate training data is increasingly the differentiator of companies that successfully adopt machine learning from those that do not.

These training datasets traditionally are often labeled by hand at a great time and monetary expense; and often cannot be hand-labeled practically at all due to privacy, expertise, and or rate-of-change requirements in real-world settings like healthcare and more. Let’s double-click on these three challenges:

Data privacy, who has access to what data and how different aspects of data should be handled, is a big one. Shipping data out of the organization to get labels is often a non-starter for companies with strict data privacy requirements. For these cases, having access to internal data annotators at scale is frequently a real hurdle.
Subject matter expertise is another significant factor. When developing AI applications for complex processes, such as understanding legal documents, diagnosing medical conditions, or analyzing network data, data scientists and machine learning engineers need data inputs from subject matter experts (SME) such as lawyers, doctors, or network analysts. Using these SMEs’ time to hand-label data is an inefficient use of their time and knowledge. Instead, wouldn’t it be much more cost-effective if there was a way to generalize their expertise for large-scale data annotation.
Constant change in input data and output modeling goals is a reality of nearly any real-world AI application, caused by everything from changes in upstream data preprocessing to downstream business goals. This rate of change issue poses a significant challenge for ML and data science teams. Using hand-labeled training data means having to frequently re-label from scratch–or not approach high rate-of-change problems with machine learning at all.

Alex Ratner — Programmatic Supervision for Software 2.0 <Tecton apply() 2021>

Alex Ratner (Snorkel AI) discussed a radically new approach to AI development called programmatic labeling, which was first developed at the Stanford AI lab and now at Snorkel AI. Rather than labeling data by hand, programmatic labeling lets users with domain knowledge and a clear understanding of the problem (SMEs) label thousands of data points in minutes using powerful labeling functions.

Based on this research and motivated both by what it was able to accomplish and what it needed to get to the next level of practicality and accessibility, Alex and collaborators have been building Snorkel Flow, an end-to-end ML platform around the programmatic labeling paradigm. By solving the training data problem in this new way, Snorkel Flow unlocks the rest of the ML model development workflow, making it iterative, practical, and more akin to software development than the current manual labor-heavy mode of ML to date.

Here’s a primer on how Snorkel Flow works:

Snorkel Flow users first directly express knowledge as labeling functions, which can be expressed in Python or no-code UI. These labeling functions enable you to label a massive amount of unlabeled data and use them for training complex models in minutes/hours. The challenge here is that these labeling functions are far noisier than hand-labeled ones.
Snorkel Flow addresses this challenge by automatically denoising and integrating this weak supervision in a provably consistent way. Academically speaking, it uses theoretically-grounded techniques to clean, re-weight, and combine the labeling functional signals.
Afterward, Snorkel Flow manages, versions, monitors, and serves this programmatic training data.
Programmatic labeling can be used to train any ML model, which can generalize beyond the labeling functions.

V — Impactful Internal Projects

19 — Hamilton: a Micro Framework for Creating Dataframes

Stitch Fix has 130+ “Full Stack Data Scientists,” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand (FED) team, was in a bind. Their data scientists are responsible for forecasts that help the business make operational decisions. Their data generation process was causing them iteration and operational frustrations in delivering time-series forecasts for the business. More specifically, featurized data frames have columns that are functions of other columns, making things messy. Scaling featurization code results in inline data frame manipulations, heterogeneity in function definitions and behaviors, and inappropriate code ordering.

Stefan Krawczyk — Hamilton: a Micro Framework for Creating Dataframes <Tecton apply() 2021>

Stefan Krawczyk presented Hamilton, a novel Python micro-framework, that solved their pain points by changing their working paradigm.

Specifically, Hamilton enables a simpler paradigm for a Data Science team to create, maintain, and execute code for generating wide data frames, especially when there are lots of intercolumn dependencies.

Hamilton does this by building a DAG of dependencies directly from Python functions defined in a unique manner, making unit testing and documentation easy.
Users write their functions, where the function name is the output column and the function inputs are the input columns. Then they can use Python-type hints to check the DAG before executing the code.

Hamilton has been in production for 1.5 years and a huge success. Unit testing is standardized across data science projects. Documentation is easy and natural. Visualization is effortless with a DAG, making onboarding simpler. Debugging is also simpler as issues can be isolated quickly. Most importantly, data scientists can focus on what matters, not the “glue” code that holds things together.

In the future, Stitch Fix will consider distributed processing, open-source components, general featurization, and other extensions for Hamilton.

20 — A Point in Time: Mutable Data in Online Inference

Most business applications mutate relational data. Online inference is often made on this mutable data, so training data should reflect the state at the prediction’s “point in time” for each object. There are several data architecture/domain modeling patterns which solve this issue, but they only work from the implementation date onwards. Orr Shilon(Lemonade) suggested using the “point in time” as a first-class citizen in the ML Platform while still striving to maximize the use of the older and messier data.

“Point-in-time” is a training data notion, where we capture the most recent values during inference time and recreate data identical to inference time during training time. The business point-in-time can be defined as entity values at the time of a business process.

Each point in time is dependent on a single entity column.
The timestamp for each entity is different within the point in time.
And there are usually several of these points-in-time per business.

Orr then compares this concept to a snapshot to understand the difference.

In the business point-in-time, the timestamp comes from an entity’s column (e.g., when a transaction was created). This would be different for each entity row. For a snapshot, the timestamp is the same for the whole table.
There are only several business points-in-time, so we will only need to materialize each entity’s data once per point in time. In contrast, snapshots will be created periodically, so each entity will be materialized many times.

The data scientists at Lemonade do not want the freedom to “time travel” to any point in time, but only to several. Thus, the ML platform team decided to create abstractions for specific cases which not only provide less room to make mistakes but also result in cool applications:

Orr Shilon — A Point in Time: Mutable Data in Online Inference <Tecton apply() 2021>

The first application is to monitor real-time feature service-level indicators, where a common issue is race conditions that result in missing/outdated features. To detect race conditions, Lemonade’s ML platform team has taken these steps:

As the ML platform supports a closed set of points-in-time, the feature authors configure point-in-time availability per feature.
Then, the business logic sets each point-in-time by requesting and logging features.
The data warehouse compares real inference features to training features from the feature store.
Consequently, the platform engineers can make an educated decision on whether to synchronously use a feature or not.

The second application is in communication internally with the ML platform team and externally with stakeholders.

For the ML platform team, the point-in-time is visible throughout the ML lifecycle, allowing consistent data quality guarantees.
For the stakeholders, the point-in-time is part of the model contract, allowing the consumers to know where in the business logic to use a model.

The third application is to create legacy data backfill guarantees by manually testing and backfilling per point-in-time.

If the scientists continue creating snapshots, then now they can manually create point-in-time data from snapshots.
They can also materialize point-in-time training features from a feature store by comparing new rows between legacy and current data systems. Then, they can guarantee (or not) that the legacy data can be used to train models for a specific point-in-time.

So, in conclusion, consider making business point-in-time part of your ML platform if you make online predictions.

21 — Reusability in Machine Learning with Kedro

Creating and leveraging reusable ML code has many similarities with traditional software engineering but is also different in many respects. Nayur Khan (QuantumBlack) explored modern techniques and tooling which empower reusability in data and analytics solutions.

In ML development, there are often many teams sitting in silos and writing a lot of monolithic code that do not interact with each other. As a result, a couple of challenges arise:

Lack of reuse: There is a lack of collaboration or sharing of components/libraries.
Duplication: Different teams duplicate the same effort.
Inconsistency: Different approaches are created to solve the same problem, or use inconsistent data or algorithms.
Maintenance: It is difficult to maintain ML models (or hand-off to a support team).
Tech Debt: Large codebase and tech debt are always painful to deal with.

A typical ML codebase connects to data, cleans/transforms data, engineers features, builds models, and creates visualizations. Could we somehow package this up for reuse?

Kedro is an open-source Python framework created for data scientists and data engineers. It sets a foundation for creating repeatable data science code, provides an easy transition from development to production, and applies software engineering concepts to data science code. It is the scaffolding that helps us develop a data/ML pipeline that can be deployed.

Kedro is designed to address shortcomings of notebooks/one-off scripts/glue code and focus on creating maintainable data science code. Kedro uses applied software engineering concepts (i.e., modularity, separation of concerns, testability), which inspires the creation of reusable analytics code. Lastly, Kedro enhances team collaboration (both for different members within a team and between different teams).

Nayur Khan — Reusability in Machine Learning with Kedro <Tecton apply() 2021>

Kedro has a nice feature called reusable pipelines, which are lego blocks used by different teams. This addresses all the challenges mentioned earlier. To design well-maintained Kedro pipelines, the QuantumBlack team adhere to key concepts from the software engineering world such as DRY (Don’t repeat yourself), SOLID (Single responsibility, Open-closed principle, Liskov principle, Interface segregation, Dependency inversion), and YAGNI (You ain’t going to need it).

Given these pipelines, Nayur then shares three learnings on making them more discoverable: using a wiki/knowledge repo (Airbnb has an open-source version), sharing source (pipelines) via Git, and sharing components via asset repo.

In conclusion, this approach to reusability from this talk will allow you to go from insights (data science code that no one will use after your project is complete) to ML products (data science code that needs to be re-run and maintained).

22 — Exploiting the Data Code-Duality with Dali

Most large software projects in existence today are the result of the collaborative efforts of hundreds or even thousands of developers. These projects consist of millions of lines of code and leverage a plethora of reusable libraries and services provided by third parties. Projects of this scale would not be possible without the tools and processes that now define the practice of modern software development: language support for decoupling the interface from the implementation, version control, semantic versioning of artifacts, dependency management, issue tracking, peer review of code, integration testing, and the ability to tie all of these things together with comprehensive code search and dependency tracking mechanisms.

Carl Steinbach (LinkedIn) has observed similar forces at play in the world of big data. At LinkedIn, the number of people who produce and consume data, the number of datasets they need to manage, and the rate at which these datasets change are growing exponentially. This has resulted in a host of problems: rampant duplication of business logic and data, increasingly fragile and hard-to-maintain data pipelines, and schemas littered with deprecated fields due to the prohibitive costs of making backward-incompatible changes. To cope with these challenges, the team built Dali (Data Access at LinkedIn), a unified data abstraction layer for offline (Hadoop, Spark, Presto, etc.) and nearline (Kafka, Samza) systems that enables data engineers to benefit from the same processes and infrastructure that LinkedIn’s software engineers already use.

At its core, Dali provides the ability to define and evolve a dataset. A logical abstraction (e.g., a table with a well-defined schema) is used to define and access physical data in a principled manner through a catalog. Dali’s key distinction from existing solutions is that a dataset need not be restricted to just physical datasets. One can extend the concept to include virtual datasets and treat physical and virtual datasets interchangeably. A virtual dataset is nothing but a view that allows transformations on physical datasets to be expressed and executed. In other words, a virtual dataset is the ability to express datasets as code. Depending on the dataset’s usage, it is very cost-effective for the infrastructure to move between physical and virtual datasets improving infrastructure utilization. Heavily-used datasets can be materialized, and less heavily-used datasets can stay virtual until they cross a usage threshold.

Carl Steinbach — Exploiting the Data Code-Duality with Dali <Tecton apply() 2021>

Views bring about a host of desirable properties enabling users to:

Flatten highly nested data without storing two copies.
Demultiplex generic tables into specific tables, as well as the reverse direction.
Manage backward-incompatible changes without materializing two copies.
Select union views that combine the last month’s worth of a dataset in ORC with the remaining encoded in Avro.
Replace push upgrades with incremental pulls.
Share workload optimization and lazy pre-materialization spanning the view graph.

In conclusion, here are concrete benefits of using Dali for the LinkedIn engineering team:

Seamless movement between physical and virtual datasets: Views evolve as contracts between upstream and downstream sources, so one can materialize some views because they are frequently accessed. Dali provides a mechanism to access data consistently regardless of whether it is available as a physical table or logical view.
Code reuse: Boilerplate logic that would have existed in slightly different forms in thousands of scripts can be consolidated into a single Dali view reusable across the data ecosystem.
Dependency management: There is a complete dependency graph linking Applications, Datasets, and Schemas. Impact analysis and lineage are visible.
Format and storage agnostic: Dali has been able to hide the fact that LinkedIn has migrated from Avro to mixed Avro + ORC data format for over 50 tracking topics in the offline world.
Decouples data producers from data consumers: Dali Views are versioned. This is critical because it decouples the pace of evolution between producers and consumers, replacing what would be an all-or-nothing and expensive atomic push upgrade with an incremental pull upgrade initiated by the consumer. Consumers are free to inspect what changes have occurred to the data and choose to upgrade at a pace they are comfortable with.

23 — Building a Best-in-Class Customer Experience Platform

New technologies have been advancing rapidly across the areas of frictionless data ingestion, customer data management, identity resolution, feature stores, MLOps, and customer interaction orchestration. Over the same period, many large enterprises have started to find themselves in the uncomfortable position of watching from the sidelines. These advances happen faster than they can evaluate the opportunities, build and sell the business cases, and select and integrate the new desired components. Many enterprise applications, used by different teams, often have many duplications, non-standard tooling, and undocumented dependencies. As a result, these enterprises have low access to the latest technology, low automation, substantial workflow inefficiencies, and significant performance gaps. Paul Phillips and Julian Berman (Deloitte Digital) talked about the journey of building a customer experience platform for enterprise marketing.

Paul Phillips and Julian Berman — Building a Best-in-Class Customer Experience Platform <Tecton apply() 2021>

Hux is Deloitte’s customer experience (CX) platform designed for the enterprise challenges:

It connects (with low friction) to data inputs that characterize the relationship between consumers and brands. The data then is merged inside a customer data management system (Snowflake in this case).
A big theme in this customer data management structure has been ID resolution. Such property is pivotal for insignificant parts driven by ML requirements, which make decisions at the individual consumer level in real-time.
In the BI, Insight & Decisioning layer, the data is structured natively to support ML applications (in addition to BI and analytics use cases) with feature management and decision pipeline blocks.
The last two layers (marketing and CX orchestration + customer interactions) orchestrate and bring the intelligence into the customer’s hands.

Furthermore, Hux is built with interoperability in mind by (1) enforcing harmony between technology, techniques, and talent; (2) being seamless during the end-to-end pipeline from data to engagement; (3) using AI that is integrated, powerful, and designed for marketers; (4) empowering analytics with insights and measurement built-in; and (5) letting you use your in-house cloud solutions building your capabilities.

To bring this CX platform to the enterprise, it is crucial to identify the workflow challenges:

The utility of any ML system is a function of the difference between the value created and the cost of realizing it. By taking such cost out of the decisioning activity, we can optimize the utility even further.
Using a feature store can help reduce human resource costs arising from workflow-integrated feature management.

There are also technical integration challenges:

“Platform-wide” ML vendors often sell by having an organization build their entire workflow on top. In contrast, component-based vendors are deep in one area of the life cycle but may not fit well with each other without investment.
Operating diverse infrastructure is still very challenging. And your system performance is only as good as the weakest link in the architecture.
The market is still quite weak on domain or industry-specific solutions.

The talk ends with a 5-step precipice to ensure the success of any AI initiative:

Fully automated: Automated pipelines maximize performance and uptime.
Lower overhead: Best-in-class technology reduces operational overhead.
Shorter time to value: Clients get started with ML immediately using pre-configured architecture rather than building and maintaining it themselves.
Lower total cost of ownership: Automated pipelines reduce cost per action, increase business agility, lower downtime, and improve ROI.
Reduced risk: Preconfigured architecture and elimination of manual work reduce risk in complex, highly governed enterprise environments.

A huge thanks to Tecton for organizing the conference, alongside these startup partners (Algorithmia, Arize, cnvrg, Fiddler, Monte Carlo, Noteable, Provectus, Redis, Snorkel, Superb AI, Superwise AI) and PR sponsors (Data Talks Club, TFIR, AICamp, insideBigData, The Cloudcast, The New Stack, The Sequence, MLOps Community).

If you are a practitioner/investor/operator excited about best practices for development patterns/tooling/emerging architectures to successfully build and manage production ML applications, please reach out to trade notes and tell me more! DM is open 🐦