What I Learned From Attending Tecton apply(meetup) 2021
Last month, I attended apply(), Tecton’s follow-up virtual event of their ML data engineering conference series. I’ve previously written a recap of their inaugural event, a whirlwind tour of wide-ranging topics such as feature stores, ML platforms, and research on data engineering. In this shorter post, I would like to share content from the main talks and lightning talks presented at the community meetup. Topics include ML systems research, ML observability, streaming architecture, and more.
New Abstractions Enabling Operational ML
But first, let’s take a look at the kickoff talk from Tecton’s Co-Founder and CEO, Mike Del Balso, on the new abstractions enabling Operational ML.
It’s been an exhilarating year for operational ML, as we have seen a mass shift towards the data-centric perspective for building ML systems. As this space matures, we see new tools and abstractions that help build ML pipelines and collaborate on ML projects easier. At the same time, more sophisticated ML projects bring more complex and advanced requirements for ML tooling. As a result, there’s a lot of opportunity for new abstractions and tools to make ML even faster, more accessible, and more powerful.
A lot of data engineering work needs to happen in operational ML, as different architectures are required for different types of features. Each of these data engineering pipelines has ML-specific requirements because data for ML has unique and unmet needs. These include training/serving skew, point-in-time correctness, putting notebooks into production, real-time transformations, melding batch and real-time data, complicated backfills, data feedback loops, and latency-sensitive serving.
Because these pipelines touch so many different pieces of infrastructure and data science code, it’s hard for any single person to manage them, let alone build them. One trend that’s happening across ML engineering is that new abstractions are emerging to help people manage these areas of complexity in their ML applications. This abstraction is something that Tecton/Feast is very bullish on.
They have built a unified feature management layer that allows teams to (1) use a simple Python framework to define and register their ML features from these simple definitions; and (2) have their features be instantiated as productionized feature pipelines. These new abstractions provide standardized APIs and tools to manage and govern the features, to productionize and deploy them, to access their data in a standardized way, to create derivative datasets for model training or model serving, to monitor them over time, etc. All of these abstractions operate on top of your existing cloud data infrastructure.
A feature platform like Tecton is a great example of a new abstraction providing an ML-specific interface that hides a ton of complexity from engineers and data scientists. Consequently, teams get to production faster, collaborate better, and even obtain best practices in architecture design for free.
This new generation of MLOps tools is bringing standardization that allows for greater decoupling in the ML stack and a cleaner ML architecture.
Main Talks
1 - Building Malleable Machine Learning Systems through Measurement, Monitoring & Maintenance
The ML software hierarchy can be defined with four blocks as seen below:
Program and Run: This is the wheelhouse for traditional ML. You build models, program data, and change your data for your models to reach high performance for some tasks of interest.
Measure and Monitor: Once the model is deployed, you want to monitor distribution shift and perform fine-grained model evaluation/interpretability.
Improve and Repair: Once you can measure and monitor, it becomes easier to understand where you can intervene to fix the gaps, repair issues, iterate on the datasets, etc.
Interact: The last block figures out the interaction between ML practitioners and ML systems.
Karan Goel (CS Ph.D. Student at Stanford University) argued that we are doing well on the bottom block, but there remain challenges with the upper-layer blocks. In his presentation, he proposed that: if ML systems were more malleable and could be maintained like software, we might build better systems. More specifically, he described the need for finer-grained performance measurement and monitoring, the opportunities paying attention to this area could open up in maintaining ML systems, and the tools he has been building (with great collaborators) in the Robustness Gym and Meerkat projects to close this gap.
Loosely speaking, Karan defines malleable ML systems as those with the ability to modify the system over time to perform bug fixes, hit performance targets, and meet ethical standards.
We can define the system components that help us nail down what we look for in this framework. The desire data are such that the stakeholders can interactively:
(Measure) Understand fine-grained system capabilities and performance.
(Monitor) Track distribution shift and its effects over time.
(Maintain) Perform atomic updates to data and models.
Most of the knowledge to address these capabilities is tacit and hidden inside (a few) organizations. They’ve built tools to do a small subset of what’s possible.
Measurement
You can’t fix what you don’t measure. We need finer-grained model metrics across critical slices of data. Robustness Gym is a Python toolkit for building evaluations and generating visualizations. There are a lot of evaluation strategies to measure what’s happening inside an ML system: via critical data slices, via bias and fairness concerns, via sensitivity to perturbations, via invariance to transformations, etc. Robustness Gym brings these strategies under one umbrella and provides general abstractions.
Below is an example robustness report for a BERT model used for the task of Natural Language Inference:
Different evaluation types (subpopulation, attack, transformation, evaluation set) are on the rightmost side.
Specific evaluations (used to generate this report) are on the leftmost side.
Different metrics (accuracy, F1, class distribution, etc.) are at the top.
Karan and his team are currently making this measurement process even more accessible and unified for many modalities. To get started with Robustness Gym, simply pip install robustnessgym.
Monitoring
As we know how to measure things, can we figure out “does model performance degrade over time?” Thus, we need to calculate metrics on unlabeled data. If an unlabeled sample from the target distribution is available, along with a labeled sample from a possibly different source distribution, standard approaches such as importance weighting can be applied to estimate performance on the target. However, importance weighting struggles when the source and target distributions have non-overlapping support or are high-dimensional. Mandoline is a user-guided framework to evaluate models using slices.
A slice is a user-defined grouping of data. Three slices are shown above: negation, male pronoun, and strong sentiment. Given labeled source and unlabeled target data, users write noisy, correlated slicing functions (indicated by color) to create a slice representation for the datasets. Mandoline uses these slices to perform density ratio estimation. Then, it uses these ratios to output a performance estimate for the target dataset by re-weighting the source data.
Based on empirical evaluations, target performance has been estimated to go up to 5x more accurate than standard importance weighting baselines.
Maintenance
Finally, how can we measure and make fixes to undesirable behavior? In their NAACL 2021 paper called “Goodwill Hunting,” Karan and colleagues tackled this question in the context of Named Entity Linking (NEL) systems, which map “strings” to “things” in a knowledge base like Wikipedia. A correct NEL is required by the downstream system that requires knowledge of entities such as information extraction and question answering. Their goals are two-fold: (1) analyzing commercial and academic NEL systems “off-the-shelf” using RobustnessGym; and (2) repurposing the best off-the-shelf system (called Bootleg) to guide behavior for sports Question Answering systems.
Specifically for the second goal, they showed how to correct undesired behavior using data engineering solutions - model-agnostic methods that modify or create training data. Drawing on simple strategies from prior work in weak labeling, which uses user-defined functions to weakly label data, they relabeled standard Wikipedia training data to patch errors and finetune the model on the newly relabeled dataset. With this strategy, they achieved a 25% absolute accuracy improvement in sports-related errors.
Wishlist: Methods and Tools for Malleability
There’s a lot to do in the end-to-end regime of going from measurement to monitoring.
Data-centric approaches entail data augmentation, data collection, active sampling, weak labeling, and data pre-processing. Model-centric approaches entail new training paradigms, new training algorithms, new architectures, etc. In the industry, data-centric approaches tend to be more performant.
Another question is how to fix models atomically? The goal is to patch models and produce a final version covering all the errors we care about.
From the research side, it’s essential to understand the trade-offs between all these methods.
Karan is putting a lot of effort into Meerkat, which builds DataPanels for interactive ML. Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation, and model training supported by efficient and robust IO under the hood. DataPanel is a simple columnar data abstraction, which can house columns of arbitrary type – from integers and strings to complex, high-dimensional objects like videos, images, medical volumes, and graphs.
Karan ended the talk by suggesting people check out this list of resources and progress made in data-centric AI, with exciting directions past, present, and future (made possible by Stanford’s Hazy Research Lab).
2 - How Robinhood Built a Feature Store Using Feast
Rand Xie (a Senior Machine Learning Engineer at Robinhood) shared lessons from building Robinhood’s internal feature store using Feast. The goal of this feature store is to (1) provide data scientist-friendly APIs to manage features, (2) reduce communication costs between data scientists and engineers, and (3) guarantee correctness of features while enabling monitoring/alerting when features drift.
Before Robinhood built an internal feature store, data scientists were in charge of the feature extraction and the model training pipelines. The model would then be plugged into the serving infrastructure (managed by the Core ML team). The product team would send requests to the model server. Without the feature store, the product team had to write loading jobs to load features from Hive into Redis, then merging the features with client logic code. With a feature store, there would be a shared infrastructure such that the product team does not need to do this additional step.
Feast is the most compatible feature store option with Robinhood’s internal stack. Feast is so modular that its components are swappable.
More specifically, the Robinhood team took out Feast Core and Feast Online Serving, rewrote all the batch ingestion parts (Feast Job) in PySpark, and extended Feast SDK to ingest custom logic specific to them.
The diagram below displays the new architecture for Robinhood’s ML system design (called “Beast”):
The data scientist writes a feature definition to specify the name of the feature, the type of the feature, the rules to validate the feature. Those feature definitions are then pushed into a metadata store (basically Feast Core).
The data scientist is responsible for the feature transformation pipeline: writing Spark jobs, generating Hive tables, and using Airflow orchestrator to automate the feature validation process. The ingestion job puts things into Redis, which would be made available for online serving.
The data scientist would connect with the model server as like before. He can create an additional configuration file to specify which features to reach from Beast and which features to be passed through from the client requests.
The Web UI (built by front-end engineers) enables easier feature registration and feature discovery.
Looking forward, Rand mentioned that these are the initiatives that the Robinhood’s Core ML team will be working on:
Design the historical storage for generating training data to lock features at model inference time and reuse them in the future.
Develop a feature keyword search endpoint to match the keyword to the description.
Discuss streaming feature support and explore other storages for online serving.
3 - ML Observability: Critical Piece of the ML Stack
As more and more ML models are deployed into production, we must have better observability tools to monitor, troubleshoot, and explain their decisions. Aparna Dhinakaran (the Co-Founder and Chief Product Officer of Arize AI) discussed the state of the commonly seen ML production monitoring to deal with challenges including model and data drift, performance degradation, model interpretability, data quality issues, model readiness, and fairness and bias issues.
From a bird-eye’s view, Aparna categorized the ML infrastructure stack into three segments: the feature store to handle offline and online feature transformations, the model store to serve as the central model registry and track experiments, and the evaluation store to monitor and improve model performance.
ML observability falls under the evaluation store. An ideal evaluation store should be able to:
Monitor drift, data quality issues, or anomalous performance degradations using baselines.
Analyze performance metrics in aggregate (or slice) for any model in any environment (production, validation, or training).
Perform root cause analysis to connect changes in performance to why they occurred.
Enable feedback loop to actively improve model performance.
Aparna then presented the four pillars of ML observability: drift (data distribution changes over the lifetime of a model), performance analysis (to surface the worst-performing slices), data quality (to ensure high-quality inputs and outputs), and explainability (attribute why a certain outcome was made).
Drift
Drift is a change in distribution over time. It can be used for model inputs, outputs, and actuals. It’s important to look for drift because models are not static. They are dependent on the data they are trained on. Especially in hyper-growth businesses where data is evolving, accounting for drift is important to make sure your models stay relevant.
PSI is a solid metric to calculate drift for both numeric and categorical features where distributions are fairly stable. Some modifications are needed to handle proper binning for numeric features.
In general, drift is useful for trends in data not noticed by data quality monitors and slow bleed failures, but not good for outlier detection, data consistency metrics, individual prediction level analysis. It’s important to note that it is not always the case when a feature drifts that model performance has been impacted.
Performance
There are two major cases where performance analysis is needed:
Models with Fast Ground-Truths: This is a happy case that does not happen in many real-world scenarios. You can map back the ground truths with the predictions, get the right metrics for your model, and surface up the right cohorts to analyze predictions.
Models without Ground-Truths: These models are more common in reality, where either (1) there is a delay in getting ground truth, (2) there are biased ground-truths, or (3) there are few or no ground truth.
For both cases, the problem is that aggregate statistics often mask many issues. How can we surface up where a model is under-performing? One solution is to look at the performance of the model across various slices or cohorts of predictions. The challenges are (1) knowing what slices to look at to find the issue and (2) figuring how to improve the model.
Data Quality
Typically, ML teams handle bad production data by (1) dropping bad prediction requests (which are not applicable to business-critical decisions), (2) predicting or setting default values (which can create drift in your predictions and introduce data leakage if done improperly), or (3) doing nothing (which is not always possible, because your service might throw an error or bad data is used to make business decisions).
The best solution is to write data quality checks for missing data, out-of-range violations, the cardinality for discrete variables, type mismatch, unexpected traffic, and more. The main challenges include knowing the right baseline between too noisy and useful + setting up the right metrics based on the data type.
Lightning Talks
4 - Streaming Architecture with Kafka, Materialize, dbt, and Tecton
Emily Hawkins (Data Infrastructure Lead at Drizly) presented a talk about building out their Data Science stack and streaming infrastructure to match their success with the modern data stack on BI. She works closely with the data engineers to develop Drizly’s data platform, which provides access to data to anyone in the organization working with data pipelines, business intelligence, real-time analytics, data science, and machine learning.
Drizzly’s batch infrastructure leans heavily into the modern data stack (Snowflake for warehousing, dbt for transformations, Fivetran for ingestion, Census for Reverse ETL, Looker for BI). As more use cases require data in real-time, they start to develop real-time data infrastructure that looks like below:
Confluent Cloud manages schema registries and Kafka topics.
Events (inbound service) come into the topics, which would then be read into Source objects in Materialize.
dbt would manage all the materialized views (SQL queries that perform data transformation).
The data can then be sent out of Materialize using a Sink object, which writes back to another topic and can be read by any sort of outbound service.
The combination of Materialize and dbt was powerful for them:
With Materialize, they can write real-time SQL almost exactly the same as batch SQL. They can centralize streaming logic in the same way they have for batch logic in Snowflake. Applications can either connect directly to Materialize, or they can send data out using sinks.
With dbt, they have continuity between batch and streaming processes. Their analysts and data scientists who already use dbt will be able to ramp up on their real-time infrastructure quickly. Furthermore, deploying a new real-time model is as easy as writing a new dbt model.
Looking forward, they plan to bring Tecton as a feature store to their stack. This is because they want to have increased personalization in real-time alerting and eventually real-time dynamic experiences for their customers. With this new stack, they can not only bring data in from Snowflake. They can also utilize real-time features with the Confluent, Materialize, and dbt setup discussed above.
5 - Performing Exploratory Data Analysis with Rule-Based Profiling
James Campbell (the CTO at Superconductive) discussed exploratory data analysis with rule-based profiling using Great Expectations (GE), the popular open-source library that brings automated testing to data. It provides declarative grammar that starts with schema-oriented, high-level characteristics and progresses into more semantically rich expectations about data.
The beauty of rule-based profilers comes down to two things:
Expectations compile directly to human-readable documentation.
Automated Profilers embody checklists of questions, enabling new modes of collaboration.
A question-centric view of exploratory data analysis entails normative questions such as:
What does my data look like? Does my data look like it should?
How many values are NULL? Is it distributed correctly?
What are the quantiles? Is it distributed correctly?
What is the distribution of each column?
What columns are correlated?
Do the values match standard regexes? Do the values look like they should?
GE’s rule-based profilers sketch out a general thesis of how to go about performing EDA:
It is configuration-driven. Changes are made in a .yaml file, but all components can be easily extended in code.
It is customized. It is purpose-fit for your workflow. You can start small and only have it examine some columns and then add to it based on your growing needs.
It is reusable: You can reuse a profiler configuration to create a standard set of expectations on multiple datasets.
6 - How Shopify Contributed to Scale Feast
Matt Delacour (a Senior Data Engineer on the ML Platform team at Shopify) discussed how Shopify manages large volumes of ML data (billions of rows) using Feast. Shopify decided to adopt Feast to build their ML Feature Store in early 2021 due to three reasons:
Feast is plug-and-play, so they can create their custom provider and online store. Feast helps them handle the business logic like point-in-time correctness, historical retrieval of features, etc.
Feast is open-source, the type of projects that they want to collaborate with.
There is a great fit between them and the Feast team.
Matt gave an example Shopify use case with 2 entities, 4 feature views, 15 features, and billions of rows to be processed. He then presented four scaling problems and solutions they came up with:
The first problem was around how to generate a unique identifier. Feast needs to create a unique identifier of the rows of the entity_dataframe by using ROW_NUMBER() OVER(). Without using PARTITION BY(), this window function would send all the data to a given worker. Their solution is to compute their own UUID using a CONCAT function.
The second problem was around outputting results into a Pandas data frame. Pandas is an in-memory library to analyze data, so they would end up with memory errors. Their solution is to create a method to save the results of any historical retrieval directly in BigQuery.
The third problem was around performing historical retrieval on all the data. More specifically, they performed historical retrieval of each FeatureView and joined the data back to the entity_dataframe at the end. This is wasteful if you only need a sample of the data. Their solution is to join the FeatureView to the entity_dataframe as early as possible.
The fourth problem was around understanding how the OfflineStorage interprets the query key. While BigQuery optimizes the query for them, it’s important to understand where the heavy operations (e.g., Join, Group By, Window functions) happen. Their solution is to create their own API using Feast to unblock their users and give feedback to the Feast community (look at this pull request).
If you want to partner on this project, feel free to reach out to Matt and collaborate!
While the operational ML space is in its infancy, it is clear that industry best practices are being formed. I look forward to events like this that foster active conversations about ideas and trends for ML data engineering as the broader ML community grows. If you are curious and excited about the future of the modern ML stack, please reach out to trade notes and tell me more at khanhle.1013@gmail.com! 🎆