Last week, I attended Comet ML’s Convergence virtual event. The event features presentations from data science and machine learning experts, who shared their best practices and insights on developing and implementing enterprise ML strategies. There were talks discussing emerging tools, approaches, and workflows that can help you effectively manage an ML project from start to finish.

In this blog recap, I will dissect content from the event’s technical talks, covering a wide range of topics from testing models in production and data quality assessment to operational ML and minimum viable model.

1 - ML Highlights from 2021 and Lessons for 2022

2021 was a year full of advances in machine learning, natural language processing, and computer vision. Inspired by Sebastian Ruder's blog post, ML and NLP Research Highlights of 2021, Oren Etzioni (CEO of AI2) summarized the ten highlights and suggested lessons for 2022 and beyond.

Number 1 is the proliferation of massive pre-trained ML models: These models are self-supervised and do not require labeled data (i.e., GPT models hide words during the training process). They are powerful in zero-shot and few-shot learning tasks, which only need a few data samples. They work well in multi-modal settings such as language, speech, and vision tasks. They are successful generative models and were recently termed “foundation models.”

Source: ML Highlights from 2021 and Lessons for 2022 (by Oren Etzioni)

Number 2 is the dominance of the Transformer architecture: It has an encoder and decoder architecture invented in 2017 and replaces RNNs and LSTM with “multi-headed attention.” It considers the full (weighted) context simultaneously, is well-suited for GPUs, and enables massive self-supervised learning. Transformer’s success in language has been replicated in other domains.

Number 3 is the adoption of massive multi-task learning: Given a task, this technique utilizes labeled data to improve performance and creates task-agnostic models. These models can be trained on hundreds to thousands of tasks and learned to perform new tasks from instructions (also known as meta-learning).

Number 4 is the rise of prompting guides generative models: Back in the late 1980s, most of the work focused on feature engineering to continually tune features for ML models. In the 2010s, the work shifted to architecture engineering for deep neural networks, tuning the architecture design. In the 2020s decade, the work will focus on prompt engineering, in which the prompt is the input given to the generative models, as “a prompt is worth 1,000 labeled examples.”

Number 5 is the invention of efficient methods. These massive, pre-trained models are expensive to train and inefficient to run. In real-world applications, people have optimized these models via attention mechanism (longformer, fastformer), created smaller models (distillation, Macaw), built special ML hardware (TPUs), and mapped compilers direct to your hardware (Apache TVM). A model that required tens of thousands of lines of CUDA is now a 1-page example in PyTorch!

Number 6 is a sad fact that bias is pervasive in the training text: Models “learn to hate” when the training data is downloaded from the Internet. Scrubbing the data is surprisingly hard because the context in text data can often be misconstrued.

Number 7 is the bias in temporal adaptation: Time can also be a bias! We need to be able to get beyond the static nature of corpora.

Number 8 is image generation from text and pre-training: In the past, we could create captions from images. Now, we can create images from captions. Very impressive stuff!

Number 9 is the exciting work on program synthesis: In applications like GitHub CoPilot or AlphaCode, you can synthesize programs from text prompt based on code it has seen in the past.

Number 10 is the application of ML for Science: DeepMind has AlphaFold 2.0, which achieves state-of-the-art results in the difficult task of 3D protein folding. In the world of climate science, deep models do a better job than humans in the task of precipitation forecasting and nowcasting. AI2 recently developed Semantic Scholar, a search engine for scientists that recommends scientific papers and automatically summarizes content from papers.

Oren concluded the talk with some lessons for 2022: Foundation models are a “tide that lifts all boats.” Data is dirty (and hard to scrub). Prompt engineering is an emerging discipline. ML hardware and compilers are making huge strides. There have been many innovative ML applications, including science, programming, images, and even video. Now is a fantastic time to work in ML!

2 - Testing ML Models for Production

Machine learning models are an integral part of our lives and are now becoming indispensable for the decision-making process in many businesses. When ML algorithms make a mistake, it can not only adversely affect the user trust but can also cause loss of businesses and, in some sectors - loss of life (health). How do you know that the model you’ve been developing is reliable enough to be deployed in the real world? Shivika Bisen (Lead Data Scientist at PAXAFE) gave the talk that provided a closer look at best practices for testing ML models for production.

For your context, PAXAFE is a 3-year-old startup that predicts adverse supply chain events to de-risk B2B shipments and enable intelligent cargo insurance. They position themselves as the Operating System for the cold-chain and perishables industry. Their proprietary ML models properly label and diagnose excursions.

ML testing for production is essential because there is a major difference between software engineering testing and ML testing. In software, given the data and the logic, you can test the software system for the desired output. In ML, given the data and the desired output, you can test the ML system for a learned logic. But how can you ensure that this learned logic is reliable for decision-making? In other words, is a good model evaluation and accuracy enough for the real world? You need to care about various types of metrics such as business KPIs, response speed, data load, etc.

Source: Testing ML Models for Production (by Shivika Bisen)

Currently, in most companies, model development by data scientists may not always have the best production practices. Simply handing off the trained models to the engineering team for shipping isn’t enough. Shivika outlined the different testing steps of the ML testing flow, as observed below:

Model evaluation is a critical step you want to perform in the offline environment to check model performance on offline datasets.
After evaluating the model, you want to write pre-training / unit tests to check important model metrics before pushing it into production.
Once your unit tests are done, you want to write post-training tests to check the model performance while in production, such as latency and load tests.
Next comes stage and shadow tests, which measure how your model will perform in a simulated environment.
Finally, API tests help you figure out how your model performs while using your software’s API.

For model evaluation, here are common important metrics and tools:

For the classification task, you can use Accuracy, Precision, Recall, F1 Score, and AUC Score.
For the regression task, you can use RMSE, MAE, and Adjust R^2.
For a model explanation purpose (when your neural network is a “black-box”), you can use explainability techniques like LIME and SHAP.

For pre-training and unit testing, here are a few things to pay attention to:

Feature engineering: If your model is a neural network, check for out-of-sample data. If you deal with time-series data, test for feature scalar.
Input data: Check the data format and type, N/A values, and training/testing set shape.
Model reproducibility: Make sure that your model returns the same output given the same input. Look at model configuration (such as library versions) as well.

For post-training testing, it’s important to distinguish between the two common model deployment strategies: online deployment and batch deployment.

For online deployment, your model learns on the fly - given streaming training and test data, your model makes predictions in real-time. Latency testing is important here. For example, you want to test the model’s response speed using static hyper-parameter values, not grid-search or random-search cross-validation.
For batch deployment, given lots of training data, your model is trained and saved locally, then makes batch predictions. Load testing would be helpful here. You can test for a parallel access database (SQL Alchemy), look at synthetic data shipment in parallel (AWS Container helps in CPU / memory monitoring), or use automated tools like Locust.

A/B testing is a handful strategy to handle data and concept drift, where your model performance decays over time. Basically, given the incoming live data, you can create two models (splitting the data based on pre-defined criteria): model A is the controlled variant, and model B is the challenger variant. Then choose a new model if it returns better performance.

Stage and shadow testing are useful for simulating a real environment with simultaneous input data. This helps your model be more robust when dealing with challenging edge cases.

Finally, API testing is the last category of tests you want to look at. Here, you want to assess error codes to deal with issues that could not be predicted beforehand, status codes to deal with logical and syntax issues, and security by checking authorization.

3 - Data Quality Assessment Using TensorFlow Data Validation

The research shows that the majority of the ML projects do not make it to production because they do not improve business processes, enhance customer experience, or add tangible ROI for the organization. Inspired by data-centric AI, Vidhi Chugh (Staff Data Scientist at Walmart Global Tech India) discussed the typical production woes, how maintaining good quality data plays a crucial role in developing a successful machine learning model, and various sources and types of deviations and errors that can degrade data quality.

In production, your ML models might be more complex than planned, are expensive to run, and cause friction in collaboration and ownership. The probabilistic nature of ML also means that you might not get the exact output every time. However, data remains the most crucial factor in ML development, and issues with data availability and data quality are the focus of Vidhi’s talk.

Organizations need to adopt a data culture by integrating data from multiple sources and setting up proper data generation and collection processes. Here you want to look for any data duplication errors, then perform data transformations by merging and linking data together.

Maintaining quality data is the key because an ML model is as good as the data it is fed to. Adopting a data-centric approach enables you to curate robust and good-quality data. Vidhi recommends using automated and iterative checks to ensure the efficiency of data integrity since manual checks are time-consuming and error-prone.

You want to pay attention to new incoming data by monitoring it automatically and looking for any anomalies or changes over time. You want to examine any missing data point (systematic or random), a feature with high predictive power being removed, or the replacements with blank, 0, or arbitrary values. You also want to perform temporal analysis to see whether attributes change per entity, which is only possible if timestamps are stored.

Measuring drift is important to ensure that your model is reliable and up-to-date. You can look at the dynamic distribution of the model output, whether the user behavior or environment interaction changed, or if the model is still following the same characteristics as when it was trained on. KS-statistic is a handy metric to measure drift. Furthermore, you might want to consider when to retire old data or add new data (all at once, First-In-First-Out, or after N days).

Source: Data Quality Assessment Using TensorFlow Data Validation (by Vidhi Chugh)

Vidhi concluded her talk with brief coverage of TensorFlow Data Validation. This open-source library has powerful capabilities such as Descriptive Analysis with data distribution and visualization, Inferring Schema to show features and their values, Anomaly Detection that checks out-of-domain values, provides error detection and calls for quick action, statistical computation on large datasets (based on the Apache Beam framework), and data drift detection on train/test data and different days on the train data.

~ New Post ~✍️

My notes from @Cometml Convergence. It covers:
- 2021 Research Highlights
- Testing Models In Prod
- Data Quality Assessment
- Improving ML Datasets
- Using ML to Solve Right Problems

Thanks the Comet team for putting together this event!https://t.co/DqHpRTRSgd
— James (@le_james94) March 10, 2022

4 - How Improving ML Datasets Is The Best Way To Improve Model Performance

Many teams will immediately turn to fancy models or hyperparameter tuning to improve an ML model to eke out small performance gains. However, the majority of model improvement can come from holding the model code fixed and properly curating the data it's trained on! Peter Gao (CEO of Aquarium Learning) discussed why data curation is a vital part of the model iteration, some common data and model problems, and how to build workflows plus team structures to efficiently identify and fix these problems to improve your model performance.

Many people are used to “Old School” ML - tasks such as forecasting, relevance, recommendations, fraud. You work primarily with structured data (string, numeric, timestamps) where labels are generated “for free” and use algorithms like logistic regression, random forests, SVMs. This type of ML emphasizes building data pipelines and infrastructure, engineering features, and experimenting with models.

Given the rise of deep learning, you work much more with image classification or speech recognition tasks. As you deal with unstructured data (audio, imagery, point clouds), you need to pay labeling providers to label data and use neural networks. Deep learning emphasizes building data pipelines and infrastructure, fine-tuning pre-trained models, and improving the quality and variety of datasets.

In order to improve your ML system, you can improve your model code and improve your training dataset. The key is to do both of these tasks faster and more frequently in the production environment.

Here is a real use case from work for model improvement and the steps taken to get there:

- Baseline: 53%
- Logistic: 58%
- Deep learning: 61%
- **Fixing your data: 77%**

Some good ol' fashion "understanding your data" is worth it's weight in hyperparameter tuning!
— Alex Gude (@alex_gude) April 24, 2019

Here are the step-by-step Peter outlined to improve your data:

Find problems in the data and model performance.
Figure out why the problems are happening.
Modify your dataset to fix the problems.
Make sure the problems are fixed as you retrain your model on the new dataset.
Repeat.

There can be various types of data problems you might encounter, such as invalid data, labeling errors and ambiguities, difficult edge cases, and out-of-sample data. Additionally, long-tailed distributions are widespread in machine learning, reflecting the state of the real world and typical data collection practices. The charts below show the frequency of model classes in several popular AI research datasets.

*Source:* *Taming The Tail - Adventures in AI Economics* *(by a16z)*

Current ML techniques are not well equipped to handle them. Supervised learning models tend to perform well on common inputs (i.e., the head of the distribution) but struggle where examples are sparse (the tail). Since the tail often makes up the majority of all inputs, ML developers end up in a loop – seemingly infinite, at times – collecting new data and retraining to account for edge cases. And ignoring the tail can be equally painful, resulting in missed customer opportunities, poor economics, and frustrated users.

Given a lot of data points in your datasets, only a few of those data points are actually problematic. It will be labor-intensive to dig through a haystack looking for the needle. Peter recommends letting the model tell you where to look - in which high loss disagreements tend to be labeling errors, and neural network embeddings help surface trends in errors.

Peter’s startup, Aquarium, makes it easier to build and improve production ML systems! It offers a simple SDK for creating a real-time transformation, assembles online and offline features from a catalog, and integrates with model monitoring tools to monitor your models in production, identify and mitigate drift on the fly, and detect model drift based on feature drift.

5 - How Feature Stores Enable Operational ML

Getting ML applications into production is hard. When those applications are core to the business and need to run in real-time, the challenge becomes even harder. Feature Stores are designed to solve the data engineering challenges of production ML applications. Kevin Stumpf (CTO of Tecton) discussed how feature stores help tackle four key problems:

Real-time and streaming data are difficult to incorporate into ML models.
ML teams are stuck building complex data pipelines.
Feature engineering is duplicated across the organization.
Data issues break models in production.

Kevin used to be a Tech Lead at Uber, where he contributed to the development of Michelangelo, Uber’s in-house ML platform that powers various ML use cases, including customer support, demand forecasting, fraud detection, ride and order ETAs, recommendations, image extraction, dynamic pricing, accident detection, self-driving cars, and safety. Thanks to Michelangelo, data scientists and ML engineers at Uber had been able to put thousands of models in production, enforce reuse/governance/trust, and bring ideas to production in days.

From that experience, he distinguished between Analytic ML and Operational ML:

Analytic ML systems power human decision-making (dashboards and reports) by letting users explore / experiment with the data and gather offline insights. Data come mainly from your data warehouse and data lake.
Operational ML systems power more wide-ranging use cases such as financial services (fraud detection, stock trading), retail (product recommendation, real-time pricing), or insurance (personalized insurance, claims processing). These use cases provide automated decisions and require mission-critical SLAs. Besides getting data from your warehouse and lake, you also need to deal with streaming data.

Operational ML use cases have unique requirements:

ML features are calculated from raw data of very different data sources (batch, streaming, or real-time).
Model training uses historical features and requires point-in-time correctness and consistency with model serving.
Model serving uses the latest features at a high scale, high freshness, and low latency.

Without better tools, companies build band-aid solutions that involve multiple stakeholders. The data scientists cover the model training phase, while the data engineers cover the model serving phase. The lack of collaboration between them can cause major issues to bring models into production.

Source: How Feature Stores Enable Operational ML (by Kevin Stumpf)

First and foremost, extracting and serving features is hard because data sources have vastly different characteristics, as seen above. What happens if you calculate features from the raw data source? You’ll likely have to decouple feature calculation (you write ETL jobs to precompute features from the data warehouse) from feature consumption (your production store serves pre-computed features to your model). However, you need to find the optimal freshness and cost-efficiency tradeoffs. Furthermore, it’ll get really hard when you combine batch, streaming, and real-time data pipelines.

Training and serving skew is another common trap that most ML teams have to deal with. Logic discrepancies are easily introduced when you have separate implementations, so it’s always better to have one training and serving implementation. Skew is also introduced by timing discrepancies and data leakage (you contaminate your training data with features from your production data).

A feature store solves the issues mentioned above by being the interface between data and models. It connects to raw data sources, transforms them into ML features, and serves historical features into model training + real-time features into the model in production. It is an abstraction layer on top of your data stack, thereby providing access to data storage (Spark, Warehouse SQL, Python) and data compute (offline stores such as S3 or Snowflake and online stores such as Redis or Dynamo).

Secondly, ML teams are stuck building complex data pipelines. For example, a data engineer might build custom feature pipelines for ad-hoc requests from a data scientist. A feature store enables data scientists and data engineers to self-serve deploy features to production. You implement a feature once and add it to the feature store. This feature will be immediately available for production purposes.

Thirdly, a lack of feature standardization and lots of feature duplication can lead to a high barrier to entry. This will waste your time and money as well. A feature store centrally manages features as data assets so that you can easily (and securely) discover features from other teams that might be relevant to your use cases.

Finally, data issues often break models in production: broken upstream data sources, actionable feature drifts, opaque feature sub-population outages, or unclear data quality ownership. A feature store monitors your data flows so you can detect data drifts and improve data quality.

A feature store has five core components: monitoring, transformation, storage, serving, and registry. An enterprise product like Tecton lets you use a feature store in your ML lifecycle quite easily by first defining features, then generating training data to train models, and finally fetching features for real-time prediction.

If you want to build your own feature store, read this blog post. If you want to manage your own feature store, check out the open-source project Feast. If you want to buy a feature store, look at Tecton!

6 - Using ML To Solve The Right Problems

As ML passes its hype, the industry now enters a more mature scene where ML is not perceived any more as a magical wand but as a risky, yet powerful, tool to solve a new set of problems, that requires heavy investments in people and infrastructure. Eduardo Bonet looked at steps we can take to decrease the risk of ML solutions dying on the prototype phase: what types of problems are the best fit, ideas on how to handle stakeholder expectations, how to translate Business Metrics into Model Metrics, and how to be more confident if we are solving the right problems.

Some common problems in ML development include failing to understand which problems ML can tackle, failing to communicate results with stakeholders, and discovering that users don’t want that product too late.

To identify what problems might be the best for ML, Eduardo suggested The Informed Gueser strategy: A Guesser doesn’t understand the mechanisms; it finds patterns and guesses based on them. The better you inform the Guesser, the better the guesses will be. You can’t expect a Guesser not to make mistakes; the question is how often.

ML is essentially an Informed Guesser: If you have data available for inferring patterns and make fewer mistakes, then you have a good use case for ML. Example use cases are search, categorization, optimization, grouping, etc.

To set realistic expectations for stakeholders, Eduardo suggested The Minimum Viable Model strategy: MVM is the model with the smallest performance necessary to achieve success. Here are five steps to design the MVM:

What is the vision? What will change in the world once the problem is solved?
What things can we measure that will tell us we are moving in the right direction?
How much do we need to move that thing for this project to be considered a success?
How can we anchor model performance metrics on business metrics?
What is the minimum performance the model needs to achieve?

MVM is not about any specific technology. It’s about desired results. The final number is often not as relevant as the conversations that led to that number. One key challenge here is that your business metrics often don’t map well to your model performance metrics.

To validate that ML is solving the right problem, Eduardo suggested starting without ML. Deploying ML is expensive, takes time, and is costly. So try the simple stuff first (popularity results, heuristics, human in the loop, etc.)! Starting without ML helps you reduce the time to market, validate metrics and use cases, guide the data collection, avoid dealing with MLOps early on, and use heuristics as features later on.

In brief:

Informed guesser is about framing ML and what problems it can tackle.
Minimum viable model is about setting realistic expectations.
Starting without ML allows for faster iteration and validating metrics and use cases.

I look forward to events like this that foster active conversations about best practices and insights on developing and implementing enterprise ML strategies as the broader ML community grows. If you are curious and excited about the future of the modern ML stack, please reach out to trade notes and tell me more at james.le@superb-ai.com! 🎆