The field of MLOps has rapidly gained momentum among Data Scientists, ML Engineers, ML Researchers, Data Engineers, ML Product Managers, and any other roles that involve the process of designing, building, and managing ML-powered software. I have been actively involved in the MLOps community to keep up to date with rapid innovations happening within this domain.

Two months ago, I attended the second edition of MLOps: Production and Engineering World, a multi-day virtual conference organized by the Toronto Machine Learning Society that explores the best practices, methodologies, and principles of effective MLOps. In this post, I would like to share content from the talks that I found most useful during this conference, broken down into Operational and Technical talks.

I previously attended the inaugural MLOps event last year, in which I wrote an in-depth recap here.

New Post✍️

1/ My notes from #MLOpsWorld2021. It covers:
- Data-Centric Pipeline
- Scalable ML Platforms
- ML Org Failure Modes
- Model Monitoring/Debugging
- Data Logging
- Programmatic Labeling

Thanks to the @MLOpsWorld team for organizing. Enjoy!https://t.co/DnozdIq0FC
— James (@le_james94) August 26, 2021

Operational Talks

1 — Developing a Data-Centric NLP Pipeline

The number of components and level of sophistication in end-to-end ML pipelines can vary from problem to problem. Still, there’s one common element that is the key to make the whole system great and useful: your training data. The more time you spend developing the training dataset in your ML pipeline, the more positive results you’ll get.

Diego Castaneda and Jennifer Bader of Shopify presented the use case of a text classification pipeline developed from scratch to integrate with one of their products. The talk showed how they designed an appropriate classification taxonomy and a consistent annotated training dataset, plus how the end-to-end pipeline was pieced together to deploy a BERT-based model in a low latency real-time text classification system.

Shopify Inbox (https://www.shopify.com/inbox)

For those who are not familiar with Shopify, it is a leading global commerce company that powers over 1.7 million businesses in 175+ countries. The company's mission is to make commerce better for everyone. Diego and Jennifer sat under the Messaging team and worked on a product called Ping — a free messaging app used to drive sales and build customer relationships. More specifically, they wanted to use ML to power a system that can classify incoming customer messages in real-time.

The Initial Approach

Initially, their approach entails the following steps: (1) Use a state-of-the-art robust, pre-trained NLP model, (2) Put together a training dataset, (3) Train a highly accurate model, (4) Deploy the trained model for real-time low latency usage, and (5) Monitor model performance in production.

A fairly standard, traditional way to develop such a system from scratch is to create a semi-supervised dataset by exploring the conversation domain (with topic clustering techniques) and identifying important categories/sub-categories. For this to work, they invested mostly in assembling a good initial dataset and then training/optimizing the best ML model possible. A big problem that the Shopify team found with the above process is that the dataset was highly unbalanced, with broad topics dominating the clusters. As a result, a model trained on just this data was not accurate or confident enough, blocking their product development lifecycle.

The Data-Centric Approach

The Shopify Messaging team decided to switch to a data-centric approach that focuses on training dataset development. In this case, they invested time and resources in developing the training dataset by collaborating with the content design team and continuously optimizing the taxonomy. Additionally, they came up with a procedure to annotate the data with high accuracy.

Here is the three-step process that they used to build the taxonomy:

The first step was to sketch out the desired taxonomy: They identified the topics they wanted to see in the new taxonomy by manually skimming customer messages. They used clustering visualizations to look for trends. As the team looked for more messages, they added more topics.
The second step was to create a structure around the topics by grouping them by similarities. They recognized that separating the topics into categories highlighted the distinction between those messages indicating (1) customers were in a pre-purchase phase and (2) customers were in a post-purchase phase. The structure enforced hierarchical levels, which ensure greater labeling accuracy.
The third step was to consider the annotator UX, thinking about the user experience of actually labeling the messages. Topic names, topic order, category/topic descriptions, and disambiguations are important criteria that they paid attention to.

Here are the key bullet points outlining their annotation process:

They involved the Shopify customer support team as domain-aware annotators (who have deep knowledge of both the product and the merchants).
They had set-up loops with training sessions to align the annotator team, aiming for a high-annotator agreement.
They collected inter-annotator metrics such as simple agreement percentage, Cohen Kappa, Fleiss Kappa, etc.
They refined the taxonomy as more examples surfaced.

Let’s look at the outcomes of the traditional approach and the data-centric approach, as recounted by Diego:

With the traditional approach, given 40k training messages and 20 topic taxonomy, they achieved ~70% model accuracy and ~35% high confidence coverage.
With the data-centric approach, given 20k training messages and 45 topic taxonomy, they achieved ~90% model accuracy and ~80% high confidence coverage.

Developing a Data-Centric NLP Machine Learning Pipeline (by Diego Castaneda - Data Scientist & Jennifer Bader - Content Strategist, Shopify)

The diagram above displays Shopify's NLP production pipeline:

The training data preparation step is done in the red box on the left side.
The training data is then connected to a Spark-based data warehouse.
The model is trained using Shopify's ML platform module, which is a combination of in-house tools, 3rd-party components, and open-source libraries.
The trained model is based on TensorFlow with a Transformer module from HuggingFace (DistillBERT).
The DistillBERT model is hosted on Google AI Platform to make online inferences.
The only client of that model is a custom API built with FastAPI and deployed on a Kubernetes cluster.

Jennifer concluded the talk with these high-level learnings:

Training dataset development can take most of the project's timeline.
Domain expert annotators are invaluable.
Content Design + Data Science + ML teams are a winning combination in NLP.

Developing a data-centric NLP pipeline is a win!

2 - Building Scalable ML Services for Rapid Development in The Health and Wellness Marketplace

Mindbody is the premier software provider and marketplace for the health and wellness industry. The company’s mission is to connect the world to wellness. They accomplish that via two types of product offerings: the Mindbody business and the Mindbody app. The Mindbody business provides software for studio owners that offer scheduling management, marketing services, and other activities that allow them to grow and expand their businesses. The Mindbody app is the one-stop-shop for customers to find and book all of their fitness, wellness, and beauty needs in Mindbody’s global marketplace.

The AI/ML team at Mindbody was established in 2019 to build new tools and services powered by AI/ML to complement their marketplace and business software.

For Mindbody’s B2B offerings, these AI/ML features include automated marketing campaigns, lead scoring, churn prediction, automated business insights, and onboarding concierge.
For Mindbody’s consumer marketplace, these AI/ML features include recommendation engines, personalized search/campaigns, trending and trust signals, and AI wellness coach.

Recommendation Systems at Mindbody

Genna Gliner and Brandon Davis presented a sample solution for training and deploying recommendation systems for Mindbody’s consumer marketplace into their production environment.

There is an enormous amount of consumers in their consumer marketplace, each with many dimensions of preferences (fitness, modality, location, willingness to spend, instructor). There is also an enormous amount of providers, each with its own unique value proposition.
As a booking utility, the Mindbody consumer app allows the consumers to enter what they want, book it, and leave. The AI/ML team wanted to shift this booking utility into a marketplace, where Mindbody consumers go to find their next favorite fitness class through rich, meaningful discovery. They accomplished this by building recommendation systems that enable such discovery by pushing relevant recommendations and offers to the consumers.

Building Reusable and Scalable ML Services to Enable Rapid Development in our Health and Wellness Marketplace (by Genna Gliner - Machine Learning Engineer & Brandon Davis - Machine Learning Engineer, Mindbody)

To date, the AI/ML team has launched two different recommendation systems:

The Dynamic Pricing Recommender offers last-minute classes at a discounted rate. It is designed to bring the right class offerings to the right consumers to increase conversion. The system consists of an ensemble of collaborative filters + a content-based recommender to personalize the recommendations. The end-to-end development cycle for this service was about 9 months.
The Virtual Class Recommender is tailored for the consumer marketplace. It is designed to surface a greater variety of inventory to drive discovery. The system ensembles a couple of different collaborative filters with consumer data in different ways to provide diverse sets of fitness class options. Thanks to the learnings from the development of the Dynamic Pricing system, the AI/ML team was able to shorten the development cycle of this service to about 5 months.

The next service that the AI/ML team wanted to build was a General Marketplace Recommender that provides personalized recommendations to consumers for all of the health, beauty, and wellness offerings in the marketplace. Their goal is to reuse previous components from the previous two recommenders to dramatically reduce the lifecycle development and rapidly put this new recommender into service.

To make ML projects successful, the AI/ML team needs to invest in four aspects: the Data, the ML Code, the UI/UX, and the Infrastructure. In Mindbody's case, they realized that the two areas of Data and Infrastructure are where they needed to put more focus. How could they improve their Data and Infrastructure so that this could have a multiplying effect on the success of their current and future projects?

The diagram above depicts a rough layout of how a Mindbody's recommender system looks like. It works fine when there are only two systems to be deployed into production. However, for more than two systems, this method of development comes up with some obvious pain points:

Duplicated datasets and feature engineering efforts are a waste of time and energy.
Redundant models have no sharing mechanism.
Recurring jobs are non-standardized and difficult to manage, with low visibility.

Scaling Recommendations Architecture

The AI/ML team at Mindbody started to address these pain points one by one using the abundant open-source offerings in the market.

For the first pain point, they adopted a feature store - an ecosystem of databases, queries, and code to provide ML pipelines with unified access to features for ML. A typical feature store consists of two databases: an offline one for model training and an online one for model serving. Using a feature store enables standardized access to features and data processing directly from the API code. This makes it simple to re-use features and move code from notebooks to repositories. In particular, they use dbt (an open-source tool for documenting, storing, and maintaining data transformations) and great_expectations (another open-source tool for testing data and enforcing data quality). The new recommendation architecture with a feature store looks like below.

For the second pain point, they leveraged the MLflow model registry - a centralized model store to collaboratively manage the full model lifecycle of an MLflow model. It provides model lineage, model versioning, stage transitions, and annotations. Using a model registry enables the sharing of models across services while maintaining traceability and performance. The new recommendation architecture with a model registry looks like below.

For the third pain point, they used Apache Airflow as a centralized orchestration tool to programmatically author, schedule, and monitor workflows written as pipelines represented as DAGs. The new recommendation architecture with an orchestration tool looks like below.

The biggest highlight here is their mindset shift from project-focused to platform-focused: Given any new project, how can they solve the problem in a way that fits into the larger ML platform?

By taking the time to invest in their MLOps framework using free open-source tools, the AI/ML team at Mindbody were able to achieve: (1) better maintainability of their ML services, (2) centralized, consistent, and improved quality data, and (3) reusable and scalable shared components. The key result of such a platform rationalization/consolidation is that they were able to dramatically reduce the lifecycle development of launching a new recommender service by over 85% (from a 9-month release cycle to a 1-month release cycle).

3 - Shopping Recommendations at Pinterest

Millions of people across the world come to Pinterest to find new ideas every day. Shopping is at the core of Pinterest’s mission to help people create a life they love. In her talk, Sai Xiao introduced how the Pinterest shopping team has built related-products recommendations systems — including engagement-based and embedding-based candidates generations, as well as indexing and serving methods to support different multiple types of recommenders and its deep neural network ranking models.

At a high level, there are three categories of Pinterest products: the Home Feed that gives people ideas based on their interests, the Search that gives people ideas based on keyword searches, and the Related Pins that give people similar ideas to a pin they are looking at. More than 478M people worldwide come to Pinterest each month, saving 300B+ pins and creating 6B+ boards. From these boards and pins collected, the Pinterest team found a lot of shopping intent, as 83% of Pinners have purchased based on content from a brand they saw on Pinterest.

Shopping Recommendations

Shopping is growing very fast at Pinterest. There is a 215% growth in the number of Pinners engaging with shopping surfaces on Pinterest in 2020. Product searches also grew by 20x in 2020. Furthermore, Shopping is now a global feature, in countries like the US, the UK, Germany, France, Canada, and Australia. Powering the shopping capability are a handful of recommenders, including shop similar, more from this brand, items under $X, people also view, people also bought, etc.

Here are a few challenges with building recommender systems at Pinterest that Sai outlined:

Massive scale: Given hundreds of billions of product pins in the catalog, they need to serve those recommendations to hundreds of millions of pinners. Thus, there’s a lot of scalability problem.
Diverse feature data types (user profile, text, image, video, graph): The recommender must adapt to these diverse data types, which get richer over time.
Balancing objectives (engagement, relevance, diversity): While optimizing the recommender for engagement, they also need to balance out other factors such as relevance, diversity, and inclusiveness.
Want to develop and launch new recommenders quickly: Not a single recommender will fit all the shopper intent. They need to build multiple of them to accommodate different intents.
Whole page optimization: Given multiple recommenders, they want to optimize the whole page to decide which recommendations to show and where to display them.

Shopping Recommendations at Pinterest (by Sai Xiao - Machine Learning Engineer, Pinterest)

The diagram above depicts the back-end system for Pinterest recommenders:

They have built a huge Product Catalog, which includes both merchant-ingested product pins and user-provided product pins to power all the shopping use cases on Pinterest. This catalog is the backbone of all the shopping features.
The system will consume the Product Catalog to retrieve and provide different types of recommenders (Related Products, Query-Based Recs, Collaborative Filters).
The Shopping Page Optimization mechanism decides which recommender to show on the top. This significantly improves the user's shopping experience and reduces the cost of showing too many recommenders for the user.
This whole stack of techniques supports multiple shopping services, such as Product Details Page, ShopTab on Search, and ShopTab on Board.

Sai then introduced the technical details on how Pinterest built similar product recommendations. This system takes the query product and the user as inputs. The outputs are a set of products similar to the query product and match the user’s preferences. The whole system can be broken into two parts: a candidate generator and a ranker.

The candidate generator takes in the query product and the user. It then retrieves sufficient but most similar candidates from the catalog. The goal is to narrow the hundreds of millions of items in the catalog down to approximately a thousand candidates.
The ranker applies a ranking algorithm to rank the candidates and select the top-K candidates. They also apply relevant scaffolds (such as gender match, broad category match, trust and safety checks) to trim out irrelevant pins before serving those pins to the user.

Candidate Generator

They built the Candidate Generator through several sources, including Memboost (with historical engagement data), pin-board graph, and embedding similarity (ML model trained on historical engagement data). The last type performed best. There are two main phases to generate embedding-based candidates (as seen in the diagram below):

During the offline workflow, they used a deep learning model to generate product embedding for the whole catalog and saved them into an embedding store.
During the online serving phase, they looked up the value for the incoming query and did a nearest-neighbor lookup from the embedding store to find the top-k candidates.

Given the workflow described above, a challenge is how to build unique embeddings for billions of pins and millions of products in a scalable way.

Pinterest has lots of graph data, so they leveraged their giant Pin <> Board graph to address this challenge. This graph is basically a visual bookmark of online content.

More specifically, they modeled the Pinterest environment as a bipartite graph, consisting of nodes in two disjoint sets (Pins and Boards). The benefit of using this graph is that they can borrow information from the nearby nodes in the Pin <> Board graph. Thus, the resulting embedding of this node will be more accurate and more robust.
They developed a Graph Convolutional Network algorithm called PinSage, utilizing both context features (image, text, and other product attributes) and graph features. PinSage was trained on the millions of engagement data (used as labels) and learned from a standard Triplet loss.
To scale the training and inference of embeddings to billions of pins and 100+ million products, they performed a random walk-based neighborhood sampling and represented nodes via context features for inductive inference (Rex Ying et al., KDD 2018).
To search for relevant embeddings, they leveraged Pinterest’s backend search engine as the distributed serving platform for approximate nearest neighbors. They retrieved candidates using a state-of-the-art method called Hierarchical Navigable Small World Graph. Finally, they performed batch indexing via Hadoop and near real-time indexing for fresh pins.

How did Pinterest index and retrieve candidates efficiently?

As seen in the indexing framework above, Attribute represents the property of each search document. Each document in the product corpus has GraphSage embedding associated with it (alongside other metadata properties). At a high level, this framework (1) manages the conversion of a corpus into a search document and (2) merges the document’s attributes for candidate filtering. Two big benefits here are: (1) they could include multiple candidates into one index, and (2) attributes are shareable by candidates. This allowed them to define attributes once for all the candidate sources, simplifying query composition at serving time and increasing engineering velocity.

Besides embedding-based candidates, Pinterest also has engagement-based candidates. These include query and candidate pairs aggregated from the historical user engagement in the last year, in the form of (query pin -> a list of candidate pins). For each query pin, they have a list of historically engaged pins as candidates. To incorporate these engagement-based candidates into the indexing framework described above, they converted the original list of candidate pins to a list of query pins. This latter list becomes an attribute for the search document to merge.

This flexible indexing framework enabled Pinterest to onboard new recommenders quickly. For example, by configuring different attributes in the query constructor, they can build different types of recommenders (given that their product corpus has already contained all the required metadata such as domain, price, category, popularity rank, etc.). If they specify the brand restricted in the query, the retrieved candidates will only be the candidates from this branch. Therefore, they can build a recommender such as “More From This Brand.”

Ranker

The ranker is an ML model that predicts the probability of engagement given the query pin and the user. Engagement is a vague concept because there are different types of engagement such as save, close-ups, clicks, long clicks, conversion, etc. How can they build a personalized ranker to handle such vagueness?

They built a multi-headed deep learning model for this task:

Multiple features were used, including GraphSage embedding, user embedding, token embedding, text match, etc.
They used AutoML to construct the model by grouping all these features and crossing them in the fully connected layers. Pinterest’s AutoML is a self-contained deep learning framework that powers feature injection, feature transformation, model training, and model serving.
There are four output heads, each corresponding to one engagement type (save, closeup, clicks, long clicks). Note here that the four outputs borrow information from each other using the previous layer. This, in turn, reduces the overfitting problem.
Then, they applied calibration steps because they did negative downsampling in the training data. After these calibration steps, the scores of the four heads will be integrated as the probabilities of each engagement type.
In the final step, they combined the predicted scores using a utility function to generate a single engagement score to rank the candidates. The motivation here is to decouple the business logic from the engagement prediction, so they could adjust the weight in the utility function to accommodate business requirements (without retraining the model).

Sai concluded the talk by emphasizing Pinterest’s SOTA-quality recommendation technology with embedding-based candidates and a multi-head neural network ranker to improve their engagement metrics significantly. Ultimately, Pinterest wants to build a unique shopping experience to help people get inspiration, ideas, and purchases.

4 — Diagnosing Failure Modes in Your ML Organization

Over the past decade, Yelp has scaled its ML reliance from fringe usage by a handful of enthusiastic developers to a core competency leveraged by many teams of dedicated experts to deliver tens of millions of dollars in incremental revenue. They have experimented with several organizational structures and ML processes and observed several categories of pitfalls in that time. Jason Sleight and Daniel Yao discussed how to diagnose several of these pitfalls — especially as related to ML project velocity, ML practitioners’ happiness & retention, and ML adoption to broader parts of business objectives.

For your context, Yelp’s mission is to connect people with great local businesses. Its platform has close to 31M application unique devices (monthly average for 2020) and 224M cumulative reviews (as of Dec 31, 2020). Yelp has many ML use cases, such as Home Screen Recommendations, Ads, Search Ranking, Wait Time Estimates, and Popular Dish Identification.

Yelp’s ML organization has hundreds of ML models in production, with a wide range of sophistication (from manual heuristics up to deep networks), operational needs (from offline analytics up to real-time, high QPS requirements), and business objectives (content models vs. consumer models vs. strategic models). Yelp started hiring dedicated ML developers in 2015 and has approximately 75 regular ML practitioners as of 2021. They have experimented with a few organizational structures over the years to answer one question: “Is your ML organization ineffective?”

Jason reversed the question above by providing the 3 pitfalls that make an ML organization ineffective.

Pitfall 1: Low ML project velocity

The first pitfall is that your ML projects take a long time. In practice, you should be able to:

Do an entire ML project for a brand new domain in 1 month: Avoid creating the perfect model. After a few weeks, you probably have something that “works,” so deploy that and start getting value. You can always build a second iteration if there is still room to improve.
Add a new feature to an existing model in 1 week: This is mostly determined by data availability, quality, and discoverability. Ideally, you have shared platforms for ML features to share across models (so that one can build a new feature and try it in many models to amortize costs). You should separate your feature engineering code from your model training code, then parameterize your model training code, so it is agnostic to the set of input features.
Retrain an existing model on fresh data in 1 day: Your model training code should be checked into version control and parameterized for different data ranges.
Deploy a new model version to production in 1 hour: Your model platform should hide the complexity between training and serving. Each model needs its own unique ID (so deploying a new model means deploying a new ID). At Yelp, they use a workflow combining Spark, MLFlow, and MLeap.
Know your model’s current performance in 1 minute: Yelp relies on MLflow’s UI to instantly know the model’s versions, parameters, and relevant metrics.

Pitfall 2: Difficulty scaling impact to other business areas

The second pitfall is that you found it difficult to scale your ML projects to impact other business areas. You will never have enough experts to apply ML to all of your problem opportunities, even if you have amazing ML velocity. Fundamentally, this is a skills mismatch issue — where you can apply ML to a huge number of problems, but ML expertise is hard to acquire.

Diagnosing Failure Modes in your ML Organization (by Jason Sleight - Group Technical Lead: ML Platform & Daniel Yao - Director of Applied Machine Learning, Yelp)

Jason presented a case study at Yelp where they fell into different checkpoints in the spectrum above (one end having experts to lead multiple projects on different teams and the other end having projects to use only simple techniques). In 2017, Yelp decided to build an ML model to suggest businesses that Yelp consumers could review (called Preamble). At this time, they had an existing heuristic system based on simple features (like having uploaded a photo, called the business, etc.), which accounted for roughly 10% of total reviews at the time. The team owning this system had no ML experience at the time.

In the first try, the Preamble team builds an ML model with some expert consulting. There was very minimal ML expert involvement. They developed a simple linear model to replace a heuristic weighted sum. The result is that this model was not shipped due to poor counterfactual performance. The key takeaways are (1) the lack of expert involvement introduced a lot of confusion when the model didn’t perform well on live traffic, and (2) the Preamble team wasn’t equipped to understand what went wrong.
In the second try, the ML expert not from the Preamble team built the ML model (still a linear one) by re-engineering some feature representations and owning the entire process from start to finish. The result is that this model was shipped with a 5% improvement to Yelps’s business KPIs. The key takeaways are (1) a direct expert involvement led to the recognition of the need for feature representation changes; and (2) the Preamble team was not well equipped to own the model long term.
In the third try, the ML expert from another team led the modeling project, while collaborating with 2–3 engineers from the Preamble team for several months. The whole group engineered more features and switched to an XGBoost model. The result is that they shipped a sequence of models for another 10% incremental performance improvement. The key takeaways are (1) incremental rollouts kept impact high; and (2) having an ML expert participate for a sustained project allowed the team to understand best practices.
At the current status, the Preamble team is comfortable owning ML initiatives without external assistance. They continue to iterate on this model themselves and have applied ML to other problems in their team’s space.

Pitfall 3: Difficulty in building up a staff of experienced ML professionals

The third pitfall is that you came across challenges building up a staff of experienced ML professionals. Daniel went over Yelp’s journey moving from a structure of decentralized product teams to a structure of a centralized/hybrid Applied ML team:

In a decentralized format, there is a lack of mentorship and community for ML engineers. In a centralized format, senior ML engineers are more accessible, and it’s significantly easier to build out a common pool of knowledge (processes and best practices).
In a decentralized format, Yelp only hired a generalist role called data mining engineers, which led to disparate hiring practices. In a centralized format, Yelp now has specified roles (applied scientists, ML engineers, data backend engineers, ML platform engineers) and an established group effort to improve recruiting.
In a decentralized format, Yelp’s product leaders did not necessarily champion the value of ML, which led to downstream implications of product design and project staffing. In a centralized format, there are naturally stronger partnerships between business and product teams.

In brief, issues in a decentralized format culminated in low engineer retention and frequent backfilling. Thanks to the organizational restructure to a centralized format, Yelp now has a larger, more-tenured Applied ML team with a higher bus factor.

5 — Systematic Approaches and Creativity: Building DoorDash’s ML Platform During the Pandemic

As DoorDash’s business grows, it is essential to establish a centralized ML platform to accelerate the ML development process and power the numerous ML use cases. Hien Luu (Senior Engineering Manager) and Dawn Lu (Senior Data Scientist) detailed the DoorDash ML platform journey during the pandemic, which includes (1) the way they established a close collaboration and relationship with the Data Science community, (2) how they intentionally set the guardrails in the early days to enable them to make progress, (3) the principled approach of building out the ML platform while meeting the needs of the Data Science community, and (4) finally the technology stack and architecture that powers billions of predictions per day and supports a diverse set of ML use cases.

DoorDash Marketplace

DoorDash’s mission is to grow and empower local economies. They accomplish this through a set of products and services: (1) Delivery and Pickup where customers can order food on demand, (2) Convenience and Grocery where customers can order non-food items, and (3) DashPass Subscription that is similar to Amazon Prime. DoorDash also provides a logistics platform that powers delivery for notable merchants like Chipotle, Walgreens, and Target.

Systematic Approaches and Creativity: Building DoorDash's ML Platform During the Pandemic (by Hien Luu - Sr. Engineering Manager & Dawn Lu - Data Scientist, DoorDash)

DoorDash platform is a three-sided marketplace with the flywheel effect. As there are more consumers and more orders, there will be more earning opportunities for dashers to increase delivery efficiency and speed for the marketplace. As there are more consumers and increasing revenue for merchants, there will be more selections from the merchants for the consumers to choose from. This flywheel drives growth for merchants, generates earnings for dashers, and brings convenience to consumers.

ML at DoorDash

Dawn next explained how ML is incorporated into the four steps within the typical food order lifecycle for a DoorDash’s restaurant delivery:

Creating order: When a customer lands on DoorDash’s homepage, DoorDash uses recommendations to surface the most compelling options and provide a personalized food ordering experience, considering factors like past order history, preferences, and current intent. If the customer doesn’t see an option that he/she finds compelling and decides to use the search feature, DoorDash uses search and ranking models to show the most relevant results and rank them by most appealing based on the speed and quality of those restaurants. Promotions is a service that DoorDash offers for its merchants to help them grow their sales by providing targeted promotions for certain items at certain times of the day.
Order checkout: Once all the food items have been added to the cart, the customer will reach the checkout page. Estimated Time Arrivals (ETAs) are important for setting consumer expectations for when their food will arrive. This feature is predicted based on several factors, such as the size of the order, the historical store operations, and the number of available dashers on the road. At the payment stage, DoorDash uses a series of fraud detection models to identify fraudulent transactions (account takeovers, stolen credit cards, etc.) and trigger certain actions, such as asking the customer for the second layer of authentication.
Dispatching order: The logistics engine takes over during this step. The engine sends the order to merchants so they can start preparing the food. The engine also assigns the order to dashers so they can pick up the food and deliver it to the customer. The primary job of DoorDash’s logistics engine is to dispatch the right dashers at the right time for the right delivery in order to maximize marketplace efficiency. For that purpose, DoorDash relies on several ML-based predictions, such as estimated food preparation times and travel times. Another challenge in this step is that food orders emerge on the fly. Quality plays a huge role in the delivery, so DoorDash needs to strike the right balance between efficiency, speed, and quality for the customers.
Delivering order: In the last step, the dashers take over. First, DoorDash needs to figure out how to invest its marketing dollars and acquisition channels in order to bring enough dashers into the platform. Once the dashers are onboarded, DoorDash needs to figure out how to mobilize them at the right time (due to spikes during holidays, local events, etc.). DoorDash accomplishes that by predicting demand in advance and identifying the right incentives that will mobilize the right number of dashers. For the delivery assignment, DoorDash matches the optimal dashers to the deliveries to ensure that dashers get more done in less time and consumers receive the orders quickly. Finally, to ensure that the food is successfully delivered to the customer’s doorstep, DoorDash asks the dashers to take photos of the food being delivered and uses a series of deep learning models to identify whether the drop-off photos have a door in it (a strong indicator that the food is actually delivered).

ML Platform Journey

DoorDash’s centralized ML platform team serves the needs of different ML teams. The platform is built upon these four pillars: feature engineering, model training and management, model prediction, and ML insights. Hien dug deeper into the two pillars of feature engineering and model prediction.

For feature engineering, the platform handles historical features and real-time features. The historical features are generated from the data warehouse and the data lake. There is a feature service that uploads billions of features onto a feature store. The real-time features are handled via an Apache Flink framework that serves them to the feature store.
For model prediction, the platform has a centralized prediction service based on Kotlin and C++. Deploying models is fairly trivial via runtime configurations. The prediction results are monitored, controlled for quality, and logged back into the data lake.

Scaling Challenges

Initially, the platform team built a centralized prediction service with a strong foundation that works and scales out as needed. The service could support roughly 15k peak predictions per second. They took on a critical visible use case — Search and Ranking, which has very high qps and low latency requirements. While scaling out the service to support this use case, the platform team realized that the cost has become unreasonable. As a result, they decided to move the sub-services with high traffic into their own clusters and scale those clusters accordingly. With that simple sharding approach, their prediction service could support up to 30k peak predictions per second.

Next, they took on a more complex use case — Recommendations on the Homepage, which has much higher qps and lower latency than Search and Ranking. Scaling out is not an option anymore due to costs and other aspects. So how could they increase QPS and decrease latency for this service, while maintaining a reasonable cost? By digging deep into every component of the service and taking a systematic approach, they ended up optimizing both the prediction service and the feature store:

For the prediction service, they applied smart load-balancing to avoid hot spots, used the z-standard algorithm to compress the payload, and removed the harmless log statements.
For the feature store, they redesigned the data structure to store the features. Instead of having features as a flat list of key-value pairs, they used a hash map data type in Redis; thereby grouping related key-value pairs into a single pair to reduce the number of top-level key-value loops. Additionally, they compressed complex features using a similar algorithm for payload compression. As a result, they improved CPU efficiency and reduced memory footprint and cost.

Thanks to the aforementioned optimizations, DoorDash’s upgraded prediction service could support up to 6M peak predictions per second and successfully onboard the recommendations use case.

Lessons Learned and Future Work

Here are the lessons that DoorDash’s ML platform team shared throughout this journey:

Scaling out is a good option to start with, but it doesn’t always work when the traffic volume keeps increasing along with the business needs.
Customer obsession is crucial. In addition to collaborating closely with the data science community, they have striven to understand the business problems they were trying to solve to prioritize projects.
The platform was built incrementally, allowing the team to receive feedback and make adjustments along the way.

For future work, DoorDash will be looking at:

Caching solutions for feature values (instead of prediction values) to enable more micro-service optimizations.
Generalized model serving for use cases in NLP and Image Recognition.
Unified prediction client that allows easy prediction requests.

6 — Multi-Armed Bandits in Production at Stitch Fix

Multi-armed bandits have become a popular method for online experimentation, which can often out-perform traditional A/B tests. Brian Amadio (Data Platform Engineer at Stitch Fix) explained the challenges to scaling multi-armed bandits and the solutions he proposed for the Stitch Fix experimentation platform.

Experimentation at Stitch Fix

Data science is behind everything that Stitch Fix does. The Algorithms organization has 145 data scientists and platform engineers spread across 3 main verticals (merchants and operations, client, styling and customer experience) under a central Algorithms platform.

Multi-Armed Bandits in Production at Stitch Fix (by Brian Amadio - Data Platform Engineer, Stitch Fix)

Stitch Fix's high-level experimentation platform architecture is shown above:

An experiment owner can be anyone (data scientist, engineer, PM) who wants to run an experiment.
In the user interface, he/she configures an experiment using the experiment configuration service. In an experiment, he/she specifies a parameter that he/she wants to randomize.
The randomization service has an endpoint, where client applications can request the parameter values. These requests (which can be used to measure experiment outcomes) eventually end up in a data warehouse.
Stitch Fix has batch ETLs running on top of the warehouse to store outcome metrics for experiments. Another analytics service computes metrics like p-values and displays them in the UI.

Multi-Armed Bandits

An example use case of multi-armed bandits at Stitch Fix is choosing which landing page to show when someone visits StitchFix.com. A multi-armed bandit (MAB) is likely a more clever A/B test. While an A/B test returns a one-time, manual decision after all data is collected, a multi-armed bandit returns continuous, automated decisions while collecting data. Using MAB, you can earn while you learn, ditch bad variants sooner, and swap in new variants on the fly.

A MAB agent consists of a reward model and an action selection strategy. The reward model tracks the current best estimate of the reward (a function of the outcome for showing a page to the visitor). The reward is based on some context (features that you know about the visitor). Based on those reward estimates, you have a strategy saying: given what you know so far, which potential action should you take next? After observing the actual outcomes, you feed new information back into the model to update the reward estimates.

Scaling Bandits

There are three big challenges in putting MABs into production and making them scalable:

Randomness: most common bandit strategies are non-deterministic. They are hard to test and stateful, causing a lack of repeatability.
Wide variety of models: use cases can have very different requirements, depending on context features, model training time, rate of outcome data, and reward model output.
Continuous updates: theoretical bandit solutions update the reward model after every action. This requires a synchronous cycle, which is a performance nightmare.

Here are ways in which Stitch Fix handles these scalability challenges:

For randomness, they know how to tame randomness for A/B tests with a deterministic sampler. While in an A/B test, the selection probabilities are fixed in advance; most MAB strategies are inherently non-deterministic. They figured that any bandit strategy could restore such determinism if the selection probabilities can be computed from the reward estimates. Ultimately, they developed a novel approach for deterministic Thompson sampling and re-used A/B testing methods.
For model complexity, they designed the right separation of concerns using standardized interfaces and an upgraded platform for model versioning, service generation, and deployment.
For continuous updates, they relied on micro-services, which allow for continuous independent updating of reward models. The big tradeoff is that reward micro-services can only make batched updates (not streaming updates).

The end solution is MAB, an open-source library for deterministic multi-armed bandit selection strategies. It provides efficient pseudo-random implementations of epsilon-greedy and Thompson sampling strategies. Arm-selection strategies are decoupled from reward models, allowing Mab to be used with any reward model whose output can be described as a posterior distribution or point estimate for each arm.

7 - Challenges for ML Operations in a Fast Growing Company

Udemy is a leading global marketplace for teaching and learning, connecting millions of students to the skills they need to succeed. As of December 2020, Udemy has more than 40M learners, 56K instructors, 155K courses, 480M course enrollments, 115M minutes of videos, 65+ languages, and 7,000+ enterprise customers. Such multi-faceted growth created different scaling challenges for Udemy:

Data growth requires an architectural evolution. They need to develop a more scalable ML platform to handle such a massive amount of data.
Organizational growth requires tooling and process improvements to efficiently serve the needs of different parts of the organization despite their varying requirements. Furthermore, it also leads to an increased focus on developer and data science ergonomics.
The increase in product complexity requires a scalable platform to train and execute different types of ML models in real-time or batch. It also requires improvements in tooling and processes to improve efficiency for product and engineering teams.

Gulsen Kutluoglu (Director of Engineering) and Sam Cohan (Principal ML Engineer) presented the ML platform team's best practices and outstanding problems in Udemy’s current state.

Architectural Evolution

Given the dramatic increase in data science initiatives within a short period of time, they needed a more scalable platform than before. When there were only 1M users, they could train simple models using local environments and execute simple models in real-time for every request by directly reading features from MySQL.

However, what happened when there were 40M users and 480M course enrollments?

They had to move most of their model execution and feature computation pipelines from real-time to batch (else, there would be many latency problems).
They still had some models to be executed in real-time such as search and personalized ranking.

Challenges for ML Operations in a Fast Growing Company (by Gulsen Kutluoglu - Director of Engineering & Sam Cohan - Principal ML Engineer, Udemy)

At this point, they realized that it’s time for them to build generic components and increase reuse, given that multiple teams were working on ML products. A dedicated ML platform team was created to take over this responsibility.

Tooling and Process Improvements

There are quite a few potential pitfalls associated with any explosive growth on the organizational side. Given the lack of established practices on any new projects, new people might reinvent the wheel or have varying product delivery paces and qualities. There is also a steep learning curve: If someone wants to take over a project, it will be hard to understand what’s going on without a unified approach. That, in turn, causes siloed work and limited sharing of institutional knowledge.

In most organizations, the development environment and the production environment are two completely separate worlds! Typically, the data scientists write ad-hoc code and manually track their research in notebooks in the development environment. Their main goal is to take a business problem and apply data techniques to solve that problem. As a result, they are less concerned with what’s going to happen in production. There is a lot of effort to converting that ad-hoc code to production-quality code required in the production environment, given concerns in model reliability, testability, and maintainability.

To close this gap, Udemy’s ML platform team came up with this list of requirements: (1) a well-defined code structure, (2) reusable components, (3) a process for automated formatting, linting, and testing, (4) a process for code versioning and review, and (5) ephemeral execution environments (essentially mirroring the development and production environments).

More specifically, here are the lessons they have learned to improve the tooling and processes:

Standardize the development environment: Initially, they did not endorse any specific development environment and used a mix of laptop environments and shared cloud notebooks instead. Now, they made use of infrastructure-as-code (via Terraform)and automation scripts to set up a personal cloud Jupyter instance for each team member. Furthermore, they demonstrated workflows that treat Jupyter as an IDE with the help of its magic functions (%writefile).
Standardize the design patterns and code structure: Initially, they had operators and rigid structures for ETL tasks, but left open to developer preferences for the application. Now, they created example projects which make use of reusable components and demonstrate how to approach common use cases (e.g., conversion rate modeling, clustering, etc.). Additionally, they created various training materials on coding style, best practices, and high-level design patterns.
Make use of ephemeral execution environments: Initially, they had a shared execution environment with shared dependencies. Now, they had ephemeral execution environments used to decouple dependencies and avoid resource contention. The development code and environment are mirrored between development and production.
Have an explicit code review process and stick to it: Initially, they had an implicit code review process without clear guidelines on expectations. Now, they codified explicit documents and training materials on the code review process.
Testing is not optional: Initially, they had no requirements for writing tests. Now, they had mandatory test requirements with visibility into code coverage.
Post-release monitoring is not optional: Initially, they used alerts spamming emails, which most people filtered away and ignored. As a result, there existed implicitly shared responsibilities for making sure things are working as expected. Now, they cleaned up spam notifications and hooked up a support channel on Slack for better visibility. As a result, there exists an explicit on-call schedule and process for documenting issues to learn from them.

Organizational Changes

Data science and engineering collaboration is the key to build complex ML systems. They had separate engineering and data science teams in the past, which led to back-and-forth alignments between them. They built more cross-functional teams of software engineers, ML engineers, and data scientists, which lead to higher efficiency, improving product quality, and decreasing development time.

Moreover, agile makes the ML development lifecycle more effective. Using a lightweight version of the Scrum methodology helped them handle unknown unknowns in the ML development lifecycle more effectively. It helped them stay focused on delivering MVPs instead of getting lost with uncertainty. Concepts like iteration, incrementally, time-boxing helped them experience better processes in general.

8 — MLOps vs. ModelOps: What’s The Difference and Why You Should Care

MLOps and ModelOps are different. Jim Olsen covered how ModelOps not only encompasses the basic model monitoring and tuning capabilities of MLOps, but also includes a continuous feedback loop and a 360-degree view of your models throughout the enterprise, providing reproducibility, compliance, and auditability of all your business-critical models.

MLOps vs. ModelOps

MLOps is an overall method for developing, delivering, and operationalizing ML models. It utilizes techniques from the DevOps world, combined with the unique requirements of ML models, to provide a continuous loop of model development, deployment, and monitoring of performance.

An MLOps architecture is targeted at ML models specifically. It provides monitoring of the model performance and the nature of the data, as well as the basic info about the model and what kinds of data went into it. It is heavily biased on the development and deployment of the modes.

On the other hand, ModelOps (based on Gartner’s definition) is a set of capabilities that primarily focuses on the governance and the full life cycle management of all AI and decision models. This includes models based on ML, knowledge graphs, rules, optimization, natural language techniques, and agents. In contrast to MLOps (which focuses only on the operationalization of ML models) and AIOps (which is AI for IT operations), ModelOps focuses on operationalizing all AI and decision models.

MLOps vs. ModelOps – What’s the Difference and Why You Should Care (by Jim Olsen - CTO, ModelOp)

An example ModelOps Architecture is a centralized system for the orchestration of all processes involved in monitoring a model from development to production. It works with the large variety of development environments used across the enterprise. It leverages existing IT infrastructure to deliver services. It supports a variety of runtime environments across the enterprise. It integrates with existing investments in reporting tools to report on business value and compliance to the management level. Given such capabilities, it becomes the conductor of the orchestra across the entire business bringing all aspects of the company together with a single pane of glass into a model. As a result, it assures regulatory compliance with auditable snapshots and proof of compliance.

An Ideal ModelOps Solution

At a high level, a robust ModelOps solution is capable of answering these questions:

How many models are in production?
Where are the models running? How long have they been in business?
Have the models been validated and approved? Who approved them? What tests were run?
What decisions are the model making - inference and scoring?
Are the model results reliable and accurate?
Are compliance and regulatory requirements being satisfied? Can that be proven?
Are models performing within controls and thresholds?
What is the ROI for the model?

Let’s walk through two examples of how a ModelOps solution works in practice.

Assuming that you want to validate the model quality for reproducibility, security, and quality assurance purposes, a ModelOps solution can:

Assure that all information necessary to monitor the model is provided, that the model metrics and performance information is provided, and that the model has appropriate documentation.
Leverage existing CI/CD enterprise-level checks to ensure compliance, launch security scans to check for vulnerabilities, and handle other checks like library version compliance, etc.
Smoke test the model by launching a simple test job on the model to ensure it runs and validating the provided data sets against the model to ensure the model and the data match up.

Assume that you want to perform actionable monitoring, a ModelOps Solution can:

Provide real-time monitoring by detecting problems immediately (statistical, technical, business, risk) per model specification of acceptable performance, orchestrating remediation (retesting, revalidate, model refresh), and using pre-built models to define and standardize monitoring and remediation actions.
Develop proactive actions such as automated alerts and notifications, change requests and approvals, and scheduled monitoring for each model.
Improve model reliability by eliminating model degradation and increase model uptime with automated actions.

Most importantly, ModelOps solutions can make AI/ML model governance more of a reality for your organization. Given their capabilities, you can get reports on business-level metrics (by analyzing the usage of your models and business risk), understand future workload, and share reports across the enterprise (via standardized reporting tools that are customizable to the business).

Technical Talks

9 — Model Monitoring: What, Why, and How

Ensuring that production ML models perform with high efficacy is crucial for any organization whose core product or business depends on ML models (think Slack search, Twitter feed ranking, or Tesla Autopilot). However, despite the significant harms of defective models, tools to detect and remedy model performance issues for production ML models are missing. In fact, according to the McKinsey report on model risk, defective models have led to revenue losses of hundreds of millions of dollars in the financial sector alone.

Manasi Vartak from Verta discussed why model monitoring matters, what we mean by model monitoring, and considerations when setting up a model monitoring system. For the context, Manasi did her Ph.D. thesis at MIT CSAIL on model management and debugging, created ModelDB (an open-source ML model management and versioning), worked on ML at Twitter, Google, and Facebook, and recently founded Verta (an end-to-end MLOp platform for model delivery, operations, and monitoring) that is serving models for top organizations.

Why Monitoring

Models fail all the time, often unexpectedly and silently. Here are well-known causes of failure:

Train and test data are frequently different (“distribution shifts”).
Underlying data-generating processes change.
Bugs and errors are prevalent in models, in ETL pipelines, and in serving code.
Models are increasingly consumed by non-experts and in unexpected ways.
Models are fragile to adversarial attacks.

As more and more critical applications will depend on models, start to monitor your models! The basic definition of model monitoring is: “to observe and check the quality of results produced by a model over time.” Let’s unpack this statement into the two sub-components:

1 — Quality of Model Results

Model Monitoring: What, Why, and How (by Manasi Vartak - CEO, Verta Inc)

The quality of models depends heavily on ground-truth data. However, the ground-truth data is often not available or not available right away: system signals (e.g., did someone click on your Tweet?) and proxies (e.g., number of user-filled complaints). The other challenge comes from the hand-labeling process, which, by default, makes the data patchy and delayed.

Because ground truth is so hard to obtain, we need to use proxies to compute model quality.

Is the input of the model changing?
Is the featured data changing?
Is the output data changing?

2 - Over Time

Models are trained on a snapshot of data. As time goes by, the data changes, leading to a degradation in model accuracy performance (because the model has not seen the new data before).

What and How To Monitor

In order to monitor ML models, there are 3 questions to answer:

What to measure and over what time window?
How to detect changes in the measurement?
How to know when a change merits attention?

Depending on the data types, the attributes that we want to measure end up being very different (as seen below):

The frequency of measurement is key and problem-dependent. For example, given a time-varying problem and the choice of metrics, sometimes we want to measure daily, weekly, or over a rolling window. Natural variability may exist, so be sure to account for any weekly or monthly changes.

Rabanser et al., 2018 gives a solid checklist of options to detect changes:

K-S test, Max Mean Discrepancy (continuous)
Chi-Squared test (categorical)
Univariate vs. Multivariate (multiple hypothesis testing)
Dimensionality Reduction (PCA, SVD)
Outlier detection algorithms (and their variants)

Knowing how to measure and detect changes, we need to figure out when a change merits attention (what does a cosine distance of 0.2 mean? what p-value makes sense? etc.) Alert fatigue is real!

An Ideal Model Monitoring Solution

When it comes down to the practical implementation of a model monitoring system, we want to think about:

Scale. It must scale to large datasets, a large number of statistics, and live/batch inference.
Pipeline jungles. Convoluted model lineage and data pipelines make root cause analysis extremely hard.
Accessibility. For non-experts to consume models, monitoring must be easy to plug in and interpret.
Customization. Quality metrics are unique to each model type and domain.

Tooling options can be broken down into two camps:

Build your own with sklearn/numpy/scipy, Prometheus/Grafana, and ad-hoc pipelines. This approach is good for training a few models and computing batch predictions. It requires high effort yet returns low performance.
Using vendor solutions like Verta, AWS, DataRobot, and others. This approach is good for training large models and computing both batch + live predictions.

Manasi concluded the talk by confirming that a good model monitoring solution should:

Tell you when models are failing.
Be customizable (bring your own statistic).
Data science vs. DevOps-friendly.
Enable identifying the root cause.
Support closing the loop.

Given the increasing popularity of using ML models to drive key UX and business decisions, model monitoring (at its core) ensures model results are consistent with high quality. When done right, monitoring can identify failing models before social media does, safely democratize AI, and quickly react to market changes.

10 — Machine Learning on Dynamic Graphs

Graph neural networks (GNNs) research has surged to become one of the hottest topics in ML in the last years. GNNs have seen a series of recent successes in problems from biology, chemistry, social science, physics, and many others. So far, GNN models have been primarily developed for static graphs that do not change over time. However, many interesting real-world graphs are dynamic and evolving in time, with prominent examples including social networks, financial transactions, and recommender systems. In many cases, such systems' dynamic behavior conveys important insights, otherwise lost if one considers only a static graph. Emanuele Rossi (ML Researcher at Twitter)discussed Temporal Graph Networks, a recent and general approach for ML over dynamic graphs.

Graph Neural Networks

Many of the ML ideas that work on images can also work on graphs. Two common tasks on images are image classification (classifying an image to a class) and semantic segmentation (clarifying each pixel into some sort of class).

Machine Learning on Dynamic Graphs (by Emanuele Rossi - Machine Learning Researcher, Twitter)

In the graphs context, we have graph classification (classifying graphs to be drug-like or not drug-like), node classification (classifying unknown nodes within a graph), and link prediction (predicting missing edges).

Given the tasks that we need to accomplish with the graph structure, we need to think about how to achieve that. Images have a constant number of neighbors and a fixed ordering of neighbors. Convolutional Neural Networks have been the most successful tool in this paradigm. On the other hand, graphs have a different number of neighbors and no ordering of neighbors. Thus, we need to dedicate Graph Neural Networks to such attributes. In a basic formulation of GNN, we compute the updated representation for a node by first computing and sending messages from all its neighbors and then aggregating these messages together to give a new representation for that node.

A major problem of applying graph neural networks in practice is that many graphs are dynamic. This frequently happens in real-life social networks and interaction networks. In response to that, these natural research questions arise:

How do we make use of the timing information to generate a better representation of nodes?
Can we predict when and how the graph will change in the future? (i.e., when will a user interact with another user? which users will interact with a given tweet in the next hour?)

Graph Types

Emanuele then explained the variety of graph types, ordered by increasing generalizability:

A static graph has no notion of time. It only has nodes and edges.
A spatiotemporal graph has fixed topology, but the features change over time and are (usually) observed at regular intervals. Examples include traffic forecasting and COVID-19 forecasting.
In a discrete-time dynamic graph, both topology and features change over time. However, the graph is still observed at regular intervals, but we have no information about what happens in between. This includes any system which is observed at regular intervals (e.g., every hour).
The most general formulation of graph is a continuous-time dynamic graph. Each change (‘event’) in the graph is observed individually with its timestamp. This graph representation exists in most social, interaction, and financial transaction networks.

The table below shows examples of event types that can be represented by nodes and edges in a continuous-time dynamic graph on Twitter:

So why do we need a new class of models for learning on dynamic graphs? The answer is that this type of model needs to (1) handle different types of events, (2) use the time information of the events, and (3) efficiently and incrementally incorporate new events at test time. Furthermore, dynamic graphs also require a new, different task — to predict when something will happen.

If we were to use a static GNN, this would mean:

Loss of information: the model would use the last snapshot of the graph but not be able to take into account how the graph evolved.
Inefficiency: computation is repeated each time we want to compute a node embedding.
No way to make a time prediction.

Emanuele argued that an ideal Temporal Graph Model should follow the following form:

The input is a graph up to time t (basically an ordered sequence of events).
Next is an encoder module that generates temporal node embeddings.
Then we have a task-specific decoder that takes node embeddings and makes predictions for the tasks. For instance, if the task is node classification, the decoder takes one node embedding and classifies that node at time t.

Temporal Graph Networks

Temporal Graph Networks (TGN) is is a general encoder architecture that Emanuele developed at Twitter. It combines sequential processing of events with a GNN by (1) handling general event types (each event generates a message which is then used to update nodes’ representations) and (2) using GNN directly on the graph of interaction, combining the computed hidden states with node features. This model can be applied to various learning problems on dynamic graphs represented as a stream of events.

At a high level, TGN stores a vector (memory) for each node, which is a compressed representation of all past interactions of a node. This memory is updated at each new interaction where a node is involved (using an RNN). A GNN is used to aggregate nodes’ memories over the graph and to generate embeddings.

In terms of scalability, because the memory is not a parameter, we can think of it as an additional feature vector for each node which we change over time. Only memory for nodes involved in a batch is in GPU memory at any time. Thus, TGN is as scalable as GraphSage (a well-known static GNN) and can scale to large graphs.

In extensive experimental validation on various dynamic graphs, TGN significantly outperformed competing methods on future edge prediction and dynamic node classification tasks both in terms of accuracy and speed. One such dynamic graph is Wikipedia, where users and pages are nodes, and an interaction represents a user editing a page. An encoding of the edit text is used as an interaction feature. The task, in this case, is to predict which page a user will edit at a given time. They compared different variants of TGN with baseline methods:

This ablation study sheds light on the importance of different TGN modules and allowing them to make a few general conclusions:

TGN is faster and more accurate than other approaches.
Memory leads to a vast improvement in performance.
The embedding module is also extremely important, and graph attention performs best.
Using the memory makes it enough to have 1 graph attention layer.

For future work, Emanuele and colleagues are working on:

Benchmark datasets for dynamic graphs.
Method extensions with global (graph-wise) memory, continuous models (e.g., neural ODEs) to model the memory evolution.
Scalability: propose methods that scale better (possibly combining with literature on graph sampling, but not trivial).
Applications: social networks (e.g., recommender systems, virality prediction), biology (e.g., molecular pathways, cancer evolution), finance (e.g., fraud detection), and more?

11 — Catch Me If You Can: Keeping Up With ML Models In Production

Advances in ML and big data are disrupting every industry. However, even when companies deploy to production, they face significant challenges in staying there, with performance degrading significantly from offline benchmarks over time, a phenomenon known as performance drift. Models deployed over an extended period of time often experience performance drift due to changing data distribution.

Shreya Shankar (UC Berkeley) discussed approaches to mitigate the effects of performance drift, illustrating her methods on a sample prediction task. Leveraging her experience at a startup deploying and monitoring production-grade ML pipelines for predictive maintenance, she also addressed several aspects of ML often overlooked in academia, such as incorporating non-technical collaborators and integrating ML in an agile framework.

While most academic ML efforts are focused on training techniques, industry ML emphasizes the Agile working style. In the industry, we want to train few models but make lots of inferences with these models. This motivates a key question: What happens beyond the validation or test set?

The depressing truth about ML in real life is that many data science projects don’t make it to production. Data in the “real world” is not necessarily clean and balanced and is always changing. Furthermore, showing high performance on a fixed train and validation set is different from getting consistent high performance when that model is deployed in the real world.

Data Lag and Distribution Shift

Shreya next brought up the two major challenges that happen post-deployment of models: data lag and distribution shift.

The two main types of data lag include feature lag (the system only learns about raw data after it has been produced) and label lag (the system only learns of the label after the ride has occurred). Because the evaluation metric will inherently be lagging, we might not be able to train on the most recent data.

Distribution shift is a more practical problem in the industry. Data “drifts” over time and models will need to be retrained to reflect such drift. The open question is, how often do you retrain a model? It is impractical to retrain models all the time because:

Retraining adds complexity to the overall system (more artifacts to version and keep track of).
Retraining can be expensive in terms of compute (especially for deep learning models).
Retraining can take time and energy from people.

Another underlying question with distribution shift is how do you know when the data has “drifted?”A common thing that people do is monitor models in production. What do we monitor is another question that varies by task and by the team.

In the words of the agile philosophy, we want to monitor proactively as much as possible. However, this can be really hard when no labels are coming in real-time, or we are getting some form of label lag.
For first-pass monitoring, it is typical to monitor the model output/prediction distributions and averages & percentiles of values.
For more advanced monitoring, we can monitor the model input (feature) distributions and the ETL intermediate output distributions. This can get tedious if you have many features or steps in the ETL pipeline.
Even when we have monitored all the metrics mentioned above, it is still unclear how to quantify, detect, and act on the changes.

Traditional data monitoring approaches are not necessarily robust. A simple approach is to track mean and variance for a task, which can work in some cases but fails in many situations (distributions with multiple nodes, skew/symmetry) and kurtosis/tail distributions). A more complicated approach is to run statistical tests on all features (such as the KS test or Jensen-Shannon divergence). In practice, these tests can have a high “false positive rate” (especially in large datasets) by flagging distributions as “significantly different” more than needed. In the era of “big data,” p-values are not useful to look at.

Shreya argued that at this point, we do not know when the data has “drifted.” An unsatisfying solution is to retrain the model to be as fresh as possible. However, different types of bugs will surface when we are continually training models:

The data gets changed or corrupted.
Data dependency and infrastructure issues will arise.
The logical error in the ETL code, in retraining a model, and in promoting a retrained model to inference step will pop up.
There will be changes in consumer behavior.

There are many more, but why are these production ML bugs hard to find and address (compared to bugs in software engineering pipelines)?

ML code fails silently, with only a few runtime or compiler errors.
There is little or no unit/integration testing in feature engineering and ML code.
Very few (if any) people have end-to-end visibility on an ML pipeline.
We would need much better-developed monitoring and tracing tools than current offerings.
Many companies have pipelines with multiple models chained together, resulting in complex dependencies and artifacts to keep track of (especially during model retraining).

When we regularly retrain (even if we have monitoring solutions), we could use something equivalent to a model registry to track which model is the best, or we can also manage the versioning and registry ourselves. At inference time, we can pull the latest or best model from our model registry. Next, we can do a forward pass through our pipeline and come up with a prediction. In a lot of cases, when we do have these bugs, we need some tracing.

Catch Me If You Can: Keeping Up With ML Models in Production (by Shreya Shankar - ML Engineer, PhD Student)

mltrace

Shreya’s first Ph.D. project at UC Berkeley is a tracing tool for complex data pipelines called mltrace. Actively under development, mltrace provides coarse-grained lineage and tracing capabilities and is designed specifically for complex data/ML pipelines and agile multi-disciplinary teams. The first release contains a Python API to log information about any runs and a UI to query those logs and view traces for outputs.

Here are the key design principles of mltrace:

Utter simplicity: There is a logging mechanism for users to know everything going on.
There is no need for users to set component dependencies in the pipeline manually: The tool detects dependencies based on the I/O of a component’s run.
The API, in theory, is designed for both engineers and data scientists.
The UI is designed to help triage issues even if they didn’t build the ETL or model themselves.

Here are the key concepts of mltrace, which Dagster and TFX inspire:

Pipelines are made of components (i.e., cleaning, featuregen, split, train).
In ML pipelines, different artifacts are produced when the same component is run more than once.
The two abstractions are Component and ComponentRun objects (instances of that Component being run with the versioned dependencies).
There are two ways to do this logging: (1) a Decorator interface similar to Dagster “solids” and (2) an alternative Pythonic interface similar to MLflow Tracking.

There are several things on mltrace intermediate roadmap: gRPC integrations for logging outside of Python, Prometheus integrations to easily monitor outputs, DVC integration to version data, MLflow integrations, and lineage at the record/row level. Shreya is most excited about the ability to perform causal analysis for ML bugs: if you flag several outputs as mispredicted, which component runs were common in producing these outputs? Which component is most likely to be the biggest culprit in an issue? Email Shreya at shreyashankar@berkeley.edu if you’re interested in contributing!

12 — The Critical Missing Component in the Production ML Stack

The day an ML application is deployed to production and begins facing the real world invariably begins in triumph and ends in frustration for the model builder. The joy of seeing accurate predictions is quickly overshadowed by a myriad of operational challenges that arise, from debugging to troubleshooting to monitoring. In DevOps, analogous software operations have long been refined into an art form. Sophisticated tools enable engineers to quickly identify and resolve issues, continuously improving software stability and robustness. By contrast, in the ML world, operations are still done with Jupyter notebooks and shell scripts. One of the cornerstones of the DevOps toolchain is logging. Traces and metrics are built on top of logs, thus enabling monitoring and feedback loops. What would a good logging tool look like in an ML system?

Alessya Visnjic presented a powerful new tool — a logging library — that enables data logging for AI applications. She discussed how this solution enables testing, monitoring, and debugging of both an AI application and its upstream data pipeline(s). More specifically, she offered a deep dive into some of the key properties of the logging library that enable it to handle TBs of data, run with a constraint memory footprint, and produce statistically accurate log profiles of structured and unstructured data.

Data Problems

Based on her conversations with hundreds of ML practitioners in the past two years, Alessya observed that most issues in production ML stem from data problems. ML engineers can build pipelines that process terabytes of data in minutes, thanks to vendor solutions and big data platforms. This is both a blessing and a curse. If we look at the four core steps in the ML stack (processing raw data, generating features, training models, and serving models), each of them involves transforming and moving massive volumes of data. When you operate such a stack in production, each step can introduce a data bug or be completely derailed by a data bug.

The Critical Missing Component in the Production ML Stack (by Alessya Visnjic - CEO and Co-Founder, WhyLabs)

If you operate a model in production, your most common daily task becomes answering questions about the model behavior and health:

How is the model performing?
Should I be worried about data drift?
What did my training data look like?
What was the input data to the model yesterday?
How was yesterday's data different from today?
How was yesterday's data different from last week's data?

In general, how do we test/monitor/debug/document data so on? This is not in any way a novel problem. Every engineer who operates ML applications in production has their own improvised way of testing, monitoring, and debugging data - which is likely to be ad-hoc and tedious. Instead, let's imagine that there are ways to (1) continuously monitor the quality of data flowing through each stage of the ML stack and (2) alert us about data issues before they impact our model.

Data Logging

Let’s say we want to design such a data logging concept. Alessya believes that a good data log should capture:

Metadata to understand where the different subsets of the data come from and how fresh they are.
Counts to ensure that the data volume is healthy and missing values are detected.
Summary statistics to identify outliers and data quality bugs.
Distributions to understand how the shape of data changes over time and help catch distribution drifts.
Stratified samples (of the raw data) to help with debugging and post-hoc exploratory analysis.

Then she outlined the key properties of a good and useful data logging solution:

Lightweight — runs in parallel with the main data workloads.
Portable — is easy to plug in anywhere within one’s ML pipeline.
Mergeable — allows merging for multiple instances or aggregating data logs over time.
Configurable — easy to configure to one’s preferences.

whylogs

Keeping these properties in mind, Alessya and her team at whylabs set out to build a data logging solution called whylogs, an open-source purpose-built ML logging library. It implements a standard format for representing a snapshot of data (structured and unstructured). This allows the user to decouple the process of producing data logs and acting upon them. In other words, whylogs provides a foundation for profiling, testing, and monitoring data quality. More specifically, whylogs:

Logs rich statistics for each feature such as counts, summary statistics, cardinalities, histograms, etc.
Track data statistics across batches by visualizing distributions over time. This helps understand distribution drifts and data bugs.
Run data logging without overhead: Using streaming algorithms to capture data statistics, whylogs ensures a constant memory footprint, scales with the number of features in the data frame, and outputs lightweight log files (JSON, protobuf, etc.).
Capture accurate data distributions: whylogs profiles 100% of the data to capture distributions accurately. On the other hand, capturing distributions from sampled data is significantly less accurate.

To sum up, more and more data science teams are using whylogs to enable operational activities for ML applications. whylogs has integrations to make it easy to plug into any Spark pipeline, any Python or Java environments, and any of the popular ML frameworks. By systematically capturing log files at every data transformation step within an ML pipeline, you create a record of data quality and data distributions along the entire ML development process. Given that record, you can then write data unit tests, build a data quality monitoring system, or develop data debugging dashboards. With just a few lines of code, you add powerful transparency/auditability to the ML application and enable a wide range of MLOps activities.

Check out bit.ly/whylogs to learn more about this project!

13 — Iterative development workflows for building AI applications

Many AI practitioners share common assumptions that the training data used in their AI applications is static, publicly available, and crowd-labeled. In reality, the training data, observed in real-world enterprises, is extremely dynamic/drifting, private/sensitive, and labeled by experts. As a result, modern AI application development is changing — rather than focusing solely on models trained over static datasets, practitioners are thinking more holistically about their pipelines, with a renewed emphasis on the training data.

Building AI applications today requires armies of human labelers. This becomes a non-starter for private, high-expertise, and rapidly-changing real-world settings. In a traditional development loop, we spend money to get the training data labeled once and then iterate on models (experimenting with architectures, tuning hyper-parameters, etc.) to drive up performance.

What if we can iterate over this entire process?

Iterative Development Workflows for Building AI Applications (by Vincent Sunn Chen - Founding Engineer, Leading ML Engineering & Priyal Aggarwal - Machine Learning Engineer, Snorkel AI)

Programmatic Labeling

Vincent Chen and Priyal Aggarwal from Snorkel AI presented the idea of Programmatic Labeling - which programmatically labels, builds, and manages training data. More specifically, this iterative development is powered by weak supervision, an approach to rapidly encode domain expertise to "program" training datasets (using a number of different tools such as pattern matchers, organizational resources, heuristics, third-party models, crowd labels, etc.).

The idea behind weak supervision is to leverage more efficient but "noisier" supervision sources. In this weak supervision approach, we use an abstraction called the labeling function, which can encode and wrap up different forms of supervision programmatically. Given the assumption that the supervision sources aren't perfect, we need a way to denoise and aggregate the labeling function outputs to build training datasets. The Snorkel team has published a number of papers around the theory and interfaces on combining the labeling function outputs in a provably consistent way (Ratner et al., NeurIPS'16, Rather et al., VLDB'18).

Overall, weak supervision is a key foundational technology that enables programmatic labeling interfaces, making it possible to iteratively build both the models and the training data.

Benefits of Iterative AI Development Workflows

An iterative development workflow also helps users prioritize development efforts. Priyal presented a case study on building a spam detection application, in which the goal is to classify Spam vs. Ham emails. One iteration of building this application is depicted in the diagram above, which consists of an email dataset, some labeling functions, the constructed training data, the ML model, and the predictions.

After the 1st iteration, they achieved 87.3% accuracy using a simple logistic regression + CountVectorizer model. They relied on a confusion matrix to improve the accuracy to improve the model’s handling of false negatives. Then, they wrote another labeling function to improve supervision in the bucket of false negatives.
After the 2nd iteration, they gained a 2.4% increase in accuracy (89.7%) using the same logistic regression model. To further improve accuracy, they relied on a label distribution graph to fix distribution differences between the training set and model predictions. Then, they wrote more labeling functions for the Spam Class to improve coverage and oversampled the “SPAM” class in their model training.
After the 3rd iteration, they gained another2.7% increase in accuracy (92.4%), still using the same logistic regression model as before.

Lastly, an iterative workflow helps practitioners collaborate with experts more effectively.

In the beginning, a data scientist typically formulates the task by preparing the training dataset, setting up a basic modeling framework, and generating some predictions. He/she must ensure that this end-to-end pipeline is formulated sensibly.
Then, it will be important for an expert to be actively involved in developing the labeling functions as he/she has a lot of intuition about how to label the data. Instead of having him/her label one by one to form ground-truths, we should give him/her the tools and interfaces to build labeling functions directly. This will improve his/her efficiency significantly.
Once the expert has labeled some data and generated a training dataset, the data scientist returns and continues to work on the modeling side.
In the end, the expert must be involved in the model analysis to address specific error modes.

The Snorkel team has been actively building the Snorkel Flow platform that enables iterative development workflows for building AI applications. It is a first-of-its-kind data-centric platform powered by programmatic labeling.

14 — Security Audits for Machine Learning Attacks

There are various reasons for ML models to be attacked. A majority of the time, hackers, malicious insiders, and their criminal associates seek to: (1) gain beneficial outcomes from a predictive or pattern recognition model or induce negative outcomes for others, (2) infiltrate corporate entities, or (3) obtain access to intellectual property (models, data, etc.).

Several known attacks against ML models can lead to altered, harmful model outcomes or exposure to sensitive training data. Unfortunately, traditional model assessment measures don’t tell us much about whether a model is secure. In addition to other debugging steps, it may be prudent to add some or all of the known ML attacks into any white-hat hacking exercises or red-team audits your organization is already conducting. H2O.ai’s Navdeep Gill (Lead Data Scientist/Team Lead, Responsible AI) and Michelle Tanco (Customer Data Scientist) went over common ML security attacks and the remediation steps an organization can take to deter these pitfalls.

Categories of Attacks

The first category is data poisoning attacks, in which either (1) hackers obtain unauthorized access to data and alter it before model training, evaluation, or retraining or (2) people in your organization (malicious or extorted data science or IT insiders) do the same. More specifically, the attacker alters data before model training to ensure favorable outcomes.

Here are the common defenses mechanisms against data poisoning attacks:

Disparate impact analysis that looks for discrimination in your model’s prediction.
Fair or private models such as learning fair representations (LFR) and private aggregation of teacher ensembles (PATE).
Reject on negative impact analysis, which removes rows of data from the training dataset and decreases prediction accuracy.
Residual analysis that specifically looks for large positive deviance residuals and anomalous behavior in negative deviance residuals.
Self-reflection, where you score your models on your employees, consultants, and contractors and look for anomalously beneficial predictions.

The second category is backdoors and watermarks, in which either (1) hackers infiltrate your production scoring system or (2) people in your organization (malicious or extorted data science or IT insiders) change your production scoring code pre or during deployment by adding a backdoor that can be exploited using watermarked data (a unique set of information that causes a desired response from the hacked scoring system). More specifically, the attacker adds a backdoor into your model’s scoring mechanism, then exports the backdoor with watermarked data to attain favorable outcomes.

Here are the common defenses mechanisms against backdoors and watermarks:

Anomaly detection, as you screen your production scoring queue with an anomaly detection algorithm (i.e., an autoencoder) that you understand and trust.
Data integrity constraints that don't allow impossible or unrealistic combinations of data into your production scoring queue (i.e., sanity check combinations of data).
Disparate impact analysis that looks for discrimination in your model’s prediction.
Version control by keeping track of your production model scoring code like any other enterprise software through a version control tool (e.g., Git).

The third category is surrogate model inversion attacks. Due to a lack of security or a distributed attack on your model API, hackers can simulate data, submit it, receive predictions, and train a surrogate model between their simulated data and your model predictions. This surrogate can (1) exposure your proprietary business logic (which can be known as “model stealing”), (2) reveal sensitive information based on your training data, (3) be the first stage of a membership inference attack, and (4) be a test-bed for adversarial example attacks.

Here are the common defenses mechanisms against surrogate model inversion attacks:

Authentication by authenticating consumers of your model’s API or other relevant endpoints. This is probably one of the most effective defenses for this type of attack as it can stop hackers before they can even start.
Defensive watermarks by adding subtle information to your model’s predictions to aid in the forensic analysis if your model is hacked or stolen. This is similar to a physical watermark you would see in the real world.
Throttling by slowing down your prediction response times, especially after anomalous behavior is detected. This will give you and your team time to evaluate any potential wrongdoing and take the necessary steps toward remediation.
White-hat surrogate models by training your own surrogate models as a white-hat hacking exercise to see what an attacker could learn about your public models. This will allow you to build protections against this type of attack.

The fourth category is membership inference attacks. Due to a lack of security or a distributed attack on your model API or other model endpoints, this two-stage attack begins with a surrogate model inversion attack. A second-level surrogate is then trained to discriminate between rows of data in, and not in, the first-level surrogate’s training data. The second-level surrogate can dependably reveal whether a row of data was in, or not in, your original training data.

Simply knowing if a person was in or not in a training dataset can violate individual or group privacy. However, when executed to the fullest extent, a membership inference attack can allow a bad actor to rebuild your sensitive training data! The best defense mechanism against this is to monitor your production scoring queue for data that closely resembles any individual used to train your model. Real-time scoring of rows that are extremely similar or identical to data used in training, validation, or testing should be recorded and investigated.

The fifth category is adversarial example attacks. Due to a lack of security or a distributed attack on your model API or other model endpoint, hackers simulate data, submit it, receive predictions, and learn by systematic trial-and-error. Your proprietary business logic can easily be used to game your model to dependably receive the desired outcome. Adversarial example attacks can also be enhanced, tested, and hardened using models trained from surrogate model inversion attacks.

Common defense mechanisms against adversarial example attacks include the ones discussed above, such as anomaly detection, authentication, fair or private models, throttling, and white-hat surrogate models. There are also other mechanisms such as:

Benchmark models by always comparing complex model predictions to trusted linear model predictions. If the two models' predictions diverge beyond some acceptable threshold, you should review the prediction before issuing it.
Model monitoring by watching your model in real-time for strange prediction behavior.
White-hat sensitivity analysis by trying to trick your own model into seeing its outcome on many different combinations of input data values.

The sixth category is impersonation attacks. Bad actors learn either (1) by inversion or adversarial example attacks the attributes favored by your model and then impersonate them, or (2) by disparate impact analysis that your model is discriminatory and impersonate your model’s privileged class to receive a favorable outcome. You can defend against them with authentication, disparate impact analysis, and model monitoring (watching for too many similar predictions and similar input rows in real-time).

General Concerns and Solutions

From a bird-eye view, here are the general concerns with ML attacks:

Black-box models: Over time, a motivated, malicious actor could learn more about your own black-box model than you know and use this knowledge imbalance to attack your model.
Black-hat explainable AI: While explainable AI can enable human learning from ML, regulatory compliance, and appeal of automated decisions, it can also make ML hacks easier and more damaging.
Distributed-denial-of-service (DDOS) attacks: Like any other public-facing service, your model could be attacked with a DDOS attack.
Distributed systems and models: Data and code spread over many machines provide a more complex attack surface for a malicious actor.
Package dependencies: Any package your modeling pipeline is dependent on could potentially be hacked to conceal an attack payload.

And here are the general solutions to deal with these concerns:

Authenticated access and prediction throttling: These activities should be used for prediction APIs and other model endpoints.
Benchmark models: You should always compare complex model predictions to less complex model predictions. For traditional, low signal-to-noise data mining problems, predictions should not be too different. If they are, investigate them.
Encrypted, differentially private, or federated training data: Properly implemented, these technologies can thwart many types of attacks. Improperly implemented, they simply create a broader attack surface or hinder forensic efforts.
Interpretable, fair, or private models: In addition to models like LFR and PATE, you should also check out monotonic GBMs, Rulefit, AIF360, and the Rudin group at Duke.
Model documentation, management, and monitoring: Best practices include (1) taking an inventory of your predictive models, (2) documenting production models well-enough that a new employee can diagnose whether their current behavior is notably different from their intended behavior, (3) knowing who trained what model, on what data, and when, and (4) monitoring and investigating the inputs and predictions of deployed models on live data.
Model debugging and testing, and white-hat hacking: It’s important to test your models for accuracy, fairness, and privacy before deploying them. Furthermore, you can train white-hat surrogate models and apply explainable AI techniques to them to see what hackers can see.
System monitoring and profiling: It’s important to use a meta anomaly detection system on your entire production modeling system’s operating statistics, then closely monitor for anomalies.

15 — How Not to Let Your Data and Model Drift Away Silently

Many reasons contribute to the failure of ML projects in production, but the lack of model retraining and testing is the most common one. ML deployment is not the end — it is where ML models start to materialize impact and value to the business and people. We need monitoring and testing to ensure ML models behave as expected. In her talk, Chengyin Eng (Data Science Consultant at Databricks) shed light on two questions:

What are the statistical tests to use when monitoring models in production?
What tools can I use to coordinate the monitoring of data and models?

For any robust ML system, we should set up model monitoring and use that information to inform the feedback loop in our ML lifecycle development. The quality of our model depends on the quality of our data. It is expected that data distributions and feature types can change over time due to multiple factors (such as upstream errors, market change, and human behavior change), leading to potential model performance degradation. Models will degrade over time, yet the challenge is to catch this when it happens.

Drift Types

Model drift can occur when there is some form of change to feature data or target dependencies. We can broadly classify these changes into the following two categories: concept drift and data drift.

When statistical properties of the target variable change, the very concept of what you are trying to predict changes as well. For example, the definition of what is considered a fraudulent transaction could change over time as new ways are developed to conduct such illegal transactions. This type of change will result in concept drift.
The features used to train a model are selected from the input data. When statistical properties of this input data change, it will have a downstream impact on the model’s quality. For example, data changes due to seasonality, personal preference changes, trends, etc., will lead to incoming data drift. Besides the input features, deviations in the distributions of label and model predictions can also contribute to data drift.

How Not to Let Your Data and Model Drift Away Silently (by Chengyin Eng - Data Science Consultant, Databricks)

Knowing these types of drift, what actions can we take?

With feature drift, we should either retrain models using new data or investigate how that feature is generated.
With label drift, we should (similarly) retrain models using new data or investigate how that label is generated.
With prediction drift, we should investigate whether there's something wrong with the model training process and assess the business impact.
With concept drift, we can either retrain models using new data or consider alternative solutions (maybe additional feature engineering).

Monitoring Tests

There are four main aspects for us to monitor: (1) basic summary statistics of features and target, (2) distributions of features and target, (3) model performance metrics, and (4) business metrics. Chengyin then broke down the two types of monitoring tests: on data and on model.

For monitoring tests on data, we need to look at both numeric features and categorical features. For numeric features, we want to compute summary statistics (such as median/mean, minimum, maximum, percentage of missing values, etc.) and use statistical tests (such as two-sample KS test with Bonferroni correction and Mann-Whitney test for the mean + Levene test for the Variance). For categorical features, we want to compute summary statistics (such as mode, number of unique levels, percentage of missing values, etc.) and use statistical test like the one-way chi-squared test.
For monitoring tests on models, we want to look at the relationship between target and features (Pearson coefficient for numeric target and contingency tables for categorical target), examine model performance (MSE, error distribution plots, etc. for regression models, ROC, confusion matrix, F1-score, etc. for classification models, and performance on specific data slices), and assess the time taken to train models.

A standard workflow for model monitoring is observed above:

Starting at month 0, we train our model on existing data and put it into production. There is no historical data to compare against.
In month 1, new data arrives. At this stage, we can check null values, compute summary statistics, and use tests like KS, Levene, and chi-squared. If these tests pass, then we can proceed to train a new model. Else, we need to address the drifts or concerns before training a new model. Tools like
After training a new model, we need to ensure that the new model has a better performance than the old model. If yes, then we can replace the old model. Else, we do not.

Chengyin concluded the talk with a demo showing how to use MLflow and Delta Lake (2 commercial offerings of Databricks) to construct monitor tests on models and data. The demo can be accessed in this repository.

That’s the end of this long recap. MLOps is the topic that I will continue to focus a lot on in the upcoming months. If you had experience building or using platforms and tools to support MLOps for the modern ML stack, please reach out to trade notes and tell me more at khanhle.1013@gmail.com! 🎆