What I Learned From Attending TWIMLcon 2021
In January, I attended TWIMLcon, a leading MLOps and enterprise ML virtual conference. It focuses on MLOps and how enterprises can overcome the barriers to building ML models and getting them into production. There was a wide range of both technical and case-study sessions curated for ML/AI practitioners. In this long-form blog recap, I will dissect content from the talks that I found most useful from attending the conference.
The post consists of 14 talks that are divided into 3 sections: (1) Case Study, (2) Technology, and (3) Perspectives.
1 — Case Study
1.1 — How Spotify Does ML At Scale
Over 320 million Spotify users in 92 different markets worldwide rely on Spotify’s great recommendations and personalized features. Those users created over 4 billion playlists from a catalog of over 60 million tracks and nearly 2 million podcasts. With the massive inflows of data and complexity of the different pipelines and teams using the data, it’s easy to fall into the trap of tech debt and low productivity. The ML Platform at Spotify was built to address that problem and make all our ML Practitioners productive and happy. Aman Khan and Josh Baer gave a concrete talk that describes the history of the ML Platform at Spotify.
Currently, Spotify has over 50 machine learning teams using the ML platform. In 2020, those teams trained over 30,000 models using the platform team's tools. These models ramped up to 300 thousand prediction requests per second using the internal serving framework. As measured by the number of model iterations per week, machine learning productivity increased by 700% in 2020 alone due to the platform enhancement. Here are a few applications of Machine Learning at Spotify: app personalization, recommendations, user onboarding, messaging quality, and expansion to new markets
Spotify has used machine learning since the company origin. Early efforts were custom-built code that ran on a Hadoop cluster. For the first five-to-ten years, Spotify’s machine learning capabilities were all about growing the data capabilities: processing data at scale and handling unique datasets relevant to Spotify’s growing machine learning ambition. As more frameworks came into play and ML grew in popularity, they saw more combinations of tools being used in projects. This led to higher tech-debt for ML systems and more cognitive load when building new systems.
In 2017, Spotify created an internal infrastructure/platform team. Its core mission is to increase the speed of iteration for ML development by providing tooling that supports common development patterns. Here are some of the goals: (1) to build a platform for active ML use cases, (2) to reduce the cost to maintain ML applications, (3) to democratize ML, and (4) to support state-of-the-art ML.
The ML workflow at Spotify can be broken down into three iterative steps, from data exploration to model serving at a high level. The ML Platform then built 5 product components that serve various stages of the ML lifecycle:
Jukebox (Internal Feature Store): An ecosystem of libraries that allow users to easily reuse components in their own workflows for feature management during model training and serving.
Kubeflow Pipelines (SKF): A managed platform to build, iterate, and deploy ML workflow via Kubeflow pipelines.
Salem: A Spotify-wide platform for serving models online. It accomplishes this in three simple steps: (1) When a user gives Salem a model, it returns a back-end service. (2) When a user sends features to Salem, it uses those features to make predictions. (3) When a user updates his/her model, Salem handles updating the service.
ML Home: A centralized User Interface that enables teams to collaborate, discover, shared, and manage ML projects.
Klio: a collection of ML libraries for efficient processing and catalog audio and podcasts.
The piece that ties these components together is the ML Engagement Team that focuses on ML education and outreach initiatives within Spotify.
Aman then dived deeper into Jukebox, Spotify’s internal feature store. The goal of creating Jukebox is to reduce users' time to discover, share, and implement features for model training and serving. The platform team does this by simplifying feature management from experimentation all the way to online serving.
Let’s look at an example of building an Artist Preference Model that predicts how likely a user is a fan of a given artist. If done in batch, this would require an inefficient number of predictions, roughly about (300M users x 10M artists) pairs. If instead be done online at request time, it would be more scalable for other Spotify teams to build personalization into their applications (using this model's outputs as the inputs to their models).
The diagram above displays a typical Jukebox Workflow:
The starting point is a feature registry, a single source of truth for features. Users first create and register features, combine features with training models offline, and later combine features again to make predictions online. Users can explore features using a feature gallery UI on Spotify’s ML Home component, where they can search, discover, and understand features that might get reused for their models.
The collector component selects and joins features from upstream data endpoints with TF records as the outputs. When joined with labels, these features can be used to create a training set.
The loader then ingests those TF records into BigTable using a specified schema. Each team is responsible for its own BigTable customer table.
The reader is a Java library that makes features available at scale for low latency reads from an online service. It also reads BigTable data to be used as an input to the model by the previous state's TF records. The reader looks at the latest partition for each table and provides optional in-memory caching for the features.
At the end of model training, the feature transformation is saved as parts of the model graph, which dramatically simplifies model management and guarantees the same feature transformation between training and serving time (no data skew).
Through Spotify’s internal metadata service, the team associates the features with their corresponding trained models at serving time to make predictions such as how much a user likes an artist?
The logs produced at a serving time can then be stored and registered for further model training.
Towards the end of the talk, Josh shared the three lessons from his experience leading the platform team:
Build infrastructure together: Working alongside ML teams allowed them to build a better product.
Be opinionated: An opinionated path enabled the team to build a more cohesive experience.
Make difficult tradeoffs: Optimizing for production freed up ML engineers to do more ML development.
The biggest tradeoff was to build a platform solely for the ML engineering use cases. Such focus allowed the platform team to reduce the maintenance burden for ML teams and free them up for building new ML projects. Of course, this means less of a focus on non-ML engineering work (such as earlier prototyping and exploration phases). In 2021, the platform team's biggest goal is to expand the platform's breadth offering and support all different parts of the production lifecycle. Besides that, here are other main themes that they will be working on:
MLOps — The Road to Continuous Learning: design better processes and guardrails to continuously learn and deploy models.
ML Experimentation: build a better process for experimenting with ML at scale.
Quick Model Prototyping: increase focus on the prototyping phase, pulling in users (especially data scientists) that do ad-hoc ML or build via a highly iterative process.
Feature Marketplace: expose knowledge being funneled into ML models around the company and allow squads to incorporate them into their own experimentation.
Fairness and Bias: enable equitable ML by default, allowing engineers to understand and control the bias built into their models.
Note: If you are interested in learning more about Spotify’s ML Platform, check out this blog post and this slide deck.
1.2 — MLOps For High-Stake Environments
Toyota Research Institute (TRI), Toyota’s research arm, was established in January 2016 with a billion-dollar budget. It has grown to more than 300 employees across three Cambridge, Ann Arbor, and Silicon Valley facilities. The main focus areas have been automated driving, robotics, advanced material design and discovery, and machine-assisted cognition. Sudeep Pillai shared an overview of the MLOps environment developed at TRI and discussed some of the key ways MLOps techniques must be adapted to meet the needs of high-stakes environments like robotics and autonomous vehicles.
TRI has a unique approach to Automated Driving called “One System, Two Modes”:
Mode 1 is called Chauffeur — where fully autonomous driving systems are engaged at all times. This mode is currently staged for commercial release, likely beginning with shared mobility fleets.
Mode 2 is called Guardian — where the driver always engages, but the vehicle monitors and intervenes to help prevent collisions. This mode builds on similar hardware and software development as fully-autonomous Chauffeur.
Machine learning is eating automated driving (AD) software across the industry: AD software was traditionally hand-engineered. It was modular and rule-based to make things composable and inspectable. However, it was fairly rigid and, as a result, generalizes poorly. As we move towards the software 2.0 world, the technologies for the core components of an AD stack (Perception, Prediction, Planning) start to adopt ML. The idea is to engineer an AD software that is data-driven and generalizable to diverse situations.
The challenging task is to build a unified MLOps infrastructure to support all ML models in safety-critical AD applications (2D/3D object detection, traffic light detection, semantic segmentation, bird-eye-view scene flow, monocular depth estimation).
TRI’s real goal is to build a unified platform for accelerated fleet learning. This was inspired by the “Toyota Production System” for vehicles. Given an assembly line that Toyota has, the goal is to optimize parts of this line to produce more vehicles at the end of any given day. In the context of MLOps, this platform needs to be domain-specific, model-agnostic, and applicable across heterogeneous fleets. The question becomes:
How to marry MLOps best practices with safety guarantees to deploy models in an AD setting?
While adapting MLOps AD applications, the TRI team came across a few key challenges (related to the three phases of design, development, and operations): (1) more complex model dependencies and cascades during the design phase, (2) more nuanced model acceptance criteria during the development phase, and (3) more operational burden for safety-critical applications during the operations phase.
For example, let’s say the goal is to improve cars' performance at intersections. A design phase might look like the diagram above:
We have access to the log replay — which shows an internal representation of what the car sees in the world.
We then curate specific scenarios that can happen during an intersection and learn from those scenarios.
We want to make sure the data is labeled correctly (via some quality assurance score).
We next track model experiments and examine various model-level metrics (mAP, mIoU, etc.).
Once we get to deployment, we need the model to pass a test suite before putting it into production.
Finally, we want to measure a system-level metric known as kilometer per disengagement. We need to think carefully about the tradeoff between this metric and previous model-level metrics optimized during the training steps.
Modular architecture also poses the complex dependencies challenge, where we have models that are trained on predictions from other models. For example, if we want to re-train a further upstream model in the workflow, we need to consider all the possible downstream models affected by this.
To build MLOps for safety-critical applications, TRI designs a diverse suite of benchmarks and tests for evaluation during real-world and simulated scenarios (as seen below).
This unified and scalable infrastructure encourages fast iteration, scenario-driven, and safety-first deployment. A big problem in automated driving is the heavy-tail, where it’s challenging to find the edge cases. The vehicle needs to drive many miles to encounter the heavy-tail cases. Iterating fast is the only way to accelerate that process. TRI's near-term goal is to scale such foundational capability for learning from the Toyota fleet to all forms of Toyota vehicles.
Note: Check out different TRI research lines if you are an autonomous vehicle geek.
1.3 — How To Prevent The Damage Caused By Wayward Models
Model quality is a common production problem because subtle failures are often caused by infrastructure and frequently undetected.
Models that are fine, but are not detected, suddenly becomes not fine. Sometimes these are modeling failures, but they are often infrastructure-caused.
Detection is tricky. Regression can be subtle. In particular, the bad models can only affect some subsets of the modeling space or can have modest effects across the whole model.
Additionally, model quality is not just an operational but also a trust problem.
Models have quality problems for non-modeling reasons: configurations are not regularly updated, newly released learners and trainers behave differently, parallelism becomes an issue, and data changes (quantization or definition, missing or joining the wrong data used, differential delay, etc.)
Many non-technical (and some technical) leaders don’t trust their organizations to use ML. Thus, the benefits of ML are held back by the lack of trust. At the moment, the trust is inversely proportional to the damage a bad model could cause.
Using a hand analysis of approximately 100 incidents tracked over 10 years, the Google Engineering Infrastructure team has looked carefully at cases where these models reached or almost reached the serving system. In his talk, Todd Underwood identified common causes and manifestations of these failures and provided some ideas for measuring the potential damage caused by them. Most importantly, he proposed a set of simple (and some more sophisticated) techniques for detecting the problems before they cause damage.
The example Google system in his talk is a part of a complex pipeline where thousands of models are trained on hundreds of features. Data pre-processing, training, and serving are asynchronously connected. Newly arriving (logged) data is joined to slower-changing data. Different models are trained on different collections of features. New models are trained continuously (every hour). An “experiment” system plumbed through from the data warehouse to model serving.
In particular, it is one of the largest and oldest ML systems at Google:
A ranking and selection system that is old (15+ years with multiple redesigns) and well-documented (10 years of full postmortems).
There are thousands of models training concurrently with billions of parameters in size.
The engineers train the system periodically to update the model with newly arrived data.
The serving system is global, and new models are continuously synced to serving.
Todd brought up an interesting thing that Google treats outages like information. Failures are a gift, as they are a natural experiment of what, at least once, broke and usually why. Understanding failures can help them avoid, mitigate, or resolve those failures. In fact, Google maintains a searchable database of failures and their write-ups called “Requiem.”
Google’s initial hypothesis is that many ML failures probably have nothing to do with ML. Their impression is that many undetected ML failures aren’t modeling failures. Instead, boring and commonplace failures are more difficult to notice and more difficult to take seriously. Modelers are looking at modeling failures, while infrastructure engineers are not paying attention to model quality. This gap might also be a root cause. Evidence could confirm this and make it more actionable or contradict this and point towards suitable ML-centric quality.
Then, the Google Infrastructure team looked at the outages and developed a taxonomy of model failures. It is worth noting that this taxonomy is observational, not quantitative. Failures are categorized based on the experience and intuition of the authors. Furthermore, this taxonomy is deeply connected to Google’s underlying systems architecture. But in general, understanding even one failure can help prevent other quality outages.
The most general class of failures occur during the handover between elements in a system. A classic example is training and serving skew, where feature definitions in training time are different in serving time. Change in feature definitions is another common issue, where the features no longer mean what they previously meant. Finally, validation data might no longer be representative (due to bugs in data generation, a shift in real-world data, etc.).
The second class of failures occurs due to static models. Because the world is non-stationary, the problem will change, the traffic mix will change, and the infrastructure will also change.
Finally, failures can happen due to continuously updated models. These are periodically updated models, which have all of the failures above (but to a smaller degree as model updates may protect for some) and more. For instance, automated safety checks might not be strict enough, models might be non-reproducible, infrastructure reliability challenges arise, etc.
To prevent these failures in the first place, a good practice is to automatically disallow configuration changes that exclude too much data from training.
To protect the system (generally speaking), we can monitor data flowing through various stages in the training pipeline and ensure that the data is within acceptable thresholds and does not change too suddenly. The distribution of features produced by the feature generation sub-system and the distribution of features ingested by a specific model should be covered. Periodically, we want to update thresholds to keep up with the changing world automatically.
Finally, we should always automatically pause the pipeline and raise alerts on data quality degradation.
Todd concluded his talk with these takeaways:
Spend time understanding your failures.
Start monitoring your model quality.
Start writing and tracking post-mortems.
Learn everything you can from an outage.
Once you find the sources of your own quality outages, reflect on how you might detect them faster and prevent some of them.
1.4 — The Build and Buy Golden Ratio
The story of build vs. buy is older than the software industry itself. It is a question of whether you want something now or later, simple or complex, etc. Most importantly, you need to decide what kind of “technical debt” you are getting into. With infrastructure, sometimes you have to buy.
If you want to integrate a Deep Learning component inside your product, then your options are more limited — due to variance in use cases (no one-size-fit-all), time constraints, much research needed to integrate with your own use cases, few “best practices” already known, and nascent infrastructure.
This even gets worse if Deep Learning is your core product. Off-the-shelf tools for scaling are not relevant yet, while infrastructure needs to be tailored to your use case. It would be best to emphasize the facile transition from research to production.
In their session, Ariel Biller from ClearML and Dotan Asselmann from Theator presented a specific case study of build-and-buy. For the context, Theator is a Surgical Intelligence Platform. It provides revolutionary personalized analytics on lengthy surgical operation videos. The deployed client-side inference pipeline consists of multiple deep learning models, so no off-the-shelf solutions truly fit the need for full Continuous Integration / Deployment / Training. Building such critical infrastructure within a startup’s constraints would be impossible if not for existing MLOps solutions. In the current landscape, ClearML offered Theator unique precursors to realize the needed designs with unprecedented integration ease.
Theator relies on ClearML for data versioning, querying, processing, and storage for their data stack. The only part that they built in-house is the GPU loading component.
Theator relies on ClearML for experiment management, training orchestration, and model training pipeline for their research stack. Of course, they wrote the research code themselves. They also developed an in-house interface called pipeline master controller that handles various tasks from configuring the model environment to spinning up cloud machines.
Theator builds almost everything from the ground up for their continuous training stack — a production training pipeline, a video inference pipeline, an auto-tagging capability, a model re-training trigger alert (for continuous training), and a suite of regression tests (for continuous deployment). The only part that they outsourced to ClearML is the data intake step (for continuous integration).
The session ended with a couple of lessons:
The build-and-buy golden ratio is use-case specific.
Avoid reinventing the wheel at all costs.
“Build” on top of comprehensive, battle-tested “Buy” solutions.
Automate your data stack.
The flexibility of your infrastructure is a major time- and cost-saving. You can move faster from research to production by enabling a shared infrastructure.
Note: Ariel argued that the MLOps stack should be bottom-up designed — meaning easy integration with research code, logging mechanisms, one-click orchestration, workflow versioning, etc. ClearML was built specifically for this paradigm, so check them out!
1.5 — Building ML Platform Capabilities at Global Scale
Prosus is a global consumer Internet group and one of the largest tech investors — serving 1.5B+ people in 80+ countries. They invested in 4 segments: classifieds (like OLX Group), payments and fintech (like PayU), food (like iFood), and education (like Brainly). Every interaction in these platforms invokes a swarm of ML models:
Classifieds: Is the content harmful? How fair is the asking price?
Payments and fintech: Does the person qualify for a loan? How should this payment be routed?
Food: Which dish to recommend? what’s the expected time of arrival
Education: Which course to recommend? What is an answer to a specific question?
Thus, there is a dire need for tools and technologies established by these organizations to support and automate various ML workflow aspects. Paul van der Boor explored the evolution of ML Platform capabilities at several Prosus companies to enable applying machine learning at scale, including iFood’s ML Platform leveraging existing Amazon SageMaker capabilities, OLX’s development of data infrastructure for ML and model serving infrastructure based on KFServing, and Swiggy’s home-built Data Science Platform making a billion predictions a day.
iFood
iFood is the biggest food delivery service in Brazil. There are millions of real-time, synchronized decisions made on the platform every day: What does the user like to eat? Does the restaurant have capacity? Which restaurant should we recommend? Which rider should deliver the order? Should we offer incentives? How long will it take to deliver? Should we offer to pay later? Can we group orders?
The ML platform that the iFood Team built to handle these decisions is called Bruce. Bruce helps train models easily on AWS Sagemaker, creates Sagemaker endpoints to serve trained models, and integrates easily to a CI/CD environment. Bruce’s design enables a central point of collaboration where data scientists do not need to interface with the infrastructure directly.
Currently, the platform team at iFood is working on expanding Bruce’s capabilities with components such as:
Feature store: using Databricks and Spark for near real-time features, Redis for storing and serving; calculating 100s of features millions of times; building a unified dataflow for both batch and real-time purposes.
CI/CD: using Jenkins + Airflow using internal repo crawler, daily updating all models.
Model management: using model metadata to generate a dashboard.
OLX Group
OLX Group is a global online marketplace with its headquarter in Amsterdam, buying and selling services and goods such as electronics, fashion items, furniture. Its ML platform is very mature.
During training:
The data is collected systematically, then dumped into a data lake — S3 storage that complies with data protection laws. The data is then made searchable in a data catalog, a self-service tool for data management, consumption, and discovery.
The reservoir is an isolated data storage that gives access to specific use cases.
The data can be made for real-time (with Kinesis) and batch (with Kafka) consumption.
Operational Data Hub is an Airflow-based tool that consists of Schedulers, an extendable set of Operators, and universally accessible Storage.
During serving:
Laquesis is a self-service tool for performing experiments, feature flags, and surveys.
Data API is a generic, standardized, and scalable API that serves batch predictions. KFServing is used for real-time predictions.
Both prediction modes provide standardized application integration.
Swiggy
Bangalore-based Swiggy offers an on-demand food delivery platform that directly brings food from neighborhood restaurants to users’ doors. Facing the same challenges like iFood does, the Swiggy team built a home-grown ML deployment and orchestration platform for data scientists to easily integrate, deploy, and experiment with multiple models and abstract away integrations with feature teams to simple, one-time API contracts. The platform has been battle-tested at scale (3 years in production), part of the larger platform strategy at Swiggy, easily extensible, and focused on the ‘last mile’ of the ML workflow.
The diagram above displays the platform workflow. It is designed for scale:
Supporting hundreds of real-time and batch models.
Supporting billions of predictions a day.
Calculating thousands of features millions of times per entity.
Serving some of the highest throughput and lowest latency use cases (ads serving, real-time recommendations, fraud detection, ETA predictions, …).
Owning a create-once-use-anywhere feature store.
Paul concluded the talk with a few general considerations for building ML platforms at scale:
Architecture designed for change: you can remove the cost of a change via a tool's interface.
Use multiple things in parallel: different tools and technologies can serve different purposes.
Don’t reinvent the wheel: think carefully about your use cases and choose the right tools.
Ensure scalability of tools: ensure that your tools remain cost-effective and meet the SLA requirements at scale.
Take the MLOps perspective: the tools continue to mature, so look out for this landscape!
1.6 — Cost Transparency and Model ROI
As companies adopt AI/ML, they run into operational challenges with cost and ROI questions: How do you capture the costs of feature computation, model training, and model predictions? How do you forecast the costs? If a model needs an expensive GPU, what’s the ROI on a model or a set of features? Srivathsan Canchi and Ian Sebanja from Intuit gave a brief overview of Intuit’s ML platform, with a specific focus on operational cost transparency across feature engineering, model training and hosting.
At Intuit, the ML platform is responsible for the Model Development Lifecycle's different parts. It serves 400+ models and 8,000+ features, alongside 25+ trainings and 15 billion feature updates per day. Because the majority of ML workloads are in the cloud, the cloud infrastructure is a substantial (and hidden) cost for Intuit. While their ML use cases have increased 15 times year-over-year, their costs have not increased at the same pace. Ultimately, the cost is a tradeoff between model throughput/latency requirements and financial resources.
Srivathsan mentioned the technical way to address this tradeoff by minimizing the data scientists' overload. On day 1, every data scientist is granted a unique ID per model, where all resources are tracked with this ID. He/she can specify the projects to work on, and resources will be isolated for them. Over time, as his/her work pattern emerges, the platform will provide smart defaults for the most common use cases to automate the workflow as much as possible.
Ian also discussed different non-technical ways the Intuit platform team has utilized: providing information at the right points to encourage transparency and visibility, educating data scientists on cost optimization instances, collaborating with data scientists on solutions to evaluate model performance, and determining the ROI of costs saved.
Note: Overall, these practices to enable cost transparency helps Intuit articulate the impact of ML projects and encourage data scientists to make efficient use of internal resources.
1.7 — ML Product Experiments at Scale
Adapting a digital product A/B testing system to support complex ML-powered use cases requires advanced techniques, highly cross-functional product, engineering, and ML teamwork, and a unique design approach. Justin Norman explored lessons learned and best practices for building robust experimentation workflows into production machine learning deployments at Yelp.
Any sophisticated experiment management tool must enable the ML engineers to:
Create a snapshot of model code, dependencies, and config necessary to train the model.
Keep track of the current and past versions of data, associated images, and other artifacts.
Build and execute each training run in an isolated environment.
Track specified model metrics, performance, and model artifacts.
Inspect, compare, and evaluate prior models.
There are many proprietary tools in the market, such as Neptune, Comet, Weights and Biases, SageMaker, etc. There are also many open-source frameworks like Sacred, MLflow, Polyaxon, guild.ai, etc. Yelp invested in many open-source libraries that align the best with their needs and construct some thin wrappers around them that allow easier integration with their legacy code. They opted for MLflow, which automates model tracking and monitoring significantly, on the model experimentation aspect. They used MLeap for model serialization and deployment on the model serving aspect.
Justin then talked about the difference between testing and experimentation. The key difference is that experimentation is used to make a discovery or to test a hypothesis, while testing is used before it is taken into widespread use. The impact of the experiment results determines the type of experiment required. At Yelp, there are two types:
Multivariate Experiment (A/B) (most common): “I expect the feature to have an impact on our users — I’ll ship in the event of success and won’t ship in the event of failure.”
Roll-out (rarer): “I do not expect the feature to have an impact on our users, but I’ll ship either way as long as I don’t break anything major.”
An experimentation platform answers the key question: Is the best-trained model indeed the best model, or does a different model perform better on new real-world data? More specifically, the ML engineers need a framework to:
Identify the best performers among a competing set of models.
Evaluate models that can maximize business KPIs.
Track specified model metrics, performance, and artifacts.
Inspect and compared deployed models.
Every experiment needs a hypothesis: If we [build this], then [this metric] will move because of [this behavior change]. At Yelp, the product managers choose the decision metrics, and the data scientists consult and sign-off on those metrics. The data scientists are also responsible for kicking off the conversation with data engineers to articulate the data needed for metric computation and understand which events will be logged. Justin brought up the concept of minimum detectable effect, the relative difference at which you actually start to care — is your experiment risky enough that you need to detect a percentage loss or gain in your primary metric to make a decision?
Today, nearly all data experimentation at Yelp — from products to AI and machine learning — occurs on the custom-built Bunsen platform, with over 700 experiments in total being run at any one time. Bunsen supports the deployment of experiments to large but segmented parts of Yelp’s customer population, and it enables the company’s data scientists to roll back these experiments if need be. Bunsen’s goal is to dramatically improve and unify experiments and metrics infrastructure by dynamically allocating cohorts for experiments.
Bunsen consists of a frontend cheekily dubbed Beaker, which product managers, data scientists, and engineers use to interact with the toolset. A “scorecard” tool facilitates the analysis of experimental run results, while the Bunsen Experiment Analysis Tool — BEAT — packages up all of the underlying statistical models. There’s also a logging system used to track user behavior and serve as a source of features for AI/ML models.
Bunsen is a distributed platform meant to be utilized by various roles. Product managers, engineers that are not in the machine learning and AI space, machine learning practitioners, data scientists, and analysts at Yelp consume information that either comes from Bunsen or working directly with Beaker to gather the information.
Justin ended the talk with a small discussion of multi-armed bandits, an algorithmic technique used to dynamically allocate more traffic to variants that are performing well while allocating less traffic to under-performing variations. Essentially, you get a higher value from the experiment faster. There are different multi-armed bandits algorithms, including epsilon-greedy, upper confidence bound, and Thompson sampling. At the moment, Yelp’s platform team is experimenting with contextual bandits, which uses context from incoming user data to make better decisions on what model to use for interference in near real-time.
2 — Technology
2.1 — Feature Stores: Solving The ML Data Problem
Operational ML is redefining software today.
Let’s take a look at what differentiates Analytic ML from Operational ML:
Analytic ML is used in development for analytic environments, typically focused on batch analytics and plugged into a data lake or data warehouse. The outputs are one-off dashboards/forecasts that humans consume. There has been an increasing demand to use ML to power end-user experience and automate business decisions.
Operational ML brings that analytic stack to real-time to support operational systems. In the operational environment, we are not just limited to batch data. Instead, we have various new data sources (real-time data streams or user-generated sessions). Use cases range from financial services (fraud detection, trading systems) to retail (product recommendation, real-time pricing) to insurance (personalized insurance, claims processing).
Operational ML, however, is still very hard to do. Here are the signs that a team is struggling:
Frequent duplication of efforts by team members.
Multiple stakeholders are involved in any production push.
Models take months to get to production.
Lack of monitoring.
Production models lag behind development versions.
Important sources of data are not used to make predictions.
Results are not reproducible in training and/or production.
Building operational ML applications is very complex. Data and features are at the core of that complexity.
We have been building software applications for years and refined our development process with the DevOps pipeline, allowing us to build and deploy application code iteratively. Many tools have emerged to make that helpful. We have also seen a similar set of tools to manage ML models. All the MLOps platforms are great at model training, model experimentation, and model serving. However, tooling for managing features is almost non-existent. We have tools for exploratory data analysis and feature engineering but nothing to manage the features' end-to-end lifecycle.
Mike Del Balso gave an excellent talk on how the feature store solves this problem. It is basically the interface between data infrastructure and the model infrastructure. It allows data scientists and data engineers to build the catalog of productionized data pipelines. The feature store can connect to existing data that lives on data warehouses or other databases. It can also connect to raw data in those systems and derive features from such data. Data scientists interact with a feature store by productionizing the features and making them reusable for the rest of the organization.
More specifically:
The feature store connects to your raw data and (optionally) transforms the data into feature values.
It stores the features in both online and offline layers, which map to both the operational and analytic environments.
It serves the features for real-time predictions or dataset generation.
It also monitors the features and maintains a central catalog (registry) that can be used for compliance and discovery purposes.
Generally speaking, a feature store helps you:
Build more accurate models by leveraging all of your data (batch, streaming, and real-time) to make predictions and ensuring data consistency between training and serving.
Get models to production in hours instead of weeks by allowing data scientists to build production-ready features and serve them online, and eliminating the need to reimplement data pipelines in production-hardened code.
Bring DevOps-like best practices to feature development by managing features as code in Git-backed repo, sharing/reusing features, monitoring data quality/tracking lineage, and controlling data access/ensuring compliance.
Here are common problems and solutions that concern a feature store:
A common problem that many teams have is that not all of their data is available to use for predictions. Only a subset of the data sources is useful for the live model in practice. A Feature Store connects the models to the relevant data sources, enabling the data scientists to easily build ML systems.
Another common problem is that ML teams are stuck building complex data pipelines. In other words, data scientists put the burden on data engineers, which often do not have similar incentives. Such back-and-forth conversations can slow down projects by months or years just to get the features deployed. A Feature Store enables the data scientists to self-serve deploy features to production. This increases the autonomy/ownership of the data scientists.
The final common problem (an anti-pattern) is that teams often re-build features repeatedly (lots of feature duplication). Features are some of the most highly curated and refined data in a business, yet they are also some of the most poorly managed assets. A Feature Store centrally manages features as software assets. It contains a standard definition of all the features, manages them in a central Git repository, and brings a DevOps-like sense of iteration.
Note: To get started with feature stores, you should check out Feast and Tecton. Feast is a self-managed open-source software that supports batch and streaming data sources and ingests feature data from external pipelines. Tecton is a fully-managed cloud service that supports batch, streaming, and real-time data sources and automates feature transformations.
2.2 — Data Prep Ops: The Missing Piece of the ML Lifecycle Puzzle
To make ML work for organizations, there are three components that we need to nail down: the technology, the operational concerns, and the organizational concerns. Today, we are right where we need to be in technology. However, the operational and organization pieces are not yet figured out.
The typical ML lifecycle consists of three pieces: (1) Data Preparation (data collection, data storage, data augmentation, data labeling, data validation, feature selection), (2) Model Development (hyper-parameter tuning, model selection, model training, model testing, model validation), and (3) Model Deployment (model inference, model monitoring, model maintenance). In her talk, Jennifer Prendki dug into how the deficit of attention from experts to one of the most critical areas of the ML lifecycle — that of data preparation — is the likely cause for a still highly dysfunctional ML lifecycle. According to her, while general wisdom acknowledges that high-quality training data is necessary to build better models, the lack of a formal definition of a good dataset — or rather, the right dataset for the task — is the main bottleneck impeding the universal adoption of AI.
In ML, data preparation means building a high-quality training dataset. This entails asking questions such as: What data? Where to find that data? How much data? Where to validate that data? What defines quality? Where to store that data? How to organize that data?
Data labeling is the tip of the iceberg of data preparation. There are so many different pieces, including label validation, data storage, data augmentation, data selection, third-party data, synthetic data, data privacy, data aggregation, data fusion, data explainability, data scraping, feature engineering, feature selection, and ETL.
Data Acquisition
When we talk about data acquisition, we tend to think about data collection. There are both physical and operational considerations with this — ranging from where to collect the data to how much to collect initially. If we want to get data collection right, we need a feedback loop aligned with the business context.
Synthetic data generation is a second approach with pop-culture use cases such as DeepFakes. Many other use cases range from facial recognition and life sciences to eCommerce and autonomous driving. However, getting this synthetic data requires many training data and compute. Furthermore, synthetic data is not realistic for unique use cases, and there is a lack of fundamental research on their impact on model training.
A third approach is known as data ‘scavenging’ — where we scrape the web and/or use open-source datasets (Google Dataset Search, Kaggle, academic benchmarks)
Finally, we can acquire data by purchasing them (legally collected and raw/organized training data).
Data Enhancement
After acquiring the data, we would like to enhance them. The most critical part of data enhancement is, of course, data labeling. This is an industry of its own because we have to answer so many questions:
Who should label it? Humans or the machines? Human-in-the-loop approach? Proficient annotators or crowd-sourced workers? Which vendor? Which annotator?
How should it be labeled? Which tools (open-source, in-house, rented)? How should the instructions be written?
What should be labeled?
Indeed, getting the data labeling step right is extremely complicated because it is error-prone, slow, expensive, and often impractical. Efficient labeling operations would require a vetted process, qualified personnel, high-performance tools, its own lifecycle, a versioning system, and a validation process.
Even after getting the labels, we are not done yet. We need to validate the labels. This can be accomplished manually by annotators, internal team, and third-party. Here are potential challenges:
How to vet the annotators in advance?
How to separate honest mistakes from fraud?
How to do “cross-validation” and statistical analysis for explicit and implicit quality assurance?
How to find labeling errors during inference time?
Data augmentation is a scientific process where we can manipulate the data via flipping, rotation, translation, color changes, etc. However, scaling data augmentation to bigger datasets, negating memorization, and handling biases/corner cases are fundamental issues.
Data Transformation
The next phase of the Data Preparation lifecycle is data transformation. This phase includes three steps:
Data formatting: This is a small slice of the whole data engineering task that interfaces with data warehouses, data lakes, and data pipelines.
Feature engineering: This includes concepts such as feature stores, management of correlations, management of missing records, and going from feature selection to embeddings.
Data fusion: This last step means fusing data from different modalities, sensors, and timelines.
Data Triaging
Data triaging is the belief that: because the dataset is so large, we cannot be picky about the types of data that we are going to use. Thus, we need to catalog and structure the data methodologically:
We can sort the data by use cases.
We can make the data searchable by pre-defined tags.
We can cluster the data via specific embeddings and metadata.
We might even want to sell data to 3rd-party or tackle out-of-distribution example prediction.
Finally, there is the concept of data selection. In any datasets, there will be high-value data (useful), redundant/irrelevant data (useless), and mislabeled/correlated/low-quality data (harmful).
Jennifer argued that
A data prep ops market is under-served, under-estimated, and misunderstood by the broader ML community.
Here are two advanced workflows that have shown promising potential for data prep ops:
Active learning is a semi-supervised learning technique based on an incremental learning approach, where we alternate between phases of training and inference (in-training validation). This technique is used to reduce labeling costs through a strategic selection of data. This approach requires the merging of the data-preparation realm with the model-training realm.
Human-In-The-Loop is another step where we rely on a machine learning model combined with a human oracle to perform data labeling. In a simple workflow, we use ML to route the easy cases to the model and the hard cases to the human. In a complex workflow, we use ML to validate or adversarially question the labels' trustworthiness. Again, this approach requires us to think of the data-preparation realm as a fully embedded process that cannot be separated from the model-training realm.
In the future, data preparation needs to be elevated to a first-class citizen of the MLOps trilogy. Furthermore, data preparation is so complicated that it requires its own set of operations. Broadly speaking, the ML community needs a paradigm shift where (1) we do not consider data as a static object and (2) we understand that more data isn’t always better performance.
Note: This is probably the most useful talk at the conference for me personally. I look forward to seeing how Jennifer’s company, Alectio, elevates the data prep ops market.
2.3 — Unified MLOps: How Database Deployment and A Feature Store Can Revolutionize ML
Diving further into the firehose of feature stores, Monte Zweben presented a whole new approach to MLOps that allows you to successfully scale your models without increasing latency by merging the database with machine learning. He started the talk by defining a feature store as a centralized repository of continuously updated raw data and transformed data for machine learning. The whole idea is to get data scientists to leverage each other’s work and take the mundane work out of their every day’s feature engineering tasks.
In particular, the feature store takes data coming from the enterprise at different cadences (whether real-time events from websites and mobile apps or data sources from databases and data warehouses) and puts that raw data through transformations. These transformations may be batch or event-driven. Then the data lands into the feature store. Data scientists can search that feature store, form training sets from historical feature values, serve features in real-time to models, and look at the history of features (for governance and lineage purposes).
There are many requirements for a feature store. Monte listed a few below:
Scale up to 1 billion records and 20,000 features.
Feature vector retrieval by primary key for inference in the order of milliseconds.
Point-in-time consistency on training data.
Event-driven and batch feature updates.
Track feature lineage.
Discoverability and reuse with feature metadata.
Feature lineage.
Backfill of new features.
He then discussed the unique design of his company’s open-source platform, Splice Machine, which allows for the deployment of machine learning models as intelligent tables inside their unique hybrid (HTAP) database. Splice Machine's simplicity is that it uses one engine to store online and offline features. This underlying data engine is a relational database that performs transactional workloads and analytical workloads in an ACID-compliant way.
What are the benefits of having a single store (vs. having separate online and offline stores)?
It is easier to provision and operate.
It takes less infrastructure cost.
It is easier to backup or replicate.
There is no latency to synchronize.
It can take advantage of features like Triggers, which enable event-driven pipelines.
The underlying SQL platform can introspect and interrogate any SQL statements that come to it. It does so thanks to a cost-based optimizer using statistics. Those statements are then executed in an Apache HBase key-value store. On the other hand, if you perform many table scans, joins, or aggregations, the cost-based optimizer will send an instruction for that query to Apache Spark for execution.
A huge requirement of any feature store is solving the point-in-time consistency problem. This means keeping track of feature values as they change over time and keeping them consistent together to form training examples. How does Splice Machine accomplish that?
Splice Machine’s single-engine has two important tables for two feature sets. The first one is called a feature set table that pulls live feature features. The second one is called a history table with a time index history of feature values.
As we execute the feature engineering pipeline, the features are updated in the live feature set table. As you train your features in this live table, the relational database's trigger capability lights up. It replaces the old values of those same features in the history tables — indexed with a time interval.
There are even additional benefits when you couple a feature store with deployment. Many current ML systems use the same kind of traditional deployment method with an endpoint (potentially containerized). Splice Machine also does database deployment in a single click — taking advantage of transactional database and its features to make model serving both transparent and efficient. Every new record automatically triggers predictions made and populated at sub-millisecond speed. There is no extra endpoint programming. Furthermore, this method enables easy model governance and traceability via SQL statements.
Transparency is probably the biggest benefit of all. The database deployment method memorizes every prediction and what model where the prediction was used. Then we can go back to other components of the system and look at what features were in there. Because those features in the feature store have a history, we can travel back to the raw data where the feature values were created.
Overall, the aspiration behind having this single feature store is to scale data science faster with fewer headcounts.
Note: Question for the readers — How do you compare Splice Machine and Tecton?
2.4 — How An Optimized ML Lifecycle Can Drive Better ROI
When many businesses start their journey into ML and AI, it’s common to place a lot of energy and focus on the coding and data science algorithms themselves. The reality is that the actual data science work and machine learning models themselves are only part of the enterprise machine learning puzzle. The success of ML adoption is intertwined, where collaboration is critical. Priyank Patel gave a presentation on how to effectively tackle the whole puzzle using the Cloudera Data Platform.
The Cloudera Data Platform is designed under the core principle that production ML requires an integrated data lifecycle — including data collection, data curation, data reporting, model serving, and model prediction. Any user of this platform can:
Control cloud costs with auto scale, suspend, and resume features.
Optimize workloads based on analytics and ML use cases.
View data lineage across any cloud and transient clusters.
Use a single pane of glass across hybrid and multi-clouds.
Scale to petabytes of data and 1,000s of diverse users.
This full lifecycle enables collaboration between teams of data engineers, data scientists, and business users. More specifically, the platform team at Cloudera tailors unique offerings for each of these three personas:
For Data Scientists: Cloudera Data Science Workbench offers a collaborative development environment, flexible resource allocations, notebook interface, and personalized repetitive use cases with Applied Research documents created by Cloudera’s Fast Forward Labs.
For Data Engineers: Cloudera Data Engineering offers an integrated, purpose-built experience for data engineers with these capabilities:
Containerized, managed Spark service with multiple version deployments and autoscaling compute, governed and secured with Cloudera SDX.
Apache Airflow orchestration with open preferred tooling to easily orchestrate complex data pipelines and manage/schedule dependencies.
Visual troubleshooting that enables real-time visual performance profiling and complete observability/alerting capabilities to resolve issues quickly.
Simplified job management and APIs in diverse languages to automate the pipelines for any service.
For Business Users: Cloudera Data Visualization enables the creation, publication, and sharing of interactive data visualizations to accelerate production ML workflows from raw data to business impact. With this product:
The data science teams can communicate and collaborate effectively across the data lifecycle; while experimenting, training, and deploying models across the business.
The business teams can leverage explainable AI, acquire broader data literacy, and make informed business decisions.
3 — Perspectives
3.1 — Machine Learning Is Going Real-Time
Chip Huyen gave a talk that covers the state of real-time machine learning in production, based on her recent blog post.
Chip defined these two levels of real-time machine learning:
Level 1 is online predictions, where the system can make predictions in real-time (defined in milliseconds to seconds).
Level 2 is online learning, where the system can incorporate new data and update the model in real-time (defined in minutes).
Online Predictions
Latency matters a lot in online predictions. A 2009 study from Google shows that increasing latency from 100 to 400 ms reduces searches from 0.2% to 0.6%. Another 2019 study from Booking.com shows that a 30% increase in latency cost 0.5% decrease in conversion rate. The crux is that no matter how great your models are, users will click on something else if they take just milliseconds too long.
In the last decade, the ML community has gone down the rabbit hole of building bigger and better models. However, they are also slower, as inference latency increases with model size. One obvious way to cope with longer inference time is to serve batch predictions. In particular, we (1) generate the predictions in batches offline, (2) then store them somewhere (SQL tables, for instance), and (3) finally pull out pre-computed predictions given users' requests.
However, there are two main problems with batch predictions:
The system needs to know exactly how many predictions to generate.
The system cannot adapt to changing interests.
Online predictions can address these problems because (1) the input space is in-finite and (2) dynamic features are the inputs. In practice, online predictions require two components: fast inference (models that can make predictions in the order of milliseconds) and real-time pipeline (one that can process data and serve models in real-time).
There are three main approaches to enable fast inference: You can make models faster by optimizing inference for different hardware devices (TensorRT). You can make models smaller via model compression techniques such as quantization, knowledge distillation, pruning, and low-rank factorization (check out this Roblox’s article on how they scaled a BERT model to serve 1+ billion daily requests on CPUs). You can also make the hardware more powerful, both for training/inference and cloud/on-device.
A real-time pipeline requires quick access to real-time features. The most practical approach is to store them in a stream storage (such as Apache Kafka or Amazon Kinesis) and then process them as they arrive.
A model that serves online predictions would need two separate pipelines for streaming data and static data. This is a common source of errors in production when two different teams maintain these two pipelines.
Traditional software systems rely on REST APIs, which are request-driven. Different micro-services within the systems communicate with each other via requests. Because every service does its own thing, it’s difficult to map data transformations through the entire system. Furthermore, debugging it would be a nightmare if the system goes down.
An alternative approach to the above is the event-driven pub-sub way, where all services publish and subscribe to a single stream to collect the necessary information. Because all of the data flows through this stream, we can easily monitor data transformations.
There are several barriers to stream processing:
The first one is that companies don’t see the benefits of streaming (maybe the system is not scalable, maybe batch predictions work fine, or maybe online predictions are unpredictable).
The second one is the high initial investment in infrastructure. Switching from batch to online streaming is a monumental task.
The third one is a mental shift, especially for academically-trained engineers used to the batch mode.
The final one is that current tools for online predictions are built on Java, which presents another hurdle to learn for Python folks.
Online Learning
There is a small distinction between online learning and online training. Online training means learning from each incoming data point, which can suffer from catastrophic forgetting and can get very expensive. On the other hand, online learning means learning in micro-batches and evaluating the predictions after a certain period of time (whether offline or online). This is often designed in-tandem with offline learning.
The biggest use case for online learning right now is recommendation systems due to user feedback's natural labels. However, not all recommendation systems need online learning, especially for the slow-to-change preferences such as static objects. For quick-to-change preferences such as media artifacts, online learning would indeed be helpful.
There are also other use cases for online learning, such as dealing with rare events, tackling the cold-start problem, or making predictions on edge devices.
Finally, here are the barriers to online learning:
There is no epoch because each data point can be seen only once.
There is no convergence due to shifting data distribution.
There is no static test set as data arrives continuously.
Note: Be sure to read Chip’s blog post for the complete overview of real-time machine learning!
Food for thought: how should we design an MLOps tool for real-time purposes?
3.2 — Top 10 Trends in Enterprise Machine Learning for 2021
In the past 12 months, there have been myriad developments in the machine learning field. Not only have we seen shifts in tooling, security, and governance needs for organizations, but we’ve also witnessed massive changes in the field due to the economic impacts of COVID-19. Every year, Algorithmia surveys business leaders and practitioners across the field for an annual report about the state of machine learning in the enterprise. Diego Oppenheimer, the founder and CEO of Algorithmia, shared the top 10 trends driving the industry in 2021 and his tips for organizations that want to succeed with AI/ML in the coming year.
Overall, here are the 4 themes presented:
Priority shifts: Organizations respond to economic uncertainty by dramatically increasing ML investment.
Technical debt is pilling up: Despite the increased investment, organizations are spending even more time and resources on model deployment.
Challenges remain: Enterprises continue to face basic challenges across the ML lifecycle.
MLOps preferences: Orgs report improved outcomes with third-party MLOps solutions.
Here are the 3 trends regarding priority shifts:
ML priority and budgets are increasing: The average number of data scientists employed has increased 76% year-on-year.
Customer experience and process automation represent the top ML use cases: For nearly all use cases, 50% or more organizations are increasing their ML usage.
There is an increasing gap between the ML “haves” and “have-nots”: The world’s largest enterprises dominate the high-end of model scale.
Here are the 4 trends regarding the challenges:
Governance is the top challenge by far: 56% of organizations struggle with governance, security, and audit-ability issues. 67% of all organizations must comply with multiple regulations.
Integration and compatibility issues remain: 49% of organizations struggle with data integration and compatibility of ML technologies, programming languages, and frameworks.
There is a need for organizational alignment: Successful ML initiatives involve decision-makers from across the organization (especially the IT infrastructure and operations leader).
Alignment issues limit ML maturity: Organizational alignment is the biggest gap in achieving ML maturity.
Here are the 2 trends regarding technical debt:
Deployment time is increasing: The time required to deploy a model increases year-on-year.
Data scientists spend too much time on deployment: 38% of organizations spend more than 50% of their data scientists’ time on deployment. Those with more models spend more of their data scientists’ time on deployment, not less.
The last trend in MLOps preferences is improving outcomes with MLOps solutions.
Buying a 3rd-party solution costs 19–21% less than building your own.
Organizations that buy a 3rd-party solution spend less of their data scientists’ time on model deployment.
The time required to deploy a model is 31% lower for organizations that buy a 3rd-party solution.
Diego concluded the talk with these notes:
2021 is the year of ML: Next year will be a crucial year for ML initiatives. There’s an increased urgency — don’t get left behind.
More accessible than ever: Despite the increasing complexity of the space, it’s never been easier to start investing in ML and scale it more effectively.
You need MLOps: Organizations that invest in operational efficiency will reap the greatest benefits in 2021. The time to act is now!
Note: Read the whole report at this link!
3.3 — Build An End-To-End Machine Learning Workflow
The landscape of ML tooling has become richer and richer over the last few years. New tools are coming out every few weeks that solve that “one nagging problem” in the ML workflow. The result is a jungle of opinionated tooling in the ecosystem that can easily become overwhelming for machine learning engineers and leaders. Here are the common challenges organizations face to scale their ML operations: limited data access, limited ML infrastructure, disconnected ML workflow, tech mismatch, and limited visibility. The truth is that ML engineers often spend 50–70% of their time stitching phases together and maintaining technical debt — which includes glue code, pipeline jungles, and dead experimental code paths.
Unless you work for a top tech giant, the chances are that you have spent months building your current ML setup as follows:
Storage: S3 or GCP.
Compute: EC2 or GCP.
Development Environment: Notebook or Other Text Editor.
Serving: building your own service on EC2 or GCP.
Model Storing: a list of models on S3 with some file and naming convention.
And a lot of custom scripting and manual work to stitch these pieces together.
Mohamed Elgendy shared a template of an end-to-end ML workflow stack that you need to consider when building your own ML workflow. He first identified the important characteristics of an ML workflow to look for:
A streamlined workflow: stitching different phases and team handoffs will significantly accelerate your ML development lifecycle.
Experiment reproducibility: versioning and tracking all the system artifacts (datasets, pre-processing scripts, config, and dependencies).
Scalability: automating model builds, training, and deployment.
Testing and monitoring: ML systems fail silently. A robust testing and monitoring infrastructure is necessary to safeguard reliable deployment.
He then outlined the ML stack template as seen above (the yellow components should be prioritized):
In the data platform environment, you want the data to be accessible in one place. All the data sources should be loaded into a centralized data storage. Additional components are curated storage (for data cleaning and aggregation), feature store (for feature sharing and reuse), and model registry (for model versioning). It would help if you also considered a data discovery tool to search for the right data from this platform environment.
In the ML development studio, you definitely want an experimentation environment (like a notebook). If you use deep learning algorithms, consider experiment tracking and GPU optimization/distributed training tools. After finishing your experiments, consider auto hyper-parameter tuning and Neural Architecture Search options, as well as reproducible pipelines to accelerate your experimentation process. A model catalog is a direct result of the model registry.
In the production environment, your model serving should have both online and batch options. Consider model monitoring if you are serious about keeping track of model performance over time.
Note: Join the Kolena community if you are interested in this type of content.
That’s the end of this long recap. Follow the TWIML podcast for their events in 2021! Let me know if any of the particular talk content stands out to you. My future articles will continue covering lessons learned from future conferences/summits in 2021 🎆