Last September, I attended Snorkel AI’s The Future of Data-Centric AI. This summit connects experts on data-centric AI from academia, research, and industry to explore the shift from a model-centric practice to a data-centric approach to building AI. There were talks discussing the challenges, solutions, and ideas to make AI practical, both now and in the future.

In this blog recap, I will dissect content from the conference’s session talks, covering a wide range of topics from weak supervision and fine-grained error analysis to MLOps design principles and data-centric AI case studies.

New Post✍️

My notes from @SnorkelAI's Future of #DataCentricAI. It covers:
- Principles of DCAI
- Machine Programming
- Leveraging Data From Other Tasks
- DevOps for DCAI

Thanks @DevangSachdev, @aarti_bagul, and team for putting together this event!👇https://t.co/mNtbD1S7xS
— James (@le_james94) December 25, 2021

1 - The Principles of Data-Centric AI

So what is “Data-Centric AI?”

In the model-centric AI development world, the model is the focus of iterative development, while the data is a static asset collected “before AI development.”
In the data-centric AI development world, models are increasingly standardized, push-button, and data-hungry. As a result, the training data becomes the key differentiator and the focus of your iterative development.
Successful AI development requires iteration on both the model and the data. This relative shift to data being the core bottleneck fundamentally changes how we develop and deploy AI.

Alex Ratner outlined the three principles of Data-Centric AI:

AI development today centers around data.
Data-centric AI needs to be programmatic.
Data-centric AI needs to include subject matter experts in the loop.

1 - AI development today centers around data

Again, in model-centric ML development, you spend most of the time iterating on tasks such as feature engineering (selecting specific features of the data that the model can learn from), model architecture design (designing the weights and parameters of the model), and training algorithm design (choosing the right training paradigm for your model). These tasks are still the subjects of vast amounts of research. However, a couple of key trends have happened:

The major shift to more powerful, automated, but also data-hungry models: Today’s deep learning models are more powerful and push-button, but less directly modifiable and far more data-hungry.
These model architectures are increasingly convergent and commoditized: Models today are far more accessible but far less practically modifiable.
And they are increasingly data-hungry: The training data (volume, quality, management, distribution, etc.) is increasingly the arbiter of success.

As a result of these trends, ML development today has become data-centric. You spend most of your time on the operations related to the training data (collection, labeling, augmentation, slicing, management, etc.). Thus, data is the crucial bottleneck and interface to developing AI applications today.

2 - Data-centric AI needs to be programmatic

Source: The Principles of Data-Centric AI (Alex Ratner)

AI today is blocked by the need to label, curate, and manage training data. In some settings, these tasks are tackled by employing armies of human annotators. But for most other real-world use cases, it often takes entire quarters or years to label and manage the data to be ready for ML development. Real-world use cases require subject matter expertise, often have private data, and consist of rapidly changing objectives. Therefore, manual labeling and curation is often a non-starter for these use cases, even in large organizations.

Furthermore, many ethical and governance challenges are exacerbated by AI approaches relying on manual labeling:

How to inspect or correct biases?
How to govern or audit thousands or millions of hand-labeled data points?
How to trace the lineage of model errors originating from incorrect labeling?

Solving these critical challenges with large, manually-labeled training datasets is a practical nightmare for organizations today. An alternative approach is Programmatic Labeling, which enables users to programmatically label, build, and manage the training data. This approach underlies Snorkel, a 5+ year research project at the Stanford AI Lab resulting in 50+ publications and many production deployments.

Snorkel Flow is an instantiation of data-centric AI development at a high level, which is all about rapid iteration that centers around modifying/labeling the data. There are four basic steps that Snorkel Flow supports:

Label and Build: You label, augment, structure, and build training data programmatically. You can do this via Snorkel Studio (a no-code UI for subject matter experts with hosted notebooks) and Python SDK. Snorkel Flow essentially serves as an abstraction “Supervision Middleware” layer for taking in all sorts of signal types (patterns, Boolean searches, heuristics, database lookups, legacy system, 3rd-party models, crowd labels) and creating labeling functions out of them.
Integrate and Manage: You model programmatic inputs as “weak supervision” with theoretical guarantees. This is important because your labeling functions are likely to be noisy. Snorkel Flow lets you clean up these functions by automatically figuring out how to re-weight and combine them without separate ground truth. This capability has been built upon many years of theoretical work in this area.
Train and Deploy: You train push-button, state-of-the-art models from Snorkel Flow’s built-in model zoo (or custom models via Python SDK). These data-hungry models can take some amount of the labeled data and generalize beyond the programmatic labels. Two key benefits are the ability to (1) bridge direct expert specification with ML generalization and (2) scale with unlabeled data.
Analyze and Monitor: You close the loop by rapidly identifying and addressing key error modes in the data and models to adapt and improve. You can monitor performance drifts in labeling functions or the model. As a result, your model can rapidly adapt to changes in the data or business objectives without the need for data labeling from scratch.

Data-Centric AI is much more than just labeling:

It uses labeling functions to provide labels to unlabeled examples using domain heuristics.
It uses transformation functions to augment data with per-example transformations.
It uses slicing functions to partition the data and specify where the model should add more capacity.

3 - Data-centric AI needs to be collaborative with the SMEs

In the old model-centric development way (“Throw it over the wall”), the decoupling of the subject matter experts and the data scientists is impractical and dangerous. Data-centric AI acts as a way for SMEs and DSs to collaborate on the common central ground. Snorkel Flow enables a synchronous workflow by directly injecting what the SMEs know into the model and leveraging already-codified expert knowledge as programmatic supervision.

2 - Learning with Imperfect Labels and Visual Data

If you think about real-world data, there are many challenges: domain gap, data bias, data noise (which can be due to long-tail distribution, occlusions, clutters, ambiguity, or multi-sensors). As a result of that, real-world labels are imperfect due to inexact (indirect), incomplete (limited), or inaccurate (noisy) supervision. How can we deal with versatile, multimodal, sequential, sparse, or interactive labels?

NVIDIA Research thinks about three aspects to answer the previous question:

Incomplete supervision - the ability to improve generalization by continuously learning new data.
Inexact supervision - the ability to leverage diverse forms of weak supervision or self-improving on unlabeled data.
Inaccurate supervision - the ability to overcome the uncertainty due to the lack of data points under partial supervision.

Source: Learning with Imperfect Labels and Visual Data (Anima Anandkumar)

Indeed, many techniques have been designed to get around this problem of imperfect labels and lack of enough labeled data: self-supervision, architecture design, regularization, inductive biases, domain prior, and synthetic data. Anima Anandkumar brought up a couple of relevant research works that she and colleagues have worked on in these areas.

Source: https://arxiv.org/abs/2105.06464

DiscoBox is a unified framework for joint weakly supervised instance segmentation and semantic correspondence using bounding box supervision. It is a self-ensembling framework, where a teacher is designed to promote structured inductive bias and establish correspondence across objects. This framework enables joint exploitation of both intra- and cross-image self-supervisions, leading to significantly improved task performance. They achieved state-of-the-art performance in both instance segmentation and semantic correspondence benchmarks. Such capabilities can scale up and benefit many downstream 2D and 3D vision tasks.

Source: https://arxiv.org/abs/2010.05784

In the context of adapting synthetic data to the real world, reliable uncertainty estimation under domain shift has been an important research direction line. There have been several measures proposed, such as temperature scaling, angular distance, and Bayesian deep learning. Distributionally robust learning is a novel framework for calibrated uncertainties under domain shift. This framework is end-to-end differentiable for training at scale, thanks to the introduction of an additional binary domain classifier network that learns to predict density ratios between source and target domain. The estimated density ratio reflects the relative distance of an instance from both domains. This ratio is correlated with human selection frequency, which can be regarded as a proxy of human uncertainty perception. This framework achieved the SOTA results in self-training domain adaptation.

Source: https://arxiv.org/abs/2104.02290

Training on synthetic data can be beneficial for label or data-scarce scenarios. However, synthetically trained models often suffer from poor generalization in real domains due to domain gaps. Contrastive Syn-to-Real Generalization is a novel framework that simultaneously regularizes the synthetically trained representation while promoting the diversity of the learned representation to improve generalization. Benchmark results on various synthetic training tasks show that CSG considerably improves the generalization performance without seeing the target data.

Source: https://arxiv.org/abs/2105.15203

Finally, SegFormer is a simple yet powerful semantic segmentation framework that unifies Transformers with lightweight multi-layer perceptron (MLP) decoders. SegFormer has two appealing features:

SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes, which leads to decreased performance when the testing resolution differs from training.
SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers and thus combines both local attention and global attention to rendering powerful representations.

SegFormer achieves new state-of-the-art results on standard datasets and shows strong zero-shot robustness.

Anima concluded the talk with these takeaways:

Mask information can probably be totally removed in the future, such as for segmentation problems.
Auto-labeling is promising for many dense prediction tasks.
The emergence of Transformers will probably add even more to the above directions.
Synthetic data can be seamlessly adapted to real-world tasks.

3 - Uncovering the Unknowns of Deep Neural Networks: Challenges and Opportunities

Deep neural networks do not necessarily know what they don’t know in an open-world environment. As a result, how can we build unknown-aware deep learning models for reliable decision-making in the open world?

An important step towards this goal is to detect out-of-distribution data. This is indeed a tricky problem:

There is a lack of supervision from unknowns during training. Typically, the model is trained only on the in-distribution data, using empirical risk minimization. As you move to higher-dimensional space, the space of unknowns expands - making it impossible to anticipate out-of-distribution data in advance.
High-capacity neural networks exacerbate over-confident predictions. Loose decision boundaries can undesirably include out-of-distribution data. Furthermore, density estimation using deep generative models can be challenging to optimize.

Hendrycks & Gimpel (2017) proposed a baseline method to detect out-of-distribution (OOD) examples without further re-training networks. The method is based on an observation that a well-trained neural network assigns higher softmax scores to in-distribution examples than out-of-distribution examples. Sharon Li presented three major work that extends this baseline:

ODIN: Out-of-distribution image detector is a detector that does not require re-training the neural network and is easily implementable on any modern neural architecture. It is built on temperature scaling and input perturbation (to separate the softmax scores between in- and out-of-distribution images). While tested on SOTA architectures like DenseNet and WideResNet under a diverse set of in- and out-of-distribution dataset pairs, ODIN could significantly improve the detection performance and outperform the baseline method by a large margin.
Energy-based OOD Detection is a unified framework for out-of-distribution detection that uses an energy score. Energy scores distinguish in- and out-of-distribution samples better than the traditional approach using softmax scores. Unlike softmax confidence scores, energy scores are theoretically aligned with the probability density of the inputs and are less susceptible to the overconfidence issue. Within this framework, energy can be flexibly used as a scoring function for any pre-trained neural classifier and a trainable cost function to shape the energy surface explicitly for OOD detection.
Energy-Based OOD Detection for Multi-Label Classification proposes SumEnergy, a simple and effective method to estimate the OOD indicator scores by aggregating energy scores from multiple labels. It can be mathematically interpreted from a joint likelihood perspective. The results show consistent improvement over previous methods based on the maximum-valued scores, which fail to capture joint information from multiple labels.

Source: Uncovering the Unknowns of Deep Neural Networks: Challenges and Opportunities (Sharon Li)

Sharon then provided some parting thoughts on the state of out-of-distribution detention research:

Post-hoc methods aren’t sufficient to fundamentally mitigate the problem. Existing learning algorithms are primarily driven by optimizing accuracy on the in-distribution (ID) data, causing an ill-fated decision boundary for OOD detection. Therefore, we need training-time regularization to account for uncertainty ID data explicitly. One way to achieve this is to use Energy Regularized Learning - which has dual objectives during training time (one for ID classification using standard cross-entropy loss and one for OOD detection using a squared Hinge loss).
There are great opportunities for learning algorithm design. For example, we need algorithms for learning a more compact decision boundary between ID and OOD data.
There are also great opportunities for a more realistic data model. The current evaluation is too simplified to capture real-world OOD data. Previous work has relied on small-scale, low-resolution datasets (10-100 classes). As it turns out, OOD detection performance decreases as the number of labels increases. Sharon’s current work, MOS: Towards Scaling OOD Detection to Large Semantic Space, is a group-based OOD detection framework effective for large-scale image classification. Their key idea is to decompose the sizeable semantic space into smaller groups with similar concepts, simplifying the decision boundary and reducing the uncertainty space between in- vs. out-of-distribution data.

Another interesting research question is how to formalize the notion of OOD data. It’s important to explicitly model the data shifts (between training and testing environments) by considering the invariant and the environmental features.

4 - Our Journey to Data-Centric AI

In modern AI, data encodes our knowledge of the problem - often the primary encoding of domain knowledge. The winding road to get to this point involves several milestones, as told by Chris Re:

In 2016, we were in the age of new deep models. Chris and his Stanford students began Snorkel as a project in the Stanford AI Lab to build a data-centric AI framework that could be applied to a broader set of problems than were practical for model-centric approaches with static datasets.
In 2017, Apple acquired Chris’s startup Lattice (based on his work called DeepDive - an AI for services in search, knowledge, platform, device). During this time, Chris observed that most AI teams had developed “new-model-itis.” People focused a ton of energy on constantly writing new models, but they often failed to fully encode their problems into their AI system because of their relative neglect of the data. Yet, a few teams at that time were starting to shift more of their focus toward a data-centric approach, and they began to generate a lot of success in doing so. In contrast to the model-centric approach, those teams who examined the encoding of the problem more deeply—who scrutinized, measured, and audited data as their primary focus—were building better, more efficient AI applications.
In 2018, Snorkel derivatives were already in production at companies like Google, Apple, Intel, and many more. These companies took some of the data-centric ideas from our research and implemented them for Gmail, YouTube, and Google Ads, to name a few. And this glimpse of the exciting possibilities inherent in a data-centric approach drove us to look even deeper into the ways it could be implemented at a larger scale and for broader applications.
From 2018 to 2020, he witnessed several new data-centric AI startups get well off the ground. SambaNova, for example, sees the future of AI in the data flow, which changes how you build systems in fundamental ways. Inductiv, recently acquired by Apple, uses AI to clean data and prepare it before deploying it for applications. Snorkel AI has created a platform that can manage the data and the knowledge to build AI applications.

As a result of these observations, Chris believed that data-centric AI is an approach to building the foundations of long-lived AI systems.

The Technical Perspective

An AI application requires three fundamental, interconnected parts: a model, a dataset, and hardware.

Model and computing hardware and infrastructure have made tremendous progress in recent years. And model-centric AI is great for many discrete applications. Now, models are often packaged and downloadable commodities, ready for use by anyone.
Hardware’s progress has been on a similar trajectory, driven by cloud infrastructure and specialized accelerators.
But training data has not reached this level of practical utility. And that’s because it cannot be a broadly useful commodity. Training data is specific to your project within your organization. It encodes your specific problem, so it isn’t traditionally relevant outside your context. Because of that specificity, model-centric AI has not been very dynamic or expandable so far.

It’s hard to make training data a downloadable commodity because, unlike in the artificial environment of a classroom or research lab where data comes in a static ready-to-use set, in the real world, it emerges from an extremely messy and noisy process, and it is bespoke to your specific application. Model differences have traditionally been overrated, and that data differences are usually underrated when it comes to building better AI.

The key question becomes: What are the foundational techniques, both mathematical and abstractions, that allow us to get better, more useful training data and get it more quickly?

Training signal and data augmentation are key to pushing SoTA

Chris mentioned a real-world example of the data-centric approach coming out of a 2019 collaboration Snorkel did with Stanford Medicine on using AI to classify chest x-rays. They spent a year on the project creating large datasets of clinical labels, evaluating the effect of label quality, and working on publishing their results in a peer-reviewed clinical journal. What they learned using already-available models was that no matter which model they ran the data through, it only resulted in two- or three-point differences in accuracy. What mattered for their results to a much greater degree than the choice of model was the quality and quantity of the data labeling. Dramatically improving training signal and data augmentation is one key to pushing the state of AI forward.

Looking across benchmark data, using the right set of data augmentations is a relatively unexplored avenue for getting greatly improved accuracy out of almost any model you might choose:

Google’s AutoAugment uses learned-data augmentation policies from the first learned-augmentation paper.
Sharon Y. Li helped train a model at Facebook in which she used weakly-supervised data to help build the most accurate, state-of-the-art model with the ImageNet benchmark.

But there’s a well-understood yet still tremendous challenge to overcome here too, which is that traditionally, manual data-labeling is expensive, tedious, and static.

Training data is the new bottleneck

Training and labeling data by hand is slow and very costly. Data’s quality and quantity are often based on how many humans—usually subject matter experts—you can throw at the process and how long they can work. Static data sets mean you might start with manual labels that turn out to be impractical for the model you are trying to build. To use a straightforward example, maybe you began by using “positive” and “negative” labels, but it later becomes apparent you need a “neutral” as well. Re-labeling all that data means you have to throw out all your previous work and start over.

Snorkel’s approach to this problem is programmatic labeling. The key idea is that if you can write code that handles data labeling for you, building an effective model will be much faster and cheaper because you can move at the speed of a machine. That speed also allows you to treat the data with software and engineering tools that allow for dynamic data sets that can be reconfigured on the fly to be redeployed much more quickly. But there’s a tradeoff with all this: programmatic labels are generally very noisy.

Source: Our Journey to Data-Centric AI (Chris Re)

Using weak supervision for data labeling is not a new idea. Pattern matching, distant supervision, augmentation, topic models, third-party models, and crowdsourcing are well-established ways of getting large amounts of lower-quality feedback data. But even this weak-supervision data was still being applied in isolated ways, and that ad-hoc application limited the progress that could be made with this labeling method. The Snorkel project’s original goal was to replace this ad hoc weak-supervision data with a formal, unified, theoretically grounded approach for programmatic data labeling. Snorkel as a company has expanded this perspective in an array of different directions that now encompass the whole workflow for AI, but the formalization of programmatic data labeling is where things started.

The Classical Snorkel Pipeline

Snorkel’s “classical” flow goes like this:

Users write labeling functions to generate noisy labels.
Snorkel models and combines those noisy labels into probabilities.
Snorkel uses probability theory to optimally combine all of the information about the sources and the functions to generate probabilistic training data that can then be fed into any deep learning model we want.

Snorkel’s work shows that we can modify virtually any state-of-the-art model to accept this probabilistic data and improve their performance. The key idea is that at no point in this flow do we require any hand-labeling of the data. Developers do not hand off data to be labeled and returned at some future date. Rather, the developer is integral to the labeling process. So, the question becomes: if we eliminate the hand-labeled-data bottleneck, how far and how fast can we go?

At this point, Snorkel has published literally dozens of papers that fully show we can learn the structure of and do estimation for what is called latent graphical models and get new results. In some cases, our results improve on the standard supervised data that the industry has been using for decades.
As early as 2018, Snorkel’s approach to data-centric AI has been applied in many places across the industry. And not just in corporate research papers, but in changes to real production systems that you have probably already used from places like Gmail, Apple, and YouTube. While these ideas are still being refined, what we see right now is that many industrial systems are using enormous amounts of weak supervision and doing so not as an afterthought but in ways that fundamentally change how people build their system and that change the iteration cycle. It is exhilarating because it speaks to how valuable and useful Snorkel’s ideas can be for the industry’s progress, given further work and development.

Final Comments

Chris concluded the talk with these quick comments:

The ideas pioneered in Snorkel (the research project) are already around you. That speaks to the utility of this viewpoint, not how great the execution was necessarily.
The core of Snorkel AI (as a company) is the incorporation of this data-centric perspective to the entire workflow for AI. We use it to monitor what we are doing overtime, understand the quality of workflows, get SMEs on board, and take an organization’s existing domain expertise and bring it to bear on the problem at hand.
Data-centric AI allows many users to encode domain knowledge.

To learn more, Chris recommends reviewing the Data-Centric AI Github repository - a community initiative from Stanford to think about this movement’s foundational theory, algorithmic, and practical advantages. You can also read the article “What Data-Centric AI is Not” on the Hazy Research blog - which argues that AI is closer to data management than software engineering.

5 - Machine Programming and the Future of Data-Driven Software Development

Machine Programming (MP) is the automation of software and hardware development. Machine Programming Research (MPR) is a new pioneering research initiative at Intel Labs, broken down into two core tenants (as brought up by Justin Gottschlich):

Time: reducing the development time of all aspects of software development (measured as 1000x+ improvement over human work performed today).
Quality: building better software than the best human programmers can (measured as superhuman correctness, performance, security, etc.).

The three pillars of MP are (i) intention, (ii) invention, and (iii) adaptation. Intention is the ability of the machine to understand the programmer’s goals through more natural forms of interaction, which is critical to reducing the complexity of writing software. Invention is the ability of the machine to discover how to accomplish such goals, whether by devising new algorithms or new abstractions from which such algorithms can be built. Adaptation is the ability to autonomously evolve software, whether to execute it efficiently on new or existing platforms or to fix errors and address vulnerabilities.

Source: Machine Programming and the Future of Data-Driven Software Development (Justin Gottschlich)

As the figure above suggests, data is the principal driver for all three MP systems. The data required by them comes in various forms but is ever-present. This dependency on data makes it essential to consider the open problems and emerging uses around data when reasoning about MP and the systems that implement it.

When we separate intention from invention and adaptation, we restrict the programmer from over-specifying details that can be confusing to the machine to think that they are part of the semantics of the program. This separation will give rise to Intentional Programming Languages, which might be beneficial in many ways:

Improving productivity by requiring users to only supply core ideas.
Freezing up the machine to explore a wider range of possible solutions more thoroughly.
Enabling automatic software adaptation and evolution.

It’s very important to understand that MP is not a rebranding of ML. It’s also not ML for code. Both the stochastic and the deterministic sides make up the bifurcated space of MP. While the stochastic MP system uses ML techniques (neural networks, genetic algorithms, reinforcement learning, etc.), the deterministic MP system uses formal methods (formal verifiers, spatial and temporal logics, formal program synthesizers, etc.). Historically in the space of program synthesis, if we don’t have deterministic systems, we can’t necessarily guarantee correctness (the same way we would be able to if we just build a stochastic system).

There are numerous MP-related efforts at Intel. Justin’s team focuses on debugging, profiling, and productivity. A notable project that they worked on last year is ControlFlag. This self-supervised MP system aims to improve debugging by detecting idiosyncratic pattern violations in software control structures. Violations of programming patterns can be thought of as syntactically-valid code snippets that deviate from typical usage of the underlying code constructs.

The underlying motivation behind this work is that developers continue to spend a disproportionate amount of time fixing bugs rather than coding. Debugging is expected to take an even bigger toll on developers and the industry. As we progress into an era of heterogeneous architectures, the software required to manage these systems becomes increasingly complex, creating a higher likelihood of bugs. In addition, it is becoming difficult to find software programmers who have the expertise to correctly, efficiently, and securely program across diverse hardware, which introduces another opportunity for new and harder-to-spot errors in code.

The fundamental approach ControlFlag takes is to recast typographical errors as anomalies. A self-supervised system trained on a large enough semi-trusted code will automatically learn which idiosyncratic patterns are acceptable (and which are not). When fully realized, ControlFlag could help automate the tedious parts of software development, such as testing, monitoring, and debugging. This would not only enable developers to do their jobs more efficiently and free up more time for creativity, but it would also address one of the biggest price tags in software development today.

6 - The Data-Centric AI Approach

For many years, the conventional approach to AI was that AI systems need code plus data. Most people would download datasets and work on the code. Thanks to this development paradigm, the code is basically a solved problem for many modern AI applications. Therefore, it’s now more useful to find tools, processes, and principles to systematically engineer the data to make the AI systems work.

With the evolution of any new technology approach, there are typically three steps that we go through. First, a handful of experts do it intuitively. Then, principles become widespread, and many apply them. Eventually, there arose tools to make the application of this new set of principles and ideas more systematic. Andrew Ng believes that we are currently in the second phase of the data-centric AI movement and share the top 5 tips for data-centric AI development:

Make the labels y consistent: In an ideal world, there is some deterministic (non-random) function mapping from inputs x to outputs y, and the labels are consistent with this function.
Use multiple labelers to spot inconsistencies: Some of the most common examples of inconsistencies in computer vision are label name, bounding box size, and the number of bounding boxes. Having multiple labelers would help increase labeling consistency.
Repeatedly clarify labeling instructions by tracking ambiguous examples: You should repeatedly find examples where the label is ambiguous or inconsistent, decide how they should be labeled, and document that decision in your labeling instructions. Labeling instructions should be illustrated with examples of the concept, examples of borderline cases and near-misses, and any other confusing examples.
Toss out noisy examples. More data is not always better: Many ML teams are used to being given a dataset and religiously work on it. Tossing out bad examples from that dataset might actually be useful.
Use error analysis to focus on a subset of data to improve: Trying to improve data in a generic way is too vague. Instead, repeatedly use error analysis to decide what part of the learning algorithm’s performance needs improvement.

Developing an AI system’s iterative workflow should involve repeatedly engineering the data. Improving the data right is not a “preprocessing” step that you do once. It’s part of the iterative process of model development, model deployment, model monitoring, and model maintenance. The recent NeurIPS Data-Centric AI Workshop has many papers that help cultivate the DCAI community into a vibrant interdisciplinary field that tackles practical data problems.

7 - MLOps: Towards DevOps for Data-Centric AI

Over the years, we have seen the continued democratization of ML. ML models are being applied to more applications and developed by more people with less CS/ML background but more domain knowledge. On the other hand, we have also seen the increased complexity of building and deploying ML applications:

Hardware and accelerators are becoming prevalent and diverse.
Data ecosystems are becoming more diverse and heterogeneous.
ML techniques are rapidly advancing and becoming more complex.
Regulations beyond accuracy are increasingly required (fairness, robustness, etc.)

Today, it has never been easier for someone to get some ML models. But it is harder to answer whether the model is good or how to improve it.

In his talk, Ce Zhang brought up the question: “Can we provide users with systematic step-by-step guidelines for ML DevOps?” In other words, how can we construct ML systems and platforms to help users enforce them even if they are not experts?

Source: MLOps: Towards DevOps for Data-Centric AI (Ce Zhang)

What is the end-to-end process that gets users through the development and operational journey of ML applications? Given the input data, we first need to parse and augment them. Then, we conduct a feasibility study before running expensive ML tasks to understand whether the data quality is good enough. Then, we perform data cleaning, debugging, and acquisition to improve the data artifacts. Next, an AutoML system gives us a stream of ML models. Finally, we can perform continuous integration and continuous quality optimization, given the production data stream.

There are two different cycles in the large picture above: the data cycle is the continuous improvement of data quality, while the model cycle is the continuous testing and adoption of the model driven by production data. Both cycles are data-centric in different ways. Ce walked through several research projects conducted at his lab to answer a couple of data-centric questions below:

Q1 - What’s the best accuracy that any ML model can achieve on my data?

Data problems are often entangled with model problems, and we need to provide signals for the users to tease them apart. His view to this problem is to provide the functionality of an automatic feasibility study to the end-user --- given a dataset and a target accuracy, the user is provided with a best-effort "belief" on whether it is possible or not before he fires up expensive ML processes, just like how many real-world ML consultants are dealing with their customers. Of course, such a "belief" will never be perfect, but providing such a signal will help the end-users better calibrate their expectations.

From the technical perspective, they estimate the Bayes error, a fundamental ML concept. In ease.ml/snoopy, they designed a simple yet effective Bayes error estimator enabled by the recent advancement of representation learning and the increasing availability of pre-trained feature embeddings. By consulting a range of estimators of the Bayes error and aggregating them in a theoretically justified way, ease.ml/snoopy suggests whether a predefined target accuracy is achievable.

The project is under active development - tackling questions such as dealing with the case when training distribution is different from testing distribution and dealing with alternative metrics beyond accuracy (F1, AUC, fairness, robustness, etc.).

Q2 - What’s the most important data problem that I should fix in my training data?

There are two sub-questions to address here:

Which data problem should you fix first to improve your accuracy/fairness/robustness?
Which training example is to blame for your accuracy/fairness/robustness?

There has been much work using gradient-based methods to answer these two questions, but applying them to the real world has been challenging. Real-world ML applications have more feature extraction code than ML code. If we want to reason about a bad example, you have to look at both the feature extraction code and the ML code. We currently lack the fundamental understanding of connecting decades of data management study and ML study for a joint analysis.

In ease.ml/DataScope, they focus on a specific case of data acquisition — given a pool of data examples, how to choose which one to include (and to label if they are unlabelled) to maximize the accuracy of ML models? In principle, they hope to pick out those data examples that are more “valuable” to the downstream ML models. They decide whether a data example is valuable by using the Shapley value, a well-established concept in game theory, and treat the accuracy of ML models as the utility. They use the k-nearest neighbor classifiers as a proxy to calculate the exact Shapley value in nearly linear time and guide the user on which new data samples to acquire.

They learn that for a dominating number of realistic feature extraction pipelines, they can precisely compute the entropy, expected prediction, and Shapley value for their proxy pipelines. Furthermore, these proxy pipelines work well in many scenarios. In particular, data examples with incorrect labels should have a small (often negative) Shapley value. However, there are cases in which these proxy pipelines do not work well, such as when over-represented examples exist.

Here are some outstanding questions that Ce and his team plan to address next:

What is the right way to talk about real-world data pipelines?
What is the right way to approximate fundamental quantities like Entropy, Expectation, Shapley Value, and Expected Marginal Improvement?
How can we go beyond sensitivity-style metrics and measure group effects?

8 - Algorithms That Leverage Data from Other Tasks

In a typical paradigm, we have a dataset collected for a particular task. Then we train a model using that dataset. Ultimately, we evaluate that model in a separate test set. However, many challenges come up in this paradigm in the real world. Chelsea Finn highlighted two major ones:

The dataset can be small or narrow, which isn’t sufficient for learning a model from scratch.
Distribution shift might exist at evaluation time. As a result, the dataset (on which the model was trained) isn’t sufficient for the evaluation scenarios we care about.

What about if we have prior datasets from other tasks or domains? Can we leverage these datasets in a way that allows us to improve the model that we are training for our target task?

Can we train jointly on prior datasets?
Can we selectively train on prior datasets?
Can we learn priors from the prior datasets?

Training Jointly

For the first question, Chelsea provided a robotics use case described in Chen et al., 2021. The task that we want to solve here is to allow a robot to be able to detect whether or not it has successfully completed different tasks. We want to learn a classifier for robot success detection. Unfortunately, collecting data on real robots is expensive, so the data available entails demonstrations of a robot performing a few tasks in one environment. Ultimately, we want to get a classifier that generalizes to many tasks and many environments.

Source: https://arxiv.org/abs/2103.16817

Instead of just using one dataset, we can leverage diverse human datasets (with videos of humans performing different tasks). During training (top), the agent learns a reward function from a small set of robot videos in one environment and a large set of in-the-wild human videos spanning many tasks and environments. At test time (bottom), the learned reward function is conditioned upon a task specification (a human video of the desired task). It produces a reward function which the robot can use to plan actions or learn a policy. By virtue of training on diverse human data, this reward function generalizes to unseen environments and tasks.

The proposed approach, Domain-Agnostic Video Discriminator (DVD), is a classifier that can predict whether two videos are completing the same task or not. By leveraging the activity labels that come with many human video datasets, along with a modest amount of robot demos, this model can capture the functional similarity between videos from drastically different visual domains. It is simple and can be readily scaled to large and diverse datasets, including heterogeneous datasets with both people and robots without any dependence on a close one-to-one mapping between the robot and human data. Once trained, DVD conditions on a human video as a demonstration, and the robot’s behavior as the other video, and outputs a score which is an effective measure of task success or reward.

Their experiments showed that DVD could more effectively generalize to new environments and new tasks by leveraging human videos. DVD also enables the robot to generalize from a single human demonstration more effectively than prior work. DVD also can infer rewards from a human video on a real robot. The takeaway here is that joint training on diverse priors can substantially improve generalization.

Selectively Train

Now, what if we have a lot of prior data from many different tasks? Training on all of the data together may not be the best solution. For example, some tasks and some prior datasets may complement the target task, whereas, in other circumstances, the dataset may not be complementary and worsen the performance. The affinity of two different tasks depends on the size of the dataset, the model’s current knowledge, and other nuanced aspects of the optimization procedure (optimizer, learning rate, hyper-parameters, etc.).

The bad news is that there is no closed-form solution for measuring task affinity from task data. The good news is that we can approximate task affinities from a single training run. Chelsea provided a computer vision use case in Fifty et al., 2021 - which suggests a 4-step approach:

Train all tasks together in a multi-task learning model.
Compute inter-task affinity scores during training.
Select multi-task networks that maximize the inter-task affinity score onto each serving-time task.
Train the resulting networks and deploy to inference.

Source: https://arxiv.org/abs/2109.04617

Empirical findings indicate that this approach outperforms multi-task training augmentations and performs competitively with SOTA task grouping methods - while improving computational efficiency by over an order of magnitude. Further, their findings are supported by an extensive analysis that suggests inter-task affinity scores can find close to optimal auxiliary tasks, and in fact, implicitly measure generalization capability among tasks. The takeaway here is that we can automatically select task groupings from a single training run.

Learn Priors

What if we are in a scenario in which joint training does not make sense? We would like to learn a prior from the datasets we can use when transferring to our target tasks. Chelsea provided an education use case.

Early 2021, Stanford offered Code-in-Place 2021 - a free Introduction to Computer Science course for more than 12,000 students from 150 different countries. A diagnostic exam was offered to help students understand how well they understood the material. The problem is that the submissions to this diagnostic exam are open-ended Python code snippets. An estimated 8+ months of human labor would be needed to give all of these students feedback.

We want to train a model to infer student misconceptions (y) from the student solution (x) to a question in the feedback prediction task. This is essentially a multi-class binary classification problem. This is a hard problem for ML due to limited annotation (grading student work takes expertise and is very time-consuming), long-tailed distribution (students solve problems in many different ways), and the changing curriculums (instructors constantly edit assignments and exams, so student solutions and instructor feedback look different year to year).

The target task is to give feedback for the Code-in-Place course on new problems with a small amount of labeled data. They want to be able to use prior data for this task. The prior experience is ten years of feedback from Stanford midterms and finals. More specifically, the dataset has four final exams and four midterm exams from Stanford’s CS 106, including a total of 63 questions and close to 25,000 student solutions. Every student solution has feedback via a rubric.

The proposed architecture, ProtoTransformer, is a meta-learning framework for few-shot classification of sequence data, including programming code. Given a programming question, the ProtoTransformer Network is trained to predict feedback for student code using only a small set of annotated examples. Feedback categories are specified according to a rubric (e.g., “Incorrect syntax” or “Missing variable.”) Question and rubric descriptions are embedded with pre-trained SBERT. Student code is then encoded through stacked transformer layers conditioned on question and rubric embeddings. A “prototype” is the average code embedding for each class label. New examples are embedded and compared to each prototype.

Source: https://drive.google.com/file/d/1BPzSmk01mtLG8bVQxOzBUdqGqqu7Vk3R/view

Unfortunately, combining Transformers with prototypical networks “out-of-the-box” doesn’t work very well. This is because there is a limited amount of prior education data. They utilized several data-centric tricks to get past the small data size:

Task Augmentation: They applied the “data augmentation” idea to coding tasks
Side Information: They added the side information about each task (rubric option name and question text) into the embedding function. Then, they pre-pended that side information as the first token in the stack transformer.
Code Pretraining: They utilized large unlabeled datasets of code to help the model learn a good prior for code.

This system has been deployed at Code-in-Place 2021 to provide feedback to an end-of-course diagnostic assessment. The students’ reception to the feedback was overwhelmingly positive. Chelsea has contributed a blog post that discusses this project in more detail on the Stanford AI Lab blog. The takeaway here is that we can optimize for transferable representation spaces for few-shot transfer.

9 - Augmenting the Clinical Trial Design Process Utilizing Unstructured Protocol Data

In 2020, $198B was spent on global pharma R&D with an average of $2.6B for each drug (significant spending to bring a product across the entire lifecycle from research to patient utilization). 18,852 drugs or treatments in the R&D pipeline, and this number has increased 300%+ annually since the start of the century.

A clinical trial protocol is a legal document describing the study plan for a clinical trial (e.g., objective, methodology, population, organization). A protocol document can vary in length (5-200+ pages) across time, therapeutic area, indication, phase, and geography. The protocol design can significantly impact key trial performance metrics such as operational complexity and patient/site friendliness/burden.

Clinical Trial Protocol Design Process

Michael DAndrea discussed the broad impact of using unstructured data for clinical trials:

There’s a lot of potentially useful information that is unstructured and blocked from usage in a clinical trial protocol. By unlocking this information for analysis and data-informed decision making, study design teams can increase recruitment of diverse patient populations, reduce trial times/costs, and reduce patients dropping out from trials. These key trial performance metrics can reduce the overall cost for developing drugs and increase the number of drugs that can be handled in the pipeline, resulting in hopefully more cures and treatments for society.
Using data-informed study design tools, we can better evaluate treatments on diverse representative populations and gauge the real-world efficacy (albeit with significant limitations).

Extracting this valuable unstructured data in clinical trial protocol comes with its fair share of challenges:

There is a tremendous amount of variability and diversity across trials, ranging across countries, phases, indication, therapeutic area, and time.
There is a diversity of documents, such as form types, the author writing styles and dialects, treatment trend changes.
There is also the issue of data standardization. This is a complex space with a wide variety of stakeholders with different incentives.

Michael’s team focused on two key unstructured data sections of the protocol: the sections on Inclusion/Exclusion (I/E) criteria and the Schedule of Assessments (SoA). The I/E Criteria is a filtering checklist that describes conditions that make a patient eligible or ineligible to participate in a clinical trial. The SoA is an outline that contains all the activities that will be performed during a study that is usually in tabular format. It describes the procedures that need to be performed in order to meet the clinical endpoints set, specifying their frequency and distribution within the study visit.

The Shift to a Data-Centric AI Approach

The CRISP-DM framework is a common way of structuring the lifecycle of a data science project. Michael’s team attempted a gamut of approaches largely built upon more sophisticated modeling techniques on large medical, corporate datasets. However, even the combination of NLP libraries, unsupervised learning techniques, and medical BERT-derived models were limited in scalably having a clinical business impact. The challenge (as many experienced practitioners might know) is that many data scientists are excited about the advanced modeling approaches and the latest research while keeping the data fixed and focusing less on approaches to augment the data.

The figure below demonstrates the differences between model-centric and data-centric approaches. With an aggregation of modeling methods, the model-centric approach could only derive count-based features with little clinical value. In contrast, the data-centric approach leveraged programmatic labeling with Snorkel Flow to yield structured clinically relevant features with strong clinical value.

Source: Augmenting the Clinical Trial Design Process Utilizing Unstructured Protocol Data (Michael DAndrea)

Data-Centric IE Criteria

The design of a trial can benefit from the analysis of large numbers of other similar trials. The current method of doing this is manual and done on a small scale with certain biases reinforced due to personal familiarity with certain types of research indications. Manually sifting through hundreds of thousands of trials’ criteria and looking for patterns on large-scale relevant subsets for clinically relevant characteristics is especially challenging with the wide variety of medical terms, synonyms, and conditional values that exist. To find broad patterns that could be used for a variety of study design teams, they needed a way to extract structured, therapeutic area agnostic criteria from trials with high accuracy and in a scalable manner.

They initially extracted 21 CMS chronic condition entities out of the eligibility criteria of 340k+ clinical trials. They used Snorkel Flow to get the dataset and build the corresponding labeling models and pipelines for deployment. This data was the foundation for demographic tradeoff analysis for chronic conditions.

The ML task here is to extract chronic diseases as being part of I/E criteria for a given protocol (the input is an I/E criterion, while the output is the chronic disease). However, there are various challenges in dealing with I/E Data: false positives exist, multiple chronic conditions are within the same extracted span, and criteria in protocols don’t always make grammatical sense.

As seen above is the data-centric pipeline they built to accomplish this ML task. The Snorkel Flow platform made this pipeline development almost a drag-and-drop experience. The pipeline is run over 340k+ protocols and gives them output for all 21 CMS chronic conditions. The results were very strong and generalizable on both the validation and test sets. They also used the results for demographic tradeoff analysis and study design tooling.

Data-Centric SoA

Analysis of the SoA procedures may be crucial in the trial design optimization. The first step in the analysis is to extract a procedure list from protocol documents. Then, the second step is to analyze and categorize the procedures into clinically relevant groups.

They first identified the SoA tables in a clinical trial protocol document and extracted procedure names. Then, they built a classification model to classify SoA procedures into eight classes using Tufts Center definition to estimate the Participant Burden clinical trials. Like the I/E task, they also used the Snorkel Flow to output a list of procedures with the Tufts categorization and supporting classification models. However, there exist various data-related challenges: the SoA procedures data is unstructured and not harmonized, acronyms and abbreviations are common, assignment to procedures’ classes may be trivial, and similar text from the linguistic point of view may correspond to different procedure types.

As seen above is the final two-part data-centric processing pipeline that they built. The key point is building complex multi-state solutions, allowing flexibility, iteration, and innovation. For the document text extraction task, they created the end-to-end pipeline for procedure extraction in 6 weeks and did not need to label all the training samples manually (only a small subset). For the procedure text classification task, they achieved strong test and validation results with less than 100 labeling functions. Finally, this output is helpful for estimating clinical trial burden and analysis of study design and contributes to the harmonization of data within the organization.

10 - Building AI Applications Collaboratively: A Case Study

AI applications are built in teams of subject matter experts (with a deep understanding of the domain), data scientists, and ML engineers (who work together to define the what and how of the application). Ideally, we want to collect as many signals as possible from every member of the team. Furthermore, AI applications need to adapt to changing requirements, which can only be done by collaboration on the same platform.

Improving the data improves the model. Because the domain experts know the data best, we must empower them with multiple ways of sharing their expertise. By doing so, we can accelerate the development of AI applications.

However, we are often blocked from labeling our data in a data-centric world. Many teams can’t crowdsource their data. The data may require a high level of expertise and privacy requirements. Data scientists believe that it’s hard to get enough time with subject matter experts. Frequently, they receive the labels and not the context and nuance needed behind the label. The data scientists also want to know how much confidence (high or low) the annotators have in labeling the data.

Roshni Malani believes that today’s tooling is woefully inadequate. We need to unblock the data by maximizing knowledge transfer with higher value labels that provide more context or information about the data. As a result, we may also get more accurate labels. We also need to produce labels at a higher velocity by creating less friction interfaces. As we shift our focus from iterating on the model to iterating on the data, we need to bring the power of modern collaboration tooling (real-time collaboration features with version control) for training datasets.

Source: Building AI Applications Collaboratively: A Case Study (Roshni Malani)

Traditionally, data scientists and domain experts operate in silos with no common means or tooling for collaboration. Labeling data has been a manual and tedious process, often costly and done only once. Thus, the data scientists are constrained to iterating on the model, where the marginal improvement has been observed.

In a data-centric world, we maximize collaboration by creating another iteration loop focused primarily on the data. We can do this by creating multiple ways to capture expertise on the same platform. Snorkel Flow introduces a new way to capture expertise called programmatic labeling, based on weak supervision techniques pioneered by the Snorkel AI’s founders during their time at the Stanford AI lab. The core collaborative loop operates as follows:

After labeling some data and training a model, you analyze the performance gaps. You identify gaps that allow you to create more targeted and relevant labeling assignments (especially for corrective or iterative work). This is an active learning approach with a human in the loop.
Next, you iterate on these gaps collaboratively by gathering all information you can to label even more data - with which you can again train a model and analyze it.
You repeat the same iterative loop after deploying your model and monitoring a slice of production data.

As a result of this collaborative loop focused on the data, you will observe an order of magnitude improvement in your AI application.

Within Snorkel Flow, there are multiple ways to capture domain expertise. The platform converts the expertise at any level available (including comments, tags, labels, patterns, code). This diversity of input allows subject matter experts to share rich information about the data with the data scientists, thereby multiplying their value. The platform’s integrations reduce the friction and allow faster iteration and higher velocity (full context, easily shared links, real-time progress, version control).

While discussing a hypothetical case study of a loan classification application, Roshni outlined the three scenarios for deeper collaboration on data:

Label understanding changes: The annotation instructions may not be understood completely. With Snorkel Flow, you can filter specific data points and create a targeted annotation batch. These are unknown unknowns that you can iterate on. After this step, you can either refine existing labeling functions or create new ones and validate the model after training it with the new hand labels.
Label schema changes: This includes adding a label, removing a label, splitting the label into multiple labels, or merging labels into a single one. With Snorkel Flow, you can request annotators additional rich metadata that informs your labeling functions.
Data drift: If label distribution changes over time, you want to monitor slices of production data. With Snorkel Flow, the subject matter experts can identify patterns in the data and encode them using low-code interfaces. Thus, the quality and speed of data labeling increase over time as you iterate on your AI application collaboratively.

In brief, AI application development requires dynamic environments, iterative improvements (on both data and model), and is powered by teams of domain experts and data scientists collaborating. The agility to adapt ML models comes when subject matter experts can encapsulate their expertise in multiple ways in an integrated, seamless manner (that is, collaboration in the same platform maximizes the value and velocity of your data-driven AI applications).

That’s the end of this long recap. I hope you have learned a thing or two on best practices from real-world implementations of the data-centric approach. If you have experience building and deploying data-centric AI applications, please reach out to trade notes and tell me more at khanhle.1013@gmail.com! 🎆