What I Learned From Attending Scale Transform 2021
A few weeks ago, I attended Transform, Scale AI’s first-ever conference that brought together an all-star line-up of the leading AI researchers and practitioners. The conference featured 19 sessions discussing the latest research breakthroughs and real-world impact across industries.
In this long-form blog recap, I will dissect content from the session talks that I found most useful from attending the conference. These talks cover everything from the future of ML frameworks and the importance of a data-centric mindset to AI applications at companies like Facebook and DoorDash. To be honest, the conference's quality was so amazing, and it’s hard to choose the talks to recap.
1 — From Big Data to Good Data
Andrew Ng (Founder of DeepLearning.AI, General Partner at AI Fund, and an Adjunct Professor at Stanford University) shares his thoughts on how practitioners’ mindset should shift from building big data to building good data. Such mindset shift will unlock how we can more efficiently build AI systems for many more industries, especially ones outside the consumer internet.
If you look at what AI has done so far, it has transformed the software internet industry, especially the consumer software internet industry (web search, online advertising, language translation, social media, etc.) These businesses have become much more effective and valuable because of AI. But once you look outside the software industry, everything from manufacturing to agriculture to healthcare to logistics, AI today is still in the earliest development stage. Andrew believes that, in the future, AI applications outside the consumer internet will be even bigger than the applications that we have on the consumer internet.
Why aren’t we there yet, and why is building AI systems so hard today for so many industries?
Training models have received most of the attention in academic research and popular press in recent years. However, there’s a lot more to building the AI system than training the model in a commercial context.
The diagram above depicts the entire life cycle of a machine learning project:
We start with scoping the project, defining what’s worth working on, and choosing what projects to pursue. This stage turns out to be one of the hardest skills still in today’s AI world.
The next stage is collecting the data. This includes defining what data you want, making sure the inputs X are representative and the labels Y are unambiguous.
Then comes the training of the model, which entails carrying out error analysis and iteratively improving the model (by updating the model or, in many cases, getting more data until your model performs well enough).
The last stage is deploying in production. Some teams view deploying in production as the endpoint of an AI project. But often, deploying models in production only means you’re about halfway there. Only after you’ve deployed in production do you then get the live data to flow back to your system to continue learning, improving, and refining the AI system.
The life cycle of a machine learning project is a very iterative process: where when you’re training the model, you often want to go back to collect more data. Even after you deploy the model, you may want to train the model, update it, or go back and update your data set.
Many teams will collect some data, then train the model. When you get through that for the first time, you have what we sometimes call proof of concept (PoC). You have like a demo, maybe a prototype that does well on test sets stored in a data store somewhere. But after you’ve deployed in production, there is still an iterative process to go back to the earlier stages to keep on improving the system.
Two major issues make building AI systems hard:
Building the initial PoC is hard when there is insufficient high-quality data.
Even after a successful PoC, there is often still a PoC to Production gap.
Why is there not enough insufficient high-quality data?
Small Data Problems: Many problems outside the consumer Internet do not have huge datasets. Therefore, being able to learn from small datasets is critical for AI breaking into other industries. Emphasizing good data hygiene practices can greatly impact the quality of learning algorithms of models you can fit. For instance, can we build a practical computer vision system with 100 images or sometimes even fewer?
AI Systems = Code + Data: The AI community has put a lot of work into inventing new practices and algorithms and ways of improving the code. However, the amount of work on improving the data should grow as well. If 80% of data science is data cleaning, then data cleaning must be a core part of the data scientist / ML engineer's work.
Data Iteration Instead of Model Iteration: Today, many research teams hold the data fixed and iterate on the model. For most consumer projects, you should hold the model fixed and iterate on the data. That will require a repeatable and systematic process, not simply having engineers hacking around in a Jupyter notebook.
Long-Tail Problems: Even if you have a large dataset, if your ML system is addressing a problem with long tail properties (web searches with lots of rare queries, self-driving cars that have to handle corner cases, product recommendations across a large catalog), then you still face a small data problem for items in the tail.
Why is it hard to bridge the PoC to Production gap?
ML Code Is A Small Part of the Puzzle: Seasoned practitioners are aware of the Sculley et al., 2015 paper which points out that building a working ML system often requires more than doing well on your hold-out test set.
Generalization and Robustness: A PoC result that works according to a published paper often does not work in a production setting (due to concept drift and data drift). Deployment is a process, not an event, that only marks the beginning of flowing data back to enable continuous learning.
To bridge this gap, MLOps is a discipline to help us make building and deploying machine learning systems more repeatable and systematic.
In the traditional software engineering world, software engineers write code and then gently hands it off to the DevOps team, who are responsible for ensuring quality, infrastructure, and deployment.
With AI, we now have not just code but code plus data. MLOps (as a discipline) helps make the creation and deployment of AI systems more systematic.
One of the challenges with AI software is that it is not a linear process. In traditional software, you could scope it, write the code, hand it over to DevOps, and then deploy it in production. But with AI software, that doesn’t work because you scope it, collect data, train the model and deploy in production, and then the process is iterative to keep on coming back to the earlier stages of the process.
Therefore, MLOps needs to support multiple stages in an AI project's life cycle: managing data governance and showing that the systems are reasonably fair, auditing performance, ensuring system scalability, etc. In terms of deciding what’s the single most important thing that the MLOps team needs to do, it should be to ensure consistently high-quality data throughout all stages of the machine learning project life cycle.
2 — The Next 5 Years of Keras and TensorFlow
Francois Chollet (Author of Keras and Software Engineer at Google) believes that deep learning has only realized a small fraction of its potential so far. The full actualization of deep learning will be a multi-decade transformation. Over time, deep learning will make its way into every problem that can help; hence it will become commonplace as web development is today. As Keras turned 6 years old, Francois reveals how it will evolve over the next few years.
Keras and Tensorflow have had a symbiotic relationship for many years. Keras serves as a high-level API for TensorFlow, and TensorFlow is an infrastructure layer for differential board programming. While Keras contains deep learning-specific concepts like layers, models, optimizers, losses, metrics, etc., TensorFlow is an infrastructure layer for manipulating tensors, variables, automatic differentiation, distribution, etc.
If you want to build something correctly, a critical factor is the speed at which you can iterate on your ideas. In deep learning, iteration cycles depend on three factors.
Starting from an idea, you use a deep learning framework to implement an experiment to test the idea.
Then you run your experiment on the computing infrastructure (like GPUs and TPUs) to get the results.
Finally, you analyze and visualize your results, which feed into the next idea.
Developing great ideas depends entirely on your ability to go through this loop of progress as fast as possible. And these three things are critical, you want good visualization tools, you want fast GPUs or big data centers, and you want a software framework that enables you to move from idea to experiment as fast as possible.
People who use Keras fall into 3 categories:
Basic users need as little code as possible, batteries included, and best practices baked-in by default.
Engineers want flexible model architecture and training logic, scale and speed, and deployment readiness.
Researchers want a toolbox with low-level control of every detail, alongside simple and consistent programming mental models.
The key design principle that Keras uses is progressive disclosure of complexity. Keras doesn’t force you to follow a single true way of building and training models. Instead, it enables a wide range of different workflows from the very high level to the very low level, corresponding to different user profiles. The spectrum of Keras workflows is structured around two axes: model building and model training.
The model building phase goes from simplicity to arbitrarily flexibility.
The simplest model to build is a sequential model, which only allows a stack of built-in layers. This is fast and easy to build.
If you want to reuse existing components or experiment with non-linear architectures, you can use the functional API.
If you want to extend and customize your models further with new components, you can use custom layers, custom metrics, custom losses, etc.
At the end of this spectrum, you can use more sub-classing — where you are writing your own code with no restrictions.
The model training phase also goes from simplicity to arbitrarily flexibility.
The simplest way to train a model is to call model.fit() that only covers supervised learning.
Next, you can customize typical training loops with callbacks support from GPU or CPU distribution.
Furthermore, you can override specific model class methods to experiment with generative models or unsupervised learning.
Finally, you can write your own low-level training entirely from scratch.
So where’s the world of deep learning headed? What are the big trends? And what can Keras do to facilitate these trends and create the most value for the industry and the research community?
Francois identifies the four trends:
Trend 1 — An ever-growing ecosystem of reusable parts
There’s a big contrast between the traditional software development world, where reusing packages is default and most engineering actually consists of assembling existing functionality, and the deep learning world, where you have to build your own models and train them pretty much from scratch every time. The Keras team is going to:
Provide reusable domain-specific functionality with KerasCV and KerasNLP.
Offer larger banks of reusable pre-trained models for expanded Keras applications.
Build feature banks where features are trained once and reused forever.
Trend 2 — Increasing automation
In the next few years, ML practitioners will move beyond handcrafting architectures, manually tuning learning rates, and constantly fiddling with trivial configuration details. The Keras team will build out higher-level workflows with API that connects various components:
The user's input will be in the form of a dataset and a metric to optimize for and some way of specifying expert knowledge about the problem domain.
The white-box algorithm will then take over and develop a solution via a combination of search of pre-training feature banks, architecture search informed by banks of modern architecture patterns, and finally hyper-parameter tuning.
This process's output will not just be a trained model but in a format like an API where you can easily deploy for your use cases.
The user will be able to visualize what the algorithm is trying. The algorithm will provide feedback to the user as well. For instance, when the data is insufficient, or the objective is badly specified, the search process will tell you what you need to work on to make the program more solvable and improve your results.
As the Keras team focuses on building solid foundations to enable the system design above, Keras Tuner and AutoKeras are important building blocks.
Keras Tuner features tuneable models, tuneable resonance, tuneable exceptions, and so on. It has define-by-run dynamic search spaces, built-in search strategies, and support for large-scale distributed search.
AutoKeras, an AutoML library, is the layer beyond Keras Tuner. AutoKeras automates a model development process to analyze your dataset, determines the best model architecture templates for your model, runs the model, and updates your search hyper-parameter tuning to find the best model.
Trend 3 — Larger-scale workflows in the cloud
In the future, it would be as easy to start training your model with hundreds of GPUs as training a model in a Colab notebook or on your laptop today. You’ll be able to access remote large-scale distributed resources with no more friction than access to your local CPU.
With TensorFlow Cloud, you can add one line of code to your script: tfc.run(**config). This line will collect your scripts and their dependencies, as well as any local data files that you’ve specified in the direct config. Then it will inject a distribution strategy configuration into your cast mode, so you don’t have to worry about distribution. It will spin up machines corresponding to the configuration of your choice. It will start training. It will stream the logs and the saved funds, like the saved models, to a location of choice. Imagine doing distributed work that you need without worrying about things like cluster configuration at work with communication.
Trend 4 — Real-world deployment
For practical deployment, it must be possible to run Keras models on mobile devices, web browsers, embedded systems, microcontrollers, and so on.
The Keras team recently introduces a new set of layers that performs pre-processing as embedded into your model when you explore it. So Keras models should be raw data in and prediction out. They can ingest strings and raw images, which makes them fully possible.
They also invest in TensorFlow Lite for mobile devices and Tensorflow JS for deep learning browsers. These libraries enable the ability to run deep learning models on mobile devices and browsers at production-level performance.
Finally, they plan to provide a model optimization toolkit to reduce model size and memory consumption during inference dramatically. Techniques like post-training quantization (where you convert model weights from floating-point to integer formats) and pruning-aware training (where weights are pruned during training while accuracy is conserved) will be included.
3 — A Framework to Assess Your AI/ML Maturity
Chu-Cheng Hsieh (Chief Data Officer at Etsy) shares his framework about how one person could assess another company’s ML maturity. Using this framework, as a data scientist, you can answer the question: “What types of challenges do I need to solve in a company?” As an executive, you can answer the question: “How can I make decisions about AI investment? Should I acquire this company to boost AI/ML?”
This 5-level framework includes the five stages that correlate to five different roles: conductor, practitioner, craftsman, adventurist, and pioneer.
Level 1 is the conductor stage.
Here, the products are powered by AI/ML service offered by other vendors. One could see a conductor as being a big orchestra. If you see each individual playing in an orchestra, that’s the service; then machine learning is one specialist in your team.
If you join a company at this stage as a data scientist, you are actually a hybrid of engineer and scientist. A lot of your time is working with engineers, or sometimes to be an engineer, to integrate ML solutions into the technology.
If you are an executive looking to acquire a level-1 company, what you will actually get is a lot of engineering talents with a machine learning background, which can sometimes be a perfect choice if you want to integrate these vendors into your ecosystems quickly. They will tell you about different vendors (their pros and cons) and then provide you a quick boost on your talent pools.
Level 2 is the practitioner stage.
Here, the products are powered by prevalent AI/ML solutions from textbooks and online/school training. Although these solutions are common, they need to be creatively applied to your unique problem.
If you join a company at this stage as a data scientist, you can assume that you are actually using many machine learning libraries and use them to solve business problems. You will be responsible for identifying which data you are going to use to build a model. You need to work with a company to get label data in this example, like a total machine learning, which to review are relevant and why. Then eventually, you build machine learning on top of that.
If you are an executive looking to acquire a level-2 company, you will often see that such a company comprises ML generalists who can work with your engineers and PMs to understand/translate problems into solutions and then build up a pipeline to generalize these solutions in production.
Level 3 is the craftsman stage.
Here, the products are powered by customized AI/ML solutions from the latest papers and conferences. Unlike level 2 that spends a lot of time understand the problem, level 3 folks browse SOTA papers and implement them for specific use cases.
If you join a company at this stage as a data scientist, you should expect that you have to read many papers and build a lot of customized solutions that are unique to the company. Often they are based on the latest technological breakthrough like a lot of deep learning solutions these days.
If you are an executive looking to acquire a level-3 company, you often acquire their technology. Do they really have something unique? Can they bring a value that you cannot easily produce by general solutions?
Level 4 is the adventurist stage.
At this stage, your company should be the first-mover with experiences based on AI/ML. Solutions are “unique” to your company and are your secret sauce (trade secrets or patents). This stage is more patent-first, paper-next, and product-last.
If you join a company at this stage as a data scientist, you will produce a solution that is unique and innovative. Expect the solution you are going to build is a company confidential that you cannot share with your family and friends because they become a trade secret of many companies at this stage.
If you are an executive looking to acquire a level-4 company, you acquire intellectual properties. You acquire them to protect your own company and also prevent other competitors from entering the space.
Level 5 is the pioneer stage.
At this stage, you are pushing the boundary of innovation in some domains and are recognized as a leader in (some) AI/ML areas where you strategically choose to push the frontier.
If you join a company at this stage, you will be doing mostly applied research, often in a research lab. You are expected to publish papers and patents, often constrained by the field/domain your company competes in.
It’s very hard to acquire a level-5 company. They are often a group or a department inside big companies. Suppose you can, then you will acquire their brand.
How can you tell which level a company is currently at?
Chu-Cheng suggests asking the interviewer 3 questions:
How do you interview your scientists? If interview questions focus on mathematical, research, and problem-solving skills, they will likely be level three and above. If they focus more on engineering skills, they are likely to be level one to for three.
How do you measure your success on AI/ML? If they say they want to come up with innovations and then patent these innovations, you probably can guess they are level four and above. On the other hand, if they say your success is based on the product you build, they are likely at level three.
What are your KPIs? Apparently, if your KPIs are the number of papers that you publish in a year, you shouldn’t expect that the row is at least the number level of four or level five. On the other hand, if your solution is mostly business within the matrix-like revenues volumes, it’s likely level two or level three. And No KPI? They may be level one. They need someone to understand how machine learning solutions can be used in their product.
4 — AI at Facebook Scale
Srinivas Narayanan (Head of Applied Research at Facebook AI) has facilitated the R&D in a wide range of areas such as computer vision, natural language, speech, and personalization to push state-of-the-art AI to advance Facebook products. AI has been used in various use cases such as Newsfeeds and Ads, Neural Machine Translations, Accessibility, Social Recommendations, Newsfeed Integrity, Blood Donations, Bots and Assistants, Generated Content, AR Effects, VR Hardware, etc. He discusses some of the challenges in deploying AI at scale and Facebook’s approach to solving them.
The Data Challenge: Scaling Training Data
One of the biggest challenges in building AI systems is getting the right and sufficient training data. Getting labeled data for supervised learning can be difficult, expensive, and in some cases, impossible. Facebook approaches this by focusing on techniques beyond supervised learning.
For Instagram, Facebook trained an object classification system using 3.5 billion publicly shared images and the hashtag that they were shared with as weak supervision. Such a weekly supervised learning approach helped Facebook create the world’s best image recognition system. This technique enabled them to leverage a much larger volume of data for training than would have been possible otherwise, setting a new state-of-the-art result. They have released an open-source pre-trained version of this model that is available in PyTorch Hub.
For Translations, Facebook is providing more than six billion translations a day in over 4,500 language pairs. For some pairs, like French to English, there are large bodies of training data available, but people on Facebook speak over 100 different languages. For most of these, the pool of available translation training data is either non-existent or so small that it cannot be used with the existing systems. To solve this challenge, Facebook’s researchers developed a way to train a machine translation model without access to translation resources at training time, also known as unsupervised translation. This system is now used in production for translations in many low resource languages like Nepali, Sinhala, Burmese, etc.
Large-scale language models, like GPT, have become really popular. Facebook has extended that across languages to build cross-lingual language models for pairs of parallel sentences (sentences with the same meaning in different languages). To predict a masked English word, the model can look at both the English sentence and its French translation and test and align the English and the French representations. This is exciting because it shows promise for how we can scale language understanding tasks across languages without the need for many explicit labels in every language.
Facebook also extends this idea of self-supervision to speech. This model is called wav2vec, which is trained to predict the correct speech unit for the audio's masked parts. With just one hour of labeled training data, this model outperforms the previous state-of-the-art on the 100-hour subset of library speech, using 100x less labeled data. Similar to the previous example on NLP, Facebook has also developed a cross-lingual approach that can learn speech units common to several languages. These techniques have been used to improve the quality of video transcriptions on Facebook’s products.
For videos, Facebook looked at individual modalities like images, text, and speech — as videos have an interesting content form that brings all of these things together. In this model, they use an approach called generalized data transforms. The model is designed to learn audio and visual encoders of the video, such that the representation of audio and visual content taken from the same video, at the same time, are similar to each other. But, the representations from different times or different videos all together are different. Once you learn to align the audio-visual representation this way, you can use it to find similar videos without a lot of supervision. This has been really useful in products like Instagram Reels to recommend related videos to people.
Finally, Facebook is extending this to cover text in videos as well. In this approach, they first learned the audio and visual representations using CNNs and transformer models, which are then combined to produce an overall audiovisual representation of the video. Separately, they process the text (whether captions, descriptions, etc.) for the video, with the Transformer model and the recurrent neural network, to produce a representation for each word. Then, they aggregate that across all the words. They use contrastive training to match the representation across the audio-visual and the text modalities. This ensures that the video and text encoders have similar representations for the text and videos that are similar and have different representations for inputs of text and video that are unrelated. Once you have trained representations across modalities to be aligned, this approach is handy for video search applications.
The Compute Challenge: Efficient Training and Inference
Personalizing newsfeed is an example of AI in action at scale. Clearly, Facebook and Instagram’s feed systems are powered by large-scale deep learning models. The diagram below is an overview of the model architecture used for such recommendation systems.
These models have a combination of dense features, which represent dense and sparse features.
These parts' features are then mapped to dense representations using embedding tables learned by the network.
There are more neural net layers on the top.
The models have to be trained on tens of billions of examples, and the size of an embedding table can be hundreds of megabytes. The system has a different set of bottlenecks at different levels. For example, the embedding look-ups are memory capacity dominated, but the higher layers are network communication or compute dominated. One key way to address these training compute challenges is by custom designing servers for these workloads. Facebook has released its hardware designs externally through the Open Compute Project.
For inference, they do a variety of optimization techniques.
They do FP-16 quantization to reduce the cost of evaluation of fully connected layers. That can reduce the model sizes to an eighth of the original size with minimal impact on accuracy.
They also do factored shared computations. So parts of the model that will be the same within a given batch will be sent over the wire and evaluated only once.
They shared the compute across different models. This approach of having a shared trunk with different branches allows them to trade off accuracy versus efficiency in a much more effective way.
They also use knowledge distillation. The idea here is to train a large teacher model and then train a much smaller and more efficient student model to mimic the teacher's predictions. This technique has been applied now effectively across many domains, including computer vision, NLP, and speech.
The Tooling Challenge: Building the AI Platforms
In addition to optimizing the infrastructure to operate as efficiently as possible, Facebook also needs to invest in common tools to enable the engineers and researchers to operate efficiently. Here’s a very high-level overview of the stack that they have built.
On the left, you’ll see tools for preparing data in the right format.
In the middle, you’ll see the pieces for building and training models — going bottom up all the way from hardware, whether it’s a CPU’s or GPU’s. These include frameworks like PyTorch that ease the model building environment, libraries that are specific to each domain, and the models that are used in products.
And on the right, once you have the trained models, you have the right tools and systems for deploying them in production, whether it’s in a data center or locally on the device.
As you can see, Facebook has tools for data, training and testing, debugging, deploying and serving, and of course, hardware as well. These common tools and models across the company enable a range of product teams across their family of apps to more easily leverage the AI tech for what they’re building. One key challenge that they needed to solve here is that they needed to make the research to production flow smooth, even though the needs of the research stage and the needs in the production stage for large-scale deployment can be very different. To that end, Facebook has built PyTorch as a single framework to enable rapid experimentation of new research ideas and bring those ideas to production seamlessly.
The Bias Challenge: Creating AI Responsibly
Fairness and bias in AI models are key facets of responsibility that Facebook focuses on. Facebook is using AI to benefit billions of people worldwide, and to do that well, they need to ensure the systems work fairly and equally well for everybody.
Not introducing or amplifying bias and creating unfair systems is a challenge because it’s not as simple as using the right tools. Fairness is a process. At each step of the implementation process, there is a risk of bias creeping in — in data and labels, in algorithms, in the predictions, and in the resulting actions based on those predictions. At each step, we need to surface the risks to fairness, resolve those questions (which means defining fairness in that context), and document the decisions and processes behind it.
The Process and Culture Challenges: Creating Best Practices
With the rapid pace of innovation in AI, you see some state-of-the-art results almost every day, and you wonder how you can build on top of it. To do that, you first need to reproduce the advances that those papers claim consistently. The academic community is starting to make reproducibility a part of the paper submission process, and the industry should follow suit as well.
Another challenge is making the engineering velocity faster and enabling continuous integration and continuous deployment for machine learning models. There are usually many differences in performance in offline datasets and how those same models perform online on live data. This slows down the speed of innovation. Facebook has been investing in techniques such as counterfactual evaluation to bridge this gap.
Next, ML models often cascade. For example, a computer vision model may provide signals to a different downstream model, and it becomes hard to assist the real impact of improvements downstream in such cases. Thus, you have to retrain and redeploy the entire cascade of models every time. And this is not easy to do without the right tooling and infrastructure improvements.
The third problem is that ML models are inherently imperfect. A new model may perform better in aggregate but may have worse results for some more important examples. We don’t quite have the right rigorous definitions of model contracts that enable quick and easy changes with confidence.
Facebook believes the best solutions will come from open collaboration by experts across the entire AI community. So they have been releasing more open datasets for some of the hard problems mentioned above. Even creating strong, well-designed data sets internally can spur more people inside the organization to advance state-of-the-art. Additionally, Facebook also believes in bringing holistic thinking to AI problems. They bring together multiple disciplines (product management, design, analytics, user research) to frame the product problems crisply, define the ML tasks to solve more precisely, and design the right evaluation methodologies for these systems.
Srinivas ended the talk with these 5 learnings:
Be rigorous: Pushing AI experimentation to be more rigorous and look at challenges such as model reproducibility and model evaluation.
Be open: Creating more open datasets to foster more collaboration by experts.
Be holistic: A holistic approach to product thinking is better than a tech-centric approach.
Be determined: As a highly empirical science, AI requires not just rigor but also determination to see the results.
Be a finisher: Ensuring that the eventual product value is realized instead of simply handing off the technology.
5 — Applied ML at DoorDash
Andy Fang (Co-Founder and CTO of DoorDash) talked about how DoorDash uses AI to power its marketplaces. Starting from a Stanford dorm room in 2013, today, DoorDash services 20M+ customers and 450K+ merchants, with 1M+ dashers fulfilling deliveries on the platform. It has become the category leader in the restaurant food and convenience delivery verticals in America (currently in the US, Canada, and Australia with further geographical expansion plans).
One of DoorDash’s early mantras was to do things that don’t scale, which was evident in all the manual techniques they use to power deliveries in the early days. However, as the company grew, it ultimately had to scale with exponential growth. Not only they saw this as an opportunity to automate many workflows, but also to think about how to apply AI techniques to do things better than they could perform manually — considering that DoorDash processes millions of calculations per minute to determine how to optimally service all three sides in the marketplace (consumers, dashers, and merchants). In his talk, Andy dove into two case studies where DoorDash has applied AI to further their business and better service their constituents.
Case Study 1 — Creating a Rich Item Taxonomy
DoorDash has tens of millions of restaurant items in its catalog. Tens of thousands of new items are added every day, most of which have unique taste profiles to be differentiated. For DoorDash to build this rich taxonomy, they looked at all the tags they are interested in and then built models to automatically tag every item in their catalog according to this taxonomy. Next, they integrated these models into a human-in-the-loop system defined as a model that requires human interaction, allowing them to collect data efficiently and substantially reduce annotation costs. Their final implementation was a system that grows their taxonomy as they add tags and uses their understanding of the hierarchical relationships between tags to efficiently and quickly learn new classes.
There are 3 critical rules for defining annotation tags:
Make sure that there are different levels of item tagging specificity that don’t overlap. Let’s say for coffee, you can say it’s a drink, you can say it’s non-alcoholic, or you can say it’s caffeinated. Those are three separate labels that don’t overlap and categorization with each other.
Allow annotators to pick “others” as an option at each level. Having “others” is a great catch-all option that allows DoorDash to process items tagged in this bucket to see further how they can add new tags to enrich their taxonomy.
Make tags as objective as possible. They want to avoid popular or convenient tags — things that would require subjectivity for an annotator to determine.
Andy also emphasizes the importance of making tasks high-precision and high-throughput. High precision is critical for accurate tags, while high throughput is critical to ensure that the human tasks are cost-efficient. DoorDash’s taxonomy naturally lends itself towards generating simple binary or multiple choice questions for annotation with minimal background information. So you can still get high precision using less experienced annotators and less detailed instructions, which makes annotator onboarding faster and reduces the risk of an annotator misunderstanding the task objective.
For their human-in-the-loop system, DoorDash had the annotations feed directly into a model. As you can see in the diagram above, the steps with human involvement are in red, and the automated steps are in green. This loop allows them to focus on generating samples they think will be most impactful for the model. Not to mention, they also have a loop to do QA on their annotations, which makes sure that their model is being given high-quality data. Through this approach, DoorDash has almost double recall while maintaining precision for some of their rarest tags, leading directly to substantially improved customer selection.
Case Study 2 — Creating an Optimized Delivery Menu
For a restaurant to create a successful online experience, they often need to present their menus differently than in-store. DoorDash utilizes AI to analyze thousands of existing menus on their platform to surface the characteristics of successful online menus. They then translated these characteristics into a series of hypotheses for A/B tests. As a result, they saw a huge improvement in menu performance from experiments involving header photos and more info about the restaurant. DoorDash’s data team intends to conduct further experience about adding different information to further improve menu performance in the near future.
DoorDash defines a successful menu as one with a high conversion rate at the end of the day. To build a set of features for this kind of model, they looked at each layer and had a menu from the high-level menu appearance to detail modifiers for each item. For each layer, they brainstorm features relating to key elements such as menu structure, item customizations, visual aesthetics, etc. To determine which menu features were the most influential in menu conversion, they use the features mentioned above as inputs to regression models predicting menu conversion. They built their initial regression models using linear regression and base tree models to achieve a baseline error, while the results were interpretable and the error rate was pretty high. On top of that, many of the features seem to be correlated. This led to collinearity, which made it difficult to determine how changes in each feature impact the target variable directionality.
The lack of being able to explain this clearly was a pain point for DoorDash and is generally a pain point for black-box models in general. They resorted to Shapley values to solve this problem, which is a game-theoretical approach towards model explainability. Shapley values represent the marginal contribution of each feature to the target variable. They are calculated by computing the average marginal contribution to the prediction across all permutations before and after withholding that feature.
So after examining the resulting Shapley plots of the final model, the top success factor was the number of photos on the menu. This is particularly important for the menu's top items, as photo coverage of the top items appears much more prominently in the menu's overall appearance.
Another top factor is giving higher customizability for items. Customers enjoy the optionality within the top items, and the ability to customize provided a degree of familiarity that they could find while dining in.
Another factor was menus with a healthy mix of appetizers and sides. This provides customers with more choices to complete their meal and can lead to higher carb values for merchants.
Those are just two ways DoorDash uses AI to scale their marketplace and make it easier for customers to find what they want while making it easier for merchants to position themselves in an online world. If you’re curious to read about more case studies, check out the DoorDash engineering blog.
6 — Future of ML Frameworks
Soumith Chintala (the co-creator of PyTorch) gave an excellent talk about machine learning frameworks, particularly how they have evolved within certain dimensions of interest. He presented the three personas involved in an ML development process: modeler, prod, and compiler.
A Modeler is someone who assesses data quality (collecting more labels, pre-processing, and feature engineering steps), builds an architecture suitable for the data and problem (encode enough priors into the learning to make it data-efficient), and builds an efficient training pipeline.
A Prod is someone who versions models and data, verifies the increase or decrease in real-world accuracies, and brings new models from Modeler into production-level performance while coordinating with the Compiler.
A Compiler is someone who maps training and inference pipelines to hardware and squeezes out the best performance per watt or performance per second.
Pre-Deep Learning
Before 2012, you typically had a software stack that somewhat looked like above:
There is a lot of focus on pre-processing, feature engineering, and post-processing. You had domain-specific libraries for that.
The ML models themselves have a small API surface to interact with software packages or libraries that built those ML models and trained those ML models for you. You give some kind of configuration of what model you’re building, what learning rate or regularizer, or how many trees are in the forest, and so on.
Once you build that config, you give that to a factory, and then along with that, you give your data in some pre-processed or clean form.
The software engine that implemented a particular machine learning algorithm just handles the entire stack of the training loop and all the implementation details of the model.
The most important thing to recognize here is the model in this context is typically is a configuration (that is generally small and usually readable by humans) and a blob of weights (that are stored on some blob format, maybe on disk or in memory).
Post-Deep Learning
In the post-deep learning world, the stack looks like the below.
You have a very large API surface in the middle. Mainstream learning frameworks like PyTorch or TensorFlow have thousands of functions in their API. And these thousands of functions are strung together by Modelers to build models in all forms of shape and size.
And below that, you have data structures, typically dense tensors or sparse tensors with layouts of memory that might make computation more or less efficient.
And then you have a bunch of hand-optimized functions that are typically written by high-performance computing experts that map these APIs efficiently onto accelerator hardware.
In the last few years, compilers have been popping up (XLA, TorchScript, TVM) to take the whole models described in the APIs of these frameworks. They map them more efficiently to hardware than stringing through hand-optimized functions.
Lastly, a distributed transport layer enables these models to run on multiple devices at once or multiple machines at once.
On top of this API, you have domain-specific libraries that make it easy to train your models within particular domains.
You have prod tooling on top of the stack (TFX, TorchServe, SageMaker, Spark AI).
The general mainstream deep learning frameworks do a full vertical integration across the stack to make things pretty efficient. There are particular solutions by various parties that only focus on particular parts of the stack, and they interface cleanly with the rest of the stack.
The most important thing to recognize here is, in this post-deep learning, the mainstream ML framework (PyTorch and TensorFlow) model is described as code in some language, not a configuration file or a JSON blob anymore. It’s complicated code that can have loops and various structures that you typically define associated with the programming language. Then the model also includes the weights, blobs of numbers that are stored somewhere.
Model = Code
Why did we enter this “model=code” regime?
This is mostly for the convenience of Modelers, who were pushing the limits of ML frameworks. Their models had dynamic control flow and dynamic shapes for complex unsupervised and semi-supervised learning. However, this comes at a high cost for Compilers and Prods.
It became harder for Compiler to write a compiler efficiently to accelerate hardware.
With model = code, Prod had to figure out how to debug models in production.
Why do we have such a large API surface?
This is also for the convenience of Modelers, who were building disruptive new building blocks every few months (to implement results from new papers). They were writing complex mathematical functions via untenable custom operations. The larger the low/mid-level building blocks, the more architecture tweaks to invent. Again, this comes at a high cost for Compilers and Prods.
Why did the Modeler get so much leverage?
The reason is that the modeler is credited largely with making progress in the field. Every 3 years emerges a completely new disruptive architecture. Every few months arises an incremental but substantial improvement in metrics. The pace of value creation via inventing new architectures and training regimes hasn’t slowed down; therefore, catering to Modeler’s needs is essential for Compiler and Prod to stay relevant in 3 years.
Compiler isn’t happy with Modeler. These 3-year disruption cycles make hardware extreme optimization untenable. While the only stable primitives have been convolution architectures, anything higher-level than that hasn’t lasted more than a few years (LSTM, AlexNet, ResNet50, etc.). In other words, extremely specialized hardware gets outdated by the time it is nabbed and shipped. As Modelers keep expanding the operator set, going lower down the stack in expressing ideas, and breaking abstraction stability, they keep giving Compilers trouble.
Prod isn’t happy with Modeler, either. Prod want easily version for DevOps-like models. They want to be able to roll back, so if something goes wrong, there are very few variables that actually change. They don’t want you to pull some random Python function from some random Python package from the internet and then use that within your model because then that model has to ship to productions. They want something quickly stoppable onto all kinds of mobile or embedding environments into a larger application. As modelers keep expanding operator sets and using custom operators, they keep giving trouble to Prods.
The Future of ML Frameworks
Here are questions we want to ask about the future:
When does the Modeler’s leverage end? Will the Modeler’s leverage increase further? Will Compiler and Prod continue to under-fit and be under-leveraged?
Soumith believes the future to be a distribution over chains of events. You say, “Well, this thing can happen with the probability X. And then if that happens, this next thing can happen.” And then you just chain them. So he rolled out a few events that could happen and how the ML frameworks stack would change.
Event 1: Transformers + ConvNets = Stable Architectures
Today, Transformers and ConvNets make up for the majority of what people think is the answer to everything. Let’s just hypothetically say that they become the stable dominant architectures.
The ML frameworks' API surface will reduce the data structures so that we wouldn’t need so many complicated tensors. Then pretty much everything under the stack will have a much easier time.
The number of hand-optimized functions will shrink, the Compiler will have an easier job, the hardware people can start specializing more (the shapes and sizes of the types of convolutions, the matrix multiplies, the whole Transformer blocks to be computed, etc.).
If it does happen, that there will be the next wave of frameworks, which will again look like the classical frameworks where they will just drive everything with config files and then specialize. You don’t have to expose a much more generalized scientific computing framework to the general public. Hugging Face was already doing this and might become more dominant temporarily. But there will be other players that come in that try to take charge of this insight.
Event 2: Scaling a current niche fundamentally different architecture to show disruptive results
Let’s say some hardware looks very different from all the existing accelerators. And there are some not-as-used ML models, such as probabilistic graphical models or sparse networks, that have not been mapped efficiently enough to the current accelerators. If the hardware and the models are mapped together, and some disruptive results are shown, we have to rethink everything from scratch completely. New frameworks are going to take the mantle. Such a transformative event can trigger the usage of new languages (like Julia).
Event 3: Data-efficient models via priors
Let’s say you have a regime where models come from pre-trained priors or weights; hence, they do not need much-labeled data. Then, there will be vendors that attempt to democratize “prior” libraries (one-line initialization of each prior, packages that stitch together multiple priors, etc.). If the priors are neural networks, PyTorch and TensorFlow will probably continue their status quo. If not, we need to find a way to let prior pipelines inter-operate and talk to each other.
In brief, the only way for mainstream frameworks to keep up is to keep a high velocity and stay well-maintained. Specialized tooling will always pop up, as that is an essential component of disruption. Even though new tooling can move faster and be more efficient, they won’t have the advantages of full vertical integration.
Soumith left the talk on an optimistic note:
In science, progress is a function of ideas and tools. If either one is stuck in a local-minima, progress stops. Let us continue to make progress! Be open to new ideas and new tools.
7 — Fireside Chats
Besides the main talks, there were various fireside chats between the guests and Alexandr Wang, CEO and Founder of Scale AI.
7.1 — Autonomous Driving
Drago Anguelov (Head of Research at Waymo) shared key learnings on taking autonomous driving from research to reality. Here are the key ideas that I jotted down:
On the State of Perception: Most interesting problems in perception that Waymo is working on relate not so much to the common cases but the big market on the long tail (being robust to the rare events that come at you). That’s a lot of training data potentially needed, which leads to the question of how to develop models that need less and less training data? Many powerful techniques have emerged in the last two or three years. Furthermore, it is crucial to determine the right intermediate representations that should come out of the perception model that are most amenable and helpful for prediction planning and simulation. Waymo’s VectorNet shows that if you model the map as a set of polylines instead of just rendering it as an image, your modeling capabilities significantly and the prediction of agent behavior significantly improve.
On Deep Learning Limitations: In the autonomous driving domain, it’s imperative to have robust usage. When you want to build a robust system, Drago is excited about having the network’s mechanisms to give a notion of its own confidence in its prediction. When the network tells you that it is not confident, you can have a fallback — such as a more hybrid system with more inductive bias in the domain to handle the cases you have not seen as much. Simulation is a really crucial part of this objective because you need to observe the system's performance over time. To build a strong simulation system, you want a scalable evaluation of your system performing in realistic circumstances — answering two key questions: (1) “What are the right interfaces and right representations to power these interfaces through the stack?” and (2) “What other things need to be passed on that help predict the intent and behavior of the agents in the environment?”
On Behavior Prediction: In the research community, there is a lot of progress being made with machine learning for behavior prediction. This task requires a deep understanding of the scene semantics. It requires an understanding of the context (“Which traffic lights are on?” “What are the intersection rules?” “What are the old signs?” “What is the construction here telling you?”) and ties the context into how everyone behaves. Behavior prediction is essentially imitation learning at its most pure sense, with a specific loss function related to your planner. If you want to build these models that understand behavior, you need dramatically more data that has mined interesting interactions. Waymo Open Dataset is a great behavior prediction dataset because it contains many interesting maneuvers like cuttings, people negotiating in an intersection, or a bicycle weaving between cars. The Waymo team also made the benchmark metrics more stringent and demanding to reflect the demands on behavior prediction.
On Simulation Systems: You want to play out two key aspects to make simulation scenarios more realistic. The first one is sensor realism: If you move around the environment, you want your Lidar and radar cameras, depending on what sensors you have in your vehicle, to be simulated accurately, such that you can apply your perception system and the right outputs come out. The second one is behavior realism: Ultimately, the big challenge of driving is navigating and sharing the world with humans, which involves keeping the pedestrians/bicyclists/vehicles in the loop. Unfortunately, humans are far from being perfect and deterministic. The Waymo team wants to replay accidents and see how their autonomous vehicles would do in those scenarios.
7.2 — Democratizing The Benefits of AI
Kevin Scott (Executive Vice President of Technology and Research and the Chief Technology Officer of Microsoft) shares his perspectives on how AI can be realistically used to serve the interest of everyone (as imagined in his book “Reprogramming The American Dream.”) Here are the key ideas that I jotted down:
On AI’s Macro Trends: We have an interesting set of challenges facing us as a society where AI is an important part of the solution. The big ones include tackling climate change, providing better healthcare, and redefining the workforce.
On Technical AI Advancements: Large self-supervised language models have really revolutionized how natural language processing is working. The two fascinating properties of them are self-supervision(which removes the compute constraint) and transfer learning (which works well for these bigger models and enables a huge range of applications). Therefore, we can think about these models as reusable platforms. However, the energy consumed by these models is enormous, and we are not even at the point of diminishing marginal returns yet. It is a clear opportunity to think about more efficient algorithms to train these large models. Furthermore, how can we put the power of these models into many hands? Microsoft has been working on tools to package these models to be used safely and responsibly by 3rd-party developers.
On The Economic Effects of AI: The big chunk of software development is going from programming to teaching (“Software 2.0”). You now can harness the power of a computer to solve problems by teaching it to solve the problem, which is a more accessible way than explicit, manual programming.
On Re-Programming The American Dream: Taking healthcare as an example, Microsoft partners with hospitals and clinics to work on the therapeutic and vaccine development lifecycle. AI as a diagnostic tool has truly been speeding up this process. The examples from the 19th century in agriculture and industrial technologies are very similar. Kevin gave a couple of areas where AI moonshots should focus on: solving healthcare and climate change issues, addressing an aging population and the changing nature of societal productivity, understanding the slowing population growth in the developed world, and optimizing agriculture to feed more people.
On Datasets for Ethical AI: We still have an enormous amount of work to do. Microsoft has its own division to focus on this (called Office of Responsible AI). There is an inclusivity challenge since big models are expensive to train. There is a need for Explainable AI in human-in-the-loop systems to assist humans in making decisions. It is crucial to understand the full continuum of things to build the tools safely and responsibly as humanly possible. Microsoft is working on a framework to identify AI failure cases, just like how they have done for the software revolution.
On Strategic AI Initiatives at Microsoft: Microsoft has been plugging fairly sophisticated flavors of ML into tools like PowerBI and Excel, where you don’t have to know much at all about machine learning to be able to solve classification or regression problems. They also acquired a company called Lobe a few years ago to make it easy for people to build AI tools.
7.3 — Fast.AI
Jeremy Howard and Rachel Thomas discussed why they started fast.ai, how it progressed from classes to a software platform, the importance of the community, and the future direction of fast.ai.
The Why: Back in the 2012–2014 era, deep learning emerged as a high-impact technology. However, there was a fairly homogeneous group working on it (coming from the same schools/advisor groups, tackling the same problems, etc.) Where were the people with deep domain expertise who can leverage deep learning to solve more important problems? Jeremy and Rachel decided to start fast.ai so that everybody can harness this new technology.
The How: They started by teaching a course to make deep learning more accessible to domain experts, which served as a testbed to understand the learners' pain points. Later on, they see the 4 prongs of fast.ai as research, software, education, and community.
Start with Education: The first version of the course is in 2016. There were very few publicly available resources at the time, and those out there focused on theoretical math (building from first principles), less focused on implementation. Actually, there was a lot of push back against the course from folks in the deep learning community. The reality is that in each subsequent module of the course, they were digging into more and more low-level details in things like automatic differentiation and back-propagation. Many students came out of that course doing breakthrough work, getting published in academic papers, getting patents, building startups, etc. Furthermore, Jeremy and Rachel found a lot of hard edges. In fact, pretty much anything except computer vision was almost impossible to do in practice, outside of academia. So from that, after that first course, they went fully into research mode.
Move to Research: Jeremy brought up an example of the ULMfit work, which was written and published in a top computational linguistics conference with then a fresh Ph.D. student named Sebastian Ruder, who took fast.ai. Many researchers said that transfer learning wouldn’t work in NLP, but ULMfit proved them wrong.
Then Build Software to Implement the Research: They built the software heavily on top of Stephen Merity’s research code on LSTM, later becoming the fast.ai.text library. Users can install that and implement ULMfit for their use cases.
Community Made It Special: Before the pandemic, fast.ai used to be taught in-person at USF as an evening course, open to anyone who is not enrolled in a USF degree program. There were diversity and international fellowships for people all over the world. There was also a very active online community forums on fast.ai — where people helped each other with interviewing and reviewing projects.
The Future: Their focus has been moving gradually away from education towards software development. The software allows them to remove the barriers entirely. Furthermore, Rachel has focused more on the Data Ethics component of the course, which addresses and prevents the risks and harms of Deep Learning’s potential misuses.
If you skipped to the end, here are key takehomes:
As AI practitioners, we should shift our mindset from building big data to building good data.
ML frameworks will increasingly support automation, large-scale workflows in the cloud, and real-world deployment.
Asking questions about KPIs of ML projects can help you assess a company’s ML maturity.
At large organizations, there are various practical challenges in deploying AI at scale — such as scaling training data, efficient training and inference, developing internal platforms, addressing bias and fairness, and building a culture that enables the best processes.
Self-supervision is the most promising direction to solve simulation, perception, and behavior prediction challenges within the autonomous industry.
There are other excellent sessions in the conference that I have not covered here — including fireside chats with Sam Altman and Fei-Fei Li + panel sessions with builders at Flexport, Aurora, Cruise, iRobot, HuggingFace, Brex, Blend, and more. I would highly recommend watching them.
If you had experience using tools to support building high-quality training data for the modern AI/ML stack, please reach out to trade notes and tell me more at khanhle.1013@gmail.com!
Lastly, congrats Scale on the massive $325M Series E funding round!