Last month, I had the opportunity to attend the Toronto Machine Learning Summit 2020, organized by the great people at the Toronto Machine Learning Society. I previously attended their MLOps event in the summer, which I also have written an in-depth recap here.

The summit aims to promote and encourage the adoption of successful machine learning initiatives within Canada and abroad. There was a variety of thought-provoking content tailored towards business leaders, practitioners, and researchers. In this long-form post, I would like to dissect content from the talks that I found most useful from attending the conference.

The post consists of 4 sections:

Case Studies
Research, Business, and Strategy
MLOps
Quantum Computing

1 — Case Studies

1.1 — Combating Bias at Pinterest

Machine learning powers many advanced search and recommendation systems, and user experience strongly depends on how well these systems perform across all data segments. This performance can be impacted by biases, leading to a subpar experience for subsets of users, content providers, applications, or use cases. In her talk “Inclusive Search and Recommendations,” Nadia Fawaz describes sources of bias in machine learning technology, why addressing bias matters, and techniques to mitigate bias, with examples from her work on inclusive AI at Pinterest.

With 442 million global monthly-active-users, 240 billion pins saved, and 5 billion boards in 30 languages, Pinterest has an amazing dataset for search and recommendations tasks. The most basic task is to predict the likelihood that a pinner would interact with a pin — given the search queries, pinner features, pin features, and past pinner interactions with pins/boards. However, it’s not always easy to find the most relevant results. The majority of queries on Pinterest are less than three words, which presents an interesting serving challenge. Besides, their current ranking algorithms are heavily influenced by what most people have engaged with over time. This means that some pinners have had to work harder to find what they were looking for. To build a more inclusive search experience, the R&D team at Pinterest defines inclusive AI's key pillars, starting with analyzing bias at all development stages.

Here are various bias sources in machine learning that Nadia defined:

Societal bias is inherently present in the data — due to many diversity dimensions such as demographic, geographic, cultural, application-specific, implicit, etc.
Data collection bias entails serving bias, position bias, summary-based presentation bias, and repetitiveness bias.
Modeling bias includes statistical under-fitting (too simple models with few parameters and insufficient features), model fairness (disparities in performance across groups), training objective (aggregate loss function may favor the majority class, hard to differentiate between error types, only focus on a single utility), model structure (bias from model sub-components) and coarse model tuning (single or group thresholds are not robust enough).
Offline evaluation bias happens with evaluation data (imbalanced classes, biased labeling, static features) and evaluation metrics (coarse overall aggregates, accuracy might favor the majority class and hide performance for under-represented classes, disparities are not evaluated).
Experimentation bias occurs during A/B testing: your treatment doesn’t have the same effect on un-engaged users as it does on engaged users, but the engaged users are the ones who show up first and therefore dominate the early experimental results. If you trust the short-term results without accounting for and trying to mitigate this bias, you risk being trapped in the present: building a product for the users you have already activated instead of the users you want to activate in the future.

*Nadia Fawaz — “Inclusive Search and Recommendations” (TMLS 2020)*

It is desirable for Pinterest to reduce bias in their machine learning applications due to societal and legal requirements, user-centric mission, and the high standard for technical craftsmanship. Here are the variety of techniques that Pinterest applies:

Randomization at data collection: For the top-k recommendation task, they used the explore-exploit strategy. During exploration, the model selects items with high predicted scores. During exploitation, the model collects feedback on items with lower predicted scores by randomly selecting item distributions.
Diversity re-ranking at serving: They intentionally boosted the pin scores for deeper skin-tones. For models with multiple stages, they performed boosting both at the light-weight ranking and the full-ranking layers. This method is simple but requires manual tuning and multiple post-processors’ interactions. Besides boosting, they also attempted fairness-aware re-ranking via greedy and dynamic programming algorithms that can self-adjust the recommendations to meet the target diversity distribution in top-k results.
Data augmentation: At the data collection stage, they generated synthetic data, performed negative sampling, and ensured diverse manual data labeling. At the modeling stage, they used techniques like SMOTE and IPS for sampling, revised the model via error analysis / architectural changes / ensembling methods, and augmented the training objective with fairness and diversity constraints.
Offline evaluation: During the evaluation phase, they ensured that the test dataset has good enough coverage. They looked at objective function beyond aggregates (by incorporating fairness notions to quantify disparities) and evaluation metrics beyond accuracy (by doing error analysis for each class label). Finally, they also experimented with open-source monitoring tools, brought humans into the loop, measured models in production, and re-evaluated their performance over time.

A specific example that Nadia brought up in her talk on how Pinterest mitigates bias is their skin-tone models (check out this post for more details). These are closed-box deep learning models for face detection that can extract the dominant color and threshold lightness range. Such utilities give users control, respect their privacy, improve their experience in deeper skin-tone, and increase engagement with diverse content. During the offline evaluation, the Pinterest team designed various strategies to quantify bias and error patterns: labeling a high-quality, diverse golden dataset, using a confusion matrix to analyze error patterns, and choosing granular metrics per sensitive attribute + fairness metrics to quantify disparities. After several iterations of the model, they also augmented the data (by including body parts, partial faces, and men’s fashion) and created multi-task visual embeddings (fashion classification, beauty detection, internationalization), which contributes to the final skin-tone classification task. Here are specific takeaways that Nadia mentioned:

Start with diverse data.
Bias can come from modeling choices and the evaluation process.
Test and test again the Machine Learning system at every step.
Quantify bias and analyze error patterns at granular attribute level for fast iteration.
Learn from errors, and make your Machine Learning models learn too.
Build iteratively to uncover complexity layers: first get the simple case right, then master production scale, and finally expand to harder cases and more product surfaces.
Overall accuracy is not synonymous with fairness. It’s crucial to manage biases and improve all metrics for all skin-tones proactively.

Nadia ended her talk with concluding remarks about the benefits of Pinterest’s Inclusive AI approaches. Firstly, they improve user representation and content provider exposure. Secondly, they help Pinterest understand and increase content diversity. Thirdly, they mitigate bias in machine learning models for embedding, retrieval, and ranking tasks. Finally, they enable Pinterest to grow an inclusive product globally. It is noted that inclusive AI is a challenge that goes beyond engineering. This challenge requires contributions from multi-disciplinary teams (product, inclusion and diversity, legal and ethics, communities and society feedback) to effectively model, measure, and address AI bias.

1.2 — Design Patterns for Recommendations at Twitch and Twitter

Building recommendation systems in production that can serve millions of customers goes way beyond just having a great algorithm. The scale of users, size of the catalog, and speed of reaction to user actions make such systems very challenging to build. A set of co-operating systems need to be built that can serve the needs of the users. In his talk “Key Design Patterns For Building Recommendation Systems At Scale,” Ashish Bansal distilled learnings from building large-scale recommendation systems across companies like Twitch and Twitter into a set of commonly used design patterns.

He started with a few motivating examples:

In a system like Twitter, there are a variety of recommendation use cases: other users to follow (billion of items with months-to-years shelf life), relevant tweets (hundreds of millions of items with hours-long-shelf life), and events/trends (hundreds of thousands of items with hours-long shelf life).
In a system like Netflix, hundreds of thousands of movies are recommended with years-to-decades shelf life.
In a system like Amazon, hundreds of millions of products are recommended with months-to-years shelf life.

*Ashish Bansal — “Key Design Patterns For Building Recommendation Systems At Scale” (TMLS 2020)*

If we classify the recommendation system by item volume and velocity, then four system type patterns stand out (as seen above):

Few-Short Pattern: These systems capture real-time features and serve real-time inference. Session-based recommendations may be useful.
Few-Long Pattern: This is the best spot to be in. Common approaches include end-to-end deep learning and matrix factorization to capture long-liver item and user embeddings, batch pre-computation of the similarity scores, cache serving, etc.
Big-Short Pattern: A tough space. Real-time features make a huge difference, so common approaches include two-stage architectures (candidate generation + blender/ranker) and approximate nearest neighbor algorithms.
Big-Long Pattern: Here, the complexity lies in managing large models and large user/item data. All the techniques from the few-long pattern will work in this case.

Ashish then illustrated the two-stage architectural pattern with this nice diagram above:

In stage 1, the candidate generation layers filter out hundreds of items from millions of items. This step must be quick (in a few milliseconds). Tools like ElasticSearch are great examples, which pre-compute the similarity between items (using metrics like cosine distance),
In stage 2, the blending layer combines candidates from different sources, scores the candidates based on a utility function, and ranks them based on the scores. The blender could be rule-based, could be a model estimator, or could take in query parameters from the users.

He next discussed the three typical recommendation types:

User-User modeling involves measuring the similarity between users (useful for cold-issues). Neighborhood-based methods based on user attributes would suffice.
User-Item modeling is the most common pattern. We use matrix factorization, factorization machines, or deep models to capture the user-item interactions directly.
Item-Item modeling entails association rules and market-basket analysis. Neighborhood-based methods based on item attributes would suffice.

Ashish concluded the talk with a couple of supporting system patterns that can help with recommendations at scale:

Impression store can track served recommendations. It is important to answer beforehand what constitutes an impression.
Feedback store can capture user feedback on the relevance of recommendations. It can be used as a filter for irrelevant items.
Explicit interests store allows users to guide recommendations. It can be challenging to incorporate such a mechanism into the model.
Session tracker tracks impressions to clicks and actions. It can be challenging due to the distance between impressions and other actions (replies, for example).
Label generator logs training data for future models in production. However, there is often a large imbalance between positive and negative labels, which may require re-weighting data samples.

1.3 — Moderating Comments at The Washington Post

Patrick Cullen and Ling Jiang gave an informed talk about raising the quality of online conversations with machine learning at The Washington Post (WP). In particular, they shared how they built a system for automatically moderating comments from millions of reader comments.

At WP, comments provide a way for journalists to speak directly to the readers and build a sense of community, as readers share their views on important topics. The commenters are often the most active and engaged readers. However, trolls, bots, and incivility lower the quality of online conversations. Moderating the quality of conversations is a logical next step. WP has more than 2 million comments a month, so relying on human moderators is cost-prohibited. ModBot is an application that combines machine learning with human moderators to moderate the quality of conversations that can scale to millions of comments.

*Patrick Cullen and Ling Jiang — “Raising The Quality of Online Conversations” (TMLS 2020)*

As depicted in the diagram above, the ModBot API takes as input the comment and outputs a number between 0 and 1. Scores close to 0 indicate that the comment should be approved, and scores close to 1 indicate that the comment should be deleted from the site because it violates WP’s community guidelines.

ModBot includes a pre-filter, which is a rules-based system that identifies comments, including banned words. If the pre-filer passes the input, the machine learning classifier scores the comment and returns the score and moderation decision in the API call.

To train ModBot, the data science team at WP built a classifier that learns from training data using NLP techniques. They collected over 60,000 comments that are human-labeled. As noted above, deleted comments often have many offensive words, many hyper-links, and special symbols. Approved comments, on the other hand, use neutral or positive words with substantial length. After being trained, ModBot can differentiate good comments from bad ones. Then, they use this trained model to predict new unseen comments.

They ran several different models using bag-of-words features with 10-fold cross-validation on an imbalanced dataset (70% approved and 30% deleted). These models include Logistic Regression, Support Vector Machines, Random Forest, Decision Trees, and Naive Bayes. After initial experiments, they found that Logistic Regression and Support Vector Machines performed better than the others.

Then, they engineered more features that can be predictive, including sentence count, word count, link count, email count, and special characters count. They also experimented with Convolutional Neural Networks and Recurrent Neural Networks, which both outperformed the linear models but are expensive and non-interpretable. Eventually, they settled with a Support Vector Machine model in production due to its simplicity and explainability.

Another fascinating aspect of this application is the “human-in-the-loop” component, both during training and during inference. In the comment above, ModBot suggested deleting it due to the word “idiot.” However, the human moderator approved the comment because the post allows criticism of public officials. To handle this common issue, the data science team added a named entity filtering layer to pre-process the comments that involve public figures.

In general, handling mislabeled training data requires an iterative process that includes the comments, ModBot API, the predictions, and the human reviewer. After ModBot predicts on comments, the human reviewer can modify the label. The revised data can be fed back into the training process to retrain the model for better accuracy.

In production, ModBot uses a threshold for automatic moderation. As seen in the slide, anything above 0.8 will be automatically deleted and below 0.2 will be automatically approved. Human reviewers can step in if a reader flags a comment. The system also gives these reviewers flexibility to decide the threshold. Note here that the number of comments is not evenly distributed along this threshold due to the dataset's imbalanced nature. Because there is a tradeoff between automatic moderation and accuracy, the data science team needs to work with stakeholders to set this threshold in production.

1.4 — Harnessing The Power of NLP at The Vector Institute

Developing and employing Natural Language Processing (NLP) models have become progressively more challenging as model complexity increases, datasets grow in size, and computational requirements rise. These hurdles limit the accessibility many organizations have to NLP capabilities, putting the significant benefits advanced NLP can provide out of reach. Sedef Kocak gave a talk about a collaborative project conducted at The Vector Institute that explores how state-of-the-art NLP models could be applied in business and industry settings at scale.

To give the context, the Vector Institute drives excellence and leadership in Canada’s knowledge, creation, and use of AI to foster economic growth and improve Canadians' lives. It was created by visionary scientists and entrepreneurs who have lived the challenges in creating commercial AI technologies. The institute has 500+ researchers with domain-leading expertise in all areas of machine learning, 900+ participants in its programs and courses, and 1000+ MS students enrolled in their academic offerings. It aims to bring industry and researchers together to apply state-of-the-art solutions to specific industry-related problems.

Current NLP solutions require massive infrastructure/computation and trained human resources. The goal of Vector’s NLP project was retraining deep language models at scale to significantly reduce the cost of training NLP models while increasing the accessibility and benefits for businesses and researchers. The project has three focus areas:

1 — Domain-Specific Training: The three domains they focus on are health, finance, and legal.

For the health domain, they pre-trained language representations in the biomedical domain by replicating BioBERT and fine-tuning it to Named Entity Recognition, Regular Expressions, and Question Answering tasks. They also conducted an experimental evaluation of Transformer-based language models in the biomedical domain to answer questions like (1) Does domain-specific training improve performance compared to baseline models trained on domain-specific corpora? and (2) Is it possible to obtain comparable results from a domain-specific BERT model pre-trained on smaller-sized data?
For the finance domain, they investigated use cases of Transformer-based language models in finance text. In particular, they came up with finance-specific training for BERT, created a finance training corpus that covers versatile styles and sources of finance text, and proposed a semi-automated strategy of generating fine-tuning tasks on any domain.
For the legal domain, they looked at tokenization and weight initialization approaches to adapt a contextualized language model to the legal domain.

2 — Pre-Training Large Models: This work addressed the limitation of training the BERT pre-trained model. They presented optimizations on improving single device training throughput, distributing the training workload over multiple nodes and GPUs, and overcoming the communication bottleneck introduced by the large data exchanges over the network.

3 — Summarization, Question Answering, and Machine Translation: Finally, they also had other initiatives related to different NLP tasks, including (1) developing domain-specific text summarization datasets, (2) exploring masked sequence-to-sequence multi-node unsupervised machine translation, and (3) building question-answering systems in responding to the COVID-19 open research dataset challenge.

*Sedef Kocak — “Harnessing The Power of NLP at The Vector Institute” (TMLS 2020)*

Sedef concluded the talk with a couple of key takeaways:

Large language models are challenging — due to their “black-box” nature, dataset size, hyper-parameter sensitivity, and computational resources.
Domain knowledge improves the performance of NLP tasks for the domains.
Small-sized datasets can be useful for model retraining.
Domain-specific pre-training could improve fine-tuning tasks.
Collaboration between different subject matter experts is a tough organizational challenge — due to time commitment, participant turnover, and knowledge localization.
Best practices to organize this sort of large-scale collaboration are (1) getting quick wins, (2) monitoring group progress, and (3) trying out experimental learning.

1.5 — Generating Synthetic Data at Arima

A synthetic dataset is a data object generated programmatically. It is often necessary for situations where data privacy is a concern or when collecting data is difficult or costly. Although it is a fundamental step for many data science tasks, an efficient and standard framework is absent. In his talk “A Machine Learning-Based Privacy-Preserving Framework For Generating Synthetic Data From Aggregated Sources,” Winston Li from Arima studies a specific synthetic data generation task called downscaling, a procedure to infer high-resolution information from low-resolution variables, and proposes a multi-stage framework.

Here is a quick primer about the synthetic population:

It is a statistical reconstruction of individual-level data where the ground truth is not available.
It is built from available sources, which are generally aggregated geographically.
It is statistically equivalent to a real population from a data science perspective.

Generally speaking, synthetic data is useful for scenarios when we need to work with multiple data sources — data is fragmented (each source has its own structure), privacy is required (data needs to be aggregated/anonymized to preserve privacy), and there are no obvious ways to link datasets (missing data). How can we create a data fusion to produce a more consistent, accurate, and useful dataset?

To fill this gap, in collaboration with other academic labs, the Arima team proposed a multi-stage framework called SynC (Synthetic Population via Gaussian Copula):

SynC first removes potential outliers in the data.
Then, SynC fits the filtered data with a Gaussian copula model to correctly capture dependencies and marginal distributions of sampled survey data.
Finally, SynC leverages predictive models to merge datasets into one and then scales them accordingly to match the marginal constraints.
There are two key assumptions at play: (1) Correlation is preserved under aggregation, and (2) Synthetic data needs to be calibrated to match the given aggregated data.

The main data sources that Arima works with are open data (census, StatCan research), syndicated research (psychographics, healthcare, media usage, financials, leisure), and partnership data (credit card transactions, credit ratings, location tracking). In the end, their synthetic population is a national, individual-level dataset with 4000+ variables based on all Canadians. No personally identifiable information (PII) was used to respect privacy laws.

*Winston Li — “A Machine Learning-Based Privacy-Preserving Framework For Generating Synthetic Data From Aggregated Sources” (TMLS 2020)*

The Arima team also created a synthetic population matching API, which takes away the tedious work around data acquisition and cleaning. Data scientists can, as a result, focus on building the most robust machine learning models.

1.6 — Understanding Content at Tubi

Jaya Kawale gave a talk about how Tubi, an advertiser-based video-on-demand service that allows its users to watch content online, uses Natural Language Processing (NLP) to understand the content. With more than 33 million monthly active users and 30 thousand titles, Tubi aims to use machine learning to understand user preferences, improve video recommendations, influence buying decisions, address cold-start behavior, and categorize video content.

For a lot of the content, there is a large amount of textual data in user reviews, synopsis, title plots, and even Wikipedia. Furthermore, there is a large amount of metadata in actors, ratings, year of release, studio, etc. To make sense of them, the team at Tubi applied various NLP methods ranging from simple (continuous bag of words, Skipgram, word2vec, doc2vec) to complex (BERT, knowledge graphs) ones. Here are the lessons that Jaya shared:

Not all text is the same (e.g., reviews vs. subtitles vs. synopsis).
Different tasks require different texts (e.g., sentiment analysis vs. text summarization).
Averaging is a widely used method but can lead to information loss (e.g., multiple reviews for a title averaged together to generate a title embedding).
Be careful with the choice of algorithms (e.g., BERT is more suitable for next sentence prediction). There is “no free lunch” in terms of algorithms and representations.
Pre-processing and cleaning up is very critical.
Evaluation is hard but critical (e.g., embedding quality assessment on surrogate tasks).

*Jaya Kawale — “Understanding Content Using Deep Learning For NLP” (TMLS 2020)*

In the end, they built Spock, a platform for data ingestion, pre-processing, and cleaning. It generates a variety of embeddings for different use cases across the product. There are also assessments to help assess the quality of the embeddings via surrogate tasks. As seen above:

The inputs include first and third-party data, viewer-oriented data, and other content metadata.
All this information goes into Spock, which is then cleaned and pre-processed into products. These products can be in the form of embeddings, models, or beam from the universe to the Tubiverse.
Several use cases can use these content understanding products, such as addressing cold-start behavior, assessing content value, analyzing portfolios, setting up pricing tiers, augmenting search, seeding growth, pursuing new audience, etc.

Jaya ended the talk with three concrete future directions for Spock: (1) improve natural language understanding to better construct embeddings, (2) handle different languages for new geographical regions, and (3) unify embeddings across different use cases.

1.7 — Addressing Cold-Start Issues at Tractable

Despite the remarkable results achieved by deep neural networks in recent years, they are data-hungry, and their performance relies heavily on the quality and size of the training data. In real-world scenarios, this can increase the time to value being added significantly for businesses, considering that collecting huge amounts of labeled data is very time- and cost-consuming. This phenomenon — known as the cold start problem — is a pain point for almost any company that wants to scale its machine learning applications. In their talk “Overcoming The Cold Start Problem — How To Make New Tasks Tractable,” Azin Asgarian from Georgian and Franziska Kirschner from Tractable demonstrated how this problem could be addressed via aggregating data across sources and leveraging previously trained models.

Tractable is a UK-based AI company that uses computer vision to automate accident and disaster discovery. Here’s how their product works for the vehicle damage use case:

The vehicle owner uploads an estimate and pictures of the damaged vehicle to his/her claims management system.
Tractable’s AI, which is trained on millions of real images of car accidents and repair operations, then compares the pictures and the estimate to accurately judge the repair operations.
The vehicle owner receives an assessment, flagging any potential inaccuracies.

Being in 13 countries over 3 continents globally, Tractable’s AI deals with cars worldwide, which look different and are repaired differently even for the same damage. As a result, a model that classifies car damage well in one country will perform poorly in new geographies due to shifts in the data. The Tractable team partnered with the Georgian team to overcome such a cold-start problem to quickly and efficiently adapt to these unique data shifts. The proposed method aims to improve one customer's performance (the target) with access to enormous available data/models from other customers (the source).

*Azin Asgarian — “Overcoming The Cold Start Problem — How To Make New Tasks Tractable” (TMLS 2020)*

In particular, the cold-start problem is usually caused by three types of data shifts: input shift, output shift, and conditional shift. To address input shift, they rely on two types of transfer learning methods.

Instance-based transfer learning methods try to re-weight the samples in the source domain to correct for marginal distribution differences. These re-weighted instances are then directly used in the target domain for training. Using the re-weighted source samples helps the target learner use only the source domain's relevant information.
Feature-based transfer learning methods map the feature spaces from both source and target domains into lower-dimensional spaces while minimizing the distances between similar samples in both domains.

To address output and conditional shifts (which happen more frequently in the real world), they use parameter-based transfer learning methods, which transfer knowledge through the shared parameters of the source and target domain learner models. The key idea behind parameter-based methods is that a well-trained model on the source domain has learned a well-defined structure, and if two tasks are related, this structure can be transferred to the target model. They also experimented with ensemble learning, which concatenates various pre-trained models for various tasks into a strong model for a new task.

Tractable’s AI builds a visual damage assessment module using the metadata from car images, which then creates an abstract, domain-independent representation of the damages. This representation contains all the necessary information for the domain adaptation task. Having access to it, they built a domain-adaptable layer that adapts the repair methodology for each geography. The business outcomes of starting warm were highly encouraging — quicker and more efficient expansion to new markets, reduced costs for data collection and labeling, faster customer on-boarding, earlier revenue recognition, and elevated brand.

2 — Research, Business, and Strategy

2.1 — From Model Explainability to Model Interpretability

With the widespread use of machine learning, there have been serious societal consequences from using black-box models for high-stakes decisions, including flawed bail and parole decisions in criminal justice. Explanations for black-box models are not reliable and can be misleading. If we use interpretable machine learning models, they come with their own explanations, which are faithful to what the model actually computes. In the talk “Stop Explaining Black-Box Models for High-Stakes Decisions and Use Interpretable Models Instead,” Professor Cynthia Rudin from Duke University discussed the chasm between explaining black-box models and using inherently interpretable models.

The talk was largely based on Cynthia’s Nature article last year, which discusses the key issues with explainable and interpretable machine learning. The article's core thesis is that rather than trying to create inherently interpretable models, there has been a recent explosion of work on ‘explainable ML,’ where a second (post hoc) model is created to explain the first black-box model. This is problematic. Explanations are often not reliable and can be misleading. If we instead use inherently interpretable models, they provide their own explanations, which are faithful to what the model actually computes. Here are the most relevant points that I took away from the talk/article:

There is a widespread belief that more complex models are more accurate, meaning that a complicated black box is necessary for top predictive performance. However, this is often not true, particularly when the data are structured, with a good representation of naturally meaningful features. When considering problems that have structured data with meaningful features, there is often no significant difference in performance between more complex classifiers (deep neural networks, boosted decision trees, random forests) and much simpler classifiers (logistic regression, decision lists) after preprocessing.
Explainable ML methods provide explanations that are not faithful to what the original model computes. If the explanation was completely faithful to what the original model computes, the explanation would equal the original model, and one would not need the original model in the first place, only the explanation. This leads to the danger that any explanation method for a black-box model can be an inaccurate representation of the original model in parts of the feature space.
Even if both models are correct (the original black box is correct in its prediction and the explanation model is correct in its approximation of the black box’s prediction), the explanation may leave out so much information that it makes no sense. For instance, the saliency maps in computer vision can help determine what part of the image is being omitted by the classifier, but this leaves out all information about how relevant information is being used. Knowing where the network is looking within the image does not tell the user what it is doing with that part of the image.
In high-stakes decisions, there are often considerations outside the database that need to be combined with a risk calculation. But if the model is a black box, it is challenging to manually calibrate how much this additional information should raise or lower the estimated risk.
Black box models with explanations can lead to an overly complicated decision pathway ripe for human error. The multitude of typographical errors has been argued to be a type of procedural unfairness, whereby two identical individuals might be randomly given different parole or bail decisions.

*Cynthia Rudin — “Stop Explaining Black-Box Models for High-Stakes Decisions and Use Interpretable Models Instead” (TMLS 2020)*

The remaining of the talk went over recent research on interpretable machine learning that Cynthia’s groups have worked on:

RiskSLIM (Risk-Supersparse-Linear-Integer-Models) is a specialized cutting-plane algorithm to construct optimal sparse scoring systems for arbitrarily large sample sizes and a moderate number of variables. It adds cutting planes only whenever the solution to a linear program is integer-valued.
2HELP2B is a simple scoring system to estimate seizure risk in acutely ill patients undergoing continuous electroencephalography. The model was built using the RiskSLIM method above, which is designed to produce accurate, risk-calibrated scoring systems with a limited number of variables and small integer weights. This system has been proven to be a quick, accurate tool to aid clinical judgment of the risk of seizures in critically ill patients.
CORELS (Certifiably Optimal Rule Lists) is a custom discrete optimization technique that produces rule lists over a categorical feature space with optimal training performance (according to the regularized empirical risk). This simple method is equally accurate for recidivism prediction as to the very complicated COMPAS model.
This Looks Like That introduces the prototypical part network (ProtoPNet) that dissects an input image by finding prototypical parts and combines evidence from the prototypes to make a final classification. The model thus reasons in a way that is qualitatively similar to the way ornithologists, physicians, and others would explain to people how to solve challenging image classification tasks. It achieves an accuracy that is on par with some of the best-performing deep models and provides a level of interpretability that is absent in other interpretable deep models.
Concept Whitening is a mechanism that alters a given layer of the neural network to allow a better understanding of the computation leading up to that layer. When a concept whitening module is added to a ConvNet, the axes of the latent space are aligned with known concepts of interest. Their experiments show that concept whitening can provide a much clearer understanding of how the network gradually learns concepts over layers, rather than analyzing a neural network post-hoc.

2.2 — Algorithmic Decision-Making

Patrick Hall (bnh.ai), Talieh Tabatabaei (TD Bank), and Richard Zuroff (Element AI) had a lively panel discussion about algorithmic decision-making and the common types of AI accidents that can happen, of which algorithmic discrimination and data privacy stand out.

Regarding practices to ensure responsible models, Talieh walked through the model lifecycle management process currently adopted at TD Bank. The first stage is to identify and develop the models. The second stage is to assess the models’ risks by examining their complexity, user population, intended usage, and data reliability. The third stage is to validate the model performance. The final stage is to implement and monitor the models. Throughout this process, the roles and responsibilities are clearly defined: model owners are senior executives and data scientists, model risk managers and validators are responsible for measuring the risk and validating the impact, and an internal audit team is in charge of validating the effectiveness of the first and second line. Furthermore, there are documentations that capture model design to ensure reproducibility and test requirements on feature selection, model fairness/explainability/monitoring.

Adding to the process above, Patrick believed that we need to hold people accountable for the model results. Given the massive number of parties in this process, it is very likely that existing model risk management won’t help people foresee the edge cases of machine learning in the digital age. Incorporating privacy, fairness, and security metrics into this process seems to be the most practical strategy.

Bringing in his legal perspective, Richard pointed out an under-explored area of current law practice, which is to treat AI agents as employees. This means that we can hire, train, and supervise AI agents similar to how we treat human employees). He also emphasized the importance of the validation team in the framework that Talieh brought up above. So far, model explainability has focused on data scientists as consumers of the products, not on the model validators. We will need personalized technical tools to equip the validators to explain these models.

Here are the final thoughts from the panelists that I could capture:

Most stakeholders are obsessed with only the model performance. There should be a multi-objective optimization function that compromises accuracy with stability and fairness. Applying constraints and multiple objectives force the stakeholders to go through the next level of addressing tough socio-technical questions.
Doctors are held accountable for the impact of their practices. There should be the same standard for AI engineers (also a cultural problem in data science).
There is a broader issue on under-specifying the objective function. Check out Google’s recent paper on under-specification in machine learning models.
There has yet to be a concrete definition of fairness. Existing definitions are even incompatible with each other. A potential solution is to push engineers and scientists to learn the legal quantitative definition of fairness.

2.3 — Winning Enterprise Customers

Ari Kalfayan from AWS gave a tactical talk that includes practical strategies, growth hacks, and specific techniques for machine learning startup founders to land their first 50 enterprise customers.

Given that machine learning is still developing as an industry, Ari argued that customers buy the vision (your take on where the field is headed), not the features. He emphasized that the three tenets for a successful product launch are:

Listen to customers: Ask questions to dive deeper! Your customers will tell you everything you need to know to solve their problems. Use data to guide your decision making.
Have a vision and empower your team: You are the master of your destiny through your vision. Have a strong vision and empower others to do the same. Pin it to your main Slack channel.
Resilience: Startups are an ultra marathon. Develop your emotional intelligence and use creativity to solve your problem.

The talk then went to the common three stages of startup development. In the “start” stage, you are trying to land your first customer. Here are the actions that Ari recommended:

Validate that businesses will purchase your product.
Founders run the sales and marketing process.
Launch a simple website with ongoing content marketing—One-pager and deck for bonus points.
Generate leads from warm intros.
Use your first users to work out kinks with your product.
Early customer insights are more valuable than the contract amount.
Celebrate your wins.

To find the customer, you need to deeply understand your sales opportunity by knowing why they buy products, identifying who the users and key decision-makers are, and bringing all the stakeholders together.

The best practices are to start with early adopters and find a flagship brand to be your lighthouse customer.
It’s critical to prioritize your product roadmap by understanding customer’s pains.
Furthermore, use your creativity to reach customers in these un-precedented times (events, education, podcasts, referrals, etc.)
In terms of pricing, ask about your customer’s budget, know how much your competitors are charging, develop a simple pricing model, and create a vision of how the relationship will expand over time. Always try to get a read for the customer reactions and willingness to pay.

In the “grow” stage, you are trying to land your next 10 customers. Here are the actions that Ari recommended:

Generate marketing leads outside of your network.
Set up an inbound marketing system.
Test different price points and pricing models.
Set up tracking and CRM (easy and scalable reporting dashboards).
Refine company messaging to highlight customers’ biggest pain points.
Hold quarterly business reviews with each of your customers:
Ask for references, and build content with your customers.

To grow your list of customers, you will want to experiment with different demand generation techniques: create a hit list of your top 100 enterprise customers, tap into warm introductions (investors, friends, former co-workers, and advisors are key), and use different channels to reach customers (blog, email, podcasts, email marketing, etc.).

In the “scale” stage, you are trying to land your next 50 customers. Here are the actions that Ari recommended:

Go from 5 demos to 50 demos a week.
Accelerate the lead generation channels that are working: LinkedIn, retargeting, updates, announcements, customer reviews, spotlight ads, conferences, SEO, social and product announcements.
Start thinking about renewals and customer lifetime value.
Begin building specialization within the sales team
Customer success becomes a key role.
Start to think about hiring a sales engineer.

The most important thing at this stage is to reduce frictions. You would like to clearly distinguish treatment for free and paid PoCs, automate payments for lower-tier issues, talk with your customers’ legal and procurement teams early in the sales process, and possibly use discounts and expiration dates to drive your closing process.

3 — MLOps

3.1 — Machine Learning Production Myths

In “Machine Learning Production Myths,” Chip Huyen from SnorkelAI outlined the challenges and approaches to designing, developing, and deploying machine learning systems. She first exposed the differences between ML in Research and ML in Production:

In research, the objective is to achieve strong model performance. In production, different stakeholders have different objectives (the ML engineer wants the highest model accuracy, the product manager wants the fastest inference, the sales guy wants to sell more ads, the manager wants to maximize profits, etc.)
In research, the computational priority is fast training with high throughput (number of predictions made). In production, the computational priority is fast inference with low latency (time to serve a prediction).
In research, components of fairness and interpretability are good-to-have. In production, these components are vital.

Chip then introduced the 6 machine learning production myths:

Deploying is hard. Fact: Deploying is easy. Deploying reliably is hard.
You only deploy one or two ML models at a time. Fact: big companies have hundreds of models running in production (Booking, Uber).
If we don’t do anything, model performance remains the same. Fact: concept drift is a frequent issue, where the statistical properties of the input and output are not the same. A tip here is to train your models on data generated at least 6 months ago and then test them on current data to see how worse they will be.
You won’t need to update your models as much. Fact: the DevOps standard is very high at reputable companies (50 times a day at Etsy, 1000s time a day at Netflix, every 11.7 seconds at AWS, etc.). The combination of DevOps and Machine Learning will be crucial. For online systems, you want to update them as fast as possible.
Most ML engineers don’t need to worry about scale. Fact: The majority of engineers work for big companies, where scale is of high priority.
ML can magically transform your business overnight. Fact: Efficiency improves with maturity.

Chip then defined the 4 phases of ML adoption:

The first phase happens without any machine learning. “If you think that machine learning will give you a 100% boost in performance, then a heuristic will get you a 50% chance of getting there.”
The second phase involves the usage of simple machine learning models. You should start with a simple model that allows visibility into its mechanism to validate your hypotheses and the model pipeline.
The third phase entails optimizing these simple models. Now you can switch up the objective function, experiment with more data, engineer more robust features, and ensemble various models.
The fourth (final) phase moves towards the adoption of complex models. Only after the previous phases that you should attempt complex/state-of-the-art methods. Also, don’t be afraid of incorporating rules-based component into these methods.

The talk ended with a nice slide on the 10 principles of good machine learning systems design. If you are interested in learning more about these principles, stay tuned for her Stanford course next spring!

*Chip Huyen — “Machine Learning Production Myths” (TMLS 2020)*

3.2 — Machine Learning Pipelines

Benedikt Koller from Maiot gave an intriguing talk with the clever title “A Tale of a Thousand Pipelines” — where he shared real-world learnings from putting deep learning models rapidly from research to production through solid operations and orchestration.

A basic DevOps build pipeline consists of many individual processing steps, where each step depends on the previous one and produces its unique artifact. How can we build Machine Learning pipelines that ensure reproducibility — such that we can quickly transition from one experiment to another?

Below are different pipeline types that Benedikt built out at Maiot:

Pipeline 0 — Standardization: Initially, they converted the known-good code into a pipeline.
Pipeline 25 — Reusing Pipelines: They generated declarative configurations instead of reusing code for faster reproduction.
Pipeline 100 — Processing Backends: To keep an acceptable speed on giant datasets (Terabyte+), they updated their tech stack using tools like Apache Beam and Google Dataflow — which supports Spark and Flink via comprehensive pipelining syntax.
Pipeline 101 — Training Backends: Training on large datasets is no fun. At this stage, they split and pre-processed data with distributed workers, then trained their models via a cloud-based platform (AWS, Azure, or GCP).
Pipeline 250 — Caching: Caching each stage of their pipeline (reading data, splitting data, or pre-processing data) enabled faster iterations on domain knowledge and model architecture. This practice also helped them reduce the sky-rocketing cloud cost.
Pipeline 500 — Data Versioning: At this stage, writing manual SQL queries to collect slices of data wasn’t practical. Thus, they versioned their data by reusing snapshots from the data commits for the distributed splitting and preprocessing steps later in the pipeline.
Pipeline 750 — Standardized Evaluation: Using standardized evaluation tools like TensorBoard, TFMA, TFDA, or What-If helped them battle bias within isolated datasets.
Pipeline 1000 — Automated Serving: Finally, they created ownership for any end-to-end machine learning projects by making the serving/deployment layer automated, discoverable, and immutable.

*Benedikt Koller — “A Tale of Thousand Pipelines” (TMLS 2020)*

Here are the key learnings that Benedikt concluded the talk with:

Version your code, your models, and your data simultaneously.
Standardize and reuse code and configuration.
Track parameters and artifacts across all pipeline steps.
Leverage backends for compute-intensive tasks.
Treat evaluation as first-class citizens in the pipelines.
Create ownership and autonomy within your teams.

If you are interested in making machine learning pipelines reproducible, check out ZenML — Maiot’s extensible, open-source MLOps framework for using production-ready pipelines in a simple way.

3.3 — Machine Learning Model Monitoring

The data scientist’s job does not finish when the model is shipped. Models degrade and break in production. The failure modes of machine learning systems are also different from those of traditional software applications. They require purpose-built monitoring and debugging. However, this aspect is often overlooked in practice. In her talk “How Your ML Model Will Fail,” Emeli Dral from Evidently AI showed how to set up model monitoring from scratch and prioritize different metrics.

She first introduced common issues with data quality and data integrity:

Data Processing Issues: these include broken data pipelines, data infrastructure updates, wrong data sources, etc.
Data Loss at the Surface: these include broken sensors, logging errors, database outages, etc.
Data Schema Change: these changes can happen in the upstream system, in the external APIs, in catalog updates, etc.
Broken Upstream Model: one model’s broken output might correlate to another model’s corrupted feature.
Data Drift: this refers to a change in feature distribution. For instance, in a social consumer app, users get onboarded from a new channel — leading to a change in the distribution of organic/paid search and social demographics.
Concept Drift: this refers to a change in the underlying relationships of features. They may have the same distributions but exhibit new patterns.

To combat this issue, it’s crucial to monitor models in production. The enterprise reality is that 67% of organizations do not monitor their models for accuracy or drift, which is mind-blowing. There are major ways to approach this question:

Add machine learning metrics to service-health monitoring tools such as Prometheus or Grafana.
Build machine learning-focused reports and dashboards (via Business Intelligence tools like Tableau/Looker or customized libraries like Matplotlib/Plotly).

How can you start? Emeli stated these factors to be considered:

Use Case Importance: economic value, error costs, estimated risks, etc.
Complexity: data source diversity, pipeline complexity, batch or real-time inference, immediate or delayed response, etc.,
Team Resources: available engineering resources.

*Emeli Dral — “How Your ML Model Will Fail” (TMLS 2020)*

She then presented a neat monitoring checklist for practitioners to follow:

Does It Work? (Service Health) Call your model with a different random seed to make sure that it runs properly.
How It Performs? Did Anything Break? (Model Performance) Check sanity and ranges in model output distribution. Compare predicted results with ground-truth results in a hold-out dataset (using interpretable metrics and error distribution) to ensure model quality. Publish dashboard and share business metrics with the rest of the org!
Where It Breaks? Where To Dig Further? (Data Quality and Integrity) Examine missing data, data range/type compliance, and changes in feature correlation.
Is The Model Still Relevant? (Data and Concept Drift) Look at the key feature drivers and check their distributions visually/statistically.

Emeli recommended looking at model performance by segment, model bias and fairness, data outliers, and model explainability for more comprehensive monitoring. The talk ended up with a nice slide summing up the pragmatic approach to model monitoring. I’d recommend checking out Evidently AI’s Machine Learning Monitoring blog series for further exploration!

4 — Quantum Computing

4.1 — The State of Quantum Computation

In “The Quest for The Final Quantum Computer,” Alba Cervera-Lierta from the University of Toronto provides a solid overview of quantum computation’s current state. Quantum mechanics is a field that studies the microscopic or high-energy phenomena of physics. It started at the beginning of the 20th century to solve thermodynamic problems, but not until 1926 that we have the Postulates of quantum mechanics (the mathematical formalism needed to understand this theory). Later on, during the “infancy” and “childhood” periods (1930–1980), we started to learn and control the quantum mechanical world. Various applications were born — including the transistor, solar cells, GPS, laser, magnetic resonances, etc. — leading to the first quantum revolution. In the “adolescence” period (1980–200), we started to study with much more detail the field of quantum information and discovered its practicality for communication/computing (factorization algorithm, Shor algorithm, quantum logic gates, etc.) We are currently in the second quantum revolution, with inventions such as quantum chips, quantum teleportation with satellites, and quantum supremacy experiment.

*Alba Cervera-Lierta — “The Quest for The Final Quantum Computer” (TMLS 2020)*

The theoretical framework for Quantum 2.0 includes quantum mechanics, information theory, and quantum information. The three big applications are:

Quantum Communication (which includes quantum cryptography) — how to communicate using quantum phenomena.
Quantum Computing (which includes quantum simulation) — this was the main focus of Alba’s talk.
Quantum Sensing (which includes quantum metrology) — how to construct highly precise sensors.

A quantum computer is a device capable of processing data in a quantum mechanical form. The quantum part of it is the hardware that processes the data. The classical part of it is the software that controls the quantum hardware. Inside of this quantum device lies the qubits, which are the minimal units of quantum information.

How does it work?

In classical information theory, we have a bit that can be zero or one. In quantum mechanics, we have qubits, which is a super-position between zero and one.
If we want to qualify the states of two qubits, we will need four bits for that (0–0, 0–1, 1–0, 1–1). This keeps extending when more parties are involved, making quantum simulation exponentially hard with a classical computer.
Another property that arises from quantum superposition is called entanglement, where we can no longer distinguish between two qubit states.
From an experimental point of view, a qubit is a quantum system with two well-defined states that we can control with pulses, lasers, etc.

Various organizations are involving in this quantum computing movement. On the hardware side, there are well-known companies like Google, IBM, Microsoft, etc., and startups originated from academics such as Rigetti, Xanadu, Q-Ctrl, etc. On the software side, there are companies including Zapata, Entropica Labs, 1QBit, Riverlane, QCWare, etc. These organizations are tackling the experimental challenges that include:

How to choose the right technology for scalability? (think million-qubit chips) Super-conducting circuits, ion traps, photons, and NV-centers are all viable choices.
How to deal with the noise of qubits? More specifically, how to write quantum error correction codes? How to develop noise-resistant (variational) algorithms? How to enable fault-tolerant quantum computation?

Alba ended the talk with a quantum calling. The field needs people from different backgrounds to develop new quantum algorithms (variational or fault-tolerant), build new applications (in chemistry, materials, finance, biology, physics, etc.), create new architectures (super-conducting circuits, trapped ions, photons, etc.), design better quantum control (noise reduction, error correction, gate fidelities, etc.) and better quantum-classical interface (programming languages), and think critically about quantum computational complexity to solve the right kind of problem.

4.2 — Quantum Software

Continuing the quantum theme, one of the fundamental goals in the emerging field of quantum machine learning is to build trainable quantum computing algorithms. It turns out that we can, with very minimal changes, port many existing ideas, algorithms, and training strategies from deep learning over to the quantum domains. This allows us to train quantum computers largely as we do neural networks, even using familiar software tools like TensorFlow and PyTorch. In his talk “Software For Quantum Machine Learning,” Nathan Killoran from Xanadu gave a high-level overview of the key ideas that make this possible.

Let’s touch base on deep learning fundamentals real quick. We all know that hardware advancements, workhorse algorithms, and specialized/user-friendly software are the main components that make deep learning successful. In a typical deep learning workflow, we (1) build a neural network model with trainable parameters, (2) define a cost function, (3) train the model with gradient descent to minimize the cost function, and (4) use the model for classification/prediction tasks during inference time.

According to Nathan, deep learning is only a small piece of a larger perspective called differentiable programming: “any code should be trainable, not just the machine learning models.” The process for differentiable programming is very similar to that of deep learning: we (1) write code with trainable parameters, (2) define a cost function, (3) train the code with gradient descent, and (4) fine-tune the code to specific purposes after optimization.

There are already many software libraries for differential programming on the classical computing side: TensorFlow, PyTorch, MXNet, Zygote, etc. The key feature of any of them is automatic differentiation, meaning that the gradients needed to compute the model or the code are automatically handled by the software.

All the ideas from deep learning can transfer to the realm of quantum machine learning:

There are new quantum processors that have become available in the last few years (hardware advancement).
Methods like gradient descent apply to the quantum domain (workhorse algorithms).
There’s also emerging software space for quantum machine learning such as PennyLane, TensorFlow Quantum, Tequila (specialized/user-friendly software).

*Nathan Killoran — “Software For Quantum Machine Learning” (TMLS 2020)*

What do these quantum software frameworks do?

We start with a quantum circuit, which consists of a sequence of parametrized gates. The goal is to measure the circuit output, which depends linearly on the parameters of the gates. This means that this output is ripe for the same optimization tools that we use in deep learning.
We can picture the quantum circuit as some kind of black-box U(θ) that takes in input x and gates’ parameters θ, then returns a function f(θ) that happens to be classically intractable.
To compute the gradients and handle such intractability, we can use the “parameter-shift rule” trick. The circuit gradients can be determined by querying the circuit at a slightly forward-shifted value and a slightly backward-shifted value for a particular parameter. This trick can be done automatically by the software.
After gradient computation, we can train the circuit using gradient descent and move it towards a particular point in the parameter space. This routine is also called a “variational circuit,” which forms the first-tier of “hybrid” quantum-classical algorithms.

Popular (deep) machine learning libraries use computational graphs, in which computations can be broken down into individual steps. Each step is a node, and (directed) edges indicate data flow in this graph. Quantum machine learning software replaces the nodes with quantum nodes. In these nodes, a larger model hands off a class to the quantum computer. The quantum computer then evaluates a function on those classes and returns the outputs to the classical machine learning libraries.

Those are all the steps to make fully end-to-end differentiable hybrid models using classical machine learning tools that connect to quantum computers. If you are interested in training quantum machine learning algorithms, check out:

PennyLane: Xanadu’s open-source Python library for differentiable programming of near-term quantum devices.
Xanadu’s educational materials for learning how to do quantum machine learning (quantum transfer learning, doubly stochastic gradient descent, quantum GANs, etc.).
The quantum machine learning hackathon next February.

4.3 — Quantum Generative Models

Despite significant effort, there has been a disconnect between most quantum ML proposals, the needs of ML practitioners, and the capabilities of near-term quantum devices towards a conclusive demonstration of meaningful quantum advantage in the near future. To round up the quantum bandwagon, I will briefly summarize “Quantum-Assisted Machine Learning with Near-Term Quantum Device” from Alejandro Perdomo-Ortiz of Zapata Computing, which provides concrete examples of intractable ML tasks that could be enhanced with near-term devices.

The first insight is to work on intractable problems of interest to Machine Learning experts (e.g., generative models in unsupervised learning). A machine learning practitioner will be familiar with classical generative models such as Generative Adversarial Networks, Restricted Boltzmann Machine, and Variational Autoencoder. In the quantum domain, the equivalents are Quantum GAN, Quantum Boltzmann Machines, and Quantum Circuit Born Machines.

The second insight is to focus on hybrid quantum-classical approaches by leveraging available quantum resources to cope with hardware constraints. As seen in the diagrams below:

In the quantum-assisted classical ML paradigm, the training data, the model, and the predictions are all classical. We only involve the quantum model during training, with regards to aspects where the classical machine learning models struggle with.
In the classical-assisted quantum ML paradigm, the quantum computer is not just involved but also committed. The machine learning model used is quantum in nature.
In the ML-assisted quantum ML paradigm, the quantum computer helps train the weights of the classical machine learning models.

*Alejandro Perdomo Ortiz — “Quantum-Assisted Machine Learning with Near-Term Quantum Device” (TMLS 2020)*

We know that the success of classical AI came from the open-source software, the algorithms, and the hardware. Using the same analogy, the future of quantum AI will depend on the invention of new algorithms, the development of new software, and the design of robust hardware (quantum processing units).

That’s the end of this long recap. Follow the Toronto Machine Learning Society for their events in 2021! Personally, I have certainly learned a ton of new knowledge (especially around the intersection of Quantum Computing and Machine Learning) and have been contemplating ways to incorporate them into my work. My future articles will continue covering lessons learned from future conferences/summits in 2021 🎆