Last month, I attended REWORK's AI Applications Virtual Summit, which discovers machine learning tools and techniques to improve the financial, retail, and insurance experience. As a previous attendee of REWORK's in-person summit, I have always enjoyed the unique mix of academia and industry, enabling attendees to meet with AI pioneers at the forefront of research and explore real-world case studies to discover the business value of AI.
In this long-form blog recap, I will dissect content from the talks that I found most useful from attending the summit. The post consists of 13 talks that are divided into 3 sections: (1) AI in Finance and RegTech, (2) AI in Retail and Marketing, and (3) AI in Insurance.
AI In Finance and RegTech
1 - AI Beyond Pattern Recognition: Decision-Making Systems
Existing AI use cases are mostly for prediction and pattern recognition. But how about decision-making? Ultimately, we want to build AI systems that enable us to make good decisions. Kevin Kim (Data Scientist at Nasdaq) looked at this vision with a uniquely proposed framework for an AI-enabled decision-making system.
There are three key benefits of using AI for design optimization:
First, it's easy to set up industry standards for building robust AI systems, as we can achieve scalability and clear division of labor.
Second, we can take advantage of pre-existing methods in design optimization because it becomes much easier to estimate cost and identify resource requirements.
Third, we can meet organizational needs and constraints because we can make them quantifiable.
Any decision-making system consists of two key components: the model of perception that makes observations about the world and the model of action that takes actions based on such an understanding of the world. Each of these two models can be an ensemble of smaller models (termed decision-making components). If we layer these components together, we have the diagram looking like below, where the big green box is our decision-making system that observes the world and takes actions, then the world provides feedback back to it. This view is very similar to the Reinforcement Learning setting, where the decision-making system is the RL agent and the world is the environment.
What are the traits of the decision-making components?
Rationality: capacity (how rational can I be?) and consistency (how often can I reach my capacity?)
Bias: how biased am I while making the decision?
Awareness: how aware am I of different actions to take?
Assuming these components are quantifiable, we can frame our decision-making system as a design optimization problem: Our design objectives are to maximize rationality and minimize bias. Our design constraints are in time, money, and compute resources.
Kevin then brought up a use case of this framework on creating a scalable smart-beta portfolio. There are 4 components: the data pipeline, the predictive model (model of perception), the security selection, and the portfolio optimization (model of action). He had to deal with questions like how many sub-models to use (for the model of perception) and how to choose between multiple options and turnover constraints (for the model of action).
Portfolio optimization is a specific type of predictive resource allocation and control problem. There are similar problems beyond the finance domains, such as smart energy grids, tax budget planning, poker, etc. Furthermore, we can extend this design theory to all decision-making strategies for societies and organizations. Looking ahead, we want to identify more traits of decision-making components, define the notion of optimality, and research better ways models can interact with the world.
2 - Reduce Compliance Risks with Trustworthy AI
There has been a 500% increase in regulatory changes in developed countries between 2008 and 2016. As a result, Regulation Tech may increase the efficiency of compliance processes. However, the current state of compliance is grim: lack of awareness of true compliance risk, inefficient and ad-hoc processes, information silos, and the failure to understand the relationships between requirements and the organization. Compliance maturity can be categorized into three levels:
Reactive (Past): responds to compliance risks that have already happened.
Proactive (Present): actively seeks identification of compliance risks through analysis of processes.
Predictive (Future): analyzes data to predict potential/future compliance risks and preemptively mitigate risk (e.g., using AI).
Brian Alexander (CEO of Omina Technologies) presented the AI-enabled compliance framework as displayed above:
Starting with public and private data sources, we look at potential compliance data sources such as acts, status, regulations, rules, standards, best practices, contracts, etc. (both structured or unstructured data).
The next stage is the content filter, where we look at compliance risk, lines of business, products/services, and jurisdictions.
Then we have the requirement identifier. Here, we need an embedded definition of "requirements" to identify organizational compliance requirements and exclude requirements unrelated to the Regulated Entity.
The last stage is the requirement associator, where we predict associations between the requirement and an organization, given details about the business units, policies, controls, products/services, etc. The output is a compliance risk rating.
Human control is a central checkpoint across these stages to ensure trust and teach feedback loops.
Now let's look at the AI-enabled regulatory change framework:
Again, we have the public and private compliance data sources just like the previous framework.
The next phase is a filter and change identifier to identify changes to risk-rated requirements and present data relevant to a regulatory change.
Compliance risk identifier is where we identify impact to the organization and assess risk rate regulatory changes.
The last phase is a compliance treatment predictor to predict compliance treatment (to mitigate risk and maintain compliance) and provide prediction confidence.
Human control remains a central piece of this framework.
Let's explore NLP methods that we can use for both frameworks:
Knowledge-based methods measure semantic similarity between terms based on information derived from one or more knowledge sources. Advantages include the flexibility of a domain-specific knowledge source, the ability to handle ambiguity and measure word block similarity, and computational simplicity. The main disadvantage is the dependence on knowledge sources (which may require updating frequently).
Corpus-based methods measure semantic similarity between terms based on information derived from large corpora. Advantages include the ability to adapt to different languages and the definitional flexibility of word embeddings. The main disadvantage is the requirement of a large and clean corpus, which can be resource-intensive to process.
Neural-based methods measure semantic similarity using convolution operations and/or pooling operations. Neural networks generally outperform other methods and are capable of estimating similarity between word embeddings. However, they are typical "black-boxes" by nature, requiring large computational resources and stumbling upon privacy issues when processing sensitive data.
Hybrid methods can offer the advantages of different methods used and minimize the disadvantages of using any single method. However, this approach may require significant organizational resources depending on the method combination.
Effective, trustworthy NLP needs four key components:
Knowledge sources: Ontological/lexical databases, dictionaries, thesauri, etc. (WordNet, Wikipedia, BabelNet, Wiktionary). We can also experiment with legal-, compliance-, and risk-specific sources (such as legal knowledge interchange format, financial industry business ontology, Investopedia, law dictionaries, regulatory definitions, and more).
Organizational data: eGRC platform data, legal/compliance event data, and internal organizational mappings or compliance responsibilities.
Word embeddings: These are vector representations of words, where vectors retain underlying linguistic relationships (word2vec, fasttext, BERT, GloVe, etc.)
Evaluation scheme: We need datasets to evaluate semantic similarity solution performance. These are word or sentence pairs with similarity values. Open-source datasets include STS, SICK, SimLex-999, LiSent, WS-353, WiC, etc.
In conclusion, Brian argued that using trustworthy AI to reduce compliance risk helps increase efficiency with AI-enabled automation and build stakeholder trust to adopt AI-enabled compliance processes.
3 - Fund2Vec: Mutual Funds Similarity Using Graph Learning
Dhagash Mehta explored the use case of product similarity for his work as the senior manager for investment strategies and asset allocation at Vanguard. Product similarity is a very frequently arising problem in most business areas (such as recommendation systems). For example, given a product, what are other similar products? Or how similar are the given two products? (or, different from each other?)
In Vanguard's case, this problem specializes in finding similar mutual funds or exchange-traded funds. Identifying similar mutual funds concerning the underlying portfolios has found many applications:
Sales and marketing: knowing a customer has a competitor's fund, we can proactively convince the customer to switch to a similar home-grown product.
Alternative portfolio construction: for a given portfolio of funds consisting of competitors' funds, we can construct an alternative portfolio with the same risk-return profile but consists of only home-grown funds.
Portfolio diversification: two or more similar funds in a portfolio may unintentionally reduce diversification.
Similar fund with a different theme: we want to find a similar fund but with other attributes (e.g., similar ESG fund).
Competitors' analysis: a fund manager can compare various aspects of their managed fund with other similar funds managed by competitors.
Tax-loss harvesting: we want to move from one fund to another similar one for tax-loss harvesting.
Launching new products: we want to launch a new fund similar to one popular in specific markets.
What are the current approaches for product similarity?
Use third-party categorization (e.g., Morningstar/Lipper categories): This is known to partly rely on a qualitative approach, partially a black-box, and sometimes irreproducible process. More importantly, no ranking is provided.
Compute the overlap between two portfolios (with the Jaccard index, weighted Jaccard index, etc.): This captures the bigger picture but needs to be careful if granular details are needed.
Compute the Euclidean distance between pairs of portfolios in the chosen variables-space: This captures linear relationships.
Compute the cosine similarities between vectors corresponding to different portfolios in the chosen variables-space: This also captures linear relationships.
Many other unsupervised ML techniques (such as unsupervised clustering): This usually captures linear relationships or doesn't scale well.
As you can see, traditional methods are either qualitative, prone to biases and often not reproducible, or are known not to capture all the nuances (non-linearities) among the portfolios from the raw data. Dhagash's idea is to reformulate the data of mutual funds and assets as a network, use a graph neural network to identify the embedded representation of the data, and compute similarity in the learned lower-dimensional representation.
More specifically, he was working with 1,1000 index and exchange-traded funds and 16,000 assets. The mutual funds hold the data with weights of the asset in the given fund. The idea is to represent the funds and assets as a bipartite network with two distinct types of nodes (funds+ETFs and assets). Bipartite networks are used to investigate various social networks, movie-actors networks, protein-protein interactions, genome networks, flavor-ingredient networks, etc.
The problem of finding similar products is now transformed to finding similar nodes on the network. There are many ways nodes can be similar to one another on a network. His idea is to create as many such network similarity features as possible and do unsupervised learning to identify overall similar nodes. Considering that creating similarity features for all the nodes may not even be scalable, he relied on the research called Node2Vec (2016) that is capable of learning the features from the raw data:
Node2Vec follows multiple random walks from all the nodes while interpolating between breadth-first search and depth-first search using different hyper-parameters.
Node2Vec's sampling strategy accepts 4 arguments: number of walks (number of random walks to be generated from each node in the graph), walk length (how many nodes are in each random walk), return hyper-parameter P(that controls the probability to go back to a node after visiting another node), and in-out hyper-parameter Q (that controls the probability to explore undiscovered parts of the graphs).
The final travel probability is a function of the previous node in the walk, P and Q, and the edge weight.
Eventually, Node2Vec learns a lower-dimensional (chosen number of dimensions) embedded representation of the entire graph using a language model called word2vec. This embedded representation can be used to obtain similarities of the nodes, clustering (communities), link prediction, etc.
Once they learn the lower-dimensional embedded representation of the data, they are then ready to do further analysis. Since their goal was to identify similar funds, they use a few domain-specific metrics to determine the optimal number of clusters in the present work. For the future, Dhagash and colleagues look forward to refining the results (more features and richer datasets for funds) and experimenting with sophisticated clustering methods.
4 - Transformation of Core Business Function Areas Using Advanced Analytics and AI
Financial institutions have already started to embrace AI as a part of their core business strategy. The AI investments are focused on enhancing the customer and risk controls analytics and transforming and modernizing their core business function areas. Nitesh Soni (Director of Advanced Analytics and AI for Technology at Scotiabank) shared why this modernization of business core function areas is important for Scotiabank's corporate strategy:
Improved Customer Experience: Scotiabank is leveraging big data and AI ethically to deliver personalized solutions and enable even better customer experiences with the bank.
Improved Operational Efficiency: Scotiabank is leveraging big data and AI ethically to improve productivity and reduce operational risk to the bank.
Building a Data-Driven Culture: A key part of Scotiabank's AI strategy has been to build a strong talent pool of data scientists and data engineers, fully integrating them with business analytics professionals who have deep banking knowledge.
Customer Trust and Data Ethics: Scotiabank's business is built on trust, so they have a responsibility to safeguard their customers’ data and use it for good.
Establishing the AI strategy framework with embedded analytics for the modernization of business core functions requires a variety of elements:
Holistic Engagement: a buy-in from top leadership, alignment with business/data/analytics/technology, coordination with privacy/compliance/legal.
Evolutionary Approach: descriptive -> diagnostic -> predictive -> prescriptive analytics.
Reusability and Scalability: transition from building foundational to scaling solutions.
Collaboration: cross-functional team with a mix of skills and perspectives, small core AI team focused on solving the problems.
Agile Execution: nimble delivery and collaborative solutions, staffing of specialized resources based on project needs, including SMEs, and custom training on best practices in analytics.
Shorter Project Cycle: deliver high value in a shorter time, develop MVPs, and adopt a test-and-learn mentality.
An example of the AI strategy framework is shown above, including an engagement component and a delivery component.
During the engagement framework, it's crucial to set clear business objectives (based on customer and financial KPIs, corporate strategy), translate those objectives into specific requirements, then estimate and plan activities (based on business value and readiness of data, technology, people, etc.)
During the delivery framework, it's equally important to execute rapidly and receive continuous feedback post-release.
There are various use cases where a blend of AI technologies can be used to solve problems:
Network analytics: IT function is expected to provide stable and resilient technology systems. Any system failures (VPN, network, email, etc.) can lead to downtime, which can significantly impact operating costs and productivity in the business. AI can help conduct pattern analytics, identify the characteristics of network behavior before a failure happened, predict the network failure before it happens.
Vulnerability management: Flaw in a system related to software, applications, or servers can leave it open to attack. Weakness in a computer system itself, in a set of procedures, or in anything can leave information security exposed to a threat. Self-serving analytics can classify vulnerability and asset detection, stitch multiple data to create a cohesive picture, and automate data feeds for continuous monitoring. AI can help (1) optimize remediation efforts and maximize risk reduction by exploiting the vulnerability data, and (2) predict vulnerability based on the risk by exploiting internal and external security data.
Cybersecurity: Data loss can happen in multiple ways (theft/loss of laptops, emails, connecting to an external network, etc.). We can embed AI to detect and prevent data loss by enhancing traditional policy-based rules and analyze massive alerts data. Additionally, AI can help analyze the massive volume of email logs, mine email content, and prioritize the avalanche of alerts.
Nitesh ended the talk emphasizing that there are certainly challenges to enable, execute, and adapt AI in function areas, but progress is on the way.
AI In Retail and Marketing
5 - Zero-to-Hero: Solving the NLP Cold Start Problem
Mailchimp is the world's largest marketing automation platform. Users send over a billion emails every day through the platform. This mass of marketing text data creates many opportunities to leverage natural language processing to improve and create content for users. However, like many natural language processing (NLP) practitioners, data scientists at Mailchimp have found annotating text data to be costly, time-consuming, and in some cases, legally prohibited. So how do they work around it? Muhammed Ahmed did a deep dive into how Mailchimp uses state-of-the-art NLP models and unlabeled data to cold start NLP products.
Given an idea to build a product that involves NLP, we can train a model and engineer a product so that users can contribute labeled data back to the system. This is also known as the data flywheel. The product then enters a virtuous cycle: better dataset -> better model -> better product. However, the first version of the model usually requires a lot of training data to get this feedback loop going. This is known as the cold-start problem.
So why does Mailchimp care about intent? Via user research, they found out that their users expect us to play an active role in their marketing endeavors. Users would like Mailchimp to generate designs and surface marketing recommendations for them, which requires Mailchimp to know their marketing objective. This requires intent-aware products, which need to be predicted at times since Mailchimp doesn't always have the opportunity to ask.
Let's walk through two major use cases for intent classification at Mailchimp:
Creative Assistant is a content engine-generated copy that is intent-aware to ensure that (1) generated designs align with a user's original intent, and (2) generated designs and style families are tailored to a user's intent.
Smart Recommendations surface thoughtful recommendations based on what users intend to do and recommend an intent-focused campaign as an action.
A common approach to tackle the cold start problem is to use human annotators. However, this approach is costly, time-consuming, and more expensive for specialized tasks. It might even require legal hurdles for sensitive documents.
An alternative promising research approach is Zero-Shot Learning. It learns a classifier on one set of labels and then evaluates on a different set of labels that the classifier has never seen before. More recently, especially in NLP, it's been used broadly to mean get a model to do something that it was not explicitly trained to do. Zero-shot learning has been used in question answering, language modeling, natural language inference, and more.
Muhammed then zoomed in on the natural language inference (NLI) task: Given two sequences of text, predict whether they contradict, stay neutral, or entail each other.
The input data includes a premise and many hypotheses. The premise is a campaign text, while the hypotheses are statements for the intent classes.
These inputs are then passed through the pre-trained NLI Transformer model one by one.
The entailment scores are the output predictions.
Since they use a Transformer model, they can observe the attention weights to sanity checks their hypotheses statements. In the example above, the word "Post a Review" indicates the output class.
Next, Mailchimp decided to build a dataset rather than using the model out of the box. There are two reasons for this decision:
Compute: It took them about 1 week to process 250k examples on a V100 GPU (the most powerful GPU available on GCP)
Dataset flexibility: They were able to rigorously quality test the source dataset or add out-of-domain examples.
For their actual training procedure:
They first collected data using sampled campaigns and zero-shot hypotheses statements manually crafted via experiments. The data were then passed through the pre-trained Transformer, and the label outputs were used as the training data.
The second phase took the training dataset from the first phase and passed it through different vectorization/embedding techniques. Then, they trained a downstream intent classifier.
The outcome is a Python package used in various parts of the deployment process to ensure that the intent prediction stays the same no matter where people access it.
6 - Power Up Your Visual AI with Synthetic Data
Computer vision is rapidly changing the retail landscape for customer experience and in-store day-to-day logistics like inventory monitoring, brand logo detection, shopper behavior analysis, and autonomous checkout. High-volume labeled data is critical to train any computer vision models efficiently. Unfortunately, traditional methods of training models with real-world data are becoming a big bottleneck to the faster deployment of these vision models. 70-80% of the time will be spent on data collection and labeling in a typical computer vision workflow. James Fort (Senior PM of the Computer Vision at Unity) explained how computer vision engineers use Unity to get faster, cheaper, and more unbiased access to high-quality synthetic training data and accelerating model deployment.
Real-world data may have thousands of hand-collected images, is annotated manually and prone to error, is biased to what is available, and is sometimes not privacy compliant. On the other hand, synthetic data can have up to millions of computer-generated images, is automatically and perfectly annotated, is randomized for unbiased performance, and is 100% privacy compliant.
However, it can be daunting to get started with synthetic data, especially for 3D assets. Unity has the technology to produce synthetic datasets with structured environments and randomizations that lead to robust model performance.
When creating a dataset on Unity, the assets brought into Unity must match the requirements of the model they are trained on. Once the 3D assets are created for a project, these assets are set up to behave correctly frame by frame and provide error-free labeling.
Domain randomization is a technique that helps build robust models by programmatically varying parameters in a dataset. In each frame, the specific objects, positioning, occlusion, and more can vary - allowing for a diverse set of images from even a relatively small set of objects. The objects of interest can then be labeled with simple 2D or 3D bounding boxes or more complex forms of labeling (like segmentation).
Everything about the environment can be randomized to create diversity in the dataset. For example, lighting, textures, camera position, lens properties, signal noise, and more are all available for randomization to ensure that the dataset covers the breadth of the use cases.
If you are interested in using Unity Computer Vision datasets for your synthetic data needs, check out their demo and case studies!
7 - AI in Fashion Size & Fit
Online fashion shopping has been increasingly attracting customers at an unprecedented rate, yet choosing the right size and fit remains a major challenge. The absence of sensory feedback leads to uncertainties in the buying decision and the hurdle of returning items that don’t fit well. This causes frustration on the customer side and a large ecological and economic footprint on the business side. Recent research work on determining the right size and fit for customers is still in its infancy and remains very challenging. In her talk, Nour Karessli (Senior Applied Scientist at Zalando) navigated through the complex size and fit problem space and focused on how intelligent size and fit recommendation systems and machine learning solutions leverage different data sources to address this challenging problem.
Zalando is a leader in European fashion with close to 450M visits per month, >38M active customers, >3,500 brands, and ~700k articles on the shop. However, getting the right fit and size for clothes is a challenging problem for them due to a variety of reasons:
Mass production is far away from the end customers using outdated sizing statistics.
There are no uniform sizing conventions, which vary by brand and country.
Vanity sizing is another issue: Brands intentionally assign smaller sizes to articles to encourage sales depending on target customers.
The problem is multi-dimensional: for example, trousers require us to measure length, waist, fit, etc.
The mission of Nour's team is to ensure Zalando customers get the right fit the first time. Her team is multi-disciplinary with a wide variety of skillsets, but her talk focuses specifically on the applied science work. There are multiple challenges that Nour's team came across when executing this task. Here are a few on the Article side:
Different size systems across categories, brands, and countries.
New articles with no sales or returns yet.
Ambiguity or low quality of fit data.
Delayed feedback due to return process at least a 3-week window to allow returns.
Here are a couple of challenges on the Customer side:
Multiple customers (many people behind one account).
New customers and data sparsity (no purchases yet or only a few purchases).
Customer behavior and willingness differs widely across groups
High expectations and explainability, especially from customers who provide size feedback.
Let's examine the type of data sources that Nour's team used for this application. For Article Data:
They look at purchased articles by customers and selected sizes, which have sparsity in particular per category. They also look at return reasons provided by their customers (online or offline), often very subjective, noisy, and delayed.
They look at fashion images from a variety of angles (as seen in the picture below).
They rely on the size and fit-fitting lab, Zalando's in-house station that collects expert feedback and garments technical data. These experts provide the size and fit feedback (too big, too tight, short sleeves, etc.) and article measurements.
Finally, they work with fashion experts to define consistent fit and shape terms. The result is a Unified Fit Taxonomy, which is established fit standards with experts. Fit is the width of a garment concerning the wearer’s body. Shape is the silhouette of a garment concerning the wearer's body.
For Customer Data:
They have a Size and Fit Profile, a destination for customers to interact with their Size and Fit preferences. This portal enables them to (1) communicate with customers with more transparency to increase customer trust and (2) collect feedback on past purchases to improve their algorithms.
They have an onboarding process for size recommendation, where they engage and onboard new customers to algorithmic size advice and ask for a reference item (brand + size).
Zalon Customer is their service to get more personal customer data from heavily engaged customers (Overall, Upper Body, and Lower Body categories).
They recently collect body measurements from 2D images, which are images from customers in tight sports clothes.
Currently, Zalando provides two types of size advice use cases: article-centric and customer-centric. The article-centric size advice use case shows customers a small hint on whether the given size is appropriate for that specific customer. The algorithm that powers this feature is called SizeNet, a weakly-supervised teacher-student training framework that leverages the power of statistical models combined with the rich visual information from article images to learn visual cues for size and fit characteristics, capable of tackling the challenging cold start problem.
SizeNet, along with the feedback acquired by the human fitting experts, makes up a strong and smart prior to evaluate fit and size behavior. It takes advantage of article images to tackle the cold-start problem in new articles. For articles with no data from the fitting station, SizeNet helps raise flags earlier.
The customer-centric size advice use case shows existing customers a personalized recommendation of specific sizes (as seen in the picture below). Behind this feature, there are multiple algorithms in place:
A baseline recommender for all countries: Based on the purchase history, it learns a customer size distribution from the sizes kept by the customer and learns an article offset distribution from the sizes kept and returned by all customers. The inputs include the purchase history of the customer, category, and brand of the query article. The output is the recommended size.
A deep learning-based recommender (being tested in some countries): It is trained using as ground-truth the sizes kept by the customer. The inputs include the purchase history of the customer and the features of the query article. The output is also the recommended size.
For cold-start customers with no or few purchases, they use a Customer-In-The-Loop model, which consists of Gradient Boosted Trees trained using as ground-truth the size purchased by the customer.
Specifically for the Zalon customers, the inputs include the questionnaire and the brand of the query article, while the output is the recommended size. Furthermore, the output size is adjusted before recommendation using brand offsets calculated from kept and return purchases in that brand. It usually takes about 15-20 purchases for the baseline model to catch up with the customer-in-the-loop model.
And for the reference item, the input is the reference item (brand + size), while the output is the recommended size. Note that the size given by the customer is adjusted by their knowledge of the brand's behavior.
Mixing the article data and the customer data, they have developed MetalSF, a meta-learning approach to deep learning recommender. MetalSF treats each customer as a new task. It learns the article embeddings and size embeddings and then uses these embeddings in a linear regression model. At test time, this embedded linear regressor is trained on previous purchases, with the size being decoded from the output of the linear regressor. MetalSF can ingest customer or article data easily. During their experiments, by including the Zalon questionnaire as input data, Zalando observed a noticeable improvement of up to 15 purchases.
8 - Machine Learning at Rent the Runway: More Than Just A Deep Closet
Rent The Runway (RTR) started as a one-time rental company, where customers could rent a specific item or accessory for a certain date in the future. Several years ago, RTR shifted to a subscription model, where members pay a subscription fee to get access to a specific number of rentals per month. Currently, RTR has millions of inventory units, tens of thousands of unique styles, 50 distinct high-level shapes, 10 levels of formality, and 750+ designers.
Between rentals and resale, RTR is a very circular business. As observed in the circle above, RTR deals with a variety of data sources. On the right, you can see some traditional e-commerce data sources such as price, recommendations, browses, restocks, rents/buys, etc. On the left lies the more specific data to RTR (returns, wears, treatments, diagnosis/prescription). Unlike traditional retail companies, Rent the Runway intentionally holds inventory for very long periods of time and rents that inventory out over and over again to a long-term subscriber base. The cyclical nature of the business generates a wealth of data and unique opportunities to leverage that data in high-impact algorithms that improve the customer experience and, ultimately, the entire operation.
This entire process is addressed entirely by humans today. For her talk, Emily Bailey (RTR's Director of Data Products and Machine Learning) examined a specific piece of data that comes back to her team after a rental occur, called the Happiness Survey. After every rental and before a customer can rent another piece of inventory, RTR asks her to rate the item: whether she loves it or likes it, and what's wrong if she doesn't wear it. Because of how this survey is implemented, RTR has almost a 100% response rate for the initial question. It is worth noting that this Happiness Survey yields valuable data on 98% of rentals for RTR. As Emily's team dug in this data, they found out that the top two reasons for "didn't wear" or "just ok" are fit issues and not flattering.
The solution that RTR built is a Happy Model. It is a neural network to predict the probability that a user would be happy renting a specific style. Initially, they thought of using this model as a re-ranker, selecting top styles based on the probability that the user would pick them and ranking them to make her happy. However, this is expensive and slow to roll out for all possible styles. Therefore, Emily's team had to pick a top-N to re-rank, where N is 6000. This is still infeasible for a customer to see that many styles in a single session.
Another method that they looked at is weighting ranking for both the pick model and the happy model. The experiment results see a positive lift in primary, secondary, and guardrail metrics when choosing an increasing weight for their happy model.
As observed in the data flywheel below:
RTR first collect happiness survey data. Then based on the diagnosis and prescription of return items, RTR collects more quality data, which are then used as inputs for the Happy Model. The outputs of the Happy Model are item recommendations that make customers happy.
RTR is considering using this data more intelligently for price inventory. For example, if there are inventories that RTR knows will make customers unhappy, they will be priced lower to be sold faster (before somebody else rents them and has a negative experience).
Even further upstream, RTR is looking at taking this happiness data and figuring out what inventories to buy to maximize customer happiness. Even more ambitiously, RTR can work with fashion designers and guide them on what makes women happy in their clothes.
9 - The Future of Fashion: Unlocking Discovery
The experience of discovery in digital retail asks a lot of the customer. This is the consequence of a two-decade-old paradigm that forces customers to learn to navigate a set of rigid product taxonomies and hierarchies as defined by the retailer. However, recent breakthroughs in AI have unlocked new capabilities that empower us to challenge the historical discovery paradigm, lead a step-change in space and create a more powerful, interconnected discovery experience. In her talk, Katie Winterbottom (Senior Manager of Data Science at Nordstrom) discusses how her team is leveraging deep learning to re-imagine how customers discover products and style.
The specific problem that Katie's team was solving is to create a personalized, high-touch, service-oriented shopping experience that customers are used to in-store when they are shopping online. When shopping online, a customer has to navigate the rigid browsing process centered around product taxonomy, which does not do a good job of answering natural-language questions.
What if they can take the customer questions and ingest them directly into the data discovery process?
What if they can return relevant results to queries they have not anticipated when creating a taxonomy or tagging structure?
What if they can allow the customer to iterate and improve her own results, talking with the search engine in the same way that she talks to a salesperson?
What if they can take the recommended item and use it as a seed to generate an outfit?
Enter FashionMap, a deep learning model specific to the product discovery space for fashion. It's built on advances in NLP and computer vision. It is an ensemble model that takes a natural language embedded model (like BERT) and combines it with a computer vision model (like ResNet-50). It outputs a multi-dimensional representation of fashion concepts embedded in the items, not the words. Each region in this map (shown above) represents a different fashion style.
An application of FashionMap is fashion search. Customers can search esoteric concepts that Nordstrom previously could never include in their manual or supervised classification process. The example above shows a search query for "Skater Girl." With its native zero-shot learning capability, FashionMap enables Nordstrom to show results for the query even without hand-labeled training data for this concept. More specifically, it creates NLP embeddings to the "Skater Girl" term, positions these embeddings in a multi-dimensional vector space, searches for their nearest neighbors, and returns the results.
Another helpful capability unlocked by FashionMap (also enabled by zero-shot learning) is a mechanism for model monitoring and evaluation in the context of an unsupervised problem. They can look at classification performance (at scale) across a set of known categories and compare against benchmark results. For example, given a men's button-down shirt (as seen above), FashionMap evaluates the accuracy of the embeddings. In other words, when asked, "what is this product?" the model splits out a relevance score, basically which categories represent this product. Then, iterating over a corpus of product images using some set of labeled data, they can get a sense of how well their model predicts the correct category/categories.
The last advanced integration that Nordstrom is working towards is called fashion arithmetic. Essentially, they want to recreate the interactive in-store styling experience. So, for example, a user can look at an item and say, "I like this dress but want something dreamier and in pastel color."
At the end of the day, they are building a platform to make product discovery better, starting with search and moving outward from there. Outfitting is a promising use case. The images above come from a Nordstrom prototype, where the customers can create outfits around a natural language concept ("red floral dress," "loud shirt and blue shoes," etc.).
AI In Insurance
10 - How To Explain an ML Prediction to Non-Expert Users?
Machine Learning has provided new business opportunities in the insurance industry, but its adoption is limited by the difficulty of explaining the rationale behind the prediction provided. There has been a lot of recent research work on transparency (interpretability and explainability), responsibility (bias and fairness), and accountability (safety and robustness).
Specifically, on the topic of Explainable AI (within transparency), we tend to have tradeoffs between accuracy and explainability. Additionally, there has been increasing pressure from regulators and governments to make sure that we use AI responsibly. As a result, there has been a surge for interpretability in research, which is defined to be the degree to which a human can either (1) consistently predict the model's result (a perspective from Computer Science, Kim and Doshi-Velez, 2016) or (2) understand the cause of a decision (a perspective from Human-Computer Interaction, Miller, 2017).
Renard et al., 2019 proposes a useful interpretability framework:
White-box models are more simple and less accurate. They are intrinsically interpretable due to their simple structure.
Black-box models are less simple and more accurate. They require post-hoc interpretability methods, either model-specific (accessing model internals to generate explanations) or model-agnostic (separating the explanations from the ML model).
For the black-box models, we can extract four main explanation types:
Feature importance is the weight of features on the prediction.
Rules represent a global overview of model behavior and apply to all predictions.
Neighbor instances are the nearest use cases to one instance.
Counterfactuals are the minimum changes to make in a feature value to get a different prediction.
The explainable AI research community aims to:
Make AI systems more transparent so people can make informed decisions.
Provide interfaces that help explain why and how a system produced a specific output for a given input.
Analyze people's perception of explanations of ML systems.
In her latest research, Clara Bove-Ziemann (ML Researcher at AXA and Ph.D. Candidate at LIP6) explores how her group can enhance one type of explanation extracted from the interpretability method called local feature importance explanations (LIME) for non-expert users. She proposes design principles to present these explanations to non-expert users by contextualizing them based on the users' pain points.
ML transparency: The explanations give transparency on the model's purpose and basic operations, guiding how to interpret the explanations.
Domain transparency: The explanations pair each local feature importance explanation with global information provided by a domain expert, providing some brief justification about how this feature might impact the outcome.
External transparency: The explanations provide the users with external global information which may impact the outcome they get, providing additional external information on the context or algorithmic processes.
Contextual display: The explanations match the display of feature importance with the input stage and exploit the field categories users encounter in the form, providing ease to build a better mental model of the explanations.
An application of these explanations for the car insurance pricing scenario is shown above:
The onboarding text (E) makes explicit the difference between the predicted and the average prices, explains that the price has been personalized based on the user's information, and introduces the feature-associated cards.
Each feature-associated card is complemented with the main kinds of risk impacted by the feature (G).
There is information on the gender feature (H) that is not included in the model but is likely to be considered important by users.
Features are displayed into three input categories (F): related to the driver, the car, and the driver's residence.
Clara also presents preliminary observations collected during a pilot study using an online A/B test to measure objective understanding, perceived understanding, and perceived usefulness of our designed explanations. Experimental results and the qualitative feedback received lead her to believe that contextualization can be an interesting solution to explore in order to enhance local explanations.
At the moment, Clara's team at AXA is doing more research on Responsible AI related to governance, transparency, responsibility, accountability, and ethics/regulation. Check out their latest publications at this link. Concerning the landscape of AI application in insurance, here are domains where they are actively working on:
NLP systems for digital assistants that can be used for internal services (finance's purchasing orders, HR's request management) and external services (call center, FAQs).
Computer vision systems to detect fraud on PDF documents and assess risks from satellite images.
General ML/DL systems for risk management and pricing models.
Knowledge management and rules-based systems to automate smart contracts and claims management.
11 - Developing Early Warning System to Identify Relevant Events in Unstructured Data
In insurance, the traditional methods to identify relevant events become unreliable when information volume rapidly increases. Furthermore, uncoordinated views pose a challenge in taking proactive and strategic actions to manage risks. Swiss Re is a leading player in the global reinsurance sector. Its role is to anticipate, understand, and price risk to help insurers manage their risks and absorb some of their biggest losses. As one way to stay ahead of the curve and provide thought leadership to its clients, Swiss Re is developing an early warning expert community platform based around big data and natural language processing. The platform is intended to work on the front lines, detect events that can change their view on risk drivers, and help them make business decisions in shorter timescales.
Nataliya Le Vine (a data scientist at the Advanced Analytics Center of Excellence at Swiss Re) unpacks the development of this platform and explains how it helps drive the technology transformation in insurance. More specifically, the early-warning expert platform leverages new data techniques to identify relevant signals relevant to Swiss Re's business (tech-enabled toolkit) and integrates experts and decision-making bodies in a more joined-up process (ecosystem).
The ecosystem is a group of people with domain and business expertise. They are brought together to triage identified signals and assess the impact on the business while recommending actions to the decision-makers in a more coordinated way.
The tech-enabled toolkit handles large volumes of information, ensures consistent data processing and data visualization, and detects relevant events (signals).
Nataliya's team works with data from multiple domains, regions, sources, and languages for the tech-enabled toolkit. The data must have adequate historical coverage, be licensed for analytical models, and contain useful meta-data fields. Swiss Re has been partnering with Dow Jones DNA to get the data that satisfies such requirements. Dow Jones' data contains extensive aggression worldwide for news and historical access back to as early as 1950, covering >1B articles from 8k sources, 28 languages, and 100 regions.
A signal must be relevant to a selected topic, contain new information, be discussed in multiple (reputable) sources, and, most importantly, affect the company's business.
Digging into the inner working of the tech-enabled toolkit, it (1) uses unstructured data(e.g., news articles) for a topic of interest, (2) extracts important elements and classifies(e.g., by domain, country, disease) such elements, and (3) scores the classifications to identify important articles.
How does the article scoring process work? Nataliya presents two ideas:
Scoring is based on how novel the article is based on the past and how persistent the article's message is based on the future.
Scoring is based on communities of concepts for specific time periods.
A dashboard screenshot of the Early Warning System is shown above. Overall, the Early Warning System allows Swiss Re to triage signals (for experts and business teams), recommend potential actions (for business teams), and focus on relevant risk management (for stakeholders). In other words, the system enables proactive business decisions by identifying and managing relevant risks faster than ever before.
12 - Detecting Deception & Tackling Insurance Fraud Using Conversational AI
Global fraud has grown by 56% in the last decade. It now costs the global economy $4 trillion a year. Julie Wall presents a talk on how to use conversational AI to tackle insurance fraud. Currently, there are several ways to accomplish this:
Stress Detection: This entails using the pitch and tone of the voice to detect stress; however, stress doesn't always indicate deception.
Data Analytics: This is synonymous with profiling, which might lead to biases in socio-economical and racial lines.
Biometrics: Also known as "voiceprint," this tech can reasonably indicate a person's identity and match that to the database of fraudsters. But it is only effective when we know that this identity is a fraudster (for the second call of fraud).
Conversational AI can identify fraud in voice through various applications such as forensic linguistics, customer analytics/engagement, sentiment analysis, emotion detection, question classification, prosody, named-entity recognition, and cognitive interviewing. In particular, forensic linguistic analysis tends to work well. When conversing, we have a large vocabulary to choose words from, organize grammatically, and say within a 200ms time constraint. When being deceptive, we have to be careful to omit incriminating evidence and be convincing; thus, the time constraint increases the cognitive load, and leakage occurs. For instance, examples of deception that a forensic analyst can detect include missing pronouns, negation, repetition, explainers, hedging, and lack of memory. However, forensic linguistic analysis is laborious - it might take an analyst up to 2 hours to analyze a 2-page transcript and 6 hours to analyze a 2-minute emergency call. Can we automate the process?
Julie's team at the University of East London has been developing an explainable system that identifies and justifies the behavioral elements of a fraudulent claim during a telephone report of an insured loss. The pipeline is called LexiQal.
First, the audio from a call recorder or stream is processed by IV ASR.
Next, a robust set of 'Markers' are identified using AI: (1) lexical markers (the use of specific words and phrases), (2) temporal markers (the variations in time-based measures), (3) emotion markers (the sentiment), and (4) acoustic markers (the pitch, tone, stress, etc.)
Finally, a 'Decision Engine' identifies the significant clusters of markers that trigger the system to raise an alert.
They use various models for the lexical makers, ranging from simple bag-of-words models to SOTA language modeling such as BERT. BERT is an omnidirectional language model trained on Wikipedia and book corpus. It is based on the Transformer architecture but only contains the encoder (LM) part, not the decoder (prediction) part. By adding a single layer on top of the core model, BERT can be fine-tuned for various tasks, ranging from sentiment classification (e.g., sentiment) to token classification (e.g., hedging, explainers, punctuator, true casing).
Putting it all together, the Decision Engine of LexiQal contains:
Acoustic and linguistic marker detection to detect the location of each marker in a given transcript.
The proximity model to extract the proximity features of each marker T_k.
Each model corresponds to each marker, predicting the likelihood of a marker hit being part of a deceptive occurrence or not. The model is returned as a Deception Indicator.
The final layer computes the deception score of the transcript given Deception Indicators of each marker.
Based on their analysis of the interaction between any two markers against the probability of deception, there were some interesting stand-out examples:
Disfluencies with Explainers, Temporal Lacunae, and Negation have a high impact on deception score with proximity.
Disfluency before negation is not sensitive, but the other way around is very sensitive.
Interestingly, negation alone has minimal impact, but another negation, at very close proximity, drastically increases the deceptiveness of the conversation.
Furthermore, the marker proximity feature extracts the pattern of different linguistic markers around a given marker. In other words, it shows how different markers come together in a conversation.
LexiQal has been filed as a patent - a system and method for understanding and explaining spoken interaction using speech acoustic and linguistic markers.
13 - NLP for Claims Management
Insurance companies have a lot of unstructured data, which has historically been difficult to tap into. However, NLP techniques can be used to leverage the power of that data to deliver business value and drive greater customer satisfaction. Clara Castellanos Lopez (QBE Insurance Group) discusses how some NLP techniques could be used to this end in her talk. Specifically, she argues that contextual embeddings adapted specifically to your domain will improve the quality of your model performance.
QBE Insurance Group is one of the world’s leading insurers and reinsurers, with operations in 27 countries, gross written premium in 2020 of $14,643M. QBE European Operations is a commercial insurer covering multiple lines such as motor, casualty, professional indemnity, marine, and property. They learn historical claims data to understand risks and improve the claims handling process - both structured data (dates, codes, categories, etc.) and unstructured data (voice, text, images, etc.) This matters because efficient and good claims handling translates to better customer experience, reduced claim lifecycle, and higher retention and renewal rates.
There are various ways that claim handling can be more efficient using NLP:
Setting up claims quickly (responding to FNOL).
Responding to common queries faster.
Dealing with repairs and payments as soon as possible.
Flagging claims that require senior handler involvement.
Sending customer updates before they ask.
In brief, a semantic understanding of your data will lead to a better understanding of your business and business risks (emerging/trending). But, more importantly, semantic understanding goes beyond keyword similarity and requires an understanding of context.
An example of claims classification for risk understanding is shown above.
Let's say we are given a description of 6 claims with assigned loss causes. The top 3 are all about theft, even though the word "theft" is never used. The bottom 3 are all about the impact, even though the word "impact" is never used.
When a new claim arrives, the system assigned that claim a loss cause. Then, reporting gets done by the loss cause to measure risk and performance (such as theft property claim). However, we may have duplicate loss causes, or some claims might not have loss causes assigned at all.
Word embeddings come in handy here. They are multidimensional vector representations of a data point (words, sentences), preserving semantic relationships. Points close in the vector space have a similar meaning. Position (distance and direction) in the vector space can encode semantics in a good embedding. Recent NLP models that use word embeddings include GloVe, Word2Vec (2013), BERT (2018), GPT-2 (2019), etc.
Fine-tuned BERT embeddings are especially useful. Common insurance words are different from general English. For example, claim/damage, policy/premium, and catastrophe/recovery are more frequent when reading insurance-related documents. Leveraging the power of your data and fine-tuning pre-trained models will allow for higher-quality embeddings, where similar words should have embeddings closer in space.
Going back to the case study above:
Every claim with an assigned loss cause is used in the training data. Conversely, everything without a loss cause gets excluded from the training data.
They add a fully connected layer (with a softmax function) on top of BERT as a classifier. Weights are updated when training the classifier.
As a result, they extract embeddings at the claim level. Then, looking at the claim-level embeddings of test data, they can detect duplicate claims codes and claims that appear to be misclassified.
Furthermore, fine-tuning has captured the semantic similarity in property claims data, according to the experimental results comparing the performance of BERT and fine-tuned BERT, where the smallest distance represents the greatest similarity.
This is how the final model works in action:
First, it identifies and removes duplicates in the embeddings.
Second, it takes the unlabeled claim descriptions and runs them through the fine-tuned BERT. Finally, the output shows the predicted label.
Clara ends the talk with 3 key takeaways:
You can leverage the power of your historical data by adapting pre-trained language models to your datasets.
Fine-tuned contextual embeddings will improve the prediction quality of models.
Contextual embeddings have several use cases, such as triaging, prioritizing claims, identifying common queries in emails, amongst others.
That's the end of this long recap. If you enjoy this piece, be sure to read my previous coverage of REWORK's Deep Learning Virtual Summit back in January as well. In addition, I would highly recommend checking out the list of events that REWORK will organize in 2021! Personally, I look forward to the MLOps Summit on June 2nd.
My team at Superb AI is developing an enterprise-level training data platform that is reinventing the way ML teams manage and deliver training data within organizations. If you are a practitioner/investor/operator excited about best data management practices to successfully build and manage production ML applications in finance/retail/insurance, please reach out to trade notes and tell me more! DM is open 🐦