Last month, I attended another apply(meetup), Tecton’s follow-up virtual event of their ML data engineering conference series. For context, I have written recaps for both of their 2021 events, including the inaugural conference and the follow-up meetup. The content below covers my learnings, ranging from model calibration and ranking systems to real-time analytics and online feature stores.
1 - A Tour of Features In The Wild and A Modern Solution to Manage Them
Mike Del Balso (Co-Founder and CEO of Tecton) kicked off the event and presented his views on building the optimal data stack for Operational ML. This stack allows data engineers to build real-time features quickly and reliably and deploy them to production with enterprise-grade SLAs. By combining feature stores, scalable offline storage, elastic processing, and high-performance online storage, organizations can lay an optimal infrastructure foundation for their ML data.
Let’s imagine you are a payment processor. Your customers are e-commerce companies that process online payment and need a sophisticated fraud detection system. There’s a variety of different things you can take into account when making a fraud detection prediction:
Transaction context - how unusual is this purchase?
User context - does this user look normal?
User history - did this user have suspicious behavior in the past?
Site-wide activity - is there unusual traffic/fraud across the website right now?
You need to get even more concrete with specific features for the ones above:
Transaction context: Is this transaction abnormally large?
User context: Does this user login from a suspicious location? Does the user use a VPN?
User history: Has the user previously been flagged for fraudulent activity?
Site-wide activity: Has there been an abnormally large number of similar transactions observed in the past few minutes? Has there been a lot more fraud detected than usual in the past few minutes?
There are many implementation constraints when constructing a feature pipeline to process these features: How fresh is the data? How complete is the data? Where is the data stored? How can the data be accessed? How compute-heavy is the feature transform? How much historical data do we have? When does the data become available? Is the prediction made in real-time? Are predictions made in batch?
Every one of these questions can have an impact on the end implementation. An ML engineer still needs to extract as much predictive value as possible from the data given the constraints.
For example, let’s look at the feature “User previously been flagged for fraudulent activity.” You have some historical user activities in a data source and run a query or compute job against it. Then, you create a list of yes/no for every user looking at the queried data. Then, you have to store that list offline to power model training and online to be ready for serving a model for real-time prediction.
Let’s look at another feature “Transaction is abnormally large.” The architecture already looks different. You need two types of data: (1) the normal transaction size of data that comes from historical user activities and (2) the current transaction size that comes from web applications. You can have different storages to hold the data and perform online transformations to compare the historical normal transaction size to the current transaction size.
Let’s look at another feature “Abnormally large number of similar transactions observed in the past few minutes.” For this one, you will likely use streaming data, aggregate them across transactions, and constantly process the abnormal ones to make them available for the real-time serving layer.
With only three features, you already have three different architectural choices. Diverse feature pipelines are hard to build and harder to maintain—even small changes in the idea-space lead to big changes in implementation.
Tecton is built to solve this problem. Its feature platform plugs into the business data sources and serves the feature data both to model inference and training.
Tecton creates production-ready feature pipelines from simple declarative definitions.
Tecton supports common and powerful feature pipeline patterns: SQL query, transformation on real-time inputs, streaming data aggregations, and query to external API or data system.
Tecton encapsulates features as a standardized, reusable component and deploys them together to power models in production.
Operational ML is easier to build and maintain if we decouple feature definitions from feature implementations. Feature platforms like Tecton and Feast implement and manage features with modern data infrastructure and best practice architectures for modern ML. Adopting them will make building and maintaining ML systems easier while getting a more efficient and reliable implementation behind the scenes.
2 - Model Calibration in the Etsy Ads Marketplace
Etsy is a global marketplace for unique and creative goods, founded in 2005 and went public in 2015. They sell over 100 million unique items on their website. Some of their ML applications are product search, user recommendations, and onsite advertising. Etsy’s Erica Greene (Senior Engineering Manager) and Seoyoon Park (Senior Software Engineer) shared the journey, learnings, and challenges of calibrating their machine learning models and the implications of calibrated outputs.
Etsy sellers can promote their items on the site using Etsy Ads. When a buyer comes to Etsy and searches for something, Etsy Ads returns the most relevant ads to the buyer.
Sellers are charged a small fee when a buyer clicks on their ad.
They score each listing and return listings with the top scores for each ad request. Their scoring function is roughly formulated as Score = pCTR x Bid - in which the Predicted CTR is the estimated probability that a buyer will click on the listing and the Bid is the estimated price that a seller would be willing to pay for a click. Both of these components are backed by ML models.
In a traditional marketplace, there’s only one stakeholder - the buyer. In an ad marketplace, a new stakeholder is introduced - the seller. The scoring function above balances the objectives of both: the pCTR for the buyer and the bid for the seller. If their ad is clicked, sellers pay the minimum amount required to keep their position in the ranked list. The cost for the winning listing would change if the 2nd best listing had a different score.
Thus, model distributions play a large role in ad raking and seller costs. The variance of model scores impacts the relative importance between the CTR and bid components in the scoring function. Generally speaking, the larger the variance is, the more important the components are.
The distribution of model scores can vary significantly based on many factors, most notably the sampling of training data (changing the distribution of positive and negative examples in the training data or choosing “harder” or “easier” examples for training) and the model architecture (logistic regression, boosted trees, neural networks, weight decay, model complexity).
The instability in their model distributions significantly slowed down model development because the costs charged to sellers and ad quality would vary widely depending on the model. In order to stabilize these costs and the buyer experience, they introduced manually tuned parameters into their scoring function to adjust the model distributions. The real score is a function of predicted CTR, bid, and a set of three constants: f(pCTR, Bid, alpha, t, b) - in which alpha is used to weigh the components of pCTR, while t and b are used to manually fit the bid distribution. With the introduction of these new parameters, it was difficult to decouple the impact of model improvements from the impact of tuning the scoring function.
As a result of these constants, the Etsy Ads team spent extra time in the experiment cycle to tune the scoring function. A typical cycle looks like this:
They first iterate on a model until they find offline improvement gains.
Using these new models, they simulate an offline auction to determine if there are shifts in the cost charged to sellers and the buyer experience.
If they notice a significant shift in the cost and quality of the queries returned, they will tune the parameters of the scoring function in an offline manner until they come to a set of constants that they are happy with experimenting online.
They tune those same parameters again online because the offline results are not directly related to the online business metrics.
After fine-tuning the scoring function and its parameters, they launch the experiment and re-run the entire experiment cycle.
Ideally, they wanted to remove the two steps of tuning scoring functions from this cycle because that takes a long time. Thus, they needed a way to stabilize their model output distribution. The answer to this problem was calibration.
Calibration is a technique used to fit the output distribution of an ML model to the empirical distribution while preserving the order of the original scores.
Calibrating models to the empirical distribution guarantees stable model distributions because they can force all versions of their models to output the same distribution. The empirical distribution is very stable day-to-day, and they can now interpret their model predictions as probabilities.
The Etsy Ads team chose Platt scaling to calibrate their models. It is a parametric version of the sigmoid function, which transforms the raw outputs of a model and uses them as features to a logistic regression layer. That layer then learns two additional parameters that transform the raw outputs prior to sigmoid activation. Furthermore, they wanted to keep the learnings from the original model, so they froze the weights learned from training the base model before training the calibration layer.
Calibrating models with Platt scaling is extremely easy.
The left-hand side below is Etsy Ads’ old workflow - in which they trained a fully-connected feedforward network using downsampled training data and attached the sigmoid layer to transform the raw outputs to a prediction space.
In order to calibrate this model (the right-hand side), they froze the model’s weights and attached a Platt scaling layer on top of the network before the sigmoid activation layer. They used a holdout set of calibration data to train this new model.
Calibration requires a minor adjustment to the model training and the ETL workflows. They need a separate holdout dataset containing the empirical distribution to calibrate the layer. As seen below, they partition the evaluation dataset into a calibration set and an evaluation set. When training the model, they train it regularly using the downsampled dataset. But when calibrating the model, they use the empirical distribution within the calibration set. As a result, the calibration model now outputs predictions that resemble closer to the empirical distribution.
They can gather a sense of how calibrated a model is using the calibration ratio. This ratio compares the average of model predictions against the rate observed in the system. A ratio greater than 1 indicates overconfident predictions and vice versa. Indeed, they can compute calibration ratios over granular prediction intervals and create a reliability diagram by plotting the average predictions against the average labels. This diagram visualizes which areas of predictions the model may be less calibrated.
You can see an uncalibrated model on the left-hand side where the model is over-confident in predictions across all intervals. The dots lie closer to the identity function on the right-hand side with the calibrated version, where most predictions within all intervals are well-calibrated.
Finally, they had to perform a final tuning experiment with new calibrated predictions because calibrating the model significantly changed the output distributions, and launching the calibrated models as-is would cause major disruptions to seller costs and the buyer experience. They were able to simplify the scoring function by removing most of the manually tuned parameters. The new scoring function is a product between the predicted CTR to an exponent (a) and the bid: New Score = pCTR^a x Bid, where a is a lever to prioritize either the seller or the buyer experience.
With the stability and interpretability of these predictions, the Etsy Ads were able to apply the calibrated predictions to more product applications at Etsy. For example, filtering low-quality ads can improve buyer experience and help long-term marketplace health, so they could introduce thresholds to filter out low-quality listings or listings not relevant to the query. Another use case is to flexibly change the page layout based on the quality of ads, so they could now dynamically change the layout based on the quality of inventory available.
3 - What Data Engineers Should Know About Real-Time Analytics
The evolution of data processing has gone through three separate phases:
Hadoop started way back in 2006. The system is very much batch processing where you collect a lot of data, build models for a couple of days, and serve them in production after a few days.
Spark showed up around 2012. There are a lot of ETL processes to be done because you need fresh data to run your models. The latency of fresh data is maybe a few minutes to a few hours at most.
Now people are moving towards real-time processing, meaning that you can deploy models in a few milli-second latencies. Rockset is a part of this new movement.
Generally speaking, real-time systems are hard. How can we build them? Dhruba Borthakur (CTO and Co-Founder of Rockset) argued that the technical requirements of data systems for real-time analytics entail eight characteristics: bursty traffic, complex queries, flexible schemas, low-latency queries, mutability, out-of-order events, rollups and summaries, and exactly-once semantics. He then went on to discuss five of them in detail.
1 - Bursty Traffic
Bursty traffic is the variations of new data or the number of queries coming into the system serving your models. There are two ways to handle bursty traffic:
The backend is designed for the cloud, so there is no need to overprovision resources for experiments, new product features, or bursty seasonal traffic.
The system has a disaggregated architecture and can independently scale up and down resources to support bursts in both data and queries. Disaggregation removes the compute contention, so spikes in queries do not impact data ingestion and vice versa.
Cloud-native disaggregated architectures are key to ensuring real-time systems are affordable.
In designing a disaggregated architecture, you need to separate the compute and storage layers. One approach is to utilize the Aggregator-Leaf-Tailer Architecture (ALT), which separates ingest/query compute for real-time performance. As seen below:
The Tailer pulls incoming data from a static or streaming source into an indexing engine. Its job is to fetch from all data sources, be it a data lake, like S3, or a dynamic source, like Kafka or Kinesis.
The Leaf is a powerful indexing engine. It indexes all data as and when it arrives via the Tailer. The indexing component builds multiple types of indexes—inverted, columnar, document, geo, and many others—on the fields of a data set. The goal of indexing is to make any query on any data field fast.
The scalable Aggregator tier is designed to deliver low-latency aggregations, be it columnar aggregations, joins, relevance sorting, or grouping. The Aggregators leverage indexing so efficiently that complex logic typically executed by pipeline software in other architectures can be executed on the fly as part of the query.
Facebook uses ALT for the MultiFeed system, resulting in a 40% efficiency improvement via total memory and CPU consumption, reduced data latency by 10%, and increased resilience to traffic spikes due to isolation. LinkedIn also uses ALT for the FollowFeed system to bring computation closer to the data and serve relevance algorithms in real-time by querying on demand.
2 - Complex Queries
Real-time systems require complex queries to perform aggregations (sum, count, average to get metrics on large-scale data) and joins (incorporating data from multiple datasets and sources). The challenge here is that not all backend operations can equally support complex queries. Ideally, the system should have a query optimizer designed for SQL Joins to minimize the size of the intermediate table for efficient join operations. Many data systems are generally moving towards SQL because SQL supports complex queries.
3 - Flexible Schemas
ML models usually run on semi-structured data coming in not fully formatted. If you want a real-time system, then you can’t afford to do much cleaning and pipelining the data. Flexible schemas change with time in the modern world. With a flexible schema, you can add or change data fields without manually updating your data model. As a result, it’s easy to work with “messy” data. This can mean that your data has mixed types or null values. Furthermore, they enable you to iterate faster on your analytics as you don’t need to define a data model upfront. However, it is challenging to get data out of the system efficiently with flexible schemas.
Rockset recently introduced Smart Schemas - a feature that automatically generates a schema based on the exact fields and types present at the time of ingestion. There is no need to ingest the data with a predefined schema (i.e., schemaless ingest). On the other hand, other data systems limit such schema flexibility:
PostgreSQL doesn’t offer flexible schemas. Changes to Postgres require an ALTER-TABLE command that locks the table and stalls transactions.
CockroachDB can change only one schema at a time and cannot change schemas within transactions.
Hive offers Flexible schemas by storing a JSON-like blob. Every blob needs to be deserialized at query time, which is inefficient.
4 - Low Latency Queries
ML systems require low latency because many of them support user-facing applications. Indexing is a technique of data system to give you low latency queries by taking all the data and creating pointers to the data so that you can get to the data when a query comes. Database indexes enable fast queries on data. They are used to quickly locate data without having to scan through every row of data.
There are different types of indexes depending on the query. Dhruba recommended a converged index that stores every column of every document in a row-based index, column-based index, and search index. There is no need to configure and maintain indexes, and there are no slow queries because of missing indexes.
With this indexing technique, the query optimizer selects the index with low latency for both highly selective queries and large scans. More specifically, the optimizer picks between a search index (Index Filter operator), a columnar store (Column Scan operator), and a row index (Index Scan operator).
5 - Mutability
Mutability enables you to update an existing record instead of appending it to existing records. For example, let’s look at Facebook’s Event record: A ML algorithm was built to determine if event submissions were fraudulent. The ML model would insert a flag on any event submission it thought was fraudulent. This is an example of mutability.
To accomplish that in traditional immutable systems, events would initially be written to a partition. Events flagged as fraudulent would be written to another partition. At query time, the query would first need to find the event and then reconcile it with the fraudulent transactions. This is both compute-intensive and error-prone.
For support updates in data systems, you want to be able to update at the individual field level for greater efficiency. Many immutable systems will require the entire event to be updated (copy-on-write), which will be compute-intensive.
Dhruba argued that mutable systems are making a comeback. Traditionally, transactional systems were mutable (PostgreSQL, Oracle Database, MySQL, Sybase). Then, there’s a transition to immutability for a warehouse (Snowflake, Amazon Redshift). We’re now coming back to mutability (because of data enrichment, data backfills, GDPR, etc.) that is compute-efficient, reduces the propensity for errors, and eases the life of a developer.
The modern real-time data stack is SQL-compatible, supports cloud-native services, enables low data operations, provides instant insights, and is affordable (human and compute-efficiency). Rockset is a part of this stack, so learn more about them by reading their whitepaper on their concept designs and architecture.
4 - Hamilton, A Micro-Framework for Creating Dataframes, and Its Application at Stitch Fix
Stitch Fix has 130+ “Full Stack Data Scientists,” who, in addition to doing data science work, are also expected to engineer and own data pipelines for their production models. One data science team, the Forecasting, Estimation, and Demand (FED) team, was in a bind. Their data scientists are responsible for forecasts that help the business make operational decisions. Their data generation process was causing them iteration and operational frustrations in delivering time-series forecasts for the business. More specifically, featurized data frames have columns that are functions of other columns, making things messy. Scaling featurization code results in inline data frame manipulations, heterogeneity in function definitions and behaviors, and inappropriate code ordering.
Stefan Krawczyk presented Hamilton, a general-purpose micro-framework for creating dataflows from Python functions that solved their pain points by changing their working paradigm.
Specifically, Hamilton enables a simpler paradigm for a Data Science team to create, maintain, and execute code for generating wide data frames, especially when there are lots of intercolumn dependencies.
Hamilton does this by building a DAG of dependencies directly from Python functions defined uniquely, making unit testing and documentation easy.
Users write their functions, where the function name is the output column, and the function inputs are the input columns. Then they can use Python-type hints to check the DAG before executing the code.
Hamilton has been in production for 2+ years at Stich Fix. And it has been a huge success. The original project goal is to improve testing ability, documentation, and development workflow. Adopting Hamilton enabled workflow improvements thanks to the ability to visualize computation, faster debug cycles, better onboarding and collaboration, and a central feature definition store:
Everything is naturally unit-testable and documentation-friendly. There is a single place to find logic and a single function that needs to be tested. Function signature makes providing inputs very easy and naturally comes with documentation.
Visualization is effortless with a DAG, making onboarding simpler. You can create ‘DOT’ files for export to other visualization packages.
Debugging and collaborating on functions is also simpler as issues can be isolated quickly. It is easy to assess impact and changes when adding a new input, changing the function's name, adding a brand new function, or deleting a function. As a result, code reviews are much faster, and it is easy to pick up where others left off.
A nice byproduct of using Hamilton is the Central Feature Definition Store, which is in a central repository, versioned by git, and organized into thematic modules. Users can easily work on different parts of the DAG, find/use/reuse features, and recreate features from different points in time.
Stefan proceeded to give five pro tips to use Hamilton:
1 - Using Hamilton within your ETL system
Hamilton is compatible with Airflow, Metaflow, Dagster, Prefect, Kuberflow, etc. The ETL Recipe is quite simple: You first write Hamilton functions and “driver” code. Then, you publish your Hamilton functions in a package or import via other means. Next, you include sf-hamilton as a Python dependency and have your ETL system execute your “driver” code. Profit!
2 - Migrating to Hamilton
First, you should build a Continuous Integration (CI) pipeline to quickly and frequently compare results. This pipeline should integrate with the CI system if possible to help diagnose bugs in your old and new code early and often.
Second, you should wrap Hamilton in a custom class to match your existing API. When migrating, this allows you to avoid making too many changes and easily insert Hamilton into your context.
3 - Key Concepts to Grok
Common Index: If creating a DataFrame as output, Hamilton relies on the series index to join columns properly. The best practice is to load data, transform and ensure indexes match, and continue with transformations. At Stitch Fix, this meant a standard DateTime index.
Function naming: Naming creates your DAG, drives collaboration and code reuse, and serves as documentation itself. You don’t need to get this right the first time, as you can easily search and replace code as your thinking evolves, but it is something to converge thinking on.
Output Immutability: Functions are only called once to preserve the “immutability” of outputs and not mutate passed in data structures. Thus, you should test for this in your unit tests and clearly document mutating inputs if you do.
Code Organization and Python Modules: Functions are grouped into modules, which are used as input to create a DAG. You should use this to your development advantage by using modules to model team thinking, which helps isolate what you’re working on and enables you to replace parts of your DAG for different contexts easily.
Function Modifiers (aka, decorators): Hamilton has a bunch of these to modify function behavior, so you should learn to use them.
The initial version of Hamilton is single-threaded, could not scale to “big data,” could only produce Pandas DataFrames, and does not leverage all the richness of metadata in the graph. As a result, Stefan and his team have been working on extensions related to scaling computation (to scale Pandas out of the box) with experimental versions of Hamilton on Dask, Koalas, and Ray, removing the need for Pandas to control over what the final object is returned (dictionary, Numpy matrix, or your custom object), and “row-based” execution to enable data chunking and use cases that deal with unstructured data. They will work on Numba integration, data quality, and lineage surfacing tools in the future.
5 - Data Engineering Isn’t (Like Other Types Of) Software Engineering
Data engineers and data scientists often push to adopt every pattern that software engineers use. But adopting things that are successful in one domain without understanding how it applies to another domain can lead to “cargo cult” type behavior. Erik Bernhardsson (Ex-CTO of Better and Founder of Modal Labs) explained why working with data may require different workflows and systems.
In the traditional software world, engineers ship business logic with a well-defined behavior, test it by writing unit tests, and mostly add features to existing services. Software load is fairly predictable. Almost all features are finished. Services need low latency. There are no particular hardware needs.
A lot of these do not apply to the data world. Data people ship insights or models with fuzzy behavior, run tests on all data, and generally build standalone pipelines. The data load is bursty. Many projects are abandoned. Higher latency is OK. There is a need for GPUs.
Erik argued that software engineering is still maturing. Just 10-15 years ago, these ideas were considered novel: running everything in the cloud, using containers, running Javascript on the backend, using version control (at least 20 years ago), leaning to trunk-based development, and using test suites. Some ideas are still controversial, like mono-repo vs. lots of repositories, microservices v.s monolith, and continuous deployment.
Furthermore, the workflows are different across different forms of engineering: frontend, backend, analytics, and data/ML. Specifically, Erik believed that the data/ML engineering workflow is broken. This is because we are using tools optimized for a different domain. Erik is working on a new tool to extract a better feedback loop across these tools to tackle this challenge.
6 - Using Redis as Your Online Feature Store: 2021 Highlights. 2022 Directions
With the growing business demand for real-time predictions, we are witnessing companies making investments in modernizing their data architectures to support online inference. When companies need to deliver real-time ML applications to support large volumes of online traffic, Redis is most often selected as the foundation for the online feature store because of its ability to deliver ultra-low latency with high throughput at scale. Guy Korland (CTO) and Ed Sandoval (Senior Product Manager) shared key observations around customers, architectural patterns, use cases, and industries adopting Redis as an online feature store in 2021 - while pre-empting some directions in 2022.
Redis is an open-source, in-memory data structure store, used as a cache, message broker, and database. Redis (the company) is home to Redis (the open-source). It also offers two commercial products: Redis Cloud and Redis Enterprise. As seen below:
Redis core works with different data structures like strings, sets, bit maps, sorted sets, bit fields, geo-spatial, hashes, hyper logs, lists, and streams.
Redis modules come in formats such as JSON, graph, bloom filter, time series, search, gears, and AI.
Redis Enterprise brings an additional layer of enterprise-like capabilities, including linear scalability, high availability, durability, backup and restore, geo-distribution, tiered-memory access, multi-tenancy, and security.
In the 2000s, we mainly worked with relational structured data thanks to Oracle, IBM, and Microsoft. In the 2010s, we shifted towards NoSQL and document-based databases such as MongoDB, Couchbase, Cassandra, and ElasticSearch. In the 2020s and beyond, Guy believed that we would be moving towards a real-time, in-memory data platform for modern applications (Redis is a shining example).
Redis is the most suitable database to be used with a feature store because it is a super-fast storage for your online feature store:
It offers feature serving to meet ultra-low latency and high throughput requirements in real-time predictive applications.
It offers streaming feature support such as real-time event data ingestion, aggregation, transformation, and storage.
It offers native support for multiple data structures like documents, time-series, graphs, embeddings, lists, sets, geo-location, etc.
It has high availability, scalability, and cost-efficiency.
Redis Enterprise has additional operational and cost efficiencies, such as support for larger feature sets with Redis on Flash, global geo-distribution, and flexible deployments.
In 2021, Redis observed significant organic adoption and collaboration:
In November 2020, DoorDash published one of the first blog posts that cite Redis as an important part of their low-latency feature serving infrastructure.
In April 2021, Redis ran its annual event called RedisConf, in which the team positioned Redis as an online feature store and as a vector similarity or embedding store. More companies started publishing blog posts about using Redis for their online feature stores (Zomato, Swiggy, Lyft, Robinhood, Comcast, Wix, etc.).
In July 2021, Redis collaborated with the open-source project Feast to improve its performance.
In September 2021, Redis came another exciting use case from Udaan that uses a Feast-based feature store on Azure. This led to an engineering collaboration between Redis and Microsoft to build out “Feast on Azure.”
In October 2021, Uber reported the cost-effectiveness of using Redis for some of its feature sets.
Most recently, in February 2022, Feast published a benchmark blog post of serving features in different data stores.
Ed then walked through three selected use cases of Redis powering online feature stores:
DoorDash uses Redis for its store ranking application. The key requirements include low latency and persistent storage for billions of records and millions of entities (consumers, merchants, menu items), high-read throughput (“store ranking” alone generates millions of predictions per second), real-time feature aggregation (listening to a stream of events, aggregating them into features, and storing them), fast batch writes (features are refreshed periodically in batches), heterogeneous data types (strings, numeric, lists, or embeddings), and high availability and scalability.
AT&T uses Redis to combat fraud. There has been a significant drop in “fraud events NOT stopped” as the company moved from a rules-based fraud detection approach to a 25+ ML model approach. The key requirements include scoring throughput (10 Million transactions per day), scoring latency (under 50ms), hundreds of real-time features (constantly monitoring real-time events, aggregating, and turning them into features readily available for scoring), and high availability and scalability.
Uber incorporates Redis into the design of Michelangelo Palette. After a successful launch of Michelangelo in 2017, the engineering team at Uber spent a couple of years building a scalable platform. The key requirements include low latency (under 10 ms), high throughput (up to tens of millions QPS), extreme personalization (with millions of users, restaurants, menu items, and even restaurant-specific models), choice of serving infrastructure (local Java Cache, Redis, Cassandra), efficient batch data writes, and high availability and scalability.
Ed concluded the talk with some 2022 predictions on how online serving will go:
Feature store vendors will target ultra-low latency scenarios: Take a look at the Feast benchmark for the details.
There will be more online feature store adoption: Expect to see increased adoption in traditional mass consumer markets like financial services, media and advertising, and e-commerce platforms. Furthermore, digital-native brands have high demands for personalization at scale.
There will be a growing importance of low-latency feature serving: Near real-time features are the “MVP signal” for real-time ML-based applications, as they provide a valuable source of user behavior and intent. “Latency is the new outage” mentality will become prevalent: when latency goes up from single-digit ms to 50ms-70ms, this has a significant knock-on effect on user experience. Geo-distribution for additional scalability, high availability, and business continuity will be the norm.
7 - Data Transfer Challenges for MLOps Companies
Companies offering ML platforms as a service need access to customer data during the demo phase. These companies usually require that the data be moved to their own warehouse. This is not a simple task and can become costly if the two partners are using different cloud service providers. Apart from the challenges of data transfer, there is also the matter of compliance and privacy. For sensitive data, a secure transfer is not enough, and masking and other anonymization measures need to be implemented. Pardis Noorzad (Former Head of Data Science at Carbon Health) reviewed the myriad roadblocks faced by MLOps companies in working with customer data during the demo phase and discussed some potential solutions.
When providing enterprise software, one approach allows customers to install it on-premise to avoid data transfer altogether. This works particularly well for both open-source and proprietary solutions. However, the drawback to this approach is that you need to ensure that your customers are savvy enough to use the software as intended at its full potential and make a fair/correct assessment of the software capabilities.
Another approach that platform companies have is to provide a hosted cloud solution to more efficiently manage versioning, bugs, customer service, etc. Additionally, they sometimes need to tune/manage the cloud services in a specific way to manage cost and performance and move data to where GPU costs are affordable. The good news is that many platform companies offer cloud-native solutions spanning multiple cloud providers. However, this is certainly not the case for all platforms. Therefore, sometimes, data transfer is inevitable.
The status quo in data transfer is to set up an SFTP server, download the data from the warehouse into a CSV, and upload the data to the SFTP server. The receivers then download the data and upload it to their warehouse using non-trivial tools. Another approach is to replace the SFTP server with a cloud storage bucket. In both cases, data needs to leave the cloud, be saved as a CSV (where important information may be lost), be temporarily stored on some computer, and be uploaded to a lake or warehouse at the destination with the help of additional tooling. Obviously, the process described above is not fun. The tooling market is fragmented, and solutions tend to be specific to a particular choice of cloud.
Additionally, as these pipelines become more complex, they leave the scope of the data team and fall into DevOps or engineering teams to facilitate the transfer. This is not ideal because platform companies want to interact directly with the teams that will become the platform's users (as they iterate on the required data) and be in charge of making final decisions.
Managing security and staying compliant is not easy. Even during the evaluation phase, customers must stay compliant with HIPAA, GDPR, CCPA, and other legislations depending on the type of data they manage. Without the right tools, it is challenging to remain compliant. Sometimes, customers mistakenly send data containing PII without the right contract in place, do not conform to the correct data schema, and do not ensure data completeness. With the current approach to data sharing and data transfer, it is difficult to send requirements and validate the data.
When customers evaluate multiple platforms at different times, each gets a slightly different version of the data since different people could have been responsible for the transfer. It’s hard to keep track of sources of truth, which causes issues to make a fair evaluation. It’s also tricky to handle audits.
Pardis argued that we need a cloud-agnostic, simple, and secure transfer service. It should share the same version of the data with many platforms. It should handle permissions and processes, as well as hashing, masking, and encryption. Data validation and testing should be automated based on contracts and user-defined based on requirements. She ended the talk by calling for collaboration to establish such a service.
8 - Building Feature Stores on Snowflake
Feature stores can't exist without the underlying compute and storage platforms that they orchestrate. While there are many different options to choose from in the field today, several decisions need to be made to ensure scalability, reliability, and performance, to name a few.
Snowflake, the Data Cloud, is a cloud-agnostic data platform that offers many core value propositions as a traditional warehouse might, but with additional capabilities around data sharing, unstructured and semi-structured data support, and of course, limitless scalability to meet the demands of your machine learning workloads. In his short talk, Miles Adkins (Senior Partner Sales Engineer, AI & ML) walked through the benefits of building feature stores on Snowflake.
Snowflake is one place to access all relevant data instantly. It helps:
Reduce data collection time by being a single point for discovery and access to a global network of high-quality data.
Bring all data types into your model with ease with native support for structured, semi-structured, and unstructured data.
Build powerful models with shared data/services by easily incorporating shared data, 3rd-party data, and data services via Snowflake Data Marketplace.
Furthermore, Snowflake provides fast data processing with no operational costs. You can:
Prepare data with your language of choice: Snowflake supports ANSI SQL, Java, Scala, and Python with Snowpark for feature engineering.
Handle any amount of data or users: Intelligent multi-cluster compute infrastructure instantly scales to meet your data preparation demands without bottlenecks of user concurrency limitations.
Automate and scale feature pipelines: Snowflake’s Streams and Tasks automate feature engineering pipelines for model inference.
Snowflake checks a lot of boxes in terms of requirements for an offline feature store. It supports diverse feature types (batch features, streaming features, and real-time features) and diverse data types (structured, semi-structured, and unstructured data). Another differentiated attribute of Snowflake is the ability to enable data collaboration for 1st party (your organization’s data), 2nd party (secure shares amongst), and 3rd party (data marketplace).
9 - How to Choose The Right Online Store for Your ML Features
Where should you store your ML features to power real-time ML predictions, and why? Tecton’s Co-Founder and CTO, Kevin Stumpf, discussed the tradeoffs made and lessons learned while building the Feature Stores at Uber, Tecton, and Feast.
A feature store is an interface between data and models. It serves real-time features to your models that run in production making predictions and serves historical features to model training. A feature store typically doesn’t just implement its own database; instead, it abstracts access to an online store and an offline store depending on the usage.
In contrast to an offline store, the online store serves features for online predictions. It typically only serves the most recent feature value for a given entity. The load can be quite high - hundreds/thousands/millions of queries per second. In most use cases, the feature vector is small - a couple of bytes/kilobytes. Latency matters a lot for online stores.
Popular options for an online store are Cassandra, Redis, and DynamoDB. Kevin walked through five considerations that are important to keep in mind when choosing an online store:
Team expertise: Pick the one your team is most comfortable with.
Managed or unmanaged: Go with a managed solution whenever you can. DynamoDB provides a very good user experience.
Latency requirement: How sensitive is your use case to p50 and p99 (tail) latencies? Can you stomach ~100 ms? Do you really need ~10ms? Redis outperformed Dynamo in all of Tecton’s benchmarks.
Read and Write Load: Is your case particularly read-heavy or write-heavy? Do you need more than just individual row-lookups? Do you require ACID properties? Key-value stores perform well at a very high load and will outperform your standard relational database for standard feature store queries.
Cost Efficiency: The pricing model varies across databases. Some stores charge by writes and reads; others charge by node size. Consider using provisioned instances whenever you can. The more write-heavy the workload, the greater Redis’ advantage. The more sparse your traffic, the greater DynamoDB’s advantage.
Kevin’s team recently did a simple benchmarking to compare Redis to on-demand DynamoDB. The test setup includes 20,000 feature requests per second and ten feature tables per request - resulting in 200,000 requests per second to the store. As seen below, Redis outperformed DynamoDB specifically on the tail latency, while the cost was also significantly lower.
Here are some quick rules of thumb that Kevin ended the talk with:
If you have very high-scale use cases (>10k QPS) or very low (tail) latency requirements, choose Redis.
If your load is sparse or you care a lot about convenience, choose DynamoDB.
If strong consistency or complex search queries are important, choose Aurora.
In the end, not all ML use cases are created equal. Ideally, your system can transparently use a different store for a different use case or feature.
10 - Managing Data Infrastructure with Feast
Feast provides a simple framework for defining and serving machine learning features. In order to serve features reliably, with low latency and at a high scale, Feast relies heavily on cloud infrastructures such as DynamoDB or AWS Lambda. Felix Wang (Software Engineer at Tecton) explained how Feast manages its serving infrastructure reliably and predictably for users and also discussed how the process of managing data infrastructure with Feast can fit into a CI/CD pipeline.
Feast is an open-source feature store that entered Linux Foundation for AI in September 2020. It has an active community with 3k+ members on Slack and bi-weekly community calls. It has received contributions from tech companies like Twitter, Shopify, Robinhood, Salesforce, Gojek, IBM. Its architecture displays a strong focus on the cloud-native and serverless experience.
It works well with data warehouses such as BigQuery, Redshift, and Snowflake.
It aims to be modular and extensible, with connectors to AWS, GCP, Azure, Snowflake, Redis, Hive, and more.
It has a couple of abstractions, such as an offline store and an online store, making it easy for contributors to plug into and write their own connectors.
It has all the standard functionality of a feature store: feature storage, feature transformations, feature serving, data quality monitoring, feature registry, and feature discovery.
There are two core flows in the Feast architecture design:
In the offline flow: A data scientist uses Feast to query the offline store and yield some data frames to be plugged into the training algorithm. That algorithm yields a model to be put in production in an application, typically done in a notebook.
In the online flow: A data scientist first materializes features from the offline store into some low-latency online store. Then he retrieves features from the online store and serves them to the model in production, which then powers some predictions. There are various feature retrieval options that Feast support during the online flow, including Python SDK, Python server, Java server, and AWS Lambda.
Feast manages data infrastructure based on a declarative definition of the feature repository and a central configuration file. feast plan and feast apply commands will preview and apply the declared feature repository, similar to Terraform. In a typical workflow using Feast, you first define data sources and entities, next define feature views with associated features, and finally configure DynamoDB and AWS Lambda to launch your infrastructure.
Most ML teams currently version features with Git. Data scientists can create a pull request (PR), and CI/CD can run feast apply to update serving infrastructure when the PR is merged. feast plan can make this process even more reliable by revealing in the PR exactly what infrastructure changes will take place and can be automatically run on a PR to leave a comment.
11 - Twitter’s Feature Store Journey
A feature store is an essential piece of a production ML system. Twitter’s journey of building feature stores began several years ago and has gone through multiple iterations since then to facilitate creating, organizing, sharing, and accessing ML features in production. Tzvetelina Tzeneva (Staff ML Engineer at Twitter) touched on the key parts of this journey, the move from a virtual to a managed store, and their decision to adopt Feast.
The need for something to manage ML features happened at Twitter about 3 to 4 years ago. However, none of these attempts stuck. There are several technical and organizational reasons for that, such as proxy-service solutions running into operational cost and performance issues, being built by a product team leading to weak platform ownership and weak feature/data ownership, and only solving a small part of the feature management problem (e.g., just offline access).
Twitter’s feature store 1.0 was built as a virtual one, which represents the framework for registering and accessing features in a uniform way from already existing data. They are quicker to build with a small team of engineers. Twitter already has a vast set of existing data assets, a high cost of duplicating data, and existing frameworks for aggregating batch and real-time data. Thus, making the data easily accessible and shareable was what they needed. The north star here was feature sharing.
At its core, Twitter’s feature store 1.0 is a JVM-based library for registering reusable extraction/transformation logic and client code for accessing the underlying data. It applies these transformations in online and offline settings in a consistent way. Its domain-specific language for registering features allowed for almost all types of transformations that Twitter’s customers came up with (feature crosses, feature aggregations, feature templates, parameterized features, and entity management). The guiding principle is that, after some initial setup, adding a new feature is a “one-line” effort in a configuration file.
This system has been adopted widely at Twitter across the three largest organizations, 10+ teams, and tens of models. It has been serving 300+ feature sources and 3000+ features, enabling a lot of cross-team feature sharing. However, there were notable challenges:
Performance and complexity: Their online client was trying to do it all, so they ran into latency challenges with very high-throughput and low-latency use cases.
Offline experimentation is still slow as the store is based on outdated infrastructure (Hadoop), so feature joins take hours, and jobs are difficult to iterate on.
There was a large operational burden on producers and consumers to manage online storage and caching.
Last year, Twitter’s ML platform team started moving towards GCP in a hybrid cloud strategy, resulting in a new tech stack (BigQuery, DataFlow, and Python) and new operational challenges with getting data to and from GCP (online serving still needs to happen in on-prem data centers). For this feature store 2.0 version, they have a few new goals, such as managing more operational burdens, speeding up offline experimentation, and speeding up online access (pre-materialized transformations). As a result, they did an evaluation of Feast and decided to adopt it, considering Feast is aligned in Twitter ML Platform’s tech stack (BigQuery, Python) and strategy (being open-source). As seen below, Twitter’s feature store 2.0 consists of various offline and online storage tied together through Feast. Feature transformations are pre-materialized in storage rather than performed on the fly.
Moving forward, the ML platform team at Twitter is building towards a more performant lean online client and a managed service for feature hydration and model serving as an option. Additionally, they plan to support streaming and on-demand transformations, data quality monitoring, and integration with model metadata.
Tzvetelina concluded the talk with these lessons from Twitter’s feature store journey:
Building a global feature store can lead to an active feature marketplace and community.
Do not solve all problems at once. Pick a north star (whether it be feature discovery, feature generation, offline experimentation, online experimentation, versioning, embeddings support, etc.).
Don’t chase long-tail use cases around feature access and transformations.
Guaranteeing consistent runtime performance for every feature access pattern is impossible. Invest early in benchmarking and setting expectations with customers.
12 - ML Projects Aren’t An Island
We have all seen the dismal and (at this point, annoying) charts and graphs of ‘>90.x% of ML projects fail’ used as marketing ploys by various companies. This largely simplified view of ML project success rates buries in misleading abstraction because some companies have a 100% success rate with long-running ML projects while others have a 0% success rate.
Ben Wilson (Principal Resident Solutions Architect at Databricks) went through a simple concept that is obvious to the 100% success rate companies but is a mystery to those that fail time and again: a project is not an island. His core thesis is that an ML project developed with entirely self-contained feature engineering isolates otherwise incredibly useful work.
Ben made the analogy that ML projects can be like being on an island, as gathering requirements for an MVP and getting data can be slow. Each project ends up implementing the same transformations in different ways. Getting features out of a pipeline can be hard. You can create a custom framework as an abstraction top of data transformation tasks, but do you want to pay for features to be calculated repeatedly for each use case? Who will own that transformation codebase? Do you want to maintain that infrastructure on your own?
It is common knowledge that data science teams spend a lot of time building features, which can be error-prone, be challenging to monitor and debug issues, be expensive to recalculate over and over again, and lead to duplicated logic across projects. There is not a lot of time left to operationalize pipelines.
Ben then walked through a hypothetical ML project that needs some complex feature engineering: predict whether a user will click on an ad or not. You need to perform point-in-time joins, which are tricky to be written manually. You need to know if a user clicks on an ad (streaming data), the user history (calculated asynchronously every hour), and the ad metadata (static data updated whenever a new ad is created). This data is coalesced to be used both during training and inference time.
In actuality, you create an ML pipeline that compartmentalizes all the data manipulation into its own pipeline (as observed below).
A better solution is to use a feature store. Using the example of Databricks Feature Store, you can create a feature table and give it the construction logic for specific data frames. You can schedule periodic updates to that feature table asynchronously and publish to an online feature store as well. In order to create a full dataset that emulates a particular pipeline utilization of the data to solve a particular use case, you can:
Define the sources and the join keys involved from the registered feature tables.
Register the training dataset that joins the clickstream data to the slowly-moving dimensional tables.
Start the stream-table join and return the data frame object.
A feature store fully integrated into an analytics platform helps you operationalize the ML development lifecycle (training, prediction, monitoring, analysis, explainability, etc.). If interested, you should check out the Databicks Feature Store, as well as Ben’s Manning book called “ML Engineering In Action.”
13 - The Data Engineering Lifecycle
Data engineering is finally emerging from the shadows as a key driver of data science and analytics. Data tooling and practices are becoming increasingly simple and abstract. Data and ML engineering will start focusing more on the holistic journey of data and its use in data products. Given the rapid development of the space, Joe Reis (CEO of Ternary Data) argued that understanding the lifecycle of data is important to figure out what’s not likely to change over the upcoming years.
A generic data lifecycle includes universal steps such as data generation, data collection, data processing, data storage, data management, data analysis, data visualization, and data interpretation. The data engineering workflow sits between systems that generate data and downstream use cases like data science and analytics. A data engineer gets raw data, processes it, and creates useful data models for data scientists/analysts.
The data engineering lifecycle resembles the one below. Steps like data ingestion, data transformation, and data serving are unlikely to change soon. What about the undercurrents of data engineering? Where do they go in the data engineering lifecycle?
Security means access control to data and systems.
Data management includes data governance (discoverability, definitions, accountability), data modeling, and data integrity.
DataOps entails best practices of applying DevOps to data around automation, observability and monitoring, and incident response.
Data architecture involves doing trade-off analysis, designing for agility, and adding value to the business.
Orchestration is the process that coordinates workflows, schedules jobs, and manages tasks.
Software engineering encapsulates programming and coding skills, software design patterns, and testing/debugging methodology.
If you want to follow ideas around these undercurrents, check out Joe’s upcoming O’Reilly book called “Fundamentals of Data Engineering.”
14 - Using Feast in a Ranking System
Vitaly Sergeyev (Senior Software Engineer at Better) provided a walkthrough of several architectures that his team considered to manage features and rank/re-rank large volumes of entities.
For context, the ML team at Better has several models dependent on some feature vectors. These feature vectors represent some entities in different ML systems. Feature views are distinct groupings of feature sets, and multiple feature views comprise the feature vector that a model depends on. The mandatory reason to group features is because they come from different sources (Excel, S3, Snowflake, Kafka, etc.). Everything then syncs into an online store used for low-latency feature retrieval.
Vitaly’s talk focuses on Better’s ranking system. An important consideration from the architectural standpoint is: can you score the entity by itself and later push it to a queue of scores? This determines the ranking algorithm of choice. This ranking system powers Better’s lead scoring application of ranking prospects for sales outreach to raise conversion or some perceived value. Here, new leads are ingested consistently throughout the day. Leads may be engaging with the product, but they are less likely to convert without some interaction with time.
Maintaining an efficiently ordered list means constantly interacting with entities and their moving features. For Better’s case, they perform ranking or re-ranking when retrieving new entities, updating features for existing entities, and writing decay functions for entities with low interaction.
Better’s ranking system is set up as seen above:
Data comes from various sources (APIs, scrappers, dumps, and production environments) to an offline store environment, consisting of S3 and Redshift. They use dbt to manage data models and training/prediction datasets.
Feast helps materialize things from the offline store to an online store (Redis) and keep track of business logic. For instance, a lead is no longer a lead after 48 hours, so they expire the associated entity and its feature vectors after 48 hours.
Any new lead is ingested from the streaming process. The lead scoring service validates, transforms, and scores the lead. The score is passed to a queue system and also ingested directly into Feast.
Because new features are coming in, they can reuse the ETL process and data infrastructure to accelerate this whole workflow.
Vitaly concluded his talk with two things to consider when building ranking systems:
There are different feature retrieval patterns for ranking. Pointwise Scoring and LearnToRank are two popular approaches. Choose the one that fits the technical and business needs of your system.
Large feature retrieval for large sets of entities is crucial. Redis is great for lookups. You can expire/clean entities in the Online Store and combine Pointwise scoring with Learn To Rank to filter entities as needed.
While the operational ML space is in its infancy, it is clear that industry best practices are being formed. I look forward to events like this that foster active conversations about ideas and trends for ML data engineering as the broader ML community grows. If you are curious and excited about the future of the modern ML stack, please reach out to trade notes and tell me more at james.le@superb-ai.com! 🎆