Datacast Episode 59: Bridging The Gap Between Data and Models with Willem Pienaar

The 59th episode of Datacast is my interview with Willem Pienaar — the Engineering Lead at Tecton and the creator of Feast, a feature store for machine learning.

We had a wide-ranging conversation that covers his entrepreneurial journey founding and selling a networking startup in college, his time building industrial data systems, his work on designing and scaling Gojek’s Machine Learning platform, his well-known open-source feature store Feast, his move to Tecton to build an enterprise-grade feature store, and much more.

Listen to this episode on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) Stitcher, (5) Breaker, (6) iHeart Radio, and (7) TuneIn

Key Takeaways

Here are highlights from my conversation with Willem:

On Selling a Networking Startup in College

Stellensbosch University (Source: http://www.sun.ac.za/english)

I had to put myself through the university. We started a company out of my dorm room, where we resold wireless Internet. On our student campus, it was expensive to get Internet. If you had an ADSL line, you could resell that and shared lines with other people. I put a Wi-Fi router on the rooftop of my room and started to resell it. After 5+ years, it grew to the whole city. We had towers up on the mountain. We had employees. We had contracts with buildings and businesses and hundreds of paying customers. It’s somewhat crazy how a need for me to get through university and generate some income could grow to that business. Once I finished my degree, I sold the business to our competitor.

Don’t do a full-time engineering degree and running a full-time business that requires 24/7 attention. That period was the toughest time of my life. At one time, lightning struck and collapsed one of our towers at 2 AM, leading to customers reaching out to fix the problem. And I had an exam the next day. Those are not good combinations.

I learned that I am capable of a lot. If there is a necessity, I will find a way to do it. This experience required me to get out of my comfort zone. As an 18-year-old college student, going up to management consulting corporates and convincing them to commit to a contract was quite scary. I did not have an understanding of any contractual or legal frameworks at the time. But it happened anyway because we were solving a real problem for people.

On Building Industrial Data Systems

Most industrial companies have existing legacy machinery that needs to be digitized. It’s always the integration points that are the problems — lack of documentation, no connectivity, etc. Given a 30-year dated machine, you suddenly need to turn it into something that produces modern data points.

There were hundreds of little challenges I had to overcome to complete any project. And these projects were bounded by contractual amount, else the money invested would be lost. Most of the challenges were actually common to consulting, which are about scoping and setting expectations more than technical challenges.

On Joining Gojek

In 2017, Gojek was a rocket ship that was taking off. The name stands for “motorcycle taxi” in Indonesian. The company provides a single app that fulfills all the customers' workday needs (getting food, making purchases, ride-hailing, etc.). They built a super-app that eventually consists of 16–17 services and products (digital payments, logistics networks, careers, e-commerce, groceries, lifestyle services).

At the time of my joining, Gojek had a core product but no data foundation nor machine learning applications. They knew that they are sitting on a mountain of the best data for millions of Southeast Asians. They had a team of data scientists who could not get anything into production, so they decided to hire an engineering team to help these data scientists getting something into production.

My team’s initial entry point was building ML systems and getting an uplift in core metrics/reducing fraud activities.

On Designing Gojek’s ML Platform

Our ML platform team focuses on the end-to-end ML lifecycle from idea to production. Our platform was designed to be self-service, so that data scientists could go from nothing to something without involving an engineer.

  • Clockwork is probably the most widely adopted system that our team built. It consists of an Airflow scheduler with an abstraction layer on top. It is an opinionated way for you to define DAGs (Directed Acyclic Graphs) of jobs declaratively. We compiled down our top-level YAML graphs into an Airflow pipeline. For some reason, this was a much safer way for our team to operate Airflow than giving data scientists access to code. We basically containerized their code, ran that as containers, and built an Airflow graph (instead of giving them access to Python — which ended up extremely inefficient and unstable).

  • Merlin is our model serving system — an abstraction built on top of both KFServing and MLflow. A data scientist trains a model and publishes his/her model using Merlin into our model registry. Then, he/she can build up new versions of that model and deploy them with a one-click button that kickstarts an API. Merlin runs on Istio and KFServing behind the scene, so it’s a completely serverless environment.

  • Feast is our open-source feature store. It is our operational data system to get data upstream from our data warehouses/streams into production for being trained and served to ML models.

  • Turing is our experimentation system, mostly used for online experiments. If you have a request coming into your model, that request first hits Turing. Then, Turing would decide to which model it could delegate that request. Users can configure all kinds of experiment settings: balancing/weighting the models, actions to take, fallback options, etc. You’d be able to tie back your model performances ultimately to outcomes.

On Scaling Gojek’s ML Platform

Firstly, a big problem that our team had when getting started at Gojek was that we did not have a data foundation. If we want to do ML operations at scale, we would need a proper data foundation. Unfortunately, a lot of data scientists are being forced to do that data engineering work today. Hopefully, in the future, this will become a solved problem for most companies. It’s important to have a unified data foundation across different event logs and historical environments so that the data come to you.

Secondly, your features should be free. You should not struggle with creating and publishing your features. Ideally, you should reuse features created by other teams.

Thirdly, avoid breaking abstraction is a key lesson for Gojek, as we have an API-first approach. An important thing that is super over-looked in the industry where I always hammer is that there are different lifecycles in developing different parts of the ML system. The biggest mistake that many teams make (especially during the scaling phase) is not to break up their system into smaller components. They should modularize their workflow into smaller stages such as data engineering, feature stores, model development, model deployment, model monitoring, etc.

Finally, you should pre-compute everything. Many people think it’s cool to have a real-time system that is always updated with on-the-fly retraining. But it’s much better to do everything in batches (whether serving data or serving models). This is easier to track, fallback, measure, and reason about.

On Feast’s Inception

Looking holistically at Gojek’s internal ML lifecycle, we realized that we spent a lot of time on engineering features and getting features into production. Data scientists also duplicated their code and did not version/lineage their work. We knew that Uber’s Michelangelo team has built something of this kind and solved this problem, so we wanted to build something similar to solve our own problem.

We collaborated with the engineering team at Google Cloud on developing Feast. Here are the core problems at the time:

  1. Features are not being reused.

  2. The definitions of features vary.

  3. It’s hard to serve up-to-date features.

  4. There’s a consistency between training and serving.

A feature store is meant to bridge the gap between offline development and online production. A feature store helps data scientists publish/connect data into the operational side. A feature store also helps the data consumers train and serve models on that production-quality data. A feature store sits in the interesting boundary between ML and data.

Those are the problems that we want to address with Feast. When launched, Feast addressed all the 4 problems above. The one that we did not fully address was the feature reuse one. We thought that when people publish features to Feast, they will start reusing them. It turned out that there was still a large trust factor there. If you do not make it super clear exactly what you are consuming and using in your models, data scientists tend to publish their own data or fork upstream code and publish them into Feast. We had a good penetration of feature reuse, and Feast was successful at that, but it was not as high as we have thought.

Another problem that we started to solve with Feast is ensuring the data quality. At v0.7, users can validate training data and production data in real-time and batch deployment. This is a big problem in the current operations of the ML system.

On Feast’s Product Roadmap

Feast Contributors (Source: https://github.com/feast-dev/feast)

Prioritizing a product roadmap can be tricky because everyone wants something different. We pushed out a lot of functionalities that people wanted, but we needed to figure out whether our project vision solves problems for a specific group that we were targeting at the end of the day.

When started, we focused on solving feature-as-a-service for ML platform team like ours. We frequently surveyed our users outside of Gojek to understand their pain points. However, we were more often informed by our internal fires that we were fighting at Gojek, so we were more biased towards our internal users.

As time went on and things stabilized internally, we democratically looked at both external and internal users' needs and prioritized those that could be the most impactful.

The most important thing that we did to grow the Feast community is to have RFCs (request for comments). We designed specific functionalities, shared them with the community, took their feedback, and responded to GitHub issues.

On The Future of Feast

Here are the key lessons that I learned:

  • Feast requires too much infrastructure.

  • ‍Feast is rather monolithic.

  • Ingestion is too complex.

  • Our technology choices hinder generalization (Google BigQuery, Apache Beam, Dataflow, and Kafka). These choices made it hard to migrate to Amazon services, where many of our customers are.

The Future of Feast (Source: https://feast.dev/blog/a-state-of-feast/)

Here are what Feast is moving towards:

  • Right now, for v0.8, we support Amazon. For v0.9, we will support Azure as well.

  • We will go closer to Python instead of Java.

All in all, we will make sure that Feast is a lightweight system. Right now, we are even making Feast completely runnable from a notebook environment. You can expect to see a release related to this lightweight mode of running Feast in the summer.

On Commercial and Open-Source Software

Commercial software is targeted at people who have large problems and are willing to put money towards solving them. It literally takes years for an engineering team to build an in-house feature store at scale. These are the kinds of problems that Tecton is addressing. The requirements can vary:

  • Small companies typically need a production-grade feature store with enterprise capabilities: serving data in real-time to their models or scaling out horizontally, for instance.

  • Large organizations (like banks) with strict regulatory procedures need an engineering-heavy product with guarantees and security.

  • Companies with niche data models (like on-demand data transformations) need personalized solutions.

With Feast, our users are typically small data science and platform teams. As the complexity goes up and the stakes are higher, Tecton is an obvious choice. Tecton’s product is very far ahead of anything I have seen in the space.

Capabilities of Feast and Tecton (Source: https://www.tecton.ai/blog/feast-announcement/)

On Living and Working In Southeast Asia

The South African experience is very sub-urban with car culture. The Southeast Asian experience is much more confined with tighter space. At the same time, Southeast Asia is more diverse culturally, especially in Singapore and Thailand. It was a blast working there. I wouldn’t mind going back at some point in my life.

Overall, it’s extremely rare for a Feast-like system to be built in Southeast Asia. Most Southeast Asian companies focus on implementing solutions, not on building products. The competence is there, but the companies are not run in that fashion.

Show Notes

  • (1:45) Willem discussed his undergraduate degree in Mechatronic Engineering at Stellenbosch University in the early 2010s.

  • (2:34) Willem recalled his entrepreneurial journey founding and selling a networking startup that provides internet access to private residents on campus.

  • (5:37) Willem worked for two years as a Software Engineer focusing on data systems at Systems Anywhere in Capetown after college.

  • (6:49) Willem talked about his move to Bangkok working as a Senior Software Engineer at INDEFF, a company in industrial control systems.

  • (9:52) Willem went over his decision to join Gojek, a leading Indonesian on-demand multi-service platform and digital payment technology group.

  • (12:16) Willem mentioned the engineering challenges associated with building complex data systems for super-apps.

  • (14:50) Willem dissected Gojek’s ML platform, including these four solutions for various stages of the ML life cycle: Clockwork, Merlin, Feast, and Turing.

  • (19:24) Willem recapped the lessons from designing the ML platform to meet Gojek’s scaling requirements — as delivered at Cloud Next 2018.

  • (23:09) Willem briefly went through the key design components to incorporate Kubeflow pipelines into Gojek’s existing ML platform — as delivered at KubeCon 2019.

  • (26:21) Willem explained the inception of Feast, an open-source feature store that bridges the gap between data and models.

  • (32:20) Willem talked about prioritizing the product roadmap and engaging the community for an open-source project.

  • (35:07) Willem recapped the key lessons learned and envisioned Feast's future to be a lightweight modular feature store.

  • (37:29) Willem explained the differences between commercial and open-source feature stores (given Tecton’s recent backing of Feast).

  • (41:36) Willem reflected on his experience living and working in Southeast Asia.

  • (44:33) Closing segment.

Willem’s Contact Info

Mentioned Content

Feast

Article

Talks

People

  • David Aronchick (Open-Source ML Strategy at Azure, Ex-PM for Kubernetes at Google, Co-Founder of Kubeflow, Advisor to Tecton)

  • Jeremy Lewi (Principal Engineer at Primer.ai, Co-Founder of Kubeflow)

  • Felipe Hoffa (Developer Advocate for BigQuery, Data Cloud Advocate for Snowflake)

Book

Source: https://www.applyconf.com/

Willem will be a speaker at Tecton’s apply() virtual conference (April 21-22, 2021) for data and ML teams to discuss the practical data engineering challenges faced when building ML for the real world. Participants will share best practice development patterns, tools of choice, and emerging architectures they use to successfully build and manage production ML applications. Everything is on the table from managing labeling pipelines, to transforming features in real-time, and serving at scale. Register for free now: https://www.applyconf.com/!

About the show

Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.