Datacast Episode 55: Making Apache Spark Developer-Friendly and Cost-Effective with Jean-Yves Stephan

The 55th episode of Datacast is my conversation with Jean-Yves Stephan — the CEO & Co-Founder of Data Mechanics, a Y-Combinator-backed startup building a data engineering platform that makes Apache Spark more developer-friendly and more cost-effective. Give it a listen to hear about his M.S. degree at Stanford, his experience leading Spark infrastructure at Databricks, his motivation to build Data Mechanics and go through YC, the tech community in France, and much more.

Jean-Yves (or "J-Y") Stephan is the CEO & Co-Founder of Data Mechanics, a Y-Combinator-backed startup building a data engineering platform that makes Apache Spark more developer-friendly and more cost-effective. Before Data Mechanics, he was a software engineer at Databricks, the unified analytics platform created by Apache Spark's founders.

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) Stitcher, (5) iHeart Radio, (6) Breaker, and (7) TuneIn

Key Takeaways

Here are highlights from my conversation with Jean-Yves:

On Doing His Undergraduate at Ecole Polytechnique

The undergraduate math and computer science classes that I attended were theoretical.

I wasn’t sure at that time what I wanted to do. Startups weren’t popular yet. Most of the engineers graduating from my school went into consulting and finance.

On Doing His Master at Stanford

The first class that I took was CS 229 — “Machine Learning” with Andrew Ng. That started my interest in the field. Having a background in Math helped a lot.

However, the class that I enjoyed the most was CS 246 — “Mining Massive Datasets” with Jure Leskovec. This big data class focused on distributed engines and big data technologies, with a mix of computer systems and math knowledge.

I also had the chance to be a Teaching Assistant for these two classes — grading homework and holding office hours. There were students with very different backgrounds. I enjoyed being a good listener and coming up with examples to explain abstract concepts to students in small groups.

On Leading Spark Infrastructure at Databricks

Databricks was a small startup when I joined, with about 40 employees. Even though the company was small, Apache Spark started to become famous. Having worked with Hadoop during my summer internship, I could see how Spark is faster and more efficient in dealing with in-memory data. What also made a strong impression for me is their interview process, which was really hard. I reckoned that the team must also be very strong.

I initially joined the Cloud Automation team, managing the entire cloud infrastructure of Databricks. Gradually, I evolved into the lead of another team called Cluster.

When I joined, Databricks only had about 20 customers. By the time I left 3 years later, they had maybe a thousand customers. We were launching hundreds of thousands of nodes to the cloud every day.

  • We had to figure out what parts of our code were bottlenecks.

  • We needed to ensure that our service could scale-out.

  • As we scaled, all potential corner cases in our code were hit (even the low-probability bugs).

This means we had a lot of support/firefighting work. Still, we fixed many bugs, gradually made the product more stable/efficient, and launched more machines into the cloud daily.

The last challenge was to become data-driven. In the beginning, as a startup, we didn’t have a big observability stack for our software. As we scaled, my team helped define metrics and measure/improve those KPIs. I learned significant engineering skills related to that growth.

On Founding Data Mechanics

My co-founder, Julien, was a long-time friend. He was a Spark user, while I had more Spark experience as an infrastructure provider. Together, we were frustrated that Databricks and its competitors did not solve our pain points far enough. We had the feeling that we could build a data platform that could solve problems for profiles like us.

We wanted to make Spark more developer-friendly. We also wanted to make data infrastructure more cost-effective. Our goal was to build a data platform with automation that solves these problems — automatically choosing the type of cluster instance, automatically sizing the cluster, and automatically configuring the Spark code to be more efficient. The end users can focus on building their applications.

On Three Core Tenets of Data Mechanics

  • Managed service: Typically, Spark platforms today expose many knobs and put the maintenance burden onto the users. Most data engineers are not Spark experts. They should be the experts in the business pipelines that they are trying to build. Our goal with Data Mechanics is to automate choosing the parameters and the tuning of the infrastructure. End users just focus on building their application code and submitting it while we handle the rest.

  • Integrated into clients’ workflows: We don’t try to solve every problem within the platform. We let our users point to Jupyter Notebook with our platform. We have a connector for Airflow. We make it easy to build and run their own Docker images. We think that’s what data engineers with a bit of a software engineering background like the best. We focus on the more challenging problem that our customers don’t want to solve, which is managing Spark infrastructure.

  • Built on top of open-source software: This is very important for customers to trust us. They won’t be locked in with a proprietary platform. If, for some reason that they want to use another service, they can cancel anytime. We have a pay-as-you-go plan without any commitment.

On Spark-On-Kubernetes

Since 2018, Spark has been deployed on K8s instead of Hadoop YARN. With the release of Spark 3.1, the Spark-on-K8s integration is now production-ready under general availability.

The first benefit is native containerization:

  • Each Spark application is going to have its own Docker image. You can use Docker to package all your dependencies and even the Spark distribution. Each Spark application is fully isolated with its own Spark version, as there aren’t any shared dependencies.

  • Simultaneously, you use a shared infrastructure, the K8s node (a virtual machine) that runs the containers, which is cost-effective. If the node is ready to run your application, your application will start in 5 seconds. When your application finishes, your containers are gone in 5 seconds, and the capacity will be used by another application running.

The second benefit is cost reduction: A single shared infrastructure with very fast startup time makes your build cost-effective.

The third benefit is the K8s ecosystem: K8s is very popular. There are many tools for K8s monitoring, security, networking, CI/CD. You get all these tools for free when you deploy Spark on K8s.

I would say the main drawback today is that most commercial Spark platforms still run on YARN. If you need to run Spark on K8s, you probably need to do it yourself using open-source code. That requires expertise.

That’s why we build Data Mechanics to make Spark-on-K8s easy to use and cost-effective. We manage the K8s clusters so that our users don’t need to become K8s experts. In fact, they don’t need to interact with K8s at all. They just use our API or web UI instead.

On Data Mechanics Delight

Everyone complains about the Spark UI because it’s hard to know what’s going on and requires a bit of Spark expertise. It also lacks metrics about CPU usage, memory usage, I/O usage, disk usage, etc. Typically, most data engineers use a separate system like Prometheus or Datadog to view these metrics. However, these separate systems don’t know Spark, so the engineers end up jumping back and forth between the Spark UI and their metrics monitoring system.

That’s why we built Delight to have a better birds-eye view of what’s going on in your Spark application.

  • Delight also has new metrics and new visualizations to show the users the bottlenecks of their applications. As a cross-platform, it is compatible with Databricks, Amazon EMR, Google DataProd, etc.

  • Delight has an open-source agent that can be installed in your infrastructure. This agent will stream some metrics to Data Mechanics’ backend. Before being streamed, these metrics will be encrypted with a personal key that you created when you signed up. When you logged in to our web UI, you will see your metrics on Delight. Automatically after 30 days, we will clean up the data that you produce.

On Doing Y Combinator

We learned how to be bold:

  • Since we are both engineers, it didn’t come naturally to us that we want to change the world for my co-founder and me. When there were just two people and one PoC customer, it’s hard to say that.

  • But I had to; because I was setting a vision for myself, for my co-founder, for my investors, for my employees, and for my clients. The clients don’t buy the product the way it is today — they want the product to be better.

We learned to talk to our users (a popular YC mantra):

  • This is important because we needed to make sure that we were solving real problems and differentiating our product from competitors.

  • We couldn’t do this on a sheet of paper and analyzed the pros & cons. Instead, we must talk to as many users as possible.

We learned to focus:

  • The only way to enter a large market (like the Apache Spark market) is by figuring out the ideal customer profile that we wanted to attract and solving their pain point extremely well.

  • We focus on data engineers (people who write Spark pipelines and process a lot of data) and make Spark more developer-friendly and cost-effective to them.

On Getting Early Customers

There were a couple of things that helped:

  • Doing Y Combinator made us trustworthy.

  • Coming from Databricks, I had some personal relationships that helped find some early customers.

  • Producing high-quality content gave us credibility and visibility. Instead of me making outbound sales, frustrated or new Spark users find us. This solves the timing issue. We start talking to customers when it’s the right time for them, instead of randomly.

On Hiring

For the first few hires, we hired entirely throughout our network. I would ask my friends, past colleagues, and our investors whether they knew someone who might be a good fit. These hires are more engaged and easier to convert.

Another lesson that I learned is to make everyone an owner by giving stock options to our employees. When you join a very early-stage startup, the startup will become your baby. It’s important that you will be rewarded with ownership of the company.

Lastly, it’s crucial to define the culture — including the kind of people we want to work with. Then, trust our instinct to identify those people.

On The Tech Community in France

There are a few assets in France: (1) talented engineers from great engineering schools, and (2) attractive tax-subsidy government grants for startups.

However, most of our customers are in the US:

  • There are 2-to-3 times more Spark users in the US than in Europe, so it’s just a bigger market.

  • There are a lot more startups in the US who would trust a startup like us. In France, companies that use Spark are more established and bigger in size, and they only buy software if it is either well-known or recommended by a consulting group.

Timestamps

  • (2:07) JY discussed his college time studying Computer Science and Applied Math at Ecole Polytechnique — a leading French institute in science and technology.

  • (3:04) JY reflected on time at Stanford getting a Master’s in Management Science and Engineering, where he served as a Teaching Assistant for CS 229 (Machine Learning) and CS 246 (Mining Massive Datasets).

  • (6:14) JY walked over his ML engineering internship at LiveRamp — a data connectivity platform for the safe and effective use of data.

  • (7:54) JY reflected on his next three years at Databricks, first as a software engineer and then as a tech lead for the Spark Infrastructure team.

  • (10:00) JY unpacked the challenges of packaging/managing/monitoring Spark clusters and automating the launch of hundreds of thousands of nodes in the cloud every day.

  • (14:48) JY shared the founding story behind Data Mechanics, whose mission is to give superpowers to the world's data engineers so they can make sense of their data and build applications at scale on top of it.

  • (18:09) JY explained the three tenets of Data Mechanics: (1) managed and serverless, (2) integrated into clients’ workflows, and (3) built on top of open-source software (read the launch blog post).

  • (22:06) JY unpacked the core concepts of Spark-On-Kubernetes and evaluated the benefits/drawbacks of this new deployment mode — as presented in “Pros and Cons of Running Apache Spark on Kubernetes.”

  • (26:00) JY discussed Data Mechanics’ main improvements on the open-source version of Spark-On-Kubernetes — including an intuitive user interface, dynamic optimizations, integrations, and security — as explained in “Spark on Kubernetes Made Easy.”

  • (28:35) JY went over Data Mechanics Delight, a customized Spark UI which was recently open-sourced.

  • (35:40) JY shared the key ideas in his thought-leading piece on how to be successful with Apache Spark in 2021.

  • (38:42) JY went over his experience going through the Y Combinator program in summer 2019.

  • (40:56) JY reflected on the key decisions to get the first cohort of customers for Data Mechanics.

  • (42:26) JY shared valuable hiring lessons for early-stage startup founders.

  • (44:34) JY described the data and tech community in France.

  • (47:19) Closing segment.

His Contact Info

His Recommended Resources

About the show

Datacast features long-form conversations with practitioners and researchers in the data community to walk through their professional journey and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths - from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:

If you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.