The 116th episode of Datacast is my conversation with Vinoth Chandar - the creator and PMC chair of the Apache Hudi project and the founder of Onehouse (a cloud-native managed lakehouse to make data lakes easier, faster and cheaper).

Our wide-ranging conversation touches on his educational experience at UT Austin, his first job working on OLTP database at Oracle, his time building the Voldemort key-value store at LinkedIn, his years at Uber designing and scaling their massive data platform, the evolution of Apache Hudi as a streaming data lake platform, lessons learned in open-source standards, and his current journey with Onehouse bringing cloud-native lakehouses to the enterprises.

Please enjoy my conversation with Vinoth!

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcasts, (4) iHeartRadio, (5) RadioPublic, (6) TuneIn, and (7) Stitcher

~ New Podcast ~#Datacast E116 features @byte_array. We discussed:
- Distributed Databases
- Streaming Data Lakes
- Open-Source Standards
- Cloud-Native Lakehouses

Vinoth is making data lakes easier, faster, and cheaper with @Onehousehq. Enjoy! pic.twitter.com/a3Y4ddzPvx
— James Le (@le_james94) May 12, 2023

Key Takeaways

Here are the highlights from my conversation with Vinoth:

On His College Experience in India

My undergraduate experience wasn't just a typical Indian student entering a technical engineering school. There were lots of interesting things to explore.

I was lucky enough to attend the Madras Institute of Technology for engineering. Even though it was essentially computer science, it was still considered an engineering program. This is one of the biggest things I've had to explain, especially since nobody talks about the internet for a while, so I'm explaining it after a long time. In India, "engineering" means something really different. Due to all the outsourcing and service companies in the country, it practically means computer science. This is where I got my first exposure to distributed systems.

Back then, this was before Hadoop. If you remember those days, we called it grid computing. There were so many awesome projects around. I got exposed to much scientific computing and MPP sort of things, especially in my final year.
Initially, I started with a security angle, running group video or multicast applications. But by the end of that year, I was smitten by working with distributed systems and computers talking to each other. We built the university's first real-time application clusters using open-source projects, among other things.
I also had the good fortune of working on an interesting project with the Indian Space Research Organization. The university designed a student satellite, and I actually got to write a telemetry controller for it. It was so cool that I wrote this piece of code that would run in our satellite uplink station every day for a few hours, downloading elementary data every day as the satellite was in contact.

These experiences made me appreciate networks and how hard it is for two computers to connect over unreliable networks and do something meaningful. This is what I took away from my education.

I learned to appreciate that failures happen, and somehow getting this many computers to cooperate with each other in the face of failures was fascinating. You can clearly achieve a lot with distributed processing, and amazing computing problems can be solved this way. Still, the infrastructure to make those things happen requires solving many hard problems and logical thinking.

I'm not actually super good at numbers, but I think I'm much better at logical and critical thinking. For me, it was a natural draw to things like Paxos and consensus algorithms. All of those things were very interesting to me.

On Pursuing a Master's degree at UT Austin

Source: https://dl.acm.org/doi/epdf/10.1145/1921168.1921199

While at UT Austin (a great school in a great city), I had the opportunity to participate in a stellar computer science program. I was lucky enough to work in two completely different areas.

The first was the vehicle content distribution project, which I stumbled upon during a course project. I spent the summer on campus working on it as my thesis.
The second was a project I worked on with a researcher at the Texas Advanced Computing Center. It was fascinating to see the difference between my undergraduate background and the work done at TACC. Our undergrad had around 200 computers linked to a network, while at TACC, there was an 8,000-core supercomputer that did crunches for all the weather reports for the southern states. The researchers there were running massive MPP programs with parallel computing. I landed an internship there and wanted to work on technology. It didn't happen in my first semester, but I was persistent and eventually landed a very cool mobile networking project.

Source: https://dl.acm.org/doi/epdf/10.1145/1645164.1645175

I ended up producing papers on both. Ten years later, at Uber, I found myself working on Hudi (big data processing) and leading Uber's mobile networking team, working on the same problem I had worked on for my thesis at UT. It was interesting to see how the field had changed.

At the end of my time at UT, I had to make choices about my career. I could have gone deeper into embedded systems or worked on parallel processing and distributed systems. I found both options fascinating, but I decided to work on distributed databases specifically and ended up going to Oracle. From there, my career has continued to grow.

On Being An Engineer at Oracle

Oracle was my first real full-time job out of school, and I learned that the Oracle database is a remarkable piece of engineering. I learned a lot about writing production code, testing practices, and release management. It's easy to knock enterprise software and say it's boring, but I got to see how much goes into delivering something as stable as an Oracle database. People take it for granted, but you can trust it if you put data in there. I got to see the other side of building mission-critical data systems.

Source: https://docs.oracle.com/en/middleware/fusion-middleware/osa/18.1/understanding-stream-analytics/overview-oracle-streaming-analytics.html

I also got an enjoyable startup-like experience because, two to three months in, I joined the Oracle Streams team, and then two months in, they bought our competitor, basically. So for all three months, I kept hearing the team talk about a competitor. Three months in, they bought the competitor, which was Golden Gate. So I got to see firsthand how M&A works and how to navigate uncertainty and risk during a recession. I also worked on optimizing performance and integrating the two systems, which was interesting because of the nature of the M&A.

I learned to deliver impact incrementally and plan and prioritize more complex problems. It was a pretty interesting experience, not a run-of-the-mill first job. Oracle had optimized the Oracle to Oracle native integration, like replication, for over a decade. Golden Gate presented a solution that could do heterogeneous databases and move from Oracle to SQL Server, DB2, or whatever. There were a lot of different product-level details to understand to navigate that scenario.

Looking back, I consider that experience to have taught me many things I would otherwise not have been exposed to.

On Building and Scaling Voldemort at LinkedIn

I was mainly looking for a faster-paced environment at Oracle, and Oracle's distributed data fundamentals article offered a great opportunity to learn about distributed data. This was when data was exploding, and terms like "data science" were coined. Companies like LinkedIn and Facebook were leading the way in social and search, and there was a lot of excitement around scaling out distributed data systems. I wanted to work in a more consumer-facing environment where I could iterate faster.

With enterprise software, you can only move so quickly, so that was my primary motivation. At the time (around 2011), there was no DynamoDB or Cloud Spanner. You had either a data or DBMS or some caches like Memcached, and that was it? The BigTable and DynamoDB papers had just come out, and Facebook was building Cassandra and Hbase while LinkedIn was building Voldemort. It was an exciting time.

Source: http://www.project-voldemort.com/voldemort/

Voldemort was started by Jay Kreps at LinkedIn, but he moved on to work on Kafka. When I joined the team, Voldemort was serving approximately 10,000 QPs. By the time I left in 2014, we had scaled the system to around 2.5 million or 3 million QPS. LinkedIn grew from approximately 80 million to over 500 million or 600 million members during that period, so it was a real hyper-growth phase for the company.

My experience at LinkedIn really helped me develop my operational skills. It showed me that it wasn't enough to just be a good engineer; I also needed to understand the SRE function well and work closely with them to own a system as a service for the rest of the company. Voldemort was managed as a multi-tenant service on the cloud. Our team handled all the clusters, capacity planning, multi-tenancy controls, quotas, and performance guarantees for the rest of the company. It was a great experience for me.

I learned a lot from the technical challenges we faced. Beyond Google, no company has really scaled key-value stores to that level. LinkedIn, Facebook, and others were all learning to scale distributed databases. There were tons of operational issues, and I made many mistakes in production, but I learned from them. I got burned many times in production but also learned a lot about new technologies like SSDs and log-structured databases. It was a period of rapid change, and we were constantly learning and adapting to new hardware and software.

For example, we realized that once you make the disk faster, the bottleneck shifts to CPU and garbage collection because the SSDs are so fast. All of the code was fine creating objects until that point because the old spinning disks were very slow. Once you throw an SSD in there, your latency goes from tens of milliseconds to a microsecond, and your bottleneck is on the CPU. These were the types of interesting things we learned.

Overall, those three years at LinkedIn taught me how to handle production and shoulder the responsibility for a company like LinkedIn.

On Shaping The Vision for LinkedIn's Private Cloud "Nuage"

Source: https://www.slideshare.net/vinothchandar/voldemort-prototype-to-production-nectar-edits

I don't believe LinkedIn still runs Voldemort. They migrated it to the new database system they were building. Towards the end of my time there, I started looking for newer things to work on because Voldemort was in good shape. It performed well, with less than a millisecond P-95 QPS across all stores.

To extend Voldemort, we would need to either add new features like range query support or transactions or extend the database. However, LinkedIn had already made the choice to build a newer system, and a team was working on it. Voldemort was reaching end-of-life.

I was working on a private cloud project at LinkedIn, which aimed to provide a public cloud-like experience for all the company's data infrastructure services. We had managed to run Kafka on Voldemort quite well, but the user onboarding experience was lacking. There was a lot of manual provisioning and filing of tickets with SRE teams.

We built the first version of the project, but my heart was actually set on building the next data system. That's why I decided to leave LinkedIn.

On Joining Uber As a Founding Engineer On The Data Team

Well this is just my @UberEats order. But get a nice feeling everytime I see my name next to “Uber”. That @UberEng team is so 🔥 (in spite of the flat stock price), lots of ex-uber startups now killing it out there pic.twitter.com/bGGvzCzpAV
— Vinoth Chandar (@byte_array) January 23, 2022

I wanted to return to the data-distributed infrastructure space, where I thought I belonged. Incidentally, I knew about Uber. I was actually interviewing with Snapchat when they gave me an Uber coupon for a ride from the airport to their Venice Beach office in LA. I could have become the 15th engineer at Snap, working as their first full-time infrastructure hire, not just data, but cloud and the whole thing. It was an exciting time to work there.

But when I took an Uber ride, I instantly fell in love with the product. I had spent nights and weekends, as well as my RA time, driving around cars and optimizing mobile communication before, so it was an easy decision to look at Uber. I spoke to other companies like Airbnb, but what was really interesting about the Uber opportunity was that it was a blank slate. They just had some Python scripts and a vertical warehouse. I could see how the company and the product could evolve and the real-time challenges that could arise.

I wanted to be a key member at Uber on the ground floor of this opportunity. I wanted to build the data architecture for a large company and shoulder company-level responsibility. It was an excellent opportunity for me to make technical visions at that kind of level. And that's how I ended up picking Uber.

On Uber's Early State of Data Infrastructure

I believe Uber was already booking multiple billions in gross bookings by 2014. There were less than 200 engineers when I joined, and all engineering could fit on one floor. We were hiring very quickly at that point.

The company moved very fast and built many new products, experimenting quickly. The company's DNA was to experiment a lot. As we hired more people and built more products, we started feeling the early pains, such as an explosion in data.

Before I joined, the data team consisted of three people, two working on an ETL system using Python and SQL queries to run ETLs on Vertica. We put out dashboards, and people could query the warehouse or build basic dashboards. That's where things were.

Uber is very different from LinkedIn. At LinkedIn, many data challenges arose because LinkedIn built recommended products and had many users who were engineers for data. LinkedIn also had a smaller set of analysts who would optimize for sales. LinkedIn was a very large enterprise business, not just ads, and it was a unique company in that sense.

In contrast, Uber had many people running cities who were not technical at all. They were super smart but not technical. They were entrepreneurial and could run with things. They would learn SQL and actually write SQL on Apache Arrow directly. But writing performance SQL on Apache Arrow is not easy. Thus, the data team had to be the buffer between all the engineers building products and producing data and the operations folks consuming the data and making sense of what was happening.

Uber's strategy was to optimize city-by-city hyperlocal. The company ran local campaigns specific to each city, which was very unique. This meant we had to build data as a product within the company. We had to build as if we were building for an external audience of end-users within the company. The infrastructure was far from where we wanted it to be because all we could do was store data in a warehouse.

Source: https://www.uber.com/blog/streamific-hadoop-ingestion/

My first year there was pretty much doing all the basic things right. We introduced and upgraded our Kafka versions, introduced database things, captured make log event collection, and made Kafka reliable at scale across multiple data centers at Uber. Even then, we wanted to run on Uber and enter China. We were running in four data centers there and eight in the US. We had to deal with consistently merging data from every place and all kinds of early issues.

So we spent the first one and a half years designing these things. We got to a state where we could do stream processing using Apache Samza. There was a basic Hadoop cluster with some Hive tables, and we could run Spark. This was Spark 1.3, which was very old. Everything was just coming up, and we actually managed to scale it to a point where we could store a lot of data on the Hadoop cluster. We would ETL something into Vertica, which would give us the performance we needed for reporting. All the other use cases would stay there.

We managed to make it work and focused a lot on the productizing part I talked about, which is building the equivalent of a schema industry in-house at Uber. As a Uber engineer, it was easy to show up, write a schema for the data you wanted to produce, punch a button, and have Kafka topics and tables created for you. Data flowed in, and we built all these services to manage those tables. You didn't have to come and hand-tune a bunch of things. We really built a layer of abstraction between the engineers and the end product, the city ops people. They could have conversations at a high level and not deal with ad hoc data requests, debugging queries together, or things like that. That's what the first one and a half years at Uber looked like.

On Uber's Case For Incremental Processing on Hadoop

Source: https://www.oreilly.com/content/ubers-case-for-incremental-processing-on-hadoop/

After achieving a state-of-the-art implementation that allowed efficient streaming data latencies into the Hadoop cluster, we encountered several issues with mismatching dependencies. I have experience working with batch data systems, such as massively parallel data processing at Oracle and stream processing with Voldemort at LinkedIn.

Our key challenge was figuring out how to reconcile new data with existing intermediate results, which is the programming model used in stream processing. In contrast, Hadoop batch processing runs periodically and has no incrementality. We decided to bring the incremental data processing model onto horizontally scalable storage and compute, such as Yarn, HDFS, Kubernetes, and S3. We called this approach "incremental data processing" to differentiate it from stream processing.

We found that running on-demand horizontal compute jobs far more frequently and incrementally, such as every five minutes, improved efficiency and query performance. It's rare in computer science or databases to reduce cost and improve performance simultaneously, but we achieved it by incrementalizing our old-school batch processing. By sticking to this mini-batch model, we were able to write data in column formats and save millions of dollars for the business.

On Introducing “Hoodie”

There are three things that we need to build at Uber. First, we need to build the ability to transactionally update or mutate data in tables. This would be very similar to what stream processing state stores do. If you look at Flink or Spark Streaming, both are state stores where you can take a new record and then incrementally apply updates.

Second, we need to build the ability to image the change for downstream processing, which is change streams and change capture. We essentially designed and started designing Hoodie as a database abstraction. We take HDFS and build a transactional database layer on top of it. Instead of rebuilding the query engine, we decided not to build another one from scratch. At that point, we had Hive, which is really good for very large batch jobs and reliability, and Spark, which was picking up a lot of steam in data science and generally starting to eat into the ETL space. We also had Presto, which is really good for interactive query performance. We designed Hoodie as this layer and then integrated it into being queryable from all three agents at Uber.

That's how Hoodie was designed. By the end of 2016, we were already building it in-house at Uber. We would ingest external operational databases into a set of Hudi tables on the lake. Then we would incrementally transform them into more derived regular fact dimensions and starch tables. This was being done for all the core critical business-critical data sets by the end of 2016.

Third, this felt like a general-purpose problem that most people would have. We open-sourced the project in early 2017 so that we could start building it out more in open-source. Until then, we had used so much software from the Apache Software Foundation to build out Kafka, Parquet, Vespa, etc. So we wanted to give back to open source, and that's why we open-sourced it very early back then.

On The Evolution Of Hudi

I want to share the journey of Hudi, our open-source data management framework. Initially, we had a lot of internal conversations about Hudi's mutability on the lake. We mainly had visionaries in 2017 who were looking for innovative solutions to database change capture problems. We slowly started getting the software to run outside of Uber's infrastructure in a more general-purpose way in 2018, and adoption grew to a point where people began to see Hudi as a storage system in 2019.

By then, the community felt that the right place for Hudi was the Apache Software Foundation, where strict norms and rules ensure projects are governed in a vendor-neutral and inclusive way. During the incubation process, we realized there wasn't a good standard API for integrating Hudi into different engines. Although the situation has slightly improved, there is still no real standard, and everything beyond the SQL language level is non-standard. Therefore, we try to build more platformization into Hudi and make it easy to use.

We provide not only the deep transactional layers but also upper-level applications that make the overall end-to-end use case easy in Hudi, such as streaming ingest and ETL applications.

On Keeping Hudi Vendor-Neutral

When examining the governance structure of Apache, it becomes clear that there is nothing extraordinary about it; much of the credit should go to the ASF. They have put governance in place, such as a project management committee and a PMC chair. The PMC chair reports to the board. If someone believes that a particular cloud vendor is receiving preferential treatment, for example, they can raise it with the board, and we are accountable.

Additionally, all software and licensing are owned by the software foundation, not by any individuals. There are many cases of individuals and PMC members being asked to leave if they violate the code of conduct. Apache is quite strict in enforcing these kinds of things.

That being said, the way decisions are made in open source with this kind of central governance model is that everyone gets to vote and make a plus or minus one addition to it. Out of the, let's say, 15-16 PMC members, Onehouse as a company only has four members. There are PMC members from other cloud providers and other large technology companies. Therefore, even though Onehouse is a Hudi company, we do not control the project in any way. For example, we cannot reject a Pull Request just because we are commercializing that exact functionality in our product and then don't want to have it open source. Such a thing would not pass master.

Many practices like this help the community advance and keep the spirit of open source, but also maintain checks and balances to ensure that the project is not cannibalized by any particular group of vendors or companies.

On Establishing Standards For Open-Source Projects

I believe every project wants to integrate with everything in the #data #oss ecosystem, but for the lack of time. Wish there were more true standards, cleanly separated from implementations.
— Vinoth Chandar (@byte_array) March 21, 2022

There are some good projects in the data space. For example, the Open Lineage project aims to define a more standard open protocol for lineage. Currently, many parts of companies adopt different systems for data lineage, which can be problematic. However, having a standard protocol for lineage is only useful if you have holistic lineage, right? There are efforts like OpenTelemetry, which is an excellent project in the data collection ecosystem, though it's outside the analytics space.

From my experience working on the backend OLTP side of things, I've always felt that standards are essential. For example, the JVM is a standard that many vendors can implement. But in the data space, the term "standard" is often loosely used to refer to popular projects like Kafka. Although Kafka is a popular protocol system, it is not necessarily the standard. Perhaps the Kafka protocol could be standardized.

There aren't many standardization efforts in the data space. For example, before Hudi, there was Hive, which provided an abstraction between writing data and reading that playbook. But Hive wasn't really a standard. There is something to be said for creating a standard protocol, rather than using marketing tactics to claim to be the "defacto standard" for something, which can cause fragmentation.

Disingenuous to brand an implementation as the #standard to the #opensource community who look to you for guidance. Standards are neutral to implementations and co-designed by implementors. Seeing dangerous levels of #marketing in open source as well 😎
— Vinoth Chandar (@byte_array) December 3, 2021

Looking at the three storage systems, Hudi, Delta Lake, and Iceberg, and integrating them with five engines, there are many combinations to deal with. With Hive, there was only one query and five engines. Now, there are fifty things to consider.

In summary, while there are some healthy standardization efforts in the data space, there isn't enough standardization compared to the backend side, where any framework can speak REST. There needs to be a common language for data systems to talk to each other.

On Leadership Lessons Obtained at Uber

It's interesting that very few engineers actually use the title "#engineering #leadership", to describe their role. I think architecture, design and roadmapping, can be very well classified as leadership.
— Vinoth Chandar (@byte_array) August 12, 2020

Uber was a fun and unique experience for me, and I realized there is much more to engineering leadership than just an IC role. The biggest thing we could achieve was translating many operational best practices, such as maintaining uptime and availability, to keep our team and the platform happy and sustainable. For example, we learned to refrain from deploying code on a Friday or during a big haul and to avoid making big changes before leaving for the weekend.

We also implemented plenty of execution lessons and operational aspects at scale. This was my first time driving projects and large initiatives that involved many teams, and I had to write a roadmap that affected a hundred engineers. Balancing how to scope projects was really valuable, as you cannot make it too easy, but you also want to make it ambitious enough.

At Uber, I learned that people are the most important factor. They must feel that their projects are a step up for them and have the right attitude towards their work. I saw that people who approached Uber as a new challenge and were open to growing and learning succeeded, while others did not.

I also learned not to be too complacent and to place trust in people to grow. Just because I succeed in one project does not mean I will succeed in the next one. These lessons came from watching many people at Uber, which attracted some of the best engineering talents from companies like Facebook, LinkedIn, Amazon, and Google.

On ksqlDB at Confluent

Joining Confluent was an addition for me. Many of the LinkedIn crew at Confluent had worked very closely with me during my time at LinkedIn. I wanted to start my own company or at least work full-time on Hudi even back then. I was mostly exploring different ways. At least, Confluent took me to a company where I worked on data infrastructure as a service in the cloud. I felt like I would at least learn a lot about cloud data infrastructure while helping Confluent with these things. Also, I got to work with some of my older LinkedIn colleagues again, and that's how Confluent happened.

I landed right on the kSQL team. kSQL was an existing project that translated SQL into streaming SQL pipelines. However, the idea with KsqlDB was to build one system that could do all of this, i.e., understand streaming data as a first-class citizen and unlock interactive and real-time OLTP applications you build. We rebranded the project and introduced a lot of database features. We called it a streaming database.

Now, with Materialize and Flink, there is a lot of effort around this. The streaming database idea is how you can abstract a queue and a database into a single system and reduce complexity in your microservice. There's lots of work to be done to head to that part.

On Apache Hudi as a Streaming Data Lake Platform

When we first covered the motivations for incremental data processing, our goal was to build the streaming programming model. This model operates more incrementally, consuming new input and computing new output.

Hudi supports this programming model using batch data infrastructure, which involves tables living on top of data lake storage and cloud, or HDFS OnPrem. Hudi provides transactions to organize data, acting like a database on the lake. We also offer platform components on top of this core transactional layer, such as the streaming data lake, which enables incremental and streaming data processing.

The Hudi stack is organized in layers, starting with cloud storage and open file formats like Parquet or Avro and ending with a database layer supporting standard SQL or programmatic interfaces. In between, we provide a transaction layer that tracks schema versions, partitioning, metadata, file listings, and advanced concurrency control mechanisms.

We also offer platform services like streaming ingestion, which allows users to quickly sort data into a lake with a single command. The platform services are built to support upper-level use cases and to make it easy for organizations to go to production.

Many large enterprises, including Uber, Robinhood, Walmart, Amazon, and AstraZeneca, have used Hudi to stream data into their lakes. We want to continue contributing to Hudi and make it a project that provides deep technology and numerous features for users to easily go to production. That is the vision for the streaming data platform mission.

On Indexing and Concurrency Control

Source: https://hudi.apache.org/blog/2021/07/21/streaming-data-lake-platform/

When people compare lakehouse technologies to databases, they often compare it to something like an Oracle OLTP database, but these problems are very different. Many of the challenging aspects of databases revolve around indexing and concurrency control. While the table format layer is relatively simple - there are only a few ways to do it - concurrency control and indexes present open problems.

Writing to a database table is a throughput problem, whereas writing to something like Voldemort is a latency problem. Database systems measure themselves in TPS - transactions per second - but a data lake will not do a million transactions per second like Oracle. These are very different systems, so concurrency control and other aspects must be designed differently. This is where stream processing can offer more benefits for this specific type of database.

I have been fortunate to be part of kSQL and the streaming database story, which has cemented my thesis on this. When people compare different technologies, they often focus on simple integration checkmarks, such as table writes. However, the things that stand the test of time are more subtle write-ups like this in data systems. If you optimize for more lock-free concurrency control, you may end up writing more metadata. Conversely, a system optimized for batch pipelines could write fewer metadata, but it may not be able to handle as many updates.

There are still many subtle trade-offs and technology challenges in this space. We look forward to contributing more to Hudi, whose core mission is to incrementalize every batch job.

On Open-Source Roadmap Planning and Community Engagement

Definitely have lot more gray hair than 2016 :). #OpenSource #Community is a serious commitment, Hard work and toil. it’s very difficult to disengage once you realize people come to depend on your project. https://t.co/BDyuto7MUH
— Vinoth Chandar (@byte_array) October 15, 2021

Roadmap planning is an exciting and challenging task, especially when done in a distributed and open-source community. To ensure transparency, we set official expectations for the community and work hard to meet them. Roadmap planning involves starting a thread on a mailing list to discuss the major release and its features. Debates usually arise on whether to include a feature in this or another release and how to sequence it. My job is to facilitate and sequence the process, ensure alignment, and make sure that the releases are stable, high-quality, and delivered on time. Once we have agreement and alignment across priorities, one of the committers or PMC members is nominated as the release manager, who then has the authority to decide on the release.

The release manager decides when to cut the release candidate, and we put it out for voting in the community. People can vote plus one or minus one, and we debate on negative votes to resolve issues. By and large, we proceed when there are no more negative votes and at least three positive votes from the PMC. It's a collaborative process that takes a lot of energy and time to put towards the community, but it's appreciated by the community.

In the Hudi community, we answer tens of thousands of Slack messages yearly, have over 2,200 Slack members, and have weekly office hours and support channels. We also spend time providing suggestions and maintaining a good FAQ that people can use. We try to keep it generally useful for everybody. We strive to keep Hudi open, with minimal changes to the community, so everyone can benefit from it.

On Founding Onehouse

I am thrilled to start my first company around a problem that has been my passion for over 10 years now-“open data infrastructure”. Today, we announced @OnehouseHQ, to usher in a new era of managed data lakes with funding from @Greylock and Addition. Check out our intro blog! https://t.co/4KoUu71MDY
— Vinoth Chandar (@byte_array) February 2, 2022

For me, the starting point for Hudi back at Uber was pretty controversial. Traditional Hadoop people were unhappy that we were introducing updates into immutable file storage, which had been the norm for 10 years. However, as the idea gained more validation, such as when it became an incubation project, it became clear that this was the way forward for data architectures.

We saw this as a powerful new architecture we could bring to the world. Given how competitive the space is, with big players like Databricks, we wanted to advance the technology in Hudi and solve some real problems for users.

I was mentally ready to start working on Hudi full-time in 2019, but I didn't have the necessary documents, such as a green card, to start a company back then. So, my time at Confluent for one and a half years was helpful. The good people I worked with wrote letters for me and helped me get a green card. Finally, we started the company in 2021. However, the pandemic made the previous two years tough to navigate, and we lost all the free open-source time that we had, as all the community work was done on weekends and nights for four years. So, there was a struggle to get to a point to start the company.

Source: https://www.onehouse.ai/blog/introducing-onehouse

Beyond advancing Hudi and the technology, the business problem we want to solve is what we discussed. Most companies start with a warehouse that is close but fully managed. Then they pick a fully managed ingest system and click buttons to get to a point where there is data in Snowflake, BigQuery, Redshift, or something where they can write SQL. Then they build some dashboards. However, usually, after that, it starts to break down. Even at medium scale, there are issues such as the data starts ballooning. When companies want to create a data science team, none of this stuff works well, and they need Spark, PySpark, and other tools. Then they start building a lake and check out projects like Hudi, Debezium, and Spark. They hire a data engineering team and start building the SQL stack. While there are success stories in the community around engineers who successfully make the transition, there are also stories where these projects fail. They fail because there is so much custom DIY built into solutions to get a lakehouse, and it takes a long time to operate a lake.

Essentially, they wanted to say, "Hey, we already built Hudi. Can we build a cloud service that gives you the same ease of use and time to market as the more closed cloud warehousing stack but can deliver drastically faster data freshness and cost efficiency? It's future-proof because all your data is in an open format - Hudi." That was the basic idea. This will be the new kind of architecture that companies ultimately end up with, but with something like Onehouse, they would have the opportunity to start with that architecture on day one instead of signing up for a migration project two years down the line.

On Onehouse’s Commitment Towards Openness

Source: https://www.onehouse.ai/blog/onehouse-commitment-to-openness

We want to become champions of Apache Hudi but avoid controlling the project in unpredictable or undesirable ways. By and large, what we are trying to convey is that everyone wants openness.

However, today it comes at a cost. You need to trade off more cost for openness, either by hiring more engineers for DIY solutions or accepting a longer turnaround time for projects. We want to make it possible for you to switch back to running right out of Hudi if you get started with one for some reason.

You may be in a regulated environment or big enough to build an in-house team. For whatever reason, it should be possible to return to running right out of Hudi. There will be some services that you have to build in-house, but by and large, things should work.

If you contrast this with other vendors in a similar kind of lakehouse format, the approach has been to have a thin format and do all the marketing around it. But if you think about it, the format is a passive thing. It's a means to an end. What people really need is the ingestion service that pumps data into query tables. You need compaction and clustering services, which optimize the storage layer for you to improve query efficiency.

Our commitment is that these things already exist in open-source. That's how we designed the project, connecting back to our platform vision. We fiercely believe that as a business, we can add significant value by operating all of this well and taking that headache away from you. Our commitment is to ensure you can start with Onehouse, DIY Hudi, and then pick up on the managed service again after a year or two, even if your team leaves or changes. We are building this with an eye toward the future.

On Hiring

Source: https://www.onehouse.ai/about-us

So far, we've been fortunate enough to attract some outstanding talent for the Hudi project. In the four years since its inception, the project has gained awareness, which has helped us attract talent.

I could share what I've learned so far about what to look for in engineers and people you bring on at this stage, in case it's useful to others in the same boat. Technical skills are obviously crucial, but it's also important to have people who are excited about the vision and want to make managed lakehouse a reality. We're trying to bring about a new data architecture that will become the mainstay for people, so it's important to be excited about that vision and understand it.

We've also filtered heavily for culture and a hunger to accomplish this and deliver on the vision. After my first few hires, I quickly realized that this was what we needed. When you have a team that's aligned and excited about the vision, they focus all their energy on thinking about how to make it happen, which has many compound effects.

On Fundraising

A year after introducing @Onehousehq, managed data lakehouses are now a reality. I am thrilled to announce our $25M Series A funding led by Addition and @Greylock 🔥. In this letter, I have shared our journey, learnings and what we are building towards! https://t.co/Xou9gMb3ra
— Vinoth Chandar (@byte_array) February 2, 2023

I feel fortunate to have had the experience of dealing with VC firms. Although it was a little painful at times, I had PDF term sheets and other materials before I even had a green card to sell the company. This gave me the opportunity to speak with almost all of the top VC firms here. While all of them are great and offer to be helpful, it's important to consider the people and the relationships you will be working with. Your VC partners are the people who will stick with you through the ups and downs of your business. Therefore, it is crucial to make sure you feel comfortable working with them.

It's a two-way street; the VCs want to understand your vision and the prospects for your business, and you want to understand who can help you with go-to-market strategies, who knows the customer profiles in your space, and who is connected to the right networks. Ultimately, you want to work with someone you can trust and who has the expertise to support your business for at least four or five years. These things should be top of mind when considering VC firms.

Show Notes

(01:58) Vinoth shared his college experience studying IT at the Madras Institute of Technology in Chennai, India.
(07:09) Vinoth reflected on his time at UT Austin, getting a Master's degree in Computer Science - where he did research on high-bandwidth content distribution and large-scale parallel processing with shell pipes.
(11:20) Vinoth recalled his two years as a software engineer at Oracle, working on their database replication engine, HPC, and stream processing.
(15:30) Vinoth walked over his transition to LinkedIn as a senior software engineer, working primarily on Voldemort - a key-value store that handles a big chunk of traffic on Linkedin and serves thousands of requests per second over terabytes of data.
(24:41) Vinoth talked about his career transition to Uber in late 2014 as a founding engineer on Uber's data team and architect of Uber's data architecture.
(28:39) Vinoth reflected on the state of Uber's data infrastructure when he joined.
(34:31) Vinoth elaborated on Uber's case for incremental processing on Hadoop.
(38:53) Vinoth reviewed the initial design and implementation of Hudi across the Hadoop ecosystem at Uber in 2016.
(41:33) Vinoth shared the evolution of Hudi after it was initially open-sourced by Uber in 2017 and eventually incubated into the Apache Software Foundation in 2019.
(46:49) Vinoth explained how to keep the development of Apache Hudi vendor-neutral.
(49:36) Vinoth provided lessons learned about establishing standards for open-source data projects.
(53:45) Vinoth went over the valuable leadership lessons that he absorbed throughout his 4.5 years at Uber.
(57:17) Vinoth reflected on his 1.5 years as a principal engineer at Confluent working on ksqlDB, which makes it easy to create event streaming applications.
(01:02:16) Vinoth articulated the vision for Apache Hudi as a Streaming Data Lake platform.
(01:08:00) Vinoth highlighted the challenges with databases around indexing and concurrency control.
(01:11:37) Vinoth shared the unique challenges around prioritizing the Hudi roadmap and engaging an open-source community.
(01:16:32) Vinoth shared the founding story of Onehouse, a cloud-native, fully-managed lakehouse service built on Apache Hudi.
(01:22:02 ) Vinoth emphasized Onehouse's commitment towards openness.
(01:24:36) Vinoth shared valuable hiring lessons to attract the right people who are excited about Onehouse's mission.
(01:26:40) Vinoth shared fundraising advice to founders who are seeking the right investors for their startups.
(01:28:24) Closing segment.

Vinoth's Contact Info

Onehouse's Resources

Website | Twitter | LinkedIn
About | Product | Blog | Careers

Apache Hudi's Resources

User Docs | Technical Wiki | Roadmap
GitHub | Twitter | Slack

Mentioned Content

Articles and Presentations

Voldemort : Prototype to Production (May 2014)
Uber's Case for Incremental Processing on Hadoop (Aug 2016)
Hoodie: An Open Source Incremental Processing Framework From Uber (2017)
The Past, Present, and Future of Efficient Data Lake Architectures (2021)
Highly Available, Fault-Tolerant Pull Queries in ksqlDB (May 2020)
Apache Hudi - The Data Lake Platform (July 2021)
Introducing Onehouse (Feb 2022)
Automagic Data Lake Infrastructure (Feb 2022)
Onehouse Commitment to Openness (Feb 2022)

People

Book

Zero To One (by Peter Thiel)

Super excited about this! If you want to build a #datalake on @awscloud , it’s now easier than ever! 🙌 https://t.co/uUEE3I0Yrz
— Vinoth Chandar (@byte_array) April 13, 2023

Notes

My conversation with Vinoth was recorded back in August 2022. The Onehouse team has had some announcements in 2023 that I recommend looking at:

Eating our own dog food ;). Po Hong shares how he used Onehouse to build a GitHub data lake for Onehouse. Also very happy to finally share more about the product we have been pouring our lives into... https://t.co/Lylmi60r4v
— Vinoth Chandar (@byte_array) April 24, 2023

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Related Episodes