Datacast Episode 112: Distributed Systems Research, The Philosophy of Computational Complexity, and Modern Streaming Database with Arjun Narayan
The 112th episode of Datacast is my conversation with Arjun Narayan, the CEO and Co-Founder of Materialize, a streaming database for real-time applications and analytics built on top of a next-generation stream processor — Timely Dataflow.
Our wide-ranging conversation touches on his liberal arts education, his Ph.D. work on differential privacy for distributed systems, his love for teaching and writing, his time as an engineer at Cockroach Labs, his current journey with Materialize building a SQL streaming database on top of cutting-edge research from his co-founder Frank McSherry, lessons learned from building a capital-intensive business and identifying partnerships, thoughts on hiring and fundraising, and much more.
Please enjoy my conversation with Arjun!
Listen to the show on (1) Spotify, (2) Apple, (3) Google, (4) Stitcher, (5) RadioPublic, and (6) iHeartRadio
Key Takeaways
Here are the highlights from my conversation with Arjun:
On His Upbringing
I grew up in Bangalore, which is very different now from when I was young. It was a quiet little retirement town but has since become a major metropolis. I do not recognize it anymore, so I do not enjoy returning.
My experience growing up in Bangalore was different from what people might picture. I went to a boarding school because I received a generous scholarship. I had never planned on leaving home, but it was an opportunity for high school that I could not pass up. A friend convinced me to apply, and when I visited, I saw this amazing campus near Bombay with resources I had never seen before.
Once I saw the place, I knew I wanted to go. It was very formative because it opened my eyes to various experiences and educational possibilities. I consider it a lucky break to be exposed to the liberal arts experience so early. Until then, I had been a stereotypically narrowly focused science and mathematics student.
That experience convinced me to go deeper down that path, so I attended a liberal arts college in the United States. That was the reason I chose to come to the US. Before that, I had always planned on getting into IIT and studying computer science, physics, or engineering. Looking back, I do not know if the outcome would have been that different, given that I eventually returned to study computer science. However, I learned a lot by taking the long road rather than the quick, narrow one.
On The Value of A Liberal Arts Education
The inquisitiveness of asking profound questions is fundamental to a liberal arts curriculum. When I was younger, I focused solely on advancing through my calculus textbook and did not ask many questions. Although I was fascinated by physics, I was only inquisitive in a narrow way. My goal was to understand and explore theoretical physics, and I did not consider other subjects.
It is common for scientifically-minded kids to become obsessed with a particular field, like theoretical physics. However, when I broadened my focus to history, philosophy, literature, economics, and the social sciences, especially sociology, I realized that those fields had equally profound questions. This realization opened up a new world of inquiry for me.
On His Academic Experience at Willams College
I loved my college experience at Williams. It was exactly what I was looking for after finishing high school, though I did not know what to expect since I had never been to the US. My only guide was the US news rankings of colleges and universities, and Williams was highly ranked that year.
I did not know where Williams was located when I got accepted. I remember celebrating with my friends on admissions day and getting quite drunk. One of my friends asked, "Where is Williams?" We did not know the answer. My roommate and I rushed to one part of campus with the internet at 1:00 AM to Google the question. It turned out that Williams was in rural Massachusetts, which was an excellent setting for a college.
As an immigrant, I see many other immigrants in America spending too much time with other immigrants. Doing so is easy and comfortable, but I highly recommend against it. What was nice about Williams was that it forced me to spend more time with Americans, the people who inhabit this country. This was a fortunate setting, as well as offering a wonderful education.
Initially, I intended to major in a series of subjects, including history, music, economics, and math. However, I discovered that I was mediocre at music. It was possibly the hardest series of classes I have ever taken. That experience taught me that maybe I was not cut out to be a world-renowned composer or performer.
It was also my first time studying academic computer science, which was very different from programming. While programming focuses on achieving specific tasks, computer science focuses on the why, how, and what is possible with algorithmic underpinnings. I fell in love with it, which is why I chose to study computer science and economics.
The liberal arts education at Williams allowed me to choose my major quite late. I could start multiple majors and only commit to finishing one or two toward the end of my second year. I also spent a year abroad at Cambridge, which was a less interesting experience than I had expected. In retrospect, I could have probably spent an additional year at Williams instead. However, it prepared me to follow up my college education with a Ph.D., giving me a clearer picture of what an academic research institution would look like. That was the impetus for going directly to a Ph.D. after college.
On History and Political Philosophy
At Williams, I have developed a deep interest in history and philosophy, specifically political philosophy. I am passionate about understanding the causal chain of events, as it is a valuable method of inquiry. Nowadays, most of my leisure reading and spare time activities involve engaging with history or political philosophy.
However, it is challenging to contrast this with economics, where the gold standard for understanding causality is a randomized controlled trial. In economics, I became interested in understanding which poverty interventions work, using experimental designs to measure the impact of interventions in two similar villages, one with the intervention and one without. This allows for making causal inferences with high certainty if the trial is designed correctly. I loved all of the experimental design challenges.
However, what is much harder about history is that you cannot do this. You must try to understand causality without clean interventions or instrumental variables under your control to determine what works and what does not. This is perhaps the hardest intellectual challenge in coming up with clear predictive theories of how the world works and how humans work. Nonetheless, I find it to be the most profound line of inquiry that I find deeply fascinating.
While I think there are specific applications in business and day-to-day life in determining what we could have done, what we should have done, and what we should do next, I feel obligated to say that I am a liberal arts enthusiast. The liberal arts are valuable for their own sake, not because they will make you a better businessman or improve any specific skill. They are valuable because they enrich the human experience with no end goal beyond that.
On Getting Into Distributed Systems
My interest in economics and finance led me to focus on the branches of computer science that I believed had the most impact on technology and society, namely networks and databases. These areas (and operating systems) have the largest market size and are supported by large corporations with significant innovation budgets and research and development efforts.
While I was also interested in hardware, I considered Intel to be the company that had the most impact in the technology sector. I wanted to work with the foundational elements of computer science, such as chips, operating systems, databases, and networks, and to engage with companies like Cisco, Oracle, Microsoft, and Intel, which had helped to build the industry.
When choosing a field, I prioritized the potential for impact rather than the inherent love of computer science or operating systems. Between networks, distributed systems, and operating systems, I was indifferent and fell into distributed systems by accident. However, choosing an advisor is more important than choosing a field or institution. I was fortunate to have a great mentor who guided my research on differential privacy, a project within the broader field of distributed systems.
On His Ph.D. Experience at the University of Pennsylvania
I fell into my Ph.D. program because I chose an advisor based on how good he would be at mentoring me, and he turned out to be a phenomenal advisor and researcher. He spent a lot of one-on-one time teaching me various things.
Many people misunderstand what a Ph.D. is because it is very different from undergraduate studies. In undergraduate studies, professors teach you things they already know the answers to. They are not pushing the frontier or figuring things out as they go along. They have a syllabus and a defined set of answers. In a Ph.D., however, you are trying to figure out something to which you do not know the answers, and there may not even be an answer.
The experience is less about a teacher-student relationship and more about exploring together. You learn by watching your advisor try to solve problems, and in the beginning, they will likely do most of the work while you help along. Over time, you become more of an equal, and when your institution and advisor decide you can research independently without daily supervision, you graduate.
I have fond memories of working with my advisor, Andreas, because we were trying to build practical systems with differential privacy. Differential privacy is a mathematical guarantee about the amount of privacy preserved when analyzing a dataset. We were trying to build practical systems that enforced these guarantees and could be used like a database. This required us to engage with classic database issues, security, distribution, and performance.
We also used advanced programming languages and type theory to enforce the differential privacy properties we needed. We had to build our own custom query execution runtime to protect against adversarial queries that leaked information through side channels. This gave me a grand tour of computer science, and although I have not done anything with differential privacy since my Ph.D., it was a great opening question to engage with other fields I enjoyed.
On His Ph.D. Dissertation About Differential Privacy
The key claims of differential privacy are rooted in mathematical guarantees. In order to achieve these guarantees, they need to be executed by some runtime, which is often a database system. However, this poses a challenge: while differential privacy protects the data, it makes it harder to answer users' follow-up questions since they cannot see the raw data.
To address this, there has been research on zero knowledge verification, which allows users to gain confidence that a program was executed correctly without learning anything about the inputs or intermediate states. They only know the final result and that the program was executed exactly as specified because of additional metadata. The hard part is ensuring that the metadata does not leak any information about the inputs.
However, these systems are currently academic prototypes and not performant. To make them more practical to use, we pushed the limit on these systems to give privacy guarantees while being somewhat practical. This was particularly important in medical data settings, where it is valuable for researchers to access the data, but there are many barriers due to privacy concerns.
Another challenge is combining the results of multiple data sets without centralizing the data, which creates a single point of failure liable for breaches. We explored the possibility of asking join questions, which involve selecting and joining across multiple data sets kept in two different systems that do not trust each other. While this is possible, it comes at a high-performance cost.
On His Love For Teaching
I loved teaching. Teaching would be my top choice if I were doing something other than what I was doing.
One thing I learned is that I would actually prefer teaching younger kids. I think I would enjoy being a primary or middle school science and math teacher more than a high school or college teacher. One of the most frustrating things is when a student is eager to learn, but they do not understand a concept because they lack the prerequisites. In college, I did a lot of volunteer teaching for high school math students, and the number one issue they faced was not being able to complete a calculus test. I would help them, but it was really sad when I realized they were struggling with algebra.
I would tell them they had a test in two days, but they really needed an extra three months to go back two years and solidify their algebra skills. One of my biggest frustrations and takeaways from teaching was that it is important to intervene as early as possible to shore up the fundamentals.
Teaching also helps build empathy. Nobody in any job is trying to do a bad job. If someone is not performing well, it is usually because they misunderstand something or lack foundational knowledge. Good teaching, particularly one-on-one, involves diagnosing where students' misconceptions began and why they reached an incorrect conclusion. This is a valuable skill that can be applied to managing or leading any team. Everyone is on the same team, trying to do the right thing, but diagnosing why someone may have made a mistake quickly and efficiently is important.
On Joining Cockroach Labs
The story of how I ended up at Cockroach is a bit long. It started halfway through my Ph.D. when I became increasingly interested in distributed systems research. I found it the most enjoyable field among the various ones I explored, such as programming languages, core privacy research, and security.
I delved deep into distributed computing, including streaming. I followed projects like Apache Spark, Google Spanner, and Naiad, which was Frank McSherry's project on stream processing. I was particularly impressed by Nyad, which represented a step function improvement in what could be done in streaming. It was a competitive performance-wise with batch systems, which was a significant breakthrough.
I tried to convince Frank to start a company based on Naiad, but he was not interested. However, I did find Cockroach, an open-source clone of Spanner, which I thought had much potential as a commercial enterprise. Cockroach had made some product decisions that I found more appealing than Google Spanner, such as building a Postgres-compatible layer for the query layer and making it an open-source project.
That is how I ended up at Cockroach, almost directly from my Ph.D. research and engagement with cutting-edge distributed systems projects in academia.
On His Contributions At Cockroach
I would not say I was one of the worst engineers at Cockroach, but I was definitely in the bottom half. The team was amazing, and the product was incredibly complex, which made it tough for me. However, my academic background allowed me to play a key role in synthesizing and explaining complex issues so everyone on the team could understand them. This involved not only external communication but also internal communication, which was critical for shared understanding across the engineering team.
As a result, I began doing a lot of writing. The writing we did at Cockroach was very effective from a marketing standpoint, but it was also essential for internal communication and clear understanding. This is how I became passionate about two subjects: performance and the storage engine.
In terms of performance, we faced many challenges, and I wanted to get more precise about what that meant. It is not just about raw queries per second. We needed to consider the types of queries and other factors, such as what the literature says, how other databases perform, and the tradeoffs of focusing on a single metric. Should we focus on multiple metrics? By synthesizing all of these factors, we were able to build a set of performance goals that rallied the entire engineering team, including myself, to improve our performance. Over two quarters, we achieved more than an order of magnitude improvement on a database benchmark, the TPC-C benchmark. I wrote a guide that outlined our motivations and the work we did, and it was very effective in helping others achieve great performance with Cockroach.
The second piece that I was passionate about was the storage engine. At the time, it was RocksDB, an open-source storage engine database maintained mainly by Facebook. It was a fork of LevelDB, another open-source storage engine database maintained by Google. The Cockroach co-founders, including the CTO, were ex-Googlers and were very familiar with LevelDB. However, it wasn't the right architecture for Cockroach, which was buckling under the workloads we were putting it through. I thought it was salvageable, but the CTO, Peter, thought we needed to build our own storage engine. There were inefficiencies introduced by the fact that Cockroach was written in Go, while RocksDB was written in C++. Go is a managed memory environment, while C++ is a manually managed memory environment, which meant that we had to transfer large amounts of memory, essentially copying it in a redundant way that we might not have had to do if our storage engine was from the get-go in the Go management. We were able to eke out many performance gains by being clever about how we pushed data from C++ to Go.
At one point, I had identified a set of queries where we could achieve an order of magnitude speedup if we were cleverer about how we pushed data from C++ to Go. However, it took me about a month, and I made no progress. Spencer, the CEO, was getting frustrated, and he asked why I was not making any progress. Eventually, Peter coded the entire solution in a single day, and I realized that my future lay in identifying performance bottlenecks and pointing Peter at the right problems to solve. The outcome of all of this was the RocksDB guide, which I wrote from the perspective of keeping RocksDB. It was the last work I did at Cockroach, and they now have a new and highly performant storage engine.
On His Scaling Journey With Cockroach
My time at Cockroach Labs was an incredible learning experience to witness the various stages a startup must go through. During my first quarter, our OKR was to ensure that a single-node database system had 24 hours of uptime under a continuous query load. This seemed like a joke of an objective for a database, as one would hope that a database has more than 24 hours of uptime. However, that was the state of things back then.
By the time I left, after all the performance work, we were keeping ten node clusters up under extremely high query volume. Shortly after my departure, they pushed 100 node clusters that saturate TPCC load generators. I have not kept up since, but it is probably one of the most scalable systems you can find in any database today.
Watching that journey and knowing how long it took without panicking was the number one skill I learned. The number two skill was understanding the importance of communication in selling or marketing a database, or building user and customer confidence that you know what you are doing.
This approach is very different from what I see many other vendors do. I think we have copied the Cockroach model quite a bit at Materialize, which is to be extremely detailed and honest. Since they are a startup, most people's instincts are to oversell and overpromise. This is the most polished database ever. Hundred times better. We solved it. We solved databases. We are done. But this is the absolute opposite of what you should do, as no one will believe a word you say in that press release or any other future announcement you make.
We approached communication at Cockroach through introspective and honest blog posts. I remember when we talked about stability and received all these Hacker News comments about how much of a joke our system was and how we could not keep a three-node cluster stable for a week. We were honest about the challenges we faced and the design challenges we had to overcome. This helped build trust with those who really understood how hard it is to solve this non-trivial problem.
Watching how we went from our early communication to commercial viability, and the point at which Cockroach was a scaling revenue operation with a go-to-market team that was bringing in non-trivial revenue, was invaluable.
Databases have a particularly long incubation period for research and development (R&D). This sets them apart from other startups, which often receive commercial signals of success in a quick feedback loop. With databases, however, you may spend three or more years on R&D without receiving such feedback.
To succeed in this environment, you must have a clear internal understanding of what you are doing and be skilled at communicating your progress externally. Without this clarity, it can be difficult to maintain the conviction needed to keep going and not give up.
On Writing About Database Systems In Production
As I mentioned earlier, many things I argued for involved internal writing, such as performance work. I believe that internal writing benefits from the idea that writing is a muscle, meaning that the more you write, the better you become at it. I have written various blog posts on different topics, including some GitHub issues on the Cockroach and Materialize repositories and internal Google Docs.
To me, this is the clearest and most effective way to communicate and help others learn. The amount of onboarding material you leave behind as you go along is phenomenal at bringing people up to speed with your work. Writing has this non-linear accruing effect over time as you accumulate more written content. I enjoy writing blog posts and wish I had more time for them. The transactional isolation and log-structured merge trees blog posts were both written as a result of trying to convince some folks internally at Cockroach of various things.
Kyle Kingsbury, an independent researcher and consultant who does testing of database guarantees, also writes valuable and in-depth blog posts. His Jepson blog posts are incredibly valuable for users who are selecting between databases and choosing one that lives up to its claims of correctness. When Kyle came and did an engagement at Cockroach, we were trying to improve our own understanding of what we were claiming and get to absolute precision. As part of that, I ended up writing a blog post on serializability, strict serializability, and all the various guarantees, which still hold up.
Similarly, log-structured merge trees, which are the core of the storage engine of RocksDB and Cockroach's new storage engine, are a data structure that is relatively less written about because it is an academic invention. As a practical choice by industry distributed systems engineers, comparatively less is written about it. I wanted to dig into the history of log-structured merge trees and how they evolved because it was a topic that was less written about. Because we were built on a log structure and merge-based storage engine, I thought it was important for us to understand it better.
On "The Philosophy of Computational Complexity"
Tyler Cowen's blog is one of my favorites; I read it nearly daily. There was an interesting article about what philosophy has contributed to society in recent years. It got me thinking about the huge revolution in our understanding of computational complexity in the last 50 years, which I believe is on par with the advancements made in early 20th-century physics.
I tried to convince a close friend of mine that this was the case, but he was skeptical. So I decided to write this blog post to convince him. One of my tips for overcoming writer's block is to write as if speaking to one person rather than a mass audience. This helps to make your writing clearer and more effective.
In this post, I explore the six possible worlds of Russell Impagliazzo's research paper, now known as Impagliazzo's Worlds. These worlds are related to the famous P vs. NP question, which is whether or not polynomial-time algorithms can solve NP-hard problems. If P=NP, it could mean that all problems are easy or that our categorization of hard problems is wrong. If P≠NP, cryptography may become impossible.
We do not know which world we live in, but exploring these possibilities is fascinating and worthwhile. While I did not convince my friend, this post was picked up by Marginal Revolution and gained some attention on the internet.
On Writing Evergreen Content
When writing for a technical audience at a company, it is important to recognize the value of evergreen content. Over time, a steady stream of hits on a well-written, detailed article can add up to a significant amount. This is why it is better to write longer, more detailed pieces that appeal to a narrower audience, even if they only get a few hundred views per month. It is more valuable to have an authoritative post that's recognized as such by experts rather than something shallower that gets thousands of views.
For example, a friend of mine, Justin, wrote a definitive guide to a database anomaly called Write Skew. It is the number one hit when you search for "Write Skew" and explains the issue so well that readers come away feeling like Justin is the only person who can save them from their predicament. This is a great position for a company to be in, as it establishes trust and credibility.
In the enterprise world, it is better to go narrow and deep than shallow and wide. Even if you only have a few thousand paying customers, if you can communicate directly and effectively with them, you can build a successful company. So when writing for an enterprise audience, focus on communicating directly with a smaller group of people rather than trying to appeal to a broader audience.
On Founding Materialize
The first time around, there was a vague sense that Frank should do something with his amazing technology. Writing code on the internet and putting it on a GitHub repo was not enough to get people interested. More needed to be done.
I suggested that Frank should commercialize the technology. However, the pitch was not successful because it was too vague. After moving to Cockroach, the pitch became more specific. I stayed in touch with Frank and explained how we needed to build a layer that sits on top of the query layer that people understand, which is SQL. SQL is accessible to most developers who write and interact with SQL databases like Postgres, MySQL, or Snowflake. It is a very accessible way to specify business logic, and the database does the heavy lifting to make it efficient and process at scale and large volumes.
Having seen how Cockroach took a complex, distributed, geo-replicated system and simplified it for the user, I could articulate this more effectively. The pitch was that Cockroach is Postgres that scales infinitely, so you do not have to worry about manually scaling your own starting scheme. I believed that taking the stream processor and wrapping it in SQL was the way to make streaming mainstream.
Spencer was the first person to think this was a phenomenal idea and helped me a lot with fundraising. Our first year was spent prototyping, early recruiting, and building a team. We were fortunate to get our product out in February 2020, just before March 2020, when nobody cared anymore about a new database.
The initial fundraising for the company was short, lasting just the month of January. We raised our Series A then, and Frank and I began working in March.
On The Architecture Design of Materialize
At its core, Materialize has a computation engine which is a stream processor. It is designed to expose and allow users to drive the stream processor without dealing with the shortcomings of stream processing. Most stream processes today are manual, but Materialize aims to make it so that users can get value from it without even caring about streaming.
The goal is to make it simple for users to get value from Materialize, even if they do not care about streaming. The hard architectural challenge is how to encapsulate the stream processor. The solution is to give people a control plane, which is the SQL for defining the streams. The simplest way to expose streaming use cases is through materialized views.
Materialized views are a common database concept that pre-materializes the result of a view so that it can be read for free. Most traditional databases have limited support for materialized views. However, Materialize can incrementally materialize any view, regardless of its complexity. This makes it easy for users to understand the capabilities of Materialize.
In addition to materialized views, Materialize has built more traditional database features. These include a storage engine, cloud product, replication for high availability, seamless migration, and the ability to run multiple isolated instances of compute. This makes Materialize more akin to a relational database management system, with features that enable stream processing to be more widely usable while taking care of all the management systems aspects like availability, replication, cloud interoperability, and seamless migrations.
On Streaming SQL
At Materialize, we simplify the process of exposing and streaming materialized views expressed in SQL. We achieve this by using the incremental compute capabilities of Differential Dataflow in the underlying stream processing library.
Other systems have used simplified SQL-like languages to build pipelines, which can be done using some SQL and fairly limited syntax. However, this differs greatly from using the full breadth of complex SQL queries and having a relational database management system to figure out the underlying data flow and efficiently maintain indexes incrementally.
It is a bit like the difference between Snowflake and Hive. Hive was SQL on Hadoop, which allowed you to write SQL using Hive. However, it never really worked well. It is difficult to explain the lived experience of Snowflake versus Hive to someone who has not used these products. With Hive, you build batch pipelines using a SQL-like language with limited expressiveness. Similarly, with streaming, many SQL pipeline builder tools have the same limitations.
When dealing with business logic and SQL that involves 30-40 lines of code, six different data sources, subqueries, and complex aggregations, converting it into a set of staged pipelines that execute the query is the job of the database. Asking users to do that manually is like asking them to manually implement a query optimizer and planner. This defeats the purpose of having a database in the first place and makes SQL a hindrance rather than a help.
Some organizations can effectively deploy microservice-oriented streaming stacks that use Flink or Kafka streams. However, not every organization can recruit and build data teams that can do that. Uber and Netflix can do it, but it is unlikely that every company in the Fortune 500 can.
On Open-Source Engagement
The number one tactic to grow our open-source library's adoption is blogging and technical communication.
I believe there is a distinction between projects that try to make their source code understandable to outsiders and those that do not really care. Some open-source projects by large companies have no commit history or issue tracker and offer little help to users. Our open development philosophy is different. Although it is not truly open source because it does not have an OSI-approved open source license and there are limitations on what users can do with the code, we still believe source availability has huge value for users. They get to see what we are developing, how we are thinking about things, and how we are dealing with bugs. This is something that many vendors and software pieces do not offer, where users submit bug reports and never hear back again.
Our philosophy around source availability is closer to having an open scientific process. It is more akin to open science than traditional open source, which allows commercializing. As a company, we have an invested interest in commercializing our work at Materialize, but we do not want to compromise the benefits of open communication. This also applies to our blog posts. Developers and builders want to see how problems are solved, which is why they join a community, participate in GitHub, or read our blog posts.
My main takeaway is that there is not really a limit to how much you can share. It is the same with technical blogging. One good blog post that goes through all of the subtle correctness problems you will face when building a deep streaming pipeline is more valuable than 300 posts that offer little help. It may get fewer views, but it is more likely to reach the right people who actually face the problem and have purchasing power decisions.
On Materialize Cloud
We are getting an early preview of this, and I cannot wait to make it more broadly available. This summer, we have a huge set of features planned for Materialized Cloud that are particularly tailored towards enterprise use cases and our enterprise early adopters.
Materialize Source, the source-available product, is about building an effective system end-to-end. You have all the SQL, and there is no SQL that you cannot write in the free version. But the enterprise is all about building reliable, highly available systems that can really harness the power at scale.
In critical use cases, this means that people care about high availability through replicated instances of Materialize that are always in sync. The first big tent pole feature of Materialized Cloud Enterprise is replication, which means seamless replication. If a single machine fails, the user does not even notice because there is a replica catching up or has already been advancing at the same speed, taking over the workload, and preserving correctness such that the user cannot notice anything. Because the two replicas are exactly in sync, this is the first feature.
The second feature is an enterprise-grade storage layer that stores all the historical data on S3. This helps give very efficient, cheap storage that performs well for people storing streams. We want our users to be able to point streams at Materialized Cloud of very high volume and very high throughput and just forget about them. The historical data is getting compacted, efficiently warehoused in cheap object stores, and accessible by replicas of Materialize very efficiently.
The third feature is use case isolation. We see this with our users. You start with one use case, but the moment you say, "Oh, I can build everything in SQL," you start to build more and more things in SQL on top of Materialize. Many databases have this problem where if they are not good at use case isolation and scalability, not in the sense of scaling horizontally to more workloads, but in the sense of scaling to more workloads, you can have workloads 1, 2, and 3 sit and not interfere with each other.
If you do not have that, then your use case 1 is an extremely important one, which is why you have not bothered adopting a new system. If use case 2 threatens use case one, you will not let use case 2 touch the database, right? This limits the upside in value for what this new system will bring you. Using separated storage on object store and replicas that can share and scalably share the storage, you can have the replicas evolve to host different sets of queries. This means that it is not just usable for high availability, but it is usable for use case separations. You can have cluster 1 and cluster 2 be identical, serving use case 1. These are replicas of use case 1 sitting with a load balancer in front of it.
But you can have cluster 3 host a different set of views for a completely different use case, and they can all share the input streams, share and co-evolve their SQL, and the way you model your SQL. This really unlocks the power of adopting a system like Materialize to power increasingly more and more of your business logic in real-time SQL.
The fourth feature that's also necessary but not to talk down is horizontal scalability. Building clusters that can scale horizontally to a multi-machine. These features are coming out later this summer, and I am extremely excited because these are the foundational enterprise features to use Materialize at scale and across an entire organization.
On Materialize’s Unbundled Cloud Architecture
That blog post explains how we arrived at these four features in detail. Launching four big foundational features at once is a little unusual, but it turns out that they all use the same underlying core enterprise primitive: the separation of storage and compute. Once you have this separation, all of these features fall out pretty straightforwardly. Frank's blog post provides excellent detail about how we are separating storage and compute.
I think Snowflake is the best example of doing this in batch, and in many ways, what we are trying to do with Materialize is to do what Snowflake has done for batch but for streaming.
To succinctly explain the architectural decisions, what makes Materialize different from the Hadoop ecosystem is how we think about taking Kafka microservice architectures. Hadoop was not unsuccessful, and many companies got much value out of it. But Snowflake really took the power of the cloud and made it accessible to users who want systems that do all the things they do not want to think about. They want it to look and feel exactly the same as it always has, like a SQL system that just scales under the hood. And that is really the core insight in Snowflake, as well as what is going on in streaming today.
All these streaming proponents are correct: the world needs to move to lower and lower latency, and people cannot wait for days for it. This stuff has to be cloud-native, simple, and cheap, and it has to scale out. But Materialize's fundamental thing is that people want to just write the SQL they have always been writing. They do not want to reinvent and throw out their entire data architectures and start over from scratch. That is really the core value and founding principle behind why we started Materialize.
On The Modern Data Stack
When it comes to enterprise go-to-market, it is a challenging and complex process that cannot be accomplished overnight. As with my core lesson from Cockroach, it is going to take time, and as a result, you need to start evangelizing and communicating far earlier than even when your product is still in development to keep up with the evolving ecosystem. You must also take great care to fit within the ecosystem. People do not choose a data platform or tool in isolation. A significant benefit of standard SQL is that it comes out of the box with all the necessary integrations, including BI and data integration tools.
From day one, you need a strategy to engage with the ecosystem as it is. You cannot afford to build a great tool in a vacuum that reinvents everything. Sometimes, in certain spaces, you are required to switch to the provider's custom visualization layer, observability stack, or monitoring stack to adopt their tool. However, this will not work because, from the user's perspective, you are just one of many tools. While you may solve an important problem, if you require them to throw everything out, they will not do it.
Certain parts of it are just table stakes, such as dbt. Evolving your SQL business logic is a complex task, and people have correctly converged on using dbt to specify their model, schema, and shared workflows because it is a much better way of building an organization for productivity. Table stakes integrations that rely on deep SQL integration, such as the ability to lift an existing dbt model on SQL from batched to Materialize, work well.
Other integrations work well because the two tools together give you a new way of doing things that's superior in a greenfield project. A good example of this is Redpanda, which is a more developer-friendly stream processor. If you are building an event-sourced application from scratch, it is the right choice to have a lean, clean streaming architecture where you source all your events into your stream processor and build application use cases using just SQL. You can use whatever web framework you are most productive in and tell it to read SQL directly from materialized and run select queries.
For each integration to be valuable, there has to be a clear story about how both tools' users see value coming in either direction. Not everything is bidirectional, and it is very much case-by-case, tool-by-tool. You must be very thoughtful and clear about why these things make sense together. Some partnerships rely on one party more than the other, which is fine too. In those cases, the relationship is unidirectional, and it is all about bringing deals and making more money. Snowflake and Fivetran had such a relationship early on, where Fivetran solved a major problem in Snowflake by getting data into it in a straightforward way, bringing Fivetran more money. But some partnerships are bidirectional; again, it is very much case by case, depending on the circumstances and specifics of the two products.
On Hiring and Culture Building
I believe that the most important responsibility of a CEO is hiring. In fact, the CEO's primary job is to hire.
Clear communication is also crucial for attracting the right people to Materialize. You need to be able to articulate your vision clearly and concisely and convey what you stand for and what is exciting about joining the team.
Many of the people we have hired have come through referrals and our referral network. Referrals are an outsized way to hire, as they come with a prior relationship and a level of trust. Joining a startup can be scary, but if you know and trust the people at that company, it is much easier to take the leap.
Of course, building a diverse team can be challenging when your network is made up of people you already know. But the key is to build trust over time. Recruiting often requires playing the long game, and it can take several years of engagement before someone joins your team.
To build a high-performance culture, clear communication and alignment are essential. Everyone can work independently while minimizing communication when they understand what they are building and why. This leads to highly coordinated behavior, like watching a soccer team pass the ball without looking because they know where their teammates will be.
It all starts with clear communication and a shared understanding of the product direction and values. As a CEO, I have learned to over-invest in this communication and articulation of what we are doing and why.
On Finding Customers
The majority, if not all, of our users, come inbound. This means that technical communication is crucial. In fact, many of the use cases were not initially considered by us. The users approached us and said, "You are solving exactly this thing." We were surprised because we did not know much about e-commerce, for instance.
I do not think that is entirely accurate, but the use cases are often determined by our users and customers. They educate us because we build a horizontal platform applicable in multiple verticals, like a database or anything else. As a result, your users always know more about the verticals than you do. Therefore, the communication is geared towards explaining general principles, what you are solving, the core technical problem, and letting your users teach you over time. Every use case on our website or case study represents our users, teaching us a great deal of information about their industry, where they are the experts, and how to use Materialize in ways that are driven entirely by them.
It is also helpful to consider the journey from the user's perspective. Most users begin building their library of workflows in batch. They often explore a database using batch processing, trying to control as many variables as possible. They certainly do not want the data to change while working on it. So, they start building dbt models and a SQL library of insights in their company. The next step is where real-time processing becomes essential, which is when they start to take action regularly on the results of that analytics.
If they do this with a human in the loop, batch processing works fine. You can run a batch job, get the results, and create a report or dashboard a human looks at. However, when you start to do automated actions, speed becomes critical. As people quickly see, by looking at the data, there is a huge penalty to increasing latency. For instance, e-commerce notification personalization is a use case that we have users for, such as Drizzly. They use Materialize to power notifications that need to happen when the analytics engineer determines it is the most valuable time to do so. These notifications are a precious resource; you do not want to send outdated information or something after it is too late. That is when migrating a pipeline from batch processing to streaming becomes a priority. You do not need real-time capabilities if a human is in the loop triggering that. However, when you are doing automated actions, that is when a system like Materialize or a streaming pipeline becomes far more valuable.
On Fundraising
There are two primary sources of advice when it comes to fundraising. The first is to consider your investors, stakeholders, and the entire lifecycle of your company from their perspective. Spencer at Cockroach provided me with a clear understanding of this, as databases are capital-intensive and have a long time to pay off. Fundraising for them is particularly difficult because you are asking for a lot more money and a longer wait time than investors are used to. This means that the bar you have to meet is very high.
The second source of advice is to think of venture capital as a personal relationship with individuals making a bet on the team and the company rather than just a spreadsheet with an ROI on capital invested. Unlike other asset classes, such as private equity or public markets, investors in venture capital are much more likely to meet or conduct diligence on the team before investing. Therefore, it is essential to take great care in choosing partners because it is a relationship that will last for a decade.
In terms of pitching or convincing investors to give you money, the most effective thing you can do is to have the backing and endorsement of successful founders. This went a long way in our initial fundraising experience and has continued to benefit us since. Therefore, it is a good idea to work at a high-performing, venture-backed startup before starting your own company.
The most important thing you can do is choose the company you work for and the people you work with wisely. High-integrity and generous people are more likely to provide help and endorsement that will go a long way in building relationships with VCs. This was a conscious decision on my part when I started my company, as I knew that the capital-intensive nature of the business would require years of relationship building to gain the trust of investors.
On Being A Researcher vs. Being A Founder
I believe there are many similarities between academia and founding startups. Many successful startup founders have backgrounds in academia. Both fields operate under high degrees of uncertainty and lack a set playbook, requiring individuals to figure things out as they go.
Both fields also have long feedback loops, making it difficult to know if you are on the right track for a long period of time. In addition, both can be quite lonely endeavors. A Ph.D., which takes at least five years to complete, requires a significant amount of individual work. Similarly, building a company can take at least five years or more.
While failed startups and academic careers are unique in their own ways, successful trajectories tend to be more similar than not.
Show Notes
(01:18) Arjun shared formative experiences of his upbringing - growing up in Bangalore, India; going to UWC Mahindra College for high school; and pursuing a liberal arts education in the US.
(04:45) Arjun described his overall academic experience at Willams College - where he studied Computer Science and Economics and did a one-year stint at the Computer Lab at the University of Cambridge.
(11:19) Arjun talked about his specialization within academic computer science: distributed systems.
(14:17) Arjun unpacked the arc of his Ph.D. experience at the University of Pennsylvania, advised by Professor Andreas Haeberlen.
(19:25) Arjun dissected the technical challenges and novelty of his Ph.D. dissertation on distributed systems that computed differentially private things.
(23:20) Arjun shared his love for teaching which benefits his industry career.
(25:55) Arjun walked through his decision to join Cockroach Labs as a software engineer.
(32:25) Arjun unpacked the CockroachDB Performance Guide and a RocksDB deep-dive on the Cockroach Labs blog.
(37:24) Arjun shared valuable lessons learned from his scaling journey with Cockroach.
(41:36) Arjun mentioned how his writing practice benefited his day-to-day work designing database systems in a production setting (Check out his posts on database transaction isolation semantics and the history of log-structured merge trees).
(45:46) Arjun unpacked his 2019 blog post titled "The Philosophy of Computational Complexity."
(52:52) Arjun emphasized the importance of writing evergreen and authoritative long-form content that attracts a small amount of audience.
(55:54) Arjun shared the story behind the founding of Materialize, which builds a SQL streaming database on top of Timely Dataflow and Differential Dataflow, two research projects created by his co-founder Frank McSherry.
(01:00:04) Arjun unpacked the architecture design of Materialize at a high level.
(01:04:36) Arjun explained a core capability of Materialize called Streaming SQL.
(01:07:37) Arjun discussed successful tactics to raise the adoption and contribution to Materialize's open-source project.
(01:11:23) Arjun walked through the major enterprise-grade features baked into Materialize Cloud.
(01:15:54) Arjun dissected a blog post about Materialize’s unbundled cloud architecture detailing the shift from the Materialize single binary to Materialize Cloud.
(01:21:13) Arjun envisioned how Materialize fits into the quickly evolving modern data stack.
(01:25:07) Arjun shared valuable hiring lessons to attract the right people who are excited about Materialize's mission.
(01:27:59) Arjun shared his brief take on building a high-performance company culture.
(01:29:19) Arjun discussed the challenges for his team to find the early design partners.
(01:31:17) Arjun walked through notable use cases of Materialize.
(01:34:24) Arjun shared fundraising advice with founders who are seeking the right investors for their startups.
(01:41:21) Arjun highlighted the similarities and differences between being a researcher and a founder.
(01:42:46) Closing segment.
Arjun's Contact Info
Materialize's Resources
Mentioned Content
Research + Articles
People
Book
Zero To One (by Peter Thiel)
Notes
My conversation with Arjun was recorded back in May 2022. Since then, a lot has happened. I recommend looking at the resources below:
About Materialize webpage (which shows the team building Materialize as well as the pedigree)
Guide: What is a Streaming Database (which walks through why Materialize is important and different from a normal database)
Case Study: Real-time Delivery Tracking UI in a Single Sprint at Onward
Tech Demo: CI/CD Workflows for dbt+Materialize (March 2023)
About the show
Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.
Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.
Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:
If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.