Datacast Episode 110: Wisdom in Building Data Infrastructure, Lessons from Open-Source Development, The Missing README, and The Future of Data Engineering with Chris Riccomini

The 110th episode of Datacast is my conversation with Chris Riccomini - an engineer, author, investor, and advisor.

Our wide-ranging conversation touches on his 15+-year experience working on infrastructure as an engineer and manager at PayPal, LinkedIn, and WePay; his involvement in open source as the original author of Apache Samza and an early contributor to Apache Airflow; tactical advice on conducting technical interviews, building internal data infrastructure, and writing a technical book; his experience investing and advising startups in the data space; the future of data engineering; and much more.

Please enjoy my conversation with Chris!

Listen to the show on (1) Spotify, (2) Apple Podcasts, (3) Google Podcastss, (4) iHeartRadio, (5) RadioPublic, (6) TuneIn, and (7) Stitcher

Key Takeaways

Here are the highlights from my conversation with Chris:

On His Education at Santa Clara University

I actually did my first year at USC down in Los Angeles. At that point, I was interested in the fusion of video and technology. I naively went down there thinking I could work on both. But USC is very focused on its cinematography department and entertainment work, which required a portfolio. I went there as a Computer Science undergrad, and while USC had a decent CS education, I knew I wanted to work and be involved in tech companies while at university. The opportunities were much better in Silicon Valley than in LA at the time, so I transferred to Santa Clara during my sophomore year.

I got a job initially at a little company called NeoMagic, which they were known for making video cards. But by the time I got there, they were working on chipsets for cellphones and had a cool cellphone implementation - essentially a touch screen without buttons. That did not go too far, so I transferred over to Intacct Corporation.

On the education side, a nice thing about going to school in Silicon Valley is that many professors, especially at the Master's level, were people who had worked in the industry. I found that really valuable. I had a guy who had worked at Intel for 20 years teaching a project management class. I had someone else who had worked at Oracle teaching a database class. That was my interest since I was much more into applied knowledge.

The undergrad CS department at Santa Clara was heavily focused on math and the theoretical stuff. They also had a computer engineering program - more like an applied computer science thing that blends CS, computer engineering, and straight electrical engineering. That would have been a better fit for me. But going in as an undergrad, I did not understand the difference so I ended up with Computer Science. I favored classes that were in the data and database space and self-selected for a data focus. Honestly, I couldn't tell you why I gravitated to it.

On His First Engineering Job at PayPal

The biggest learning I had at PayPal was the importance of the people I work with. The team I landed on was called the Advanced Concepts team. We were doing early machine learning, and I, in particular, focused on data visualization. That got me back to the UI and design portions of computer science. I got introduced to someone there named DJ Patil, who was in an area of the organization called Ebay Corp. He was really influential in the early part of my career. He eventually left to go to LinkedIn, and I followed him there. He really helped develop my early career and was a mentor for me for a while. My intern manager, whose name was Grahame Jastrebski, was helpful in teaching and guiding me, making sure I had projects that were important, and getting me in front of people. It's pretty rare for people like DJ and Grahame to bring me along with the systems I built. We would demo them to the VPs and directors of PayPal. As a very young engineer, getting to sit in a room with these folks was very neat. I learned a lot by being in the room and observing what was going on.

Looking back, I don't know whether it was luck or intuition, but the willingness to stick around and wait to get the job at PayPal, and then getting lucky enough to meet these people and learn from them was really a big deal. For me, the big learning was finding people you want to work with and learn from. Finding these mentors completely changed the arc of my career.

On Improving LinkedIn's "People You May Know" algorithm

The first real project that I worked on at LinkedIn was essentially to try and productionalize the "People You May Know" algorithm, also known as the PYMK algorithm. There's a widget on LinkedIn that suggests people you might know and want to connect with. They found that this was highly impactful in driving connections, which was critical for growth and the health of the overall network. At the time I joined, it was essentially running on a bunch of scripts on Oracle, and one of the biggest contributors to the algorithm was something called Triangle Closing, which is essentially just friends of friends. The Triangle Closing calculation is essentially like a self-join from A to B, then from B back to A, and the A to A becomes your second-degree connections. So that's a huge join in an Oracle database, and it didn't work. They were running this thing, and it would take weeks to calculate, and so they would get this new calculation and then ship it off to production. The goal was to get it to not take weeks, to be stable, and to be able to get a progress bar so you know how far along you were.

Source: https://blog.linkedin.com/2010/05/12/linkedin-pymk

We tried a number of charted database solutions that were out at the time, and most of them were derived from Postgres.

  • The first one we tried was Greenplum, which at the time was a data warehouse that allowed you to run on theoretical commodity hardware. It didn't work out so well for us, and we ended up having to get machines that were not really commodity.

  • We then moved to Aster Data, which I believe is now owned by Vertica. Aster was at least able to complete the inner-join, but it was very unstable.

  • We eventually ended up trying Hadoop, and it just worked for the algorithm. We built a Map and Reduce algorithm, and you could see the mappers and reducers going and everything, and you could see it making progress. We plugged the map-reduce stuff into Azkaban, which is an orchestration tool, and built a whole orchestration DAG for People You May Know that did all the different calculations. We built a logistic regression algorithm, and it just worked.

The team that initially explored the Aster Data and Greenplum space became what is now known as the data science team. At the time, we didn't have the name "data science" yet, so the skill set was somewhat different from what you would expect from an engineering team. One of the initial criteria we were looking for was SQL expertise. The theory was that we had a lot of people who were not very familiar with Java or Python, but were comfortable with R and SQL. Initially, we were not very interested in Hadoop. It wasn't so much that we were resistant to it, but it was lower on our priority list because it lacked SQL support. (Keep in mind that Hive was not yet fully developed at that time, and Pig was not well known.) So, we tried Greenplum and Aster, mainly because they were data warehouses that analytics people and data scientists could understand--and they used SQL, which data scientists were already comfortable with. We started there.

The reason we eventually started using Hadoop was that none of the other systems we tried were working. We ended up with Hadoop, and in hindsight, that was probably the right choice. For several years after that, I would advise people not to use Hadoop unless they absolutely had to--operationally, it was a nightmare, and even from a developer experience perspective, it was a pain to deal with. So, if you can avoid it, you should. But we couldn't avoid it--we were dealing with billions of address book entries and connection entries, and the hard disks at the time were much slower and less expensive than they are now. Hadoop was really the only viable option we had at the time.

That was my first real project, and it was something that I'm still really proud of. I think especially as a young engineer, getting to work on something like that of that scale and really driving results with it was gratifying.

On The Operational Challenges at LinkedIn

One thing that was really a shock to me was the lack of observability that we had at LinkedIn. We didn't really have a robust operations culture, and the way that worked was really black box to the developers. When we wanted things to be deployed, we would have to go to the operations team, and when we wanted to see logs, we would have to sit with them. In theory, we had Splunk, but in reality, the Splunk cluster could not support the volume of logs we had. Same thing with metrics, we had really no charts or graphs that I could see.

The frustrating thing was the ops people had all these things, but they were walled off where the developers couldn't see them. Deploying a web service was also a challenge. They were doing weekly releases, and they would literally turn the site off when they would do a release. They would have all the engineers on an IRC chat, and they would copy and paste all these jars and SCP them where they needed to be in the various machines. It was brutal. They had a wiki with steps that you would fill in, and it became a very complex graph. Ideally, it's acyclic, but in practice, it was actually cyclic in some cases. It took us several years to get out of that kind of dependency hell.

Source: https://engineering.linkedin.com/blog/2019/learnings-from-the-journey-to-continuous-deployment

Towards the end of the year, they spent a few months breaking up the monolith that we had, which was this big Java-based monolith. They built continuous integration and continuous deployment machinery and set up a system called CRT, which was essentially a continuous integration and deployment dashboard. They invested a ton in that area, and it really paid off. We were able to shift at will versus this sort of bi-weekly or weekly cadence. We were able to develop independently. We didn't have to have everything in the monolith. We broke things up into multiple repos, so it made a huge difference.

On Engineering Practices For Early-Stage Startups

One piece of advice I would give to early-stage startups is to adopt existing solutions instead of trying to build their own. At LinkedIn, we suffered from NIH (not invented here) syndrome, and we only succeeded in spite of ourselves, not because of the tools we built. Today, there are many great tools available for DevOps, CI/CD, and related fields. Just choose one and use it. It could be GitLab, GitHub Actions, CircleCI, Travis, or others.

The second piece of advice is to not neglect the continuous integration and deployment part. While it may be tempting to focus solely on building the product and getting it out there, investing in CI/CD is relatively cheap and easy to do. For example, I recently used Vercel and found it simple to set up staging and production environments in less than five minutes. So, my suggestion is to pick an existing solution, use it, and invest in CI/CD without trying to build your own thing.

On Technical Interviews

Looking back, it makes me laugh because I did well over 1000 interviews-- so many interviews! They don't tell you in school that if you're lucky enough to work at a fast-growing company, interviewing will take up a huge amount of your time and can be very disruptive. Initially, I was excited about interviewing people, but interview fatigue quickly set in. Eventually, though, I realized that interviewing is one of the highest-leverage activities an engineer can do. If you find someone who is as good or better than you are, you now have two of you. You literally double what the company can do, and if you do it again, you have three of you. The leverage on hiring is incredibly high, but engineers often get fixated on the cost and impact to their short-term productivity. While that is true, in the long term, hiring pays huge dividends.

Improving your interview skills takes practice, but to get practice, you need guide rails. Shadowing is key. There are two modes of shadowing:

  1. Attend an interview with someone else asking the questions. Observe the candidate and the person asking the questions. Understand what they're doing, why they're asking the questions, and what they're asking.

  2. Lead the interview while someone experienced passively watches. If things go off the rails, they can step in and help out or give feedback.

Reading Joel Sapolsky's blog posts on interviewing and hiring is also worthwhile. His posts provide insight on how to bring people into an organization.

At WePay, we had a full-on interviewing training program that worked extremely well. To become a trainer, you would qualify by shadowing as a follower and a leader. If you can find a company with a robust interviewing culture, that is a good way to learn. When interviewing for a job, pay attention to the interviewer's questions and whether they have a good interview program. Overall, shadowing and practicing interviews in a safe environment with guidance is a great way to improve your interview skills.

It's important to note that there are different types of interviews and being good at one does not necessarily mean you'll excel at the others. For instance, there's the phone screen, which has a specific objective and structure that is different from an on-site interview. During an on-site interview, you might be evaluating technical expertise, communication skills, or design abilities. Each of these requires a different approach, and it's important to understand the objectives and be prepared accordingly. Communication and past work interviews are more free-form, and you need to be adept at guiding the conversation. Technical questions, on the other hand, are more predictable and require a different skill set. It's important not to generalize and assume that because you're good at one type of interview, you can handle them all. Each interview requires a unique approach, and you need to adapt accordingly.

Extended on-site interviews, where candidates code with a laptop or pair program, are also a different kind of interview that requires a distinct approach. These are different from whiteboard interviews or one-hour, canned interviews.

On Apache Samza

Samza is a stream processing system that we built. It is important to understand the context around the time it was created, because now there are newer systems like Flink, Dataflow, kSQL, and Kafka Streams. At the time, the mainstream processing systems were Spark Streaming and Storm, both of which had limitations. Spark Streaming was a hybrid system for batch and stream processing, but it was not ideal. Storm used ZeroMQ as its transport layer, which was not persistent.

Our philosophy for Samza was to have a separate stream processing system instead of marrying batch and stream processing. We believed that having a single system like Spark was not the right way to go. We also didn't want an internal transport layer like ZeroMQ, so we used Kafka for messaging between each operator in the stream processing flow. This turned out to be a good decision, as it offloaded a lot of the complexity to Kafka's transactional and state management features, like log compaction.

Source: https://engineering.linkedin.com/samza/apache-samza-linkedins-stream-processing-engine

In terms of architecture, Samza was fairly simple. It provided an API that looked a lot like MapReduce. The orchestration layer was tied heavily to YARN, which was the Hadoop deployment scheduler at the time. This was a mistake, as not many people were using YARN. We should have built it as a library like KSQL and Kafka Streams did. Lastly, the transport layer was mostly Kafka, with a System Consumer and System Producer for each source and sink.

Overall, Samza served its purpose as a stream processing system and helped us learn what worked and what didn't.

I believe that we took the right approach. We presented the project at various companies, both those that would adopt it and those that would contribute to it. Hortonworks and Cloudera were two of the major companies we spoke with at the time. We also talked to companies such as Netflix and startups to spread the word. I think we did a good job of promoting the project through meetups, talks, presentations, and videos. However, I think the most cost-effective way to promote is through blog posts. We also wrote papers, but I would say this is the least valuable cost-to-benefit unless you have something truly novel.

It's important to promote the project, but as I mentioned earlier, growth can only take you so far. You need to make it easy for customers to use the system, whether it's open source or not. This means providing documentation, hello worlds, demos, and being responsive to questions and feedback.

Source: https://samza.apache.org/

One area where we could have done better is making it easier for customers to adopt our system. We tried to make it easier for them to use YARN, but it was still a tall order. I learned that we need to pay attention to what people are doing and how they are deploying systems to make it easier for them to use our project.

Another thing I learned is that while it's important to respond to every question and request, it's not scalable. As an open-source maintainer, you need to be comfortable ignoring or load-shedding some lower-value tasks because there are only so many hours in a day. You need to pick your battles and where you will spend your time.

On Joining The Data Infrastructure Team at WePay

I worked at LinkedIn for about 6.5 years before leaving to go to WePay, which raises several questions: when and why to leave a company, where to go, and why. The main reason I left LinkedIn was that I had become too much of a specialist and wanted to work with more breadth and be involved in more things. When I joined LinkedIn, it was a small company with about 300 people, and I worked on the data science team. I was involved in data engineering, analytics engineering, ML engineering, inversions, developer tools, developer platform, web service for who viewed my profile, and essentially application engineering and development, all in about two years. However, as the company grew, they hired more specialists, such as database administrators, web service developers, and developer tool experts. To survive in that environment, I was also forced to specialize, which led me to specialize in infrastructure, data infrastructure, and Samza. I read many PhD and postdoc-level stream processing papers and spent all my time on Samza, which became too hyper-focused on one area. I wanted to get back to breadth and working on many different things to develop my skill set. At the time, I wanted to learn about cloud technology, but LinkedIn was nowhere on the cloud landscape and was still running their data centers. They were building their own equivalent to Docker and Kubernetes rather than adopting them, and it was not the right place for me to learn that kind of stuff.

To think about leaving a job, I borrowed the inspiration from a friend, Monica Rogati, who worked with me at LinkedIn for several years. She suggested thinking about job offers as a weighted average, where you care about several things such as location, social impact, money, growth opportunities, vertical, team, culture, and more. For each job offer, you assign a score for each one of these things, and then you sum it all up. The thing with the highest sum is the job that you should go for.

Source: https://cnr.sh/essays/choosing-where-to-work

In my case, I wanted to learn about cloud technology, work with a wide array of things at the company, work with friends, have a good culture, and prove that I could do something as a leader. WePay offered me that opportunity, so I decided to work there.

One particular dimension that I find nuanced and worth highlighting is the concept of impact, which I also discussed in the article. Many engineers care about the impact they have, and there are two ways to think about it. The first one is the overall, absolute impact on society, while the second one is the impact within the organization, specifically the company you're working for.

  1. For instance, if you work for a huge company like Google, your job could be as simple as changing a button color from red to blue, which can increase engagement by 0.5%. Although the impact on the world is substantial because a billion people use that button, it doesn't seem like it matters much to you or anyone else.

  2. On the other hand, if you work for a startup, the impact may be less visible and tangible, but it's more emotional and rewarding because you're directly in contact with the people you're helping. For example, if you get the first customer onboarded, you can see and feel the difference you've made even though it may not have a significant impact on society as a whole.

Both types of impact are valid and neither is better or worse than the other. However, some people thrive in one environment more than the other, and it's important to know which one you prefer.

On The Evolution of WePay's Data Infrastructure

When I first joined WePay, I spent a year working on their service infrastructure stack, helping them move from a monolithic system to a microservice architecture. During this time, I built one of their first web services, which was a settlement service for settling money into bank accounts. This didn't have much to do with data infrastructure, but it was similar in some ways to processing payments.

Source: https://www.infoq.com/articles/future-data-engineering-riccomini/

After a year, I moved onto the data infrastructure side of things. At the time, our data warehouse was just a replica microservice instance. This worked well for a while, but as the company grew and our security compliance needs evolved, we needed a real data warehouse to handle the amount of data and govern who had access to it. We decided to use Airflow and BigQuery for the first version of our data warehouse. We chose BigQuery because we were on Google Cloud and it was the shortest path to a reasonable data warehouse. Airflow was written in Python, which was appealing, and had a great UI.

We started with Airflow and BigQuery because we knew we eventually wanted to stream data, but we didn't have the resources to do so at the time. Establishing a basic data pipeline was a way to show value quickly and buy time to build the second version of our data warehouse. For the second version, we used Debezium as the source connector to get data from our OLTP systems into Kafka, and then we used a connector we wrote called KCDQ to load data from Kafka into BigQuery. We built views to transform and filter the raw messages we loaded from Kafka into BigQuery so that they looked like the source table in MySQL.

Source: https://www.infoq.com/articles/future-data-engineering-riccomini/

From the beginning, I invested heavily in data quality checks because no pipeline is perfect. We had data quality checks that counted the number of rows in a table in MySQL vs. BigQuery, and also did row-by-row, column-by-column checks to make sure the data was identical. We got six years of mileage out of those data quality checks and found many bugs and inconsistencies. There are two ways to approach data quality checks.

  1. The first is similar to a unit test, where you define or auto-derive a set of heuristics. For example, you might define that a particular column, like "country", should have a unique cardinality of the total number of countries in the world. These are static and rigidly defined, and more akin to a unit test.

  2. The second approach to data quality checks is using machine learning to automatically detect anomalies in the data. This is a more interesting approach that we didn't use as much at my previous company. Instead of manually defining cardinality thresholds, for example, a machine learning system would look at the cardinality of that column and detect when it changes in a strange way over time. For instance, if the cardinality for a column has been between 200 and 300 for the last 100 days, but suddenly jumps to 10,000, the system would alert you to the anomaly. This approach is particularly useful for event-based systems, where there is no real upstream source of truth to compare against.

Both approaches have their place, and it's important not to overlook the importance of data quality. Great Expectations is a system that works well for the first approach, while companies like Anomalo use machine learning for the second.

On Apache Airflow

Disclaimer: I have not been actively involved with the Airflow project for a few years now, but I can still provide some insight. When I first started working on Airflow, it was a young project that had a lot of activity and contributions. However, with that came instability and chaos in the early days. There were often bugs and broken features that needed to be debugged and fixed. Upgrading to new releases sometimes caused new issues to arise.

At one point, my team froze Airflow at version 1.10 and cherry-picked the commits we needed. The community agreed that there were features that needed to be implemented, but many of them were backwards-incompatible. The Airflow 2.0 release added many features that I wished we had, such as a robust RESTful interface, tighter integration with Kubernetes, and a more scalable scheduler. However, my team never upgraded to Airflow 2.0.

Source: https://hightouch.com/blog/airflow-alternatives-a-look-at-prefect-and-dagster

If starting from scratch now, I would consider Airflow, Prefect, and Dagster as potential orchestration options. Airflow is widely adopted and has a low level of risk involved in adopting it. Prefect is more focused on the data processing world and has a more robust hosting offering. Dagster has an interesting approach to defining DAGs, where the system derives the DAG-based on declarations rather than developers having to define it.

It's worth noting that Airflow 2.0 has a more idiomatic Python language and allows for more Python annotations, making it easier to write code that looks like normal Python.

I would like to point out that orchestration is closely related to lineage, which was not immediately apparent to me. However, there is a whole subset of data processing related to lineage, which involves tracking where the data comes from and where it goes. When you think about it, in ETL, operators create, read, and write data, so it's easy to keep track of where they are reading and writing from. I have seen a lot more synergy and integration between orchestration and lineage than I initially realized. Lineage is not only important for compliance, but also for operations and other purposes.

On Apache Kafka

Source: https://cnr.sh/essays/kafka-escape-hatch

There are two parts to this. First, architecturally, in an organization, there is the concept of an enterprise service bus. This is an old concept that dates back 20-something years. It serves as an integration layer where you connect various systems to a bus, allowing data to flow between them. Prior to Kafka, the issue with this architecture was that there was no technology that could scale, provide low latency, and have the necessary durability to make it work. Companies like Tibco attempted to accomplish this with varying degrees of success, but it was not widely adopted due to technology limitations.

Kafka addressed this issue by marrying a traditional log aggregation system, such as Flume or Scribe, with more of a pub-sub message queue system. The design decisions made to support this were key. The topics were partitioned so that they could scale horizontally, and they were also durable, which is not always an obvious decision to make. Essentially, messages persist for a tunable amount of time, which can be based on size or time. This allows the consumer to go back and forth in the log, and it's okay to not be able to catch up. It also allows for batch processing of messages.

The flexibility to go back and forward in time, have messages be durable and partitioned, and support consumer groups were key features that Kafka implemented well. Consumer groups allow multiple consumers to read messages independently of each other, ensuring that a given message is seen only once by a particular group. Kafka's scalability and rebalancing features made it stand out from other systems like ActiveMQ, which could not handle large volumes of data.

The implementation of the write-ahead log was also crucial. Being able to write to a file repeatedly without a binary tree was a significant improvement over other systems. This meant that Kafka could perform much faster, which was demonstrated in performance charts that showed how scalable it was compared to other systems.

Overall, Kafka's ability to fill the enterprise service bus promise that was previously missing was due to its marrying of traditional log aggregation with pub-sub messaging, its partitioning and durability features, and its support of consumer groups.

On His Proudest Accomplishment at WePay

I'm really proud of the team, especially the engineering team that we built at WePay. Earlier, I mentioned the interview training system that we built. I think that program really paid dividends when it came to the quality and quantity of people that we hired. We hired a lot of great engineers and built a great team. People would ask me why I stayed for six years, and I would say that I've worked on teams where people were strong technically but not nice, or where people were nice but not strong technically. It's rare to work on a team where people are both strong technically and nice, and we had that at WePay. I really treasured the team that we built and worked with. That was probably the thing I was most proud of.

When I joined, the engineering team had 16 people. When I left six years later, WePay as a whole had grown to over 400 people, with about 200 engineers on the team. This healthy growth rate, doubling year over year for the most part, was great. It wasn't crazy growth, which I think is healthy.

On Writing The "Missing README"

When I left LinkedIn, I had an idea to write down what I had learned during my time there. Initially, I wanted to focus on operational stuff because I had seen a lot of growth and change in that area. I thought that the tooling and systems built at LinkedIn for config management, topology management, deployment, and other areas were great.

Fast forward a few years, and I was managing the data infrastructure team. Through a series of organizational changes, I ended up inheriting what was called our core payments team. This team had around 20 engineers, many of whom were fairly entry-level. I found myself repeating the same advice to them in one-on-one meetings: technical advice, organizational advice, career growth advice, and more. It would be great, I thought, to have a handbook to distill this information and help these engineers integrate into the team and be productive.

In 2019, I sent out a tweet asking if I should write this handbook. The response was overwhelmingly positive, and Dmitriy Ryaboy messaged me saying he had similar thoughts and even had a list of ideas for the handbook. We started working on an outline and wrote a few chapters, starting with the testing chapter. We talked to publishers and eventually settled on No Starch Press. We spent most of the pandemic writing the book, which was a collaborative effort between Dmitriy and me. We had never actually met in person, but our previous experience with open source collaboration made the remote writing process feel natural. We worked on multiple chapters simultaneously and provided feedback and edits to each other.

The core thesis of the book is to bootstrap the knowledge that engineers need when they enter the workforce. The book covers topics like agile development, deployment, and postmortems, which are often not covered in traditional computer science curricula. We provide links to more reading for those who want to dive deeper into specific areas, like Site Reliability Engineering (SRE) or project management.

Nowadays, there are two different channels and publishing models to consider when releasing a book. The first channel is physical books, which can be sold either online or in a physical bookstore. The second channel is e-books, which are sold exclusively online. When we wrote our book, we wanted it to be available in paperback form, so that managers could keep a stack on their desk to hand to new hires. We also wanted it to be available in bookstores and as a potential gift for recent graduates, parents, and family members. To ensure this, we partnered with a publishing company that took physical book sales seriously. In our case, we went with no starch, who had a distribution partnership with Penguin.

When deciding whether to go with a publisher or to publish independently, there are several factors to consider. Independent publishing offers more money per book sold, but requires more work in terms of promotion and distribution. Going with a publisher means more help with editing, book cover design, and promoting the book, but less money per book sold. If you have a large network or personal brand, independent publishing may be a good fit for you, but if not, going with a publisher can provide more support.

Finally, if you're interested in writing, I strongly recommend reading "On Writing Well" by William Zinsser, which provides valuable lessons in clear and concise writing.

On "The Future of Data Engineering"

I feel good about my predictions, but I now realize they were incomplete. In fact, a more recent tweet included these predictions as a subset. There are many trends happening in the data space, including real-time data warehouses and analytics, engineering, reverse ETL, headless BI, DQ solutions, data lake houses which combine lakes and warehouses, data ops, and more. While I believe my predictions in 2019 were accurate, they only represent a small subset of current trends. If I were giving this talk now, I would expand the scope to be more comprehensive.

On the topic of data mesh, a crucial concept is the idea of a data product, which involves grouping data logically to provide value to the business. This includes product sizing, treating data as assets, building customer-facing products, and more. Achieving this goal requires several things, such as data quality checks, a real-time data warehouse, proper execution, data cataloging, and more. To me, data mesh as a philosophy encompasses many of the trends we're seeing in the industry, and we need these trends to mature to achieve the ultimate goal of a data product. Although we're still in the early stages of this concept, I believe it's important because it drives around 5 to 10 of these trends until we reach the outcome of a real-time data warehouse.

I'm excited about this because it combines different things that I experienced pain with at WePay. For instance, we wanted to do batch offline processing, real-time time series OLAP processing for risk analysis, and report generation for internal and external customers. However, these use cases required three different systems, such as Snowflake for the data warehouse, Apache Druid or Apache Pinot for dashboarding, and a third system, like a materialized view of the various aggregates in different buckets, for the low latency time-series analytics needed for machine learning.

Real-time data warehouses are now combining all three use cases and replacing three systems with just one. This concept unifies everything rather than having three different teams, operational footprints, and sets of data. The caveat is that it's still early days, and it's still fairly costly to run Elasticsearch, Apache Pinot, or Apache Druid compared to a Snowflake cluster. Nonetheless, I find the technology exciting and believe that it will continue to mature over the next 10 years.

On Angel Investing

I would like to preface this by saying that this is solely based on my personal experience, and I believe that a lot of what worked for me was simply a result of chance and opportunity. However, I think it is crucial, especially early in your career, to seek out and carefully choose who you work with. Building a strong network sets you up for long-term success. Many of the opportunities I had to invest, advise, and collaborate with others were a result of my network. For example, Anomalo' CEO (Elliot Shmukler), StarTree’s CEO (Kishore Gopalakrishna), and Confluent's founders (Jay Kreps, Jun Rao, and Neha Narkhede) all used to work at LinkedIn, and we also hired people from WePay.

In terms of practices that can help you grow, writing and presenting have paid off for me. I enjoy the writing aspect, despite being the oddball in a family of writers with Ph.D.s in English and journalism degrees. Writing helps clarify my thoughts and build an audience, and it has also led to valuable connections. Additionally, there is now a wealth of great content out there, especially related to investing and advising. Y Combinator's blog, AngelList, and books like Jason Calanacis' "Angel" are just a few examples. So, my advice is to read, write, work with people you admire, and be patient. Remember that success takes time.

On Hiring Engineers

Hiring engineers for a startup can be tricky and depends heavily on the company, industry, and founders. However, it's important to not only hire great engineers but also ones who are a good fit for the startup. For example, an engineer fresh out of college who is deciding between working for Google or a startup may not like the tech stack of the latter. Startups and larger companies like Google are very different, and it's crucial to consider whether the engineer will be happy working at the startup.

When hiring for startups, technical skills are important, but not the only factor to consider. Culture fit is also crucial since early-stage engineers set the tone for the company. It's vital to find people who mesh well with the team and are excited to work on product development. If the product is back-end technical stuff, then finding people who are passionate about it is also important. Cultural fit is a vital aspect to consider when hiring for startups, and making the wrong decision can be damaging to the company, especially when there are very few people involved.

On Navigating Open-Source Strategy

On the product strategy front, I'll give one specific example. I talk a lot with folks about open source, and I've really come around to the idea of what's called the BDFL form of open source management. This stands for "benevolent dictator for life," and it essentially means that someone or some group is the decider for the project's direction. This stands in contrast to the Apache philosophy, which is very democratic and slow-moving.

As a startup, especially in the open source space, it's really hard to navigate those kinds of community relationships while also starting a company and selling software around it. While it is possible to work with an Apache community and manage the code base, the velocity you get from the BDFL model is pretty nice. Companies like Prefect follow this model, where the company is responsible for driving the roadmap and direction for the open source project.

If you are starting an open source company, I would definitely go the BDFL route instead of the Apache round. However, you need to be very thoughtful about the licensing and the way that you govern the project to make sure that it doesn't hinder your business's ability to ship product.

On Adding Value To His Relationships

I believe that helping others is easier than most people think. When you realize that you can assist someone, it's often easier to just help them without trying to make it transactional or extract any value from it. Personally, if I recognize that I can help someone, I try to do so without expecting anything in return. It's actually less effort than trying to figure out how to extract value. In most cases, I do a lot of people routing--introducing people who I think would hit it off or helping people find jobs. I don't have a formal decision tree for how to help people; instead, I just chat with people and see how they're doing. If I recognize that someone is having an issue, I try to think of a way that I can help. The way that I can help has evolved over the years, from producing code to managing and teaching. These days, I realize that I can provide value by introducing people to each other. So, I enjoy helping, and I do it whenever an opportunity presents itself.

Unfortunately, networking can sometimes come across as sleazy or manipulative. This is often due to the misconception that networking requires an extroverted personality. However, as someone who is a huge introvert, I can attest that networking is not limited to social butterflies. You don't have to attend parties or do a lot of socializing to network effectively. You also don't have to resort to sleazy or manipulative tactics.

The key is simply to be a good person, help others when they need it, and maintain relationships by checking in with people and seeing how they're doing. That's really all there is to it.

Show Notes

  • (01:47) Chris reflected on his educational experience at Santa Clara University in the mid-2000s, where he also interned at NeoMagic and Intacct Corporation.

  • (07:31) Chris recalled valuable lessons from his first job as a software engineer at PayPal, researching new fraud prevention techniques.

  • (11:28) Chris shared the technical and operational challenges associated with his work at LinkedIn as a data scientist - scaling LinkedIn's Hadoop cluster, improving LinkedIn's "People You May Know" algorithm, and delivering the next generation of LinkedIn's "Who's Viewed My Profile" product.

  • (22:00) Chris provided criteria that his team relied on when choosing their big data solutions (which include Aster Data, Greenplum, and Hadoop).

  • (25:22) Chris gave advice to early-stage startups that want to start adopting best practices in observability and deployment.

  • (28:02) Chris expanded on his concept that models and microservices should be running on the same continuous delivery stack.

  • (30:52) Chris discussed his strategy to become a better interviewer - as he performed ~1,500 interviews at LinkedIn and WePay.

  • (37:39) Chris explained the motivation behind the creation of Apache Samza (LinkedIn's streaming system infrastructure built on top of Apache Kafka) and discussed its high-level design philosophy.

  • (46:19) Chris shared lessons learned from evangelizing Samza to the broader open-source community outside of LinkedIn.

  • (52:44) Chris talked about his decision to join the Data Infrastructure team at WePay as a principal software engineer after 7 years at LinkedIn.

  • (01:00:53) Chris shared the technical details behind the evolution of WePay's data infrastructure throughout his time there.

  • (01:12:40) Chris shared an insider perspective on the adoption of Apache Airflow from his experience as a Project Committee Member.

  • (01:20:15) Chris discussed the fundamental design principles that make Apache Kafka such a powerful technology.

  • (01:25:40) Chris reflected on his experience building out WePay's engineering team.

  • (01:27:14) Chris shared the story behind the writing journey of the "Missing README" - which he co-authored with Dmitriy Ryaboy.

  • (01:38:16) Chris revisited his predictions in a 2019 post called "The Future of Data Engineering" and discussed key trends such as real-time data warehouses and data mesh.

  • (01:44:27) Chris gave advice to a smart, driven engineer who wants to explore angel investing - given his experience as a strategic investor and advisor for startups in the data space since 2015.

  • (01:48:17) Chris shared advice on hiring engineers and navigating open-source product strategies for companies he invested in.

  • (01:53:57) Chris reflected on his consistency in adding value to the relationships he has formed over the years.

  • (01:58:00) Closing segment.

Chris's Contact Info

Mentioned Content

Blog Posts

People

Books

Notes

My conversation with Chris was recorded back in May 2022. Earlier this year, Chris released Recap, a dead simple data catalog for engineers, written in Python. Recap makes it easy for engineers to build infrastructure and tools that need metadata. Check out his blog post and get started with Recap's documentation!

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.