The 119th episode of Datacast is my conversation with Chad Sanderson, who was previously the Product Lead for Convoy’s Data Platform team. He has built everything from feature stores, experimentation platforms, metrics layers, streaming platforms, analytics tools, data discovery systems, and workflow development platforms.

Our wide-ranging conversation touches on his early career as a freelance journalist; his entrance to data analytics by way of Conversion Rate Optimization; his experience establishing experimentation cultures at Subway, SEPHORA, and Microsoft; his time as the Head of Data Platform at Convoy; the existential threat of data quality; the death of data modeling; the rise of the knowledge layer; lessons learned championing internal products; and much more.

Please enjoy my conversation with Chad!

Listen to the show on (1) Spotify, (2) Google, (3) Stitcher, (4) RadioPublic, and (5) iHeartRadio

~ New Podcast ~#Datacast E119 features Chad Sanderson. We discussed:
- Building An Experimentation Culture
- Immutable Data Warehouse
- The Data Collaboration Problem
- The Rise of Data Contracts

Chad was previously the Product Lead of @convoyteam Data Platform. Enjoy! 🚚 pic.twitter.com/AoPeH2mRko
— James Le (@le_james94) June 16, 2023

Key Takeaways

Here are the highlights from my conversation with Chad:

On His Early Journalism Career

My dad was a literature teacher, specializing in American literature. As such, I grew up reading a lot of books and writing quite a bit. My initial career goal was actually to be an editor or creative writer, either for a magazine or as a freelance novelist.

In college, I took a course about potential writing opportunities. I quickly realized that anything creative around writing would be quite challenging. So, I pivoted towards journalism, and my first job out of college was actually a journalism role in Thailand.

I flew over there, bought a $500 plane ticket, and stayed in a rundown apartment. I would go to a stadium every weekend and cover the martial arts events there. I would write about it in English, and various magazines would pick it up. I also published the articles on my website.

I had many super interesting experiences there. I learned how to hustle and focus for long periods of time on one thing. I would write one or two stories every other day, which is a tremendous amount of volume.

It also made me adept at asking good questions. As a journalist, I had to get to the heart of the story - finding both the truth and a narrative. When writing, I couldn't simply report the facts exactly as they were. I had to tie the information cohesively into something readers could take away.

This was a formative part of my early journey and eventually led me to build those types of narratives within the data space.

On The Benefits of Writing

I believe that writing is one of the most valuable skills for anyone in a technical field. Whether you work in data analysis, software engineering, design, or even non-technical roles like product management or marketing, writing allows you to communicate your ideas, philosophy, and hypotheses with clarity and precision.

Being a good writer is especially important if you want to lead a team or convince a business to invest in impactful projects. Writing has also helped me ask the right questions and create a framework for addressing complex problems.

As a product manager, the best way to solve problems is to talk to people and ask questions. The "five whys" methodology is one approach, but it's also about identifying patterns between multiple people and connecting them into a compelling narrative. Writing is a powerful tool for telling that story.

These two skills have helped me the most, which I was able to refine during my time as a journalist. As I moved into the data field, I applied these skills to focus on product development instead of martial arts stories.

On His Entry To The Data World

Conversion rate optimization was my first role in data and analytics, and it was pretty interesting. GrillGrate, the company I worked for, produced a metal surface that sits on top of a gas grill, evenly heating up food. It was a great invention I still use in my home for cooking steak and fish.

However, the company didn't invest in data and had a basic website with no knowledge of customer behavior. That's why I joined the company to set up their early analytics, analyze their digital properties, and make suggestions on where to improve conversion rates.

Conversion is the process of a customer entering into a procurement funnel, such as buying a product, adding an item to their cart, or downloading an app. Conversion rate optimization specialists focus on different areas that can affect conversion rates, including copywriting, user testing, and experimentation.

Initially, I focused on all three areas, using analytics as my entry point. Later, I decided to focus explicitly on experiments and experiment data.

On Diving Into Experimentation

The reason I chose experimentation was to move quickly to a senior position within the technology industry. Coming from a journalism background, I didn't want to spend a long time at the entry-level. So, I found a technical niche to specialize in, which made the number of applications for specialists in that particular niche at any company quite small. Thus, the amount of competition I would face when applying to a more well-known tech brand like Microsoft would go down substantially.

Analytics and general CRO (conversion rate optimization) was an entry point to that, and experimentation is a subset of CRO. I would say that experimentation is one of the more technical and statistics-heavy aspects of conversion rate optimization, requiring knowledge of A/B testing, how to construct a hypothesis, how to derive statistical values like t-statistics and confidence intervals, and running experiments at scale.

Specializing in that niche at GrillGrate helped me land a job at Oracle to work on their marketing. Oracle has a marketing cloud, and within that marketing cloud, they have an experimentation software called Maximizer. As a consultant or customer success associate, my job was to help clients design and run experiments properly, analyze results, and provide insights.

I spent a lot of time on YouTube, taking courses and watching videos to learn the basics, like how statistics work, what hypothesis testing is, how to set up an experiment, and what control and treatment groups are. Whenever I encountered an unfamiliar term, I would look it up, write it down, and review it daily.

At Oracle, I started using R for the first time to understand how statistical models and frameworks work and generate statistically valued results. This allowed me to develop a relatively comprehensive understanding of how statistics worked and what sort of data, models, and frameworks could be trusted. As a non-technical person, this knowledge was invaluable.

On Establishing An Experimentation Culture

While at Oracle, I was in New York City, while the Subway corporate headquarters was located in Milford, Connecticut, a small town. They were in the process of a digital transformation, which involved Accenture consultants helping them move their data to the cloud, rebuild their engineering workflows, and migrate to modern technologies.

As part of this transition, I was brought in to run an experimentation program. I set up the platform and infrastructure to facilitate experimentation on a larger scale, developed best practices for analytics, and instrumented their websites and mobile applications. We started running experiments on our team with the assumption that demonstrating the value of experimentation to Subway's marketing teams and executives would generate excitement and potentially gain more resources to expand the program, which ultimately happened.

The differences I observed across all the companies I worked for were their maturity in data, experiment, and data science; the investment in third-party SaaS software and internal applications built by software engineering teams; and the maturity around running the business, specifically the software engineering and product development teams.

Subway, for example, was not known for its mobile applications and was primarily driving revenue through its franchises. Hence, the focus was on expanding franchises, onboarding more franchisees, and testing new sandwiches.
SEPHORA, on the other hand, had already invested in technology and integrated personalization into their system and had a comprehensive workflow around experimentation.
At Microsoft, there was an internal team called the EXP team, which had built a product specifically for Microsoft's use cases and did not rely on external SaaS applications for experimentation. In this role, I was more of a product manager, working with a software engineering team to investigate customer problems, develop a roadmap, and write documents to describe what we were working on and why.

Overall, the differences in the companies' maturity and technical team management made for a unique experience at each company.

On Joining Convoy as the Head of Data

Convoy is a digital freight marketplace that sits between a shipper, like Walmart or Starbucks, who needs to move freight from one facility to another, and a carrier, which is a business that owns trucks, ranging from a single truck to hundreds.

Trucks use our application to bid on freight since we have an auction-based model. Once they are awarded the freight, we give them all the necessary details to pick up and deliver it to the facility. We track the truck en route to dropping off the freight so we can report things like the ETA back to the shipper.

Thanks to the economy of scale, we have so many shipments in our marketplace at any given time that we create a robust and healthy ecosystem for our truckers, who are always looking to move freight. Additionally, we ensure the shipper that we can have freight shipped even if it's difficult to fulfill the load or if it's going to a destination that many people wouldn't take it to.

Data is extraordinarily important to Convoy's business model. Although there are businesses similar to what Convoy does, called freight brokers, they are very manual and ad hoc companies. They get called on the phone by a shipper, go through a black book of all the carriers they know, call the carrier on the phone, and ask if they can take the job. If the carrier cancels, they have to call another carrier. There's not much that differentiates us from a typical freight broker, except data.

Data enables us to price shipments in our marketplace, report on ETA's, and determine whether the price we offer in our marketplace is margin positive for the company or not. Essentially, data facilitates Convoy's entire business model. Ensuring that the data is high-quality, accessible for analytics and machine learning, and discoverable, is a big responsibility.

On The Evolution Of Convoy's Data Platform

The biggest evolution in Convoy has been the focus on investing in internal products. When I joined the company, my main responsibility was to run their experimentation platform team, as they wanted to build an internal A/B testing product. Having experience doing that at Microsoft, which is also in Seattle, I was able to bring that expertise to Convoy.

However, over time, I took on ownership of the product side of our machine learning platforms and our big data platform team as well. To approach my work, I used the same process I used as a journalist: asking questions about the problems people had and going deep into root-causing those problems. From that, we identified a set of product gaps that we felt didn't exist, and we began iteratively implementing those product gaps over time.

We use a typical modern data stack, including AWS, Fivetran, and a CDC system that pushes data from our production databases into Snowflake. We have a lot of tools on Snowflake, including dbt for data transformation, Amundsen for data discovery, and Airflow for orchestration.

But the big change has been the products we've built on top of that third-party application and open-source system. We built our experimentation platform from the ground up, as well as a metrics repository and a feature store. We also built a custom instrumentation system that allows our engineering partners to capture and emit data directly from the services. Additionally, we created an events catalog that sits on top of that event data and allows people to search for the data they need and request more data. These are the biggest additions to our infrastructure.

On Solving Data Discovery with Amundsen

Source: https://convoy.com/blog/integrating-slack-with-amundsen-for-ease-of-data-discovery/

Amundsen is an open-source data discovery tool. Our team is in the process of transitioning to the hosted version, Stemma. Data discovery creates a metadata layer on top of your data warehouse, allowing you to search for datasets, see who owns them, and preview what the data looks like and what columns are included in any particular table.

This is particularly important for Convoy, as we have a vast amount of data with hundreds and thousands of tables, some of which are source tables capturing raw event data, while others are downstream tables that have been processed and transformed many times. Without a catalog, navigating the warehouse can be confusing for data analysts, data scientists, or business intelligence engineers. Amundsen simplifies the discovery, navigation, and cataloging process, making it easy to integrate and use. The majority of our data team uses it regularly.

We chose Amundsen over other data catalogs because it was an open-source project that solved the componentized problem we cared about. It was easy to test and had a user-friendly interface. Its focus on search aligned with our goal of providing a guided workflow rather than exposing all data assets and letting users stumble through them. Since then, many people have used the tool.

On Experimentation Challenges at Convoy

The most important takeaway from Data Council that I want to emphasize is that setting up your own experimentation system is quite challenging. It's often not enough to rely solely on an external vendor.

The pattern in the experimentation industry for some time now is that there aren't many vendors that allow for simple experimentation. Those that focus more on the customer data platform (CDP) level. This is front-end experimentation where you can easily change the copy on a website or potentially remove or change images around. It's like an integration with the DOM layer. You have some WYSIWYG editor that you can use to move things around. This is the industry standard.

However, if you look at companies that have been doing experimentation for a long time, like Google, Microsoft, Twitter, Airbnb, and Uber, the way they experiment is by focusing on the entities that are most meaningful for their business, not just the ones that are present in the customer data platform, like customers and sessions. In Convoy's case, we care about things like shipments, shippers, lanes, geographies, contracts, RFPs, and facilities.

None of those things, or at least a very large percentage of those entities, can be captured through a pure front-end layer. You have to access the data directly from the warehouse, or you have to capture this data from production tables. You have to pipe it into the warehouse as a source table, as a dimensional table. Then your experimentation layer has to sit on top of that data and read it.

You also have to randomize assignments, which means returning one version of your experiment experience to a certain percentage of that entity type and another version to the other entity type. So, you need to think about that.

Instead of just buying an experimentation tool, you should list out all your use cases. What is most valuable to run experiments on? What are the entities that are most critical to your business? What has the highest ROI? What are the metrics that you need to analyze? Do you just need to analyze metrics on the front end, like being able to count the number of people who click the add-to-cart button, or do you need to report on more backend warehouse-centric metrics like margin or profit, which you have to derive by combining many different data sources together?

That was the core of my talk. Really think about the state of your business instead of just buying a tool without considering what you need and what's most valuable.

On Change Data Capture and Building Chassis

CDC stands for change data capture. It is essentially a mechanism for identifying when row-level changes occur in a production database, such as Postgres. At Convoy, we use CDC to push data into our warehouse whenever row-level changes occur. There are batch ways of doing this, such as using Fivetran a few times a day or on demand, but we use a streaming-centric approach using a tool called Debezium. Anytime a row changes, we stream that change directly into the warehouse and update the corresponding row in the raw table.

CDC is important, especially if you need real-time information because you don't want to wait for days or hours to see a change in the data. However, there are some problems with CDC. One of them is that it captures everything in a production table, regardless of whether that data is valuable for analytics or not. Additionally, production tables often contain data that is irrelevant to the analytics or machine learning team, or generated in a complex, non-straightforward way. This makes it harder to build clear business context-rich tables based on this data without going back to the source and talking to an engineer about how the databases were implemented.

At Convoy, we decoupled our services from the data used for analytics using something called semantic events. A semantic event is a real-world behavior captured as a schema, such as a shipment being canceled or an RFP being completed. We captured the event and its associated attributes, including primary keys, foreign keys, and interesting attributes about the event itself, and we did it in real-time using Kafka. We had two pieces of software that enabled this: the Unified Events Definition Framework (UEDF) and Chassis. UEDF was an SDK that handled all the boilerplate code, and Chassis cataloged all the events and contracts for the engineering team.

Convoy is more mature than other companies in this space because we focus on events in the backend. Other tools like Segment, Mixpanel, Amplitude, and Heap focus on events in the front end. However, service data is the source of truth, and not all existing tools capture CRUD updates, which are important for knowing the current status of an entity. We believe this method will become a common pattern in the future, but it may take three to five years for people to start applying it.

On The Existential Threat of Data Quality

A lot of these problems are tied to issues with CDC. When we investigated data quality problems at Convoy, we found that they fell into one of two buckets: upstream or downstream problems.

Source: https://dataproducts.substack.com/p/the-existential-threat-of-data-quality

Upstream problems occur when you have production databases that change, and because there's no contract between the producer and the consumer, the consumer (which is the warehouse) can break without the producer's knowledge or concern. This significant problem affects machine learning models, training data sets, dashboards, and reports. It happens quite frequently.

The other problem is downstream. The downstream problem occurs when many businesses have invested in ELT (loading data from various sources, dumping it in the warehouse, and doing all the transformations downstream). If you're living in that world and you're not going back to the producer and ensuring that you're generating the correct data you need, data developers and analysts inevitably spend an inordinate amount of time writing SQL against this mess of spaghetti code in the warehouse.

Unfortunately, data people are not trained to write scalable, maintainable, well-documented, and robust code like software engineers. As a result, you get an enormous amount of poorly written queries that capture critical business concepts, which other teams take dependencies on. This data will quickly become outdated and may not represent the current state of the business. It will break and cause significant pipeline issues. Then you have the upstream problem and the downstream problem converging together, with the production tables breaking and breaking pipelines and the downstream tables breaking and breaking pipelines.

The modern data stack is built on a fundamentally broken foundation of data. This is one of the primary reasons why it's difficult to scale a data warehouse and why many teams find themselves frustrated working with data. It always seems to be on fire, and they never have enough people to fix it.

On "Immutable Data Warehouse"

I believe the modern data warehouse is broken for several reasons. First, queries are often written in an unscalable way, and upstream producers generate data in an unscalable way. Additionally, we don't talk enough about the semantics of the data or how it's tied back to real-world entities or events. As a result, data personnel can spend weeks talking to people to understand how the business works before they can write queries against the warehouse. This is not scalable and not what the data warehouse was intended to be.

The immutable data warehouse is an attempt to pivot back to the original definition of the data warehouse. The original data warehouse was intended to reflect the real world and how businesses work. The entities that exist in a company and how they interact with each other, the relationships between them, and the real-world behavior that causes those entities to interact with each other can all be used to derive metrics like margin, profit, volume, and growth.

The immutable data warehouse is a way to resolve many of the problems with the original data warehouse. It follows a model of implementing semantic events in the services of documenting those semantic events in a catalog and using this catalog of clean, high-quality source data to construct data domains or data marts where you have data products. These data products are owned by a single team that has a one-to-one mapping with the service that's generating that data.

Source: https://www.montecarlodata.com/blog-is-the-modern-data-warehouse-broken/

If anyone needs data about shipments, for example, they know exactly where to go. If the data they need doesn't exist, there's a request-based workflow where they can ask for the data they need, and the engineer can implement that data within their service. This is a way to evolve the data warehouse in a healthy, controlled, and iterative fashion.

Most data folks can tell that something is not right in data warehousing today. The data environment is often messy, and there's no clear ownership. Data evolves in a broken, chaotic way, which doesn't feel right or clean. In the world of product development, we have a solid framework for managing changes to any application, and my perspective is that data development needs to follow a very similar model of requests. Engineers implement those requests as APIs in production, and we treat data as a product once it's in the warehouse with clear domain owners.

On The Death of Data Modeling

Source: https://dataproducts.substack.com/p/the-death-of-data-modeling-pt-1

If you're not familiar, data modeling is the practice of finding relationships between data sets. It involves identifying how certain data sets are connected to each other, which in turn is an abstraction of how real-world entities and business concepts are connected.

There are two places where you can conduct a data modeling exercise. You can do data modeling during the design phase, which is before any data is published. At this stage, you create an entity relationship diagram (ERD), which specifies the entities, primary keys, foreign keys, relationships, and other properties.

Then, the practical physical model is modeled as it's manifested in the warehouse. This includes all the modeling that needs to happen when you combine various source tables and raw tables to form a cohesive layer that anyone in the company can access for analytics.

Data modeling has died because the design phase for data has gone out the window. This is largely due to the shift to Extract, Load, Transform (ELT), where we dump a lot of data from production systems into a lake or a lake house environment. Preparation work that was previously done to meticulously think about how we need to collect data and in what form is basically gone away. This leaves only physical modeling in the warehouse and in the lake.

The challenge with this is that if you don't have a way to connect those data models to some semantic context, it's not clear why any particular data set should be joined together. It's not clear what that data set even represents. And this frequently requires a data professional to deep dive into a lot of the modeling decisions that were made in the warehouse.

Fundamentally, people can do a few things to improve data modeling. The first thing is that data modeling often follows a lot of technical frameworks. There's star schema, data marts, data vault, and other methodologies about how to think of data modeling. At its core, one of the reasons that people don't do data modeling in today's day and age is that the design phase takes a very long time.

For the next phase of data modeling, it needs to be collaborative, fast, version controlled, and connected to business value. It must be low friction, meaning the cost to start doing data modeling, whether it's the cost to educate yourself or the technical cost, needs to be low. If it's still a long process, takes forever to educate people, or is very technically complex, it will probably always be dead.

On The Knowledge Layer and Data Contracts

The knowledge layer acts as an abstraction between the real world and the production code. It describes, in English or any language of choice, how the company works, the important domains, and the data attributes generated within each domain. It also outlines the various steps in the life cycle of each domain. For example, the life cycle of a payout includes when it is generated, canceled, and validated after it's paid. Similarly, the life cycle of a shipment involves generation by a shipper, unlocking and placing into a marketplace, bidding by carriers, picking up the shipment, delivery to a facility, validation, and confirmation by the shipper team.

In many companies, there is no layer that captures the way the business operates. Without this layer, real-world behavior has to be inferred based on data represented in tables and columns, which is not appropriate for understanding how the business operates. However, having a knowledge layer can make discovering and leveraging data much simpler. It brings many different stakeholders into the data generation and usage process. They can better understand and comprehend the data model by starting from the knowledge layer and working down to the data catalog. The metadata generated by the knowledge layer is then inherited by the data that maps to the various steps in the life cycles generated in the knowledge layer.

Source: https://dataproducts.substack.com/p/the-rise-of-data-contracts

I haven't seen any tools that tackle this problem, but data contracts can help with it. Tools don't capture this well because there's no incentive to add business-level metadata to a central platform when manipulating warehouse data. There's no personal benefit to you, and no team has end-to-end insight for a single lifecycle. Many teams participate in the process, so there's a high cost.

However, with an events-oriented architecture and data contracts, each team must describe events using semantic terms. They describe what is happening, when the event fires, and what it represents in the real world. Without that information, engineers won't know what to implement. The event should also be tied back to a central entity like a shipment or a carrier.

As teams request these contracts, our goal is to capture metadata and build relationships between entities and events, which becomes the knowledge layer. There's an incentive to generate the knowledge layer because you get data for analytics or machine learning. The only way to get the data is to provide enough semantic information for the engineer to know what to produce.

On The Data Collaboration Problem

The core issue with collaboration is that there are many stakeholders involved in generating data. Product managers have a need for data to conduct analysis on their features. Data developers either instrument events in front-end systems or build tables and data marts in the data warehouse. Software engineers produce first-party data that lives in production tables and is brought into the warehouse via CDC or some other ELT mechanism. Unfortunately, there is no good form of communication and collaboration between these three stakeholders.

They are all just doing what they want. The people who build the core tables in the warehouse are often not the same people who talk to the product teams about developing data for their business use cases. The folks who receive requests from product teams are doing so in vertical silos. For example, the payments team may be developing data sets for payments, but they are not talking to the teams that would be interested in consuming those data sets. This results in gaps between the data we have and the data we actually need to answer questions.

Furthermore, software engineers are generating data that is fundamentally critical to powering machine learning models, analytics, and experiments, but they are not talking to anyone about it. The lack of collaboration is one of the most important problems that can and should be solved in the next few years. How do you bring these stakeholders into a single workflow and get everyone on the same page? People need to speak the same language when it comes to data. Once that happens, people can determine the right workflow.

For example, if a product manager has asked for a particular metric, maybe a data developer on their team should bring together tables to generate that data set. Maybe a data developer on another team should generate tables that the asking team can depend on. Maybe a software engineer should produce events or generate additional attributes that the data developers can put together. However, without that shared language and shared ecosystem, the silos will result in an increasing lack of communication and growing tech debt.

In contrast to big tech companies, which inherit a lot of their development modalities and have a certain type of business model, Convoy works with data from the real world. The data comes from shipments they do not control, customers they do not necessarily control, and facilities they do not control. Emitting all this data is challenging, and it is even more challenging to understand the life cycle of the shipment. A shipment can take a linear path or a nonlinear path to completion, and it can be canceled and reopened. There might be issues with the shipment and the contract.

Therefore, the relationships between entities in a business model that marries the technical world and the offline world actually have a much deeper need for collaboration because things can change at any moment, and they have to be iterated upon flexibly and based on semantics. In contrast, Google has a very constrained environment that is tightly controlled by its applications, and there are very predictable ways that the application will be used.

On Customer Centricity for Data Infrastructure Teams

One of the biggest pieces of advice I can give is regularly engaging with your customers. Even if we don't always think of data consumers within our company as customers, that's exactly what they are. They use the products and services that we build and provide. So, talking to data scientists, analysts, and non-technical stakeholders on a weekly basis can be incredibly valuable.

If you're not engaging with them every week, you're probably not doing it enough. Requirements, needs, problems, and pain points can change very quickly. Data teams are often hyper-reactive and focused on responding to problems through a ticketing system. They can get so bogged down with the sheer number of tickets that they never really have time to innovate or solve underlying issues. At Convoy, we try to avoid this by taking innovative approaches to our customer's problems.

Sometimes, this means saying no to spending weeks solving pre-existing data problems or building specific pipelines. For example, we might focus on making pipeline development easier for everyone, which could save two or three days of development and implementation time. This can be a self-service solution and provide significantly more value to the business than just answering one-off tickets.

Customer-centricity doesn't mean doing everything your customers ask for. It means focusing on the problems your customers have and finding holistic solutions to solve them. Then, prioritize those solutions and bring the rest of the business on board with your ideas.

On High-Quality Data UX

I believe that the first step is to talk to your customers and understand their perspectives on the stages of their data experience. While I can tell you what the components are, I think a mental model based on your customers' needs would be ideal.

To begin with, there is a data definition phase. Next, there is the connection to source data. After that, there is a design phase which is distinct from the definition phase. Design involves planning, collaborating, and testing configurations for your data assets.

The deployment phase involves moving from definition to shipping your data assets into production environments, whether a warehouse or a service. Monitoring is also important to maintain high-quality data and to diagnose issues when they occur.

There is also a discovery stage, which involves finding the data required to compose these data assets. Additionally, communication is essential for finding the right person to ask for more details or business context.

At Convoy, we laid out all of these components and assessed whether we had a good workflow in place for each data asset. We examined tables, metrics, machine learning features, experiments, models, and events. While some of the tool sets may overlap, some may be distinct.

I think this is a valuable exercise to start thinking deeply about data UX.

On Evaluating Investments In The Data Space

When evaluating a business idea, I focus on two main criteria. Firstly, is it solving a real customer problem? Is it a "vitamin" or a "pain pill"? In other words, is it a nice-to-have or a must-have? If it's the latter, that's a good sign. Secondly, how well does the team understand their customer? This includes both the user and the buyer.

For example, suppose someone pitches me an amazing new system that will revolutionize pipeline development. In that case, I'll ask whether their customer will love it so much that they'll rip out everything they've built over the last few years and use the new system exclusively. Most people won't do that, so it's crucial to understand the customer's pain point and whether it's deep enough for them to take a risk on a relatively new company. I believe that having a great team is important, but I mainly focus on the problem space.

Companies like Snowflake and Databricks have shown that there is huge potential for data investments, as companies are using more and more data and doing more and more machine learning, artificial intelligence, and analytics. However, it's important to avoid falling in love with a technical idea and instead focus on a solution that truly solves a real customer problem at scale. A simple idea with a great user experience can be more valuable than an incredibly technical product no one needs.

On Internal Product Management

Having technical conversations with our customers has been awesome. It allows us to see things from their perspective, understand the customer journey of multiple stakeholders, and tie workflows together.

On the other hand, having to sell our efforts to the leadership team can be challenging. However, it has taught us to advocate for what we need to build, even if it's hard to measure. This has helped us become better at communicating abstract value.

In summary, these two experiences have been valuable: having conversations with technical customers and communicating abstract value propositions to our leadership team.

Show Notes

(01:38) Chad reflected on his early career as a freelance journalist working in Southeast Asia.
(04:27) Chad explained the benefits of writing for anyone working in a technical field.
(06:15) Chad touched on his entrance to data analytics through Conversion Rate Optimization.
(09:06) Chad walked through his decision to dive deep into the field of experimentation.
(13:28) Chad recalled how he spent time learning the basics of statistics.
(15:32) Chad discussed the differences in experimentation cultures at Subway, SEPHORA, and Microsoft.
(19:11) Chad shared the technical details behind the evolution of Convoy's data platform since he joined in 2019.
(23:05) Chad emphasized the role of data in Convoy's digital freight business.
(26:17) Chad brought up the importance of solving data discovery at Convoy and their decision to choose Amundsen.
(29:02) Chad shared lessons learned setting up a flexible experimentation platform at Convoy.
(32:46) Chad unpacked the problems with Change Data Capture and how his team built an internal change management platform called Chassis (a source of truth for definitions of events, entities, and relationships.).
(41:33) Chad discussed the existential threat of data quality.
(44:38) Chad explained why the modern data warehouse is broken and why the "Immutable Data Warehouse" can be a solution.
(51:39) Chad zoomed in on the death of data modeling.
(57:42) Chad is bullish on the rise of the knowledge layer and data contracts in the upcoming years.
(01:03:37) Chad talked at length about the data collaboration problem.
(01:09:28) Chad gave advice for data organizations to be more customer-centric.
(01:11:55) Chad shared components of a high-quality Data UX function that any centralized data team should consider when developing data experiences.
(01:14:46) Chad touched on his mental framework for evaluating potential investments in the data space.
(01:18:57) Chad brought up the valuable skills he acquired as an internal product manager.
(01:20:05) Closing segment.

Chad's Contact Info

Mentioned Content

Talks

Aligning Experimentation Across Product Development and Marketing (CXL Live 2019)
Chassis: Entities, Events, and Change Management (Data Quality Meetup, 2021)
1,000 Experiments Club with AB Tasty (July 2021)
Data Discovery at Lyft and Convoy (July 2021) (with Mark Grover)
The growth of the data platform product manager role (The Tech Trek, Dec 2021)
Implementing Amundsen at Convoy (Building the Backend, Jan 2022)
Getting ROI from Experimentation: How AB Experimentation plays out in Organizations (Data Council, March 2022)
Why are we so bad at this modern data stack? (Catalog and Cocktails, April 2022)

Articles

Experimentation not only protects your KPIs but your job as well (Dec 2019)
Is The Modern Data Warehouse Broken? (April 2022) (with Barr Moses)
The Existential Threat of Data Quality (May 2022)
The Death of Data Modeling (June 2022)
Data Collaboration Problem (June 2022)
The Rise of Data Contracts (Aug 2022)

People

Barr Moses (Monte Carlo Data)
Juan Sequeda (data.world)
Adrian Kreuziger (Convoy)

Book

Agile Data Warehouse Design (by Lawrence Corr)

Notes

My conversation with Chad was recorded back in July 2022. Since then, I'd recommend looking at:

His two-part series on engineering guide to data contracts (Part 1 and Part 2)
The Data Quality Camp community
His most recent post on how scale kills data teams
Data Facade (of which he is an angel investor)

Source: https://dataproducts.substack.com/p/how-scale-kills-data-teams

About the show

Datacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.

Datacast is produced and edited by James Le. For inquiries about sponsoring the podcast, email khanhle.1013@gmail.com.

Subscribe by searching for Datacast wherever you get podcasts, or click one of the links below:

If you’re new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

Related Episodes