Back in September 2021, I attended the second annual Modern Data Stack Conference, Fivetran’s community-focused event that brings together hundreds of data analysts, data engineers, and data leaders to share the impact and experiences of next-generation analytics. The presenters shared the transformations they experienced with their analytics teams, the new insights and tooling they enabled, and the best practices they employ to drive insights across their organizations.

In this long-form blog recap, I will dissect content from 14 sessions that I found most useful from the conference. These talks are broken down into 4 categories tailored to 4 personas: data engineers, data analysts, product managers, and data team leads. Let’s dive in!

New Post✍️

My notes from #mdscon. It covers:
- Data Lakehouse
- Cloud Data Platform
- Reverse ETL
- Data Literacy
- Modern BI
- DataSecOps
- Insights-as-a-Service
- Data-not-as-a-Service

Thanks @fivetran for putting together this event!👇https://t.co/igbwYudRoY
— James (@le_james94) January 11, 2022

Data Engineer

1 - Your Next Data Warehouse is not a Data Warehouse. Discover Lakehouse

Today, most enterprises struggle with data and AI projects. The challenge starts with the architecture. You need to build out four different stacks to handle your different data workloads. They are different technologies that generally don’t work well together. The overall ecosystem further hampers this architectural complexity. There are many different tools to power each of these different architectures (data warehousing, data engineering, streaming, and data science/ML). Traditionally in data warehousing, you deal with proprietary data formats. Suppose you want to enable advanced use cases. In that case, you have to move the data across different stacks, which becomes very expensive, resource-intensive, and hard to manage from a governance perspective.

Your data teams feel all these complexities. Because these systems are siloed, your teams end up becoming siloed as well. Communication slows down, hindering innovation and speed. Teams end up with different versions of the truth because they have to keep copying the data all over the place. The result is multiple copies, no consistent security/governance model, closed systems, and disconnected/less productive data teams.

The core problem is the technology these stacks have been built upon. Most enterprises have grown up with data lakes and data warehouses for different purposes.

Data lakes do a great job of supporting ML. They have open formats with a big ecosystem and enable distributed training over a large swath of data. You can also process any type of data (structured or unstructured). However, they have poor support for BI. They haven’t had the performance needed and suffer from data quality problems. Fundamentally with data lakes, you are working at the file level rather than a more logical level.
Data warehouses are great at summing up tabular data for BI reports. However, they have limited support for ML models. They are typically proprietary systems and only have a SQL interface.

Jason Pohl argued that unifying the two systems above can be transformational in how we think about data. Delta Lake is a place where you can centralize your data after ingestion. It provides the reliability, performance, and governance you expect from a data warehouse. It also provides openness, scalability, and flexibility from a data lake. With Delta Lake, you deal with a more logical construct of tables.

The Databricks Lakehouse Platform is unique in 3 different ways:

Simple: It unifies your data, analytics, and AI on one common platform for all data use cases.
Open: It unifies your data ecosystem with open source, standards, and formats. It is built on the innovation of some of the most successful open-source data projects in the world (Spark, Delta Lake, mlflow, Koalas, redash).
Collaborative: It unifies your data teams (data analysts, data engineers, data scientists) to collaborate across the entire data and AI workflow.

Here are the key components of the Databricks Lakehouse Platform:

Databricks SQL gears towards data analysts, made to explore their data and build virtual dashboards. It provides the best price and performance compared to other cloud data warehouses and a simplified administration and governance. If you want to connect your favorite BI tools, you can do so through a scalable SQL endpoint.
Databricks ML has two new components this year (besides the old products like Notebooks and MLflow). AutoML is a transparent way to generate baseline models for the discovery and iteration of ML models. Users can quickly modify parameters in the search spaces as they build their models without doing boilerplate work. Feature store is the place to centrally discover/manage features and make them available for training and serving. It serves both online and offline features, enabling you to go from prototype to production much faster.

2 - Why Your Warehouse Should Be Your CDP?

A Customer Data Platform (CDP) is a database designed to hold all your customer information. More specifically, it is purposefully built to help companies activate marketing automation. Data is pulled from multiple sources (like your CRM, product, website, etc.), cleaned, and combined to create a single customer profile. This structured data is then made available to other marketing systems.

Tejas Manohar argues that there are six key reasons why you should prefer your data warehouse to an off-the-shelf CDP:

CDPs are not the single source of truth. The data warehouse has all your data.
CDPs require their event tracking. They take months to implement.
CDPs do not mesh with data teams. Marketing and data teams should work together.
CDPs are not flexible. Every business has a unique data model.
CDPs own your data. You are locked in.
CDPs do not benefit from the data ecosystem. You are siloed.

Let’s dig deeper into each of them:

The data warehouse has all your data

Whether you’re a D2C brand, B2B SaaS company, an e-commerce marketplace, or even a massive bank like Capital One, chances are your customer data is already in a data warehouse. The one reason that your CDP should be the data warehouse is that your data warehouse is already your CDP.

Off-the-shelf CDPs take months to implement

Setting up a CDP takes a lot of work. Before even getting started, you’ll need to first spec out all of the data to track across different data sources. Then you need to actually track it, which requires engineers to write a lot of tracking code. On top of that, CDPs require you to use their event tracking libraries, which only support the rigid data model that CDPs understand (accounts, users, and events). For these reasons, it’s not uncommon for a CDP implementation to take over a year.

As mentioned, your warehouse is already your CDP, which means this “implementation” step has mostly already happened. The data exists, it’s modeled in a way that works for your business, and it’s ready to be synced to SaaS tools.

Your data teams love your warehouse

CDPs target marketing teams and primarily sell to CMOs. Ultimately, marketers are not the right persona to solve CDPs' intricate data problems.

Self-service access and data democratization are important, but they are cross-functional. Data teams should be responsible for understanding your company’s data model and building clean data models for everyone else to consume. Marketing teams should be empowered to analyze customer behavior and iterate on customer segments without bottlenecking by data teams. CDPs do not recognize this, and instead, they give marketers immense capabilities without the process or guardrails of data and engineering teams.

Your business needs flexibility

CDPs are built around rigid data models. Segment Personas, as an example, offers only two core objects - users and accounts. What’s more? A user can only belong to a single account. In reality, data models aren’t so cookie-cutter. Users can be in multiple accounts, and accounts can have sub-accounts, business units, etc. Apart from users and accounts, companies of the 21st century have their own proprietary objects and hierarchy.

Privacy matters

There is no truly on-premise CDP offering with the rise of regulation and concerns around data privacy and data security. CDPs offer restricted access to your customer data, whereas data warehouses offer unrestricted access to your data. The best companies recognize that leveraging customer data is a competitive advantage. Therefore, they should own their data.

The modern data stack is best practice

Because CDPs are built to play well with their proprietary ecosystem, every CDP has to independently address these concerns via proprietary product features. The transformations you need to run often don’t exist, so you have no choice but to file a support ticket. However, if your CDP is your data warehouse, you can use SQL to transform your data in any way you wish and tools like dbt on top to systematically encode and execute these transformations.

Tejas’ company Hightouch fits into the Reverse ETL category of the modern data stack. Reverse ETL solves the “last mile” of analytics: making data actionable. While ETL and ELT move data from sources into the data warehouse (like Fivetran), Reverse ETL moves modeled data from your warehouse back into your SaaS tools. For example, Hightouch Audiences make the data warehouse accessible to Marketers.

The talk then discusses two use cases from Blend (by William Tsu) and Auto Trader (by Darren Haken) that adopt Hightouch.

Blend Use Case

Blend has two major challenges: data silos and slow in-house tools.

Historically, Asana data was siloed within the platform. If teams wanted to connect project timeline data to Salesforce, they would be forced to manually copy fields and values between the two. Their Services team was forced to context-switch between tools multiple times throughout the day.
The convoluted approach to data ingestion costs the team time and money. Pulling just a single column from Salesforce or changing a field could take the team weeks, limiting access to time-critical data. Without the ability to prototype and rapidly iterate, the team had to release straight to production to be able to test their solutions - creating more headaches for the operations team. And with rapid expansion came the onset of new tools to manage workflows and processes, like Asana, Marketo, or Lever. Each of those tools also required data to be synced inside them to be more effective.

Blend’s solution has been to adopt Fivetran and Hightouch. The value of Fivetran is being able to pull in Salesforce, Marketo, Asana, NetSuite, and Lever - then blend the data from historically separate departments together for analysis. Hightouch then pushed the data out to ensure everyone looked at the same metrics.

As a result, Blend’s Finance department is able to close out service team books four days earlier, cutting the team's financial reporting time in half. Blend’s Operations team merges data sources to provide powerful, business-wide data analysis. In brief, Hightouch enables alignment between teams and tools, driving better alignment and impacting Blend’s bottom line.

Auto Trader Use Case

Auto Trader’s goal has been to unify their customer data by creating a unified 360-degree view of their customer. The big motivation here is to activate data within their product and marketing departments for personalization, retargeting, and product experimentation. This also allows their data scientists to enrich their customer data with predictions. Other reasons include doing the right thing for the customer - Data Governance and Privacy and controlling the data sent to 3rd parties - only anonymized users to Ad Networks.

However, they ran into a few challenges: They used a tool called Snowplow for tracking their users within their products, but they still had many islands of customer data - marketing, sales, and advertising tools. Each island has duplicate audiences. Integration work for each ad platform added up (having to rebuild audiences and conversion events in each tool). As a result, engineering bottlenecks slowed down the Marketing team’s ability to experiment.

Auto Trader wanted a composable, finer-grained CDP, so their solution has been to adopt Hightouch Audiences for syncing to platforms and audience creation. It is powered by Auto Trader’s data warehouse on BigQuery and compiles paid ads audiences and conversions for Facebook and Google, thereby enriching with predictions made by their Data Science team.

3 - Lessons Learned from Spinning Up Multiple Data Stacks

Utsav Kaushishshared what he has learned being an early Analytics employee at startups like Zenefits and User Interviews. His MVP analytics stack entails Fivetran (for ELT), Amazon Redshift (for warehouse), and Mode Analytics (for BI). This stack is quick to set up (days, not weeks), has a small cost, and requires minimal maintenance. Getting early wins at this phase is crucial. Utsav distinguishes big wins (important for culture and team growth) and small wins (important for efficiency).

Examples of big wins include:

Setting a “North Star” Metric by getting the company aligned around engagement (important cultural shift) so folks can use analytic skills to prove the metric is tied to revenue.
Retention Analysis: Hard to do without multiple, connected data sources, but produces good ideas for other teams.
A/B Testing: Hardest to pull off, but still great ROI. It is also a powerful example of data answering product questions (another seismic cultural shift).

Examples of small wins include:

Dashboards to make data part of someone’s day-to-day work and quickly produce important insights (What’s working? What’s not?)
Automating Reports: Teams are probably doing lots of manual work in Excel, which can be replaced with scheduled queries. Saving time is valuable.

Utsav concluded his talks by sharing a few tips he wishes he could have done differently:

Get a deep understanding of each “schema”: What data can change? What data is human-generated? Try to make zero assumptions about your data schema.
Make snapshot tables: This can be as simple as the query “Select * from TABLE.” These help hedge against data mutability and are easy to set up.
Invest in data transformations: The combination of Fivetran and dbt have made setting up the data stack much more effortless, bringing great ROI and facilitating the onboarding of new team members.

4 - How Chick-fil-A Leveraged Data & Analytics to Prioritize in the Pandemic

At Chick-fil-A, the Enterprise Analytics team is set up in a decentralized way to support different verticals such as marketing, supply chain, financial services, and analytics across the business. The Digital Transformation and Technology team setups the core infrastructure for all the analytics work to be done. Korri Jones emphasized a strong partnership between these two teams to facilitate analytics projects at the organization.

Many Mom and Pop restaurants were hit extremely hard when the pandemic happened. Cleanliness and food safety became more important to customers. Digital ordering, online menus, delivery, etc., moved from luxuries to necessities. There were severe supply chain disruptions as well as a war for talent.

As a business, Chick-fil-A needed to decide how to support and protect owner/operators, their staff, and the communities they serve. They also wanted to create and surface restaurant operational issues due to the pandemic. As a result, they leaned on their forecasting expertise to make better-informed decisions about the business and brand while redoubling their efforts to support their people and families.

To keep the company and the guests healthy, Korri’s team has two main goals:

Know the business well: This entails asking questions such as - what is missing from the data they need to be aware of and who can help them identify and account for this? What are their competitors’ challenges, how do they apply to Chick-fil-A, and are there opportunities to collaborate during this time? What do their customers expect, and how can they meet and exceed those expectations?
Know their people well: Working remotely is one thing, but working remotely and taking care of family is another. They need to accommodate school closures and help people with their remote setups.

From those goals, it is important to identify COVID trends, closings, state mandates by consolidating the data into an action plan for the business. Here are the analytics projects that Chick-fil-A has prioritized during the pandemic:

Scenario Planning: Forecasting worst case, best case, and somewhere in-between the two in order to reassess all business functions while focusing on their owner/operators and their teams, guests, and staff.
New Dashboards: Building hyper-focused dashboards, alerts, and data pipelines, which surfaced whether a restaurant needed to close entirely or partially so that they had a near-real-time understanding of the impact.
ML Projects: Doubling down on MLOps via defining the core architecture required to systematically stand-up a solid core to accelerate and make data science efforts more resilient, leaning heavier into model governance and how that partners with data governance, and investing in monitoring/observability/explainability tools as their data evolved with the new business climate.

The key outcome is more expedited delivery of business insights to those who need it, not through tooling and technology but through tighter collaboration. As a result, they doubled their focus on protecting the business, owner/operators, staff, and customers through data excellence. In the meantime, they are expanding their ML Engineering team to support growing process automation and forecasting work.

Data Analyst

5 - What is a Modern Data Architecture?

As seen above is a common journey to a cloud data platform.

We start with the on-prem data warehouse, which comes in many flavors such as single-server instances like Oracle or SQL Server and massively-parallel processing (MPP) varieties. Given their limited computational power, these on-prem services can’t handle all of your data.
The next evolution is the cloud data warehouse. Now you don’t have to worry about buying infrastructure and hardware. Systems like Redshift and Synapse are on-prem MPP appliances moved to the cloud, so they still face similar limitations.
The response to such limitations is the file-based data lake and Hadoop. Given the explosion of data, since you can’t store them in a relational database, you can store them in files in commodity storage. However, this approach takes away the ACID compliance and fast answer value of SQL databases.
From the beginning, Snowflake has been built as a cloud data platform to address both of these concerns.

If you asked almost any current leader in data engineering to draw a “modern” data architecture on a whiteboard, you would most certainly get something like the following:

However, Jeremiah Hansen argued that this architecture has been around for almost 10 years and hasn’t changed much. This architecture comprises three major components: the data warehouse, the data lake, and the data marts.

The need to have separate data marts and data lakes arose because those traditional data warehouses couldn’t scale to meet the different, competing workloads placed on them. Data marts came about because the central data warehouse couldn’t scale to meet the different workloads and high concurrency demands of end-users. Then came data lakes because the enterprise data warehouse wasn’t able to store and process big data (in terms of volume, variety, and velocity).
Data lakes and data marts were created to address a real need in the data engineering space at the time. And even today, data warehouses continue to be unable to support all the varied workloads found in the enterprise. This is true even for the newer cloud data warehouses. These disparate data systems result in siloed data, which is very challenging to derive business value from and govern securely.

But Snowflake Cloud Data Platform has dramatically changed the data landscape and eliminated the need to have separate systems for each of your workloads. Snowflake can be your data warehouse, data marts, and data lake. And that requires us in the data engineering space to think differently about what we’ve been doing. It requires us to understand why we’ve been doing things a certain way and to challenge our assumptions.

Jeremiah has noticed that as data architects begin to work with Snowflake, they continue to fall back on that legacy systems–based data architecture design, using Snowflake only as a data warehouse or maybe expanding it a bit to include some data marts. And most continue to argue for maintaining a separate file-based data lake outside of Snowflake, even when building one from the ground up. But why continue to think this way when Snowflake can replace all of these systems?

In order to move forward, we need to stop thinking about data in terms of existing types of systems, such as legacy data warehouses, data marts, and data lakes. Doing that is not helpful, introducing an unnatural and artificial boundary in an enterprise data landscape.

At a high level, you can group all enterprise data into the following logical data zones:

The old systems-based thinking will keep data engineering professionals locked into old ways of doing things and will continue to fragment the data landscape. There is no need to divide the data zones into disparate, siloed data systems like lakes, warehouses, and marts with Snowflake. Instead of thinking along system lines, we should consider a single platform for all enterprise data such as this:

Snowflake Cloud Data Platform can support all your data warehouse, data lake, data engineering, data exchange, data application, and data science workloads. With support for just the first two of those workloads alone, you can consolidate your data warehouse, data marts, and data lake into a single platform.

6 - How Analysts Are Transforming Data Literacy at their Company

Lauren Anderson discussed how the Data and Insights team at Okta has been teaching every employee how to speak enterprise data as a second language. Their overarching goal is to democratize data by empowering the enterprise with trusted data and analytics to make timely, data-driven decisions. Their path to that goal is by creating a community of citizen data analysts. In order to create and support this community, Lauren’s team needed to implement and maintain modern technologies/environments for analytics, provide a framework for data ownership and data quality tracking and remediation, and support users with training events in real-time support.

First, they identified the three personas: the consumers (who view high-level KPIs on demand through a centralized reporting portal and have access to view reports in the BI tool), the creators (who leverage certified data sources to publish and share reporting and can perform ad-hoc analysis), and the trailblazers (who have access to production and pre-production data, can create and publish data sources, and even bring their own data).

It’s also important to empower the business units with data ownership and rely on subject matter experts to ensure data is a high-quality asset. Lauren’s team enables a network of data trustees and data stewards to standardize definitions and address quality improvement areas so that the three personas can consume, explore, and make informed decisions. The team supports this process by providing a data quality scorecard that the stewards can use to monitor data quality during monthly and quarterly meetings to ensure that everyone is aligned on priorities. The team also provides a central repository for standard metrics and definitions.

They also hold events like Annual Analytics Day, provide training sessions with Udemy courses, and maintain dedicated Slack channels for analytics discussion.

Finally, when a report, a dataset, or a predictive model is ready for prime time, they provide a path to production by ensuring proper monitoring automation and quality assessment. If someone sees the “Verified by okta Data and Insights” watermark on a report, they know they can make trusted decisions off of it.

As of now, Okta has 47% monthly BI tool engagement, 10% of the people creating content, and 1% of them trailblazing. As they continue towards ensuring a modern tech stack, by lowering the barrier of entry to create and share content with high-quality data and continued encouragement/training, Okta’s community of analysts will surely grow and mature in insights.

Gabi Steele is CEO and Co-Founder of Data Culture, whose mission is to empower organizations to solve problems using data and build lasting Data Culture. They offer full-stack implementations. fractional data team as a service, data strategy (CDO as a service), customized data visualization and storytelling, and hiring data teams from the ground up

Generally, when entering organizations, their goals are to help organizations reach their next phase of data maturity, close the gap between the infrastructure and business requirements, and empower the business to own and socialize the work we’ve implemented.

Gabi broke down the 4 phases of a business’s data maturity as the following:

Data Chaos: Data is stored in silos across disparate products and applications. There is no data team: just hardworking scrappy people or a single data analyst.
Centralization: Data is centralized from multiple sources into the data warehouse. The data team has some data analysts, data engineers, or software engineers doing data engineering.
Data Visibility: Data is centralized and made accessible in an analytics database. Dashboards can easily be built on top of analytics databases. The data team has many data analysts, data engineers, BI developers, BI analysts, business analysts, and product analysts.
Intelligent Products: The organization focuses on building the ability to use data science and ML in the product. The data team has additional data scientists, ML engineers, and data visualization engineers.

While most businesses say roughly 30% of the decisions they’re making are informed by data, up to 73% of company data goes unused for analytics. Gabi believes that building a lasting data culture requires a community of inspired employees who are empowered to solve problems using data.

Brittany City is a quantitative data analyst at Asurion - a Nashville-based, leading provider of device insurance, warranty, and support services for cell phones, consumer electronics, home appliances, and jewelry. The company operates in over 15 countries and has over 50 officers serving over 300 million customers.

The diagram below shows how the data flow to Asurion’s stakeholders. A customer will receive a new device, which is shipped back to Asurion for some reason. That becomes reshipped data (the wrong color, the wrong size, low-quality, etc.). That data is then created into reports, which are then given to the stakeholders - the quality department overseeing the repair and inspection of the devices before they go to a customer. The reports must efficiently explain what is happening, so the stakeholders can take that to suppliers (Apple, Samsung AT&T, Sprint, etc.).

Analysts need to have clear data literacy and great data products for the stakeholders. Asurion has three types of data products:

Power BI Dashboards: These are visually appealing with filters for customization for multiple stakeholders. They have the ability to tell a full data story with tabs. However, they are more timely and more complex to create.
Excel Dashboards: These are ideal for tabular reports, simpler to understand, and easy to be created. However, there are tons of files to keep track of, and the work is often repetitive.
SQL Server Reports: These can be automatically sent to stakeholders and provide straightforward information. But, there can be delays and errors during high volume times, plus the lack of visualizations.

The roadblocks that Brittany’s team has faced include several repetitive Excel and SSRs files from ad-hoc requests, repetitive weekly Excel tasks, and Power BI dashboards not providing full support to stakeholders. Given these roadblocks, her team’s future efforts entail removing unnecessary files, combining Excel and SSRS files into Power BI dashboards to tell a full data flow story, creating new dashboards to automate weekly Excel file requests, and updating current dashboards with filters and education.

Archer Newell is a senior data analyst at Fivetran. At Fivetran, the analysts are the go-between business users with the data stack. They work with different departments to understand their challenges and translate them into certified datasets. After building the datasets, they build dashboards and perform ad-hoc analysis on the users’ behalf, sometimes slowing things down. Rather than having analysts in this middlemen role, it would be even more efficient to put these tools directly into the hands of business users so that they can ask and answer their own data questions and even start contributing to the reports themselves.

The first thing analysts can do to support this is by creating good documentation, so users can figure out what datasets and metrics exist. Next, they need an additional layer of education and data literacy to actually help users put data to use.

To have long-term adoption of BI tools, users need to be able to easily find helpful data so they can make these tools part of their daily jobs. At Fivetran, 55% use dashboards (70%+ excluding Engineering ICs) each week.

The analytics team structures their BI tool, so it’s easy to navigate by adding landing pages with top reports, new releases, and important links; creating Analytics-certified reports; and identifying certified reports via a folder structure and naming convention.
The team also partners with power users to build better data products and incentivize the champion of self-service analytics within their teams.

Analysts can also play a big role in educating users on how to use BI tools and how to interpret results, enabling them to feel confident making decisions off that data.

The team demos new tools and offers targeted training sessions for relevant teams.
The team also offers Office Hours to help business people think like data analysts and provide quick insights for business users and content for analysts.

7 - How to Use the Modern Data Stack to Power Product-Led Growth

Product-led growth is a business methodology in which user acquisition, expansion, conversion, and retention are all driven primarily by the product itself. It creates company-wide alignment across teams (from engineering to sales and marketing) around the product as the largest source of sustainable, scalable business growth.

Boris Jabes (the CEO at Census) argues that you should care about PLG even if you are not working in a product function. Every PLG company has to deal with three forces: an increasing number of users, an increasing number of channels, and an increasing number of time periods. Considering that data impacts every team in PLG (Sales, Support, Customer Success, Design, Marketing, and Engineering), Product-Led Growth essentially means Data-Led Growth. Data teams should utilize the Modern Data Stack with event tracking, transformation, visualization, cloud warehouse, operations, and ELT.

But how can we go from deploying the Modern Data Stack to reaching Growth and Profit? Boris emphasizes that growth is a loop of feedback to understand how users behave and how to move your product forward.

Buddy Marshburn (Data Engineering Manager at Loom) illustrated a PLG Use Case at his company. Loom is a video messaging tool that lets you instantly share your screen and your video with your team.

Designing Loom’s data infrastructure to support PLG requires four distinct steps:

Collection and Storage: Loom chose a cloud-based data warehouse to store their data and Fivetran as the ETL tool to get the data from apps to the warehouse.
Transformation Layer: Loom utilizes a reliable, high-quality, and well-documented API layer to transform the raw data to production data.
BI Layer: Buddy’s team frequently communicates with their stakeholders to ensure that the dashboards built in the BI layer (Mode) are not repetitive or irrelevant.
Reverse ETL and Operations: Loom uses Census to unlock the operationalization of the data by bringing data from the warehouse to the application layer for sales, marketing, support, and product actions.

8 - Achieving Competitive Advantages with Modern BI

Lucas Thelosen gave a very informative talk about Modern BI, which benefits businesses (better outcomes, increased agility, analytical maturity) and professionals (career advancement, strategic growth, data product management) in multiple ways. At its core, Modern BI consists of people (data culture), process (data product management), and technology (modern data stack). Analytical maturity is the theme that brings these different elements together. In a simplified way, analytical maturity entails hindsight (what has happened?), insights (why did it happen?), and predictions (what will happen?).

Unfortunately, from a technology standpoint, traditional tools are not designed for Modern BI.

Does not scale: They are not designed for modern databases, as data is often siloed, cubed, or put in extracts. This leads to high costs and poor performance.
Non-agile: They don’t support iterative development and limit the developers who want to treat Analytics as a Product.
Limited experiences: They provide a one-size-fits-all approach to analytics, in which every output is a report or dashboard. This leaves behind many consumers.
Lock-in: They have a limited choice of framework platforms and databases and are designed to be difficult to migrate away from.

Looker is designed to be API-first and cloud-native for integrating into existing workflows. It also has a semantic modeling layer for enterprise-wide governance and an in-database architecture for access to real-time data. Its capabilities entail:

Modern BI and Analytics by serving up real-time reports and dashboards that inspire more in-depth analysis.
Integrated Insights by infusing relevant data with your existing tools for an enhanced experience and more effective results.
Data-Driven Workflows by super-charging operational workflows with complete, near-real-time data.
Custom Applications by providing a data tool built to deliver the results you need as needed.

Migrating from a legacy BI tool can help organizations transition from traditional reporting tools to a modern data platform. Looker has worked with 1,000+ customers and has extensive experience with migration projects. They make the migration process simple with their proven (Legacy)-to-Looker Migration Strategy:

Lift and Shift: Replicate content into Looker with minimal changes or enhancements.
Rationalize: Move only the most valuable/popular content and simplify the process for creating new content.
New and Improved: Evaluate and develop a new data solution based on current business needs.

From a process standpoint, data product management means managing data like a product. This process entails:

Iterative Development: With Modern BI technology available, you have an agile platform. Analytics now can and should always evolve. Business users who are able to drill down and ask more questions on their own (reduced friction) will have more they want to test out. The Data Product Manager meets with the business stakeholders to gather feedback and iterate on the analytics the business needs.
Enhanced Communication: The Data PM develops a roadmap and makes it visible to all (with release notes). The notes highlight what is being released and what will come in the future, including how the business feedback influenced the roadmap.
Operational Analytics: The goal of analytics is to make better, data-driven decisions and take action. Data PM operationalizes analytics by connecting insights to actions.

From a people perspective, a strong data culture have these elements:

Data Literacy: You should never assume that technology is easy to use. It’s your job to educate people on how data relates to their job. You can set up classes people can attend and office hours or chat channels. You can ask management to commit to a Data Culture (e.g., use a certain dashboard in team meetings).
Build Data Experiences: With the Modern BI platform, you can get analytics to where people are (embedded or emailed). You can design the reports for the audience or have the end-users lead the design. In brief, Modern BI lets the user drill down and ask more questions. You can curate a drill path.
Ambassador Network: 1:1 relationships are still the key to driving adoption. A network of people who “get data” embedded within various teams can be the ambassadors to provide these touchpoints. These ambassadors can provide light touch enablement, prototype dashboards, and relay feedback to the core team.

Product Manager

9 - DataSecOps: Embedding Security in Your Data Stack

Ben Herzberg defined DataSecOps as an agile, holistic, security-embedded approach to coordination of the ever-changing data and its users, aimed at delivering quick data-to-value while keeping data private, safe, and well-governed.

To understand DataSecOps, it’s important to understand DevOps first. A primary driver of DevOps has been cloud elasticity, in which software organizations can deploy software in the cloud in a continuous way. As a result, they can advance in small increments, fail fast, and innovate much faster.

However, DevOps faces several common security issues: Many companies have a DevOps team, but they aren’t able to adopt the DevOps mindset. There are issues with misconfigurations and change management as well, which can lead to a high cost of security-as-a-patch. As a result, security needs to be baked into DevOps and embedded into the process, hence the term DevSecOps.

Looking into the data space, Ben argued that “data democracy is the worst form of data governance, except all the others.”

Data democratization is bad because it is a big liability for most organizations. There will always be a loss of control, increased risks, a proliferation of data protection, privacy and compliance regulations, and operational challenges.
But data democratization can also be good because it allows more people within the organization to make use of data in novel ways and shortens time-to-value from data, thereby bringing business value.

Companies have started to adopt data democratization in various functions such as reporting, business intelligence, and predictive analytics, with new cloud technology as enablers. However, such an adoption process tends to cost a lot of financial resources.

DataOps is the agile coordination of the ever-changing data and its users across the organization. Compared to traditional DevOps, DataOps is used by broader teams (especially business users) and has a more emphasis on privacy and data governance, but is also mostly less mature.

DataSecOps is the real enabler of data democratization. In the DataSecOps mindset, more data consumers means a higher data exposure risk. Small data infrastructure and security teams are handling a lot of data and data changes. Therefore, you shouldn’t significantly slow down the time-to-value but shouldn’t compromise on security needs. You need a holistic and collaborative mindset and responsibility, not specific teams.

Here are the fundamental DataSecOps principles:

Security needs to be a continuous part of data operations, not an afterthought.
Any ad-hoc process needs to be transformed to be continuous.
There needs to be a clear separation of environments, testing, and automation.
Prioritization is key - primarily for sensitive data.
Data is clearly owned.
Data access should be simplified and deterministic.

Ben concluded the talk with a few notes on how to start embedding security in your data stack. First, you need to give testing and automation their respect with proper resources. Then, you should include security teams in data projects. Finally, you should always prefer continuous processes over ad-hoc projects.

10 - Insights-as-a-Service: Creating Data Products from your Customers’ Data

In a traditional internal Modern Data Stack (MDS), the keys are available. Use cases are predictable. Architecture requirements are linear. Data volumes grow at a known pace. Stakeholders can reach congruence. Organizational hierarchies are apparent.

However, external Modern Data Stack (MDS) is the opposite: Keys are out of reach without effort. Use cases are relatively unknown. Architecture requirements are highly variable. Data sources are shrouded and grow at an untold pace. Builders have degrees of separation from stakeholders. Organizational hierarchies are bespoke.

The solution, therefore, is a flexible architecture with customization options that support nuance. But how can we balance, though? Giving too much flexibility and customization will manifest the need for technical expertise and a sophisticated onboarding process. It will also be more challenging to maintain. Giving too little flexibility will make your solution quickly reach extensibility limitations, and customers will leave.

Aaron Peabody argued that solid DataOps practices provide the best bang for the buck to reach that balance and serve as foundations for Insights-as-a-Service (IaaS). The six core pillars of DataOps include Authentication and Data Onboarding, Partitioning and Security, Workflow and Pipeline Orchestration, Data Quality and Veracity, Repeatability and Redundancy, CI/CD, and Observability.

The API Economy has provided a new paradigm for engineering. The shift requires us to critically examine what we should and should not build for any given project.

In the old way, we were taught to build it, source it, buy it. Consequently, fragility is always a risk.
In the new way, we buy it, source it, build it. Functional limitations can be a risk, but advantages include scale, speed, and support.

By building an external MDS the right way, the Untitled Platform is not just an External MDS application. It also contains the infrastructure to support the creation and deployment of Internal and External MDS applications. Its role in the MDS ecosystem is to become the leader in Deployment.

Aaron then dug deeper into the DataOps foundations that the Untitled platform had been designed in mind.

For authentication and data onboarding, the platform uses:

Powered by Fivetran for commonplace data source connections and integration management.
Auth0 for infrastructure connections and federating internal/external service bridges.

For partitioning and security, the platform uses:

AWS Organization and Account Architecture: Every customer is provided with a unique AWS account for the platform backend, bridged into our root organization architecture.
Schemas and Data Models: Customers have their own unique or separate schemas and storage locations throughout the data layer. Multi-tenant locations are isolated and utilized for specific offerings opted in by customers.
Row-Level Security and Security Groups: For isolated multi-tenant locations, we utilize object, column, and row-level security partitioning, federated throughout the platform offering ecosystem by Auth0 and Dynamo ARN table.

For workflow and pipeline orchestration, the platform uses:

Amazon EventBridge: EventBridge acts as a core routing decision engine for all microservices comprising the deployment ecosystem. This pattern ensures the highest performance outcomes at optimal TCO and decouples infrastructure events from dependent data job events.
Airflow: DAGs triggered via event streams and AWS Lambdas, with orchestration specs set in parameter stores for unique clients. DAGs run two tasks for each dbt model (dbt-run and dbt-test), staying true to the Airflow atomic method.
dbt: dbt invoked through a BashOperator. DAGs kick off dbt task runs in master and client warehouses. Jobs sequenced into logical lineages and dependencies by sources, staging transforms, and mart models.
Snowflake: This is the master data store for the Untitled platform, which powers its embedded BI offering and ML solutions.

For data quality and veracity, the platform uses:

dbt Packages and Tests: Package library paths and data sources come with specific templates managed in root Untitled dbt GitHub repo. Each dbt project and package manifest is assembled during the platform account setup process. There is one repo per client, dependent on updates in its master source libraries. Packages are pre-configured with test automation and table documentation.
Untitled serves as the custom sources mapping wizard: What to do with data for which you don’t have supporting schemas and pipelines?
Castor Docs is used for comprehensive data documentation, transform lineages, and governance solutions.

For repeatability, redundancy, and observability, the platform uses:

Terraform (external services), which orchestrates Fivetran deployment architecture and configurations, Snowflake architecture, role, user, and policy configurations, alternative client warehouses and cloud infrastructures for supported destinations, and Auth0 tenant architecture.
Amazon Cloud Formation (internal services), which supports AWS organization account architecture, AWS role, user, and policy configurations, and AWS services, APIs, and parameter stores.
New Relic, which allows them to see system telemetry data and ingest event feeds across their entire infrastructure ecosystem. There are automatic anomaly flagging and workflows for proactive management of deployments, alongside limited views of sub-account reporting provided to customers for their own deployments.
GitHub + CircleCI: All core infrastructure-as-code internal and external modules sit in the master organization repos, inclusive of tool-specific templates. Client organizations are created or bridged, with core repos cloned based upon use cases and configuration settings derived from account creation or modification processes, dependent on master repos.

Bringing it all together, the Untitled platform is technology agnostic, provides data source and MDS tool-specific templates, enables off-the-shelf optionality, and eventually unifies DataOps and IaaS platforms.

Team Lead

11 - Breaking Your Data Team Out of the Service Trap

Emilie Schario, who used to be the Director of Data at Netlify, gave the most enthusiastic talk at the conference. The data stack at Netlify looks like the one above.

They use custom Python scripts, Meltano, and Fivetran to move data into the Snowflake data warehouse.
They use dbt to transform the data in Snowflake and dbt Cloud to orchestrate dbt jobs.
They use Transform as a metrics store, Census for operational analytics, Mode for BI, and Airflow to orchestrate the whole architecture.

Most people think that building a data team means pulling a number real fast, but there are, in fact, much more than that: telling people no, managing competing priorities across the business, convincing people to track things before they want the data for it, etc.

Emilie emphasized that the purpose of any data team is to help organizations make better decisions. The mission of Netlify’s data team is to empower the entire organization to make the best decisions possible by providing accurate, timely, and useful insights.

Unfortunately, most data teams are failing.

Data teams are bombarded with questions: Where are our new users coming from? What does the adoption of this new feature look like? Which marketing source is generating the lowest CAC?
But people don’t trust the work of the data team: 1 in 3 executives do not trust the data they use to make decisions. 50% of knowledge workers’ time is wasted hunting for data and searching for confirmatory sources for data they don’t trust. 85% of all Big Data Investments are considered wasted.

Why? Data teams are stuck in a vicious cycle of sadness.

The team rushes to put out analyses because there’s so much to do.
Expectations shape stakeholders’ impressions of what they get. If it doesn’t match expectations, stakeholders question analysis quality, so the data team spends large amounts of time “proving” numbers.
The team can’t get to new work because of the backlog of “proving” numbers and existing technical debt.

Emilie believed the root cause of such an issue is that data teams are seen as service organizations (question comes in, answer goes out), just like IT organizations. This is problematic because data teams are stuck in a reactive position instead of a proactive position. They aren’t creating insights but instead are answering questions. But that’s not the goal of the data team.

The Service Model is all wrong because if we spend all our time answering questions, we will never deliver insights (similar to an engineer who is stuck in meetings never has time to code). If we define the team based on the insights we deliver, we need to be focused on delivering insights, not just answering questions.

This doesn’t mean we throw the service model out the window. Instead of building our data teams as service organizations, we reframe our work as building a data product that enables the business to make better decisions. But people sometimes don’t know what data as a product means in practice.

A typical data team does five types of work: Operational Analytics, Metrics Management, Data Insights, Experimentation Reporting, and Servicing Other People. There are other things your data team will do too, but we’re focused on impact.

Operational Analytics means moving data from System X to System Y. Netlify uses Census for this. This helps increase access to data across the company and improve business processes.
Metrics Management means taking ownership of specific KPIs, including being able to slice and dice them as needed and truly understand what’s going on. Netlify uses Transform for this. This helps reduce repeated asks for metrics that already exist (self-serve).
Data Insights: The problem here is that people don’t know what they don’t know about their users that they can find in the data.” Netlify publishes an internal handbook called “Insights Feed” to help discover key strategic levers (read Paige Berry’s “Sharing Your Data Insights to Engage Your Colleagues” post for more information). Furthermore, they build an Insights Hub to serve as an institutional memory of their work. This hub helps onboard people to the work of the data team (independent of their role in the organization) and has headlines as key takeaways to allow people to skim and distill what they learned.
Experimentation Reporting: Product/growth/marketing teams frequently want to understand how their experiments are doing. These experiments help improve activation, conversion, retention, and monetization. At Netlify, each experiment requires an individual analysis, and they are currently exploring Eppo.
Servicing Other People means providing answers to questions from coworkers. Netlify uses GitHub for making requests and Mode for analyses. Such practice helps empower data-informed decision-making. Even when servicing other people, you want to flip their questions on their heads. User stories can help surface areas to dig for insights.

If you are building a data team from scratch, Emilie recommended going from left to right:

Operational Analytics: Start by putting data in other systems and making it easy to access. This will allow you to operationalize data as you go.
KPIs/Metrics: Identify what your inbound asks are. Create metrics that you can set up and let people take ownership of. Build on top of existing work to put metrics in systems where people are already working (e.g., Product Usage Metric in SFDC).
Data Insights: Surface the information you’re finding in a newsfeed (Slack Channel) to your organization. Work loudly. Your work can only be impactful if people know of it.
Experimentation Reporting: Suggest running an A/B test on a proposed change to the pricing page. Volunteer to do the analysis. Share the results widely. Teach about MDE, statistical significance, and relevant key terms.
Servicing Others: By focusing on high-impact work, people will be more tolerant of waiting for asks. If you are focused on the right strategic priorities, you’ll also get ahead of the asks.

Your data team structure should look like the one below, where everyone in your data team should report to a data leader. Each data team member should be entitled to have a manager who understands your job. Requests and prioritization happen with other business partners in the organization.

Furthermore, your data team requires resources:

Headcount: 3-10% of the company should be focused on data and decision-making. Budget and headcount must scale with the organization.
Funding - People: If a person is dedicated to a specific domain, such as Product or Finance, their headcount should be funded by that domain. Get other people to share the value that the data team provides.
Funding - Tools: Ensure your tooling is working for you and not against you. Find partners who benefit from the data organization’s value to help fund.

If you are turning a struggling team around, Emilie recommended these tips:

Use Slack to start changing data culture: Create a #data-reads channel to share relevant articles/newsletters to help build a data literate culture. Share your data insights in a place where they can last (like an internal blog), but also share them where people are working in Slack, with a link to the internal blog.
Find partners in other parts of the business: Collaborate with the UX research team. Executive sponsors to engage in a habit loop. Include an “Intro to Data” as part of your onboarding. Host Office Hours where people can bring questions. Focus on fewer teams that are well-served.
Prioritize proactive insights: Carve out 2-3 hours of deep work every week. Only work on something you can find in that time window. Share what you’re finding loudly. Insights are your appetizers. They’re small but can lead to something bigger.

To measure success, you don’t want to tell your data team’s ROI story. The value of the data team is a function of the number and volume of decisions you’re enabling, not the data you have or the dashboards you have built. You want to measure business impact, not output (uplift in revenue from work started by an insight or improved onboarding throughput from targeting communications). Consider measuring NPS for overall satisfaction.

12 - Improving ROI with Snowflake + Fivetran

Here are the three major challenges with traditional data pipelines

Siloed and Diverse Data: Data comes in different shapes and forms with multiple integration styles.
Reliability and Performance Degradation: Data pipelines break frequently, and data transformations are slow as data changes over time.
Complex Pipeline Architecture: Such architecture requires many tools, custom code, and diverse skill sets.

With the Snowflake data cloud, things are different. Snowflake is one global, unified system connecting companies and data providers to the most relevant data for their business. Carlos Bouloy emphasized the benefits of the data cloud:

Access: You can get access to 3rd-party data, ecosystem data, and organization data all within the data cloud.
Governance: You can protect, unlock, and know your data easily.
Action: Product teams, data scientists, analysts, and other business teams can take action on the data.

Snowflake data cloud platform supports data engineering, data lake, data warehouse, data science, data application, and data sharing. It works with partner data, 3rd-party data, SaaS data, applications data, customer data, and data services.

It is fast for any workload: Running any number or type of job across all users and data volumes quickly and reliably.
It just works: Replacing manual with automated to operate at scale, optimize costs, and minimize downtime.
It is connected to what matters: Extending access and collaboration across teams, workloads, clouds, and data, seamlessly and securely.

The combination of Snowflake, Fivetran, and dbt is powerful. Fivetran is automated data integration that allows access to SaaS solutions (Netsuite, Marketo, Jira, Stripe, etc.). It can work with traditional data sources like SQL Server, Oracle, etc. It enables you to easily get your data into the data cloud without dealing with complex processes and development cycles. Once the data is in Snowflake, you can perform the ELT process by running dbt jobs. Once the data is transformed, your data practitioners can work with that data in the data cloud.

Overall, this combination helps you connect and load data easily, improve pipeline reliability and performance, and reduce pipeline complexity.

13 - Increase Data Productivity by Extending Self-Service Analytics to Business Teams

Most organizations are still struggling to cultivate a data-driven culture. A 2019 report from Deloitte found that 67% of executives are not comfortable using their current tools to access data. Another 2021 report from NewVantage Partners found that only 25% believe they have forged a data culture, and 24% believe they have created a data-driven organization.

Julie Lemieux argues that a human-friendly analytics stack fosters a self-service culture. There are five ways to design an analytics stack that humans will actually use, making your data democratization initiative a success:

Translating data into business terms with a data catalog.
Maintaining an analytical flow state by enabling analysts to work in the same application environment throughout the analytics lifecycle.
Thinking (and working) like a human by having an intuitive and easily configurable UI.
Enabling users to add more data (not more problems) to their analyses.
Fostering the art of (data) storytelling with powerful visualizations.

The foundation of any self-service analytics culture is a stack that is as human-friendly and flexible as other tools commonly used by other business teams. Making data analytics approachable and usable will spark curiosity, and people will take pleasure in exploring data rather than being intimidated by it. By removing the barrier that has stood between them and data for far too long, you will be delighted to see how many people are ready to exercise that curiosity and join the data conversation.

Nelson Cheung then provided a use case on building a self-service analytics culture at Clover. Nelson’s Data Analytics team develops tools and services for Clover teams to make data-driven decisions. For context, Clover builds an ecosystem of hardware and software solutions for merchants to help them run their business.

There are four main components that he explored:

Leveraging cloud technologies: They had limited DevOps resources in the early days. Later, they adopted a cloud-hosted solution with Snowflake, enabling them to spend more time with business stakeholders, flexibly configure compute resources for various workloads, and integrate natively with other analytics tools.
Adopting tools designed for self-service: They learned that certain tools lend themselves more towards empowering users to gain access to use data and perform custom analyses on their own. They settled on Heap Analytics to collect user interactions and Sigma Computing to visualize/analyze the data.
Recognizing where users are in their self-serve journey: They categorized their users into 4 phases: 1st-Time User (“I need help. Do you have some time to sit down with me and go step-by-step?”), Spreadsheet Pro (“Just point me to the data, and I can take it from there.”), Super User (“I can create an automated email notification once a threshold is triggered? Awesome!”), and Advocate (“You need to visualize this data over time? I can help!”).
Building solutions that enable each user along their journey: These solutions entail YouTube tutorials for 1st-time users, Sigma/Snowflake/Spreadsheet tables for spreadsheet pros, and sample analytics workflow for super users. The advocates are championed and celebrated for the progress in their journey.

14 - Why Don’t These Numbers Match? Why Information Architecture Matters in Analytics

Mismatched numbers is a common issue that many analytics teams deal with: Why don’t these revenue numbers match? Why did last week’s revenue go down? Which number is correct? Jacob Frackson and Callie White argued that choosing the proper information architecture is the best solution to deal with this issue.

Attributes of a company with bad information architecture are: Metric naming is on a first-come basis. Everyone has a different conception of “up-to-date” Data maintainers are more likely to add new columns rather than update existing ones. These lead to miscommunication everywhere.
Attributes of a company with good information architecture are: Metric naming is collaborative. “Up-to-date” is defined by the data maintainers. Data maintainers are actively thinking about analysis UX. Communication is contextualized to the persona.

Best practices for designing a strong information architecture can be broken down into two categories: personas and tooling.

Concerning the personas, you should be able to answer the following questions:

What are your personas? How will these change over time?
Who are your stewards/power users within those personas? What personas should be prioritized given their power users?
What language do these personas use? Can we conform to that standard, or does that conflict with our larger architecture?
What is the application of the data or information? Is the application more technical or more general usage?

Concerning the tooling, you should be able to answer the following questions:

How can you indicate the timeliness or accuracy of your data? How can you indicate which dashboards and tables are up to date and which are deprecated?
How can users understand the context around data or visualizations? How can you ensure they are interpreting data correctly? Is there a language that you can use that’s contextualized enough to convey information accurately?

For both personas and tooling, you should constantly think about what the user experience is, what the maintainer experience is, and how you can simplify or improve those with information architecture.

That’s the end of this long recap. I hope you have learned a thing or two on best practices from real-world implementations of the modern data stack. If you have experience building and deploying the modern data stack into production, please reach out to trade notes and tell me more at khanhle.1013@gmail.com! 🎆