Earlier this month, I attended the second iteration of DataOps Unleashed, a great event that examines the emergence of DataOps, CloudOps, AIOps, and other professionals coming together to share the latest trends and best practices for running, managing, and monitoring data pipelines and data-intensive analytics workloads.

In this long-form blog recap, I will dissect content from the session talks that I found most useful from attending the summit. These talks are from DataOps professionals at leading organizations detailing how they establish data predictability, increase reliability, and reduce costs with their data pipelines. If interested, you should also check out my recap of DataOps Unleashed 2021 last year.

[1/10] New Post✍️

My notes from @iLoveDataOps. It covers:
- Operational Analytics
- Data Lineange
- Multi-Cloud Data Platform
- Data Lakehouse
- Data Observability

There are best practices for running, managing, and monitoring data pipelines. Enjoy!https://t.co/5dSOSi1zH2
— James (@le_james94) February 25, 2022

1 - Welcome Remarks

As every company is now a data company, it is natural to see the practice of DataOps exploding. How can data teams keep up with the increasing demand for data? Kunal Agarwal (Co-Founder and CEO of Unravel Data) talked about the 3 Cs that slow data teams down:

Complexity: Complexity arises from the underlying pipelines and stacks that data run on. Companies comb together 6-20 systems to make up their modern data stack.
Coordination: Coordination is the collaboration between data engineers, data operations, data architects, and business unit leaders. There is a lack of alignment between these data team members on repeatable and reliable processes.
Crew: There is a shortage of data talent, in which the demand far exceeds the supply.
These challenges continue in the Cloud environment, alongside additional challenges around governance and cost.

The DataOps community can help bring about the right tools and processes to tackle them. For context, DataOps uses agile practices to create, deliver, and manage data applications. Done right, it can be the catalyst for accelerating data outcomes:

DataOps tackles complexity challenges by simplifying the management, the flow, the reliability, the quality, and the performance of data applications with full-stack observability. Teams have a single trusted source of truth that they can depend on.
DataOps tackles coordination challenges by automating tasks that otherwise need experts to scale in a much faster way.
DataOps tackles crew challenges by streamlining the communication between team members with a reliable and repeatable process.
DataOps tackles cloud challenges by promoting intelligent governance, which helps you leverage all the benefits of the cloud without any downside.

2 - The Operational Analytics Loop

Software practices are eating the business, and data is no exception. The rise of DataOps has pushed companies toward more repeatability, flexibility, and speed in data operations and processes. However, despite its best efforts, DataOps has struggled to close the gap between the data stakeholders who use data and the data teams that leverage it. Boris Jabes (CEO of reverse ETL and operational analytics pioneer Census) broke down how DataOps teams can finally bridge this gap with operational analytics to achieve the gold standard of the DevOps principles they've adopted: The virtuous cycle of data.

DevOps is “a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production while ensuring high quality.” It comprises various stages, including coding, building, testing, packaging, releasing, configuring, and monitoring (with tools for each stage). Furthermore, this is not just a series of steps but a loop, reducing the time from creation to delivery while ensuring high quality.

Where is the equivalent loop in DataOps? Data teams tend to get trapped in the request-and-wait cycle. The feedback loop between the data team and the business/product teams is ad-hoc. Reverse ETL tools like Census enable Operational Analytics, which takes the data from the warehouse and creates a CI/CD pipeline to deploy analytics live in operational tools. The operational analytics loop looks like the one below:

You get raw data from data sources and build data models to extract business insights from that data.
You next push insights back to operational applications to automate business operations.
You then move new application data into your data sources and close the loop.

Boris Jabes - The Operational Analytics Loop (DataOps Unleashed 2022)

This loop enables the implementation of “Data-as-a-Product” - data artifacts are directly connected to the business in an actionable manner. Instead of being a service provider and order taker, data team member becomes a critical function to maintain the agility of the business.

3 - A Farewell to Broken Data Pipelines and Delayed Releases

Achieving data pipeline observability in complex data environments is becoming more and more challenging and is forcing businesses to allocate extra resources and spend hundreds of person-hours carrying out manual impact analyses before making any changes. Joseph Chmielewski (Senior Technical Engineer at Manta) challenged this status quo by presenting how to enable DataOps in your environment, ensure automated monitoring and testing, and ensure that your teams are not wasting their precious time on tedious manual tasks.

A recent publication on DataOps found a significant difficulty with orchestrating code/data across tools and monitoring the end-to-end data environment. These challenges comprised 44% of the responses of those surveyed when asked what “are the top technical DataOps challenges facing enterprise Data teams today.”

As data environments become highly complex, blind spots lead to mistakes and incidents. With limited data pipeline observability, the mistakes can’t be easily fixed. These data blind spots are created by “unknown unknowns” - data that was not considered or not known to the user. Complex data landscapes create “messy data” - multiple datasets following different logic or structure - one of the leading indicators of complexity. In brief, blind spots are created when any person doesn't have complete visibility.

Here are some indicators of blind spots not being identified:

A broken dependency causes an application to stop working.
An unseen security vulnerability results in a privacy breach.
Recent system updates trigger a downstream report failure.

Joseph then walked through a real-world use case of how data blind spots impact an enterprise:

A large automotive company needed to understand its data ecosystem for faster change management and incident prevention. The DataOps team unsuccessfully attempted to make a dependency graph manually. 30-40% of the data engineering team's time was spent on manual tracking of data dependencies and data flows. The DataOps team identified blind spots in their data flow that needed to be resolved, which existed in areas of the organization that required the involvement of other teams. Manual efforts would take too much time, be too error-prone, and were unsustainable long term.

Automated data lineage is a good solution to eliminate blind spots by delivering pipeline visibility to all data users thanks to metadata analysis, end-to-end data flow documentation, and reproducible/auditable/referenceable analysis results over time.

Back to the real-world use case, after implementing automated data lineage, the automotive company observed that the mapping was done automatically within minutes, gained a complete overview of their data environment, and could automatically map end-to-end data journeys, which helped them understand their processes on an ongoing basis.

As a result, the development, testing, and production data analytics accelerated by 25%. Data operations accelerated by 25%. Application decommissioning and cloud migration accelerated by 30%. Millions of dollars were saved on maintaining compliance with GDPR.

A solution like Manta empowers all data users by having clear pipeline visibility. It turns insight into action to accelerate decision making, delivers pipeline visibility to all data users, monitors data conditioning over time to ensure data accuracy and trustworthiness, and carries out automated impact analyses to prevent data incidents and accelerate application migration.

4 - Streamline Your Data Stack With Slack

Many teams already use Slack for data pipeline monitoring & alerts - but data engineers at Slack have streamlined their kit with Slack. Ryan Kinney (Senior Data Engineer at Slack) shared how he and his team have integrated their DataOps stack to not only collaborate but to observe their data pipelines and orchestration, enforce data quality and governance, manage their CloudOps, and unlock the entire data science and analytics platform for their customers and stakeholders.

Slack Apps save time and focus. “Off-the-shelf” apps and integrations reduce context switching to keep your work all in one place. Custom development and integrations give you the flexibility to build whatever your stack requires. These Slack apps include JIRA, GitHub, Giphy, Looker, and even custom apps. For custom apps, you can use webhooks to connect almost any outside process, push messages based on outside events (alerts/notifications), trigger outside processes when Slack events occur, create message templates/parameters and populate them with data, and tie processes to a “bot” user to create a conversational interface/allow more flexible messaging and reports.

More specifically, a Slack Bot provides event-driven alerts, scheduled reporting, interactions, and automated responses. By interacting with these bots, you can monitor your Slack App’s selected channels for messages, perform actions when an app is mentioned with a message, and configure your app to almost any in-Slack behavior.

Additionally, Slack Bolt is Slack’s internal development library with useful functions to manage your Slack app. Bolt is available in JavaScript, Python, and Java - simplifying event handling to resemble decorators and wrapping API calls into intuitive methods.

Ryan Kinney - Streamline Your Data Stack With Slack (DataOps Unleashed 2022)

Midas Touch is an application at Slack to make the lives of sales reps and the analysts they work with easier. It fundamentally changes how Slack’s sales and customer success teams consume and leverage data. It is built on the Django framework in Python, which provides database support and robust administration out of the box. Django’s REST API layer allows external systems to host the application on Slack’s enterprise Heroku account, enabling them to deploy and scale easily. Bolt is installed alongside Django as a layer to handle Slack events, generate UI components, and deliver Slack messages.

Midas Touch has a few use cases handled by Bots and connected apps. One implementation of Midas Touch is Midas Slides, an initiative to automate the process of preparing data-driven slide decks for account executives and customer success managers to use in their prospecting and client meetings. This includes not only building the actual presentation but also doing the data work (pulling the data, analyzing the trends to understand companies’ situations/needs, turning trends into compelling visuals, and constructing a solution).

A user starts a Midas Slides request with a text command in Slack and sends it to Workbot.
The request is connected with Salesforce integration to a list of accounts. Midas Slides then selects the accounts it wants slides for.
This request is passed to Looker, which produces CSV summaries of each dashboard/chart/metric, zips them up, and sends the zipped file to the integration management tool Workato via webhook.
Workato calls an API endpoint from a core Midas Touch app, triggering a background celery task.
Celery creates a new copy of the slide template in Google Drive, opens the CSV file, codes them into a master mapping spreadsheet.
The mapping spreadsheet determines how the values will be distributed to the appropriate chart within the slide templates. Those finished slides are shared with the user in a Slack DM via the Google Drive API.

Ryan’s team also built Midas Signal, another offspring of Midas Touch, to drive impact for Slack's business needs. It monitors account activities and proactively suggests actions to account executives and customer success managers.

Signal leverages the Salesforce integration to subscribe to a given account’s signals. It tracks specific customer actions in Salesforce and accesses other accounts’ data from a Snowflake data warehouse.
Instead of representing past activities with visualization, Signal drives future activities with insights and suggestions.
Slack’s sales team estimated Midas Signal saved them 5,000 hours per month by prioritizing accounts and tracking down contacts while effortlessly driving new deals by surfacing propensity accounts right when they were ready to take the next step.

With a little bit of thought, you can definitely think of innovative ways to incorporate Slack into your own data stack. Not only will it make your work easier and reduce the disruption coming from context switching, but it will also drive measurable business impact by rendering the data more accessible to the people who need it.

5 - Building A Multi-Cloud Data Platform In A Tightly Regulated Industry

Building a data platform in the healthcare industry is complicated when you consider all of the privacy aspects of health data. Babylon Health's Natalie Godec showed how they enabled innovative, AI-driven products while dealing with highly sensitive data and how they chose and integrated data tools to build their multi-cloud platform.

For context, Babylon is on a mission to put an accessible and affordable health service in the hands of every person on Earth. They provide digital-first healthcare when you need it by tapping into a device a lot of people already have (mobile), keeping in touch with one’s health with monitoring capabilities, giving access to a doctor when it’s needed (from wherever you are), and enabling AI-powered tools to provide better care.

Now there’s a lot of data in healthcare, which leads to various challenges such as handling event streams and services all over the world and processing personal information (coping with data privacy, regulatory requirements, data locality, perception of AI in public). Natalie’s team decided to store sensitive data in the cloud by building a cloud-native data platform:

The applications and event streaming run on AWS. EKS manages Kubernetes and MSK manages Kafka. S3 and Redshift are their main data storage.
The analytics services run on Google Cloud. BigQuery and Cloud Storage are the data lakes. They plug transformation pipelines using Cloud Composer (which manages Airflow) and Cloud Functions (for more lightweight transformations). They also plug in AI notebooks, Kubeflow, and Tableau/Looker for data science, visualization, analytics, and orchestration purposes.

Natalie Godec - Building A Multi-Cloud Data Platform In A Tightly Regulated Industry (DataOps Unleashed 2022)

Why did the Babylon team choose these products? That’s because these services best fit their needs:

Their runtime platform is on AWS. Databases and cache are on AWS for better latency.
Google Cloud has excellent big data and analytics tools. Ease of use was the deciding factor in favor of BigQuery. Google Cloud has hosted Airflow, allowing them to migrate easily.
Both cloud providers have a presence in the regions they need. Historically, they attempted to be cloud-agnostic, but using serverless tools is a lot more flexible and cheap.

Natalie also reminded you to build your systems with regulations in mind (GDPR, HIPAA, and ISO 27001 are the most widespread). This will determine the products and services you pick and the architecture of the platform you build. This entails checking the certifications of the cloud, SaaS, and PaaS providers; enabling encryption at rest in all data stores and encryption in transit in all communications; enabling audit logging for data access and storing the logs separately; and not bothering with your own PKI unless absolutely required.

Cloud services are public APIs, so how can you protect your data? As serverless products communicate over the Internet, you want to use VPC endpoints (for AWS) and VPC Service Controls (for Google Cloud) to ensure the communication between your private services and the serverless product stays within your sub-network. Given all the security measures that you have set up to connect the two clouds, you can use VPN gateways on both sides with correct configurations and link them with a bi-directional IPSec tunnel.

Moving forward, Natalie’s team is working on getting data from any sources into Kafka and setting up Kafka connect to link up with different endpoints (BigQuery, Cloud Storage, S3, Sheets). They also want to enable global research with ideas such as running federated jobs in different regions.

6 - A Cloud-Native Data Lakehouse Is Only Possible With Open Tech

Torsten Steinbach (Lead Architect at IBM) walked through how his team fostered and incorporated different open tech into a state-of-the-art data lakehouse platform. His session included insights on table formats for consistency, meta stores and catalogs for usability, encryption for data protection, data skipping indexes for performance, and data pipeline frameworks for operationalization.

The evolution of big data systems has gone through 4 phases:

In the 90s, enterprise data warehouses were tightly integrated and optimized systems.
In the 2000s, Hadoop Data Lakes and ELK stacks introduced open data formats and custom scaling on commodity hardware.
In 2015, Cloud Data Lakes became popular with capabilities such as elasticity and use/pay per job, object storage, and disaggregated architecture.
Today, Data Lakehouses have become mainstream thanks to their consistency, data security, performance and scalability, and real-time capabilities.

Torsten Steinbach - A Cloud-Native Data Lakehouse Is Only Possible With Open Tech (DataOps Unleashed 2022)

Big data systems serve the purpose of onboarding big data to analytics. Big data comes from different sources (databases or telemetry streams) and is put into the analytics lifecycle (exploration, preparation, enrichment, optimization, querying, etc.). Traditionally, you need both a cloud data lake and a data warehouse in your big data system to ensure high quality and proper response time SLAs.

A Data Lakehouse is basically a Data Lake with data warehouse quality of service. It handles disaggregated data on object storage; supports open formats (file formats, table formats); has a central table catalog; provides capabilities in elastic and heterogeneous data ingestion, data processing, and query engines; and can be delivered as a Service.

Torsten then dissected the six lakehouse-defining data warehouse-style qualities of service: data consistency, schema enforcement, performance and scale, table catalog, data protection, and pipeline automation. It is noteworthy that data lakes have always been dominated by open technology, with examples like Hadoop (an Apache poster child) that provides open formats, open engines, and open hardware. This principle is continued 100% by Data Lakehouses.

1 - Data consistency

Traditional object storage doesn’t allow in-place updates: New records are appended as separate files. Files must be re-written entirely even when only one cell of one row changes or is deleted. Multi-threaded writes and parallel reads can interfere. As a result, consistent versioning of the data cannot be guaranteed within a query execution. Furthermore, in-place reorganization, grooming, and compaction are not possible.

Lakehouse’s solution is to use table formats to version metadata (stored along with data) to handle transactional ACID consistency.

2 - Schema enforcement

Any library, tool, or engine can technically read and write data files on object storage in a cloud data lake. This contributes to the openness of data lake architectures. However, there is no enforcement that the schema of new data files is consistent or at least compatible with existing ones.

Lakehouse’s solution is to use table formats to develop common libraries to read and write, which enforce schema defined in metadata files.

3 - Table catalog

Hive Metastore has been the established table catalog gold standard from the Hadoop era. However, Metastore is a bottleneck when 6-7 digit numbers of files or partitions are listed hierarchically.

Lakehouse’s solution is to decentralize file listing information via metadata files stored with the data, using table formats. Additionally, it adds new table catalogs on top of table formats for cross-table transactions and data versioning (Nessie, LakeFS). The next frontier is to converge table catalogs with real-time metadata, Kafka Schema Registry (Confluent), and Apicurio Registry.

Some of the most popular open-source Table Formats projects released between 2017-19 are Apache Iceberg (Netflix), Delta Lake (Databricks), and Apache Hudi (Uber).

They provide an abstraction between physical files (e.g., Parquet) and logical tables.
Data is stored across multiple files comprising data and metadata. Data is ingested and read using dedicated libraries (embedded in engines). Libraries have compound data management functions (compaction, vacuum), enabling scalable metadata via decentralization.
They support transactions and versioning per table-level with snapshot isolation for concurrent query and ingest, time travel queries, “In-Place” updates and deletes, and managed schema evolution.

4 - Performance and scale

Files formats are key factors related to performance and scalability. Common formats such as Parquet and ORC include certain statistics such as min/max values and bloom filters (which can be exploited for file skipping optimization in many engines). They unfortunately still require footer segments of all Parquet/ORC files to be read and do not cover full SOTA data warehouse indexing, such as clustered indexes (with include columns).

Lakehouse’s solution is to support dedicated, open-source indexing frameworks: (e.g., XSkipper from IBM and Hyperspace from Microsoft). Index data is stored in dedicated Parquet files on object storage so that the lakehouse can index any file format (beyond Parquet and ORC).

5 - Data protection

Traditional data storage uses the Access Control Lists (ACLs) mechanism, with only bucket or object granularity. Fine-grained ACLs (such as per column) task is not possible this way. Enforcing fine-grained access in query engines is not viable as it would kill the open and heterogeneous Lakehouse notion. Furthermore, traditional storage only provides high trust requirements on object storage providers and operators.

Lakehouse needs fine-grained access enforcement and data encryption inside files. The combined solution is Apache Parquet Encryption, which is independent of storage and transportable trusted data. It supports encryption keys per column and per footer and ACLs on columns via ACLs on keys (in key management service).

6 - Pipeline automation

Real-time ETL pipelines for data in motion use Spark or Flink processing on Kafka topic data, optionally abstracted through Apache Beam. They land streaming data on object storage (e.g., into Iceberg) and support stream transformations (topic-2-topic).

Batch ETL pipelines for data with Spark use connectors in open workflow frameworks (MLFlow, Luigi, Kubeflow, Argo, Airflow) and serverless FaaS’ built-in event triggers and sequences.

The diagram above displays IBM’s serverless data lake-as-a-service platform built on open technology (available on IBM Cloud):

The central service is SQL Query which performs serverless ingestion, batch ETL processing, cataloging, and querying. It is entirely built on Spark.
Cloud Object Storage is fully S3 compatible with various open formats.
Event Streams is a managed Kafka service with a schema registry that connects real-time data with Cloud Object Storage.
Watson Studio is a Kubeflow-based orchestration service to automate pipeline jobs.
Cloud Functions / Code Engine service integrates with serverless and FaaS functions using events and sequences triggers.
You can connect this whole platform to your own application with Python, JDBC, REST API, and other connectors such as Dremio.

IBM’s real-time data lake house vision looks like the diagram above. The key point is that they plan to re-centralize the metadata layer and integrate it with the Kafka schema registry so that they have one unified metadata store for data at rest and data in motion. This metadata store manages indexing/key mapping information and adopts table formats. Everything that you can do in object storage can be done in real-time. Furthermore, open services such as Spark, PrestoDB, Dremio, and Trino can be integrated into IBM’s lakehouse platform to take advantage of its prowess.

7 - The Origins, Purpose, and Practice of Data Observability

Data Observability (DO) is an emerging category that proposes to help organizations identify, resolve, and prevent data quality issues by continuously monitoring the state of their data over time. Kevin Hu (Co-Founder of Metaplane) did a deep dive into DO, starting from its origins (why it matters), defining the scope and components of DO (what it is), and finally closing with actionable advice for putting observability into practice (how to do it).

Since the mid-2010s, data warehouses have emerged at the top of the data value chain - an ecosystem of tools co-evolved alongside them, including easy ways to extract and load data from sources into a warehouse, transform that data, then make it available for consumption by end-users. With the adoption of the “modern data stack,” more and more data is centralized in one place, then being used for critical applications like powering operational analytics, training machine learning models, and powering in-product experiences. And importantly, we can keep changing the data even after it’s in the warehouse, lending flexibility to previously rigid data. These led to the increasing amounts of data, the increasing importance of data, and the increasing fragmentation of vendors.

Data teams can learn a lot from software engineering teams. Software observability tools such as Datadog, Grafana, and New Relic gave every engineering team the ability to quickly collect metrics over time across their systems, giving them instant insight into the health of these systems. They have transformed the software world from a convoluted, low-information environment eschewing any effort to gain a closer look to a highly-visible, observable environment.

Software Observability is built on top of the three pillars of metrics, traces, and logs.

Metrics are numeric values that describe components of a software system over time, like the CPU utilization of their microservices, the response time of an API endpoint, or the size of a cache in a database.
Traces describe dependencies between pieces of infrastructure, for example, the lifecycle of an application request from an API endpoint to a server to a database.
Logs are the finest grained piece of information describing both the state of a piece of infrastructure and its interaction with the external world.

With these three pillars in mind, software and DevOps engineers can gain increased visibility into their infrastructure throughout time.

Kevin Hu -The Origins, Purpose, and Practice of Data Observability (DataOps Unleashed 2022)

Kevin emphasized that the problems that data observability tools solve are simple to state: time (debugging in the dark costs time and peace of mind), trust (trust in data, as in all things, is easy to lose and hard to regain), and cost (the cost of bad data is the cost of bad decisions). Inspired by the pillars above, Kevin derived the four key pillars of data observability, as observed above:

Metrics: If the data is numeric, properties include summary statistics about the distribution like the mean, standard deviation, and skewness. If the data is categorical, summary statistics of the distribution can include the number of groups, the uniqueness. Across all types of data, metrics like completeness, whether it includes sensitive information, and accuracy can be computed to describe the data itself.
Metadata: Metadata includes properties such as data volume, data schema, and data freshness - all can be scaled independently while preserving the statistical characteristics.
Lineage: Lineage of data entails bidirectional dependencies between datasets, which range in level of abstraction from lineage between entire systems, tables, columns in tables, and values in columns.
Logs: Logs capture how data interacts with the external world. These interactions can be categorized as machine-machine (movement, transformations) and machine-human (creating new data models, consuming dashboards, building ML models).

In order to put data observability into practice, you need to figure out what business goals your organization wants to achieve. There are five prominent ones (ordered by an increasing level of abstraction): saving engineering time, avoiding the cost of lapses in data quality, increasing data team leverage, expanding data awareness, and preserving trust.

How can you actually measure progress? Kevin pointed out some example metrics for two goals:

If you want to save engineering time, look at the number of data quality issues, the time to identification, and the time to resolution.
If you want to reserve trust, look at qualitative stakeholder surveys (e.g., NPS), the number of inbound tickets, and Service Level Objective/Agreement.

Data quality involves and impacts the entire organization, but data teams are ultimately held responsible. Therefore, you want to identify the relevant stakeholders to get buy-in early on: be it the data producers sitting in engineering, product, and go-to-market teams; the data analysts performing analysis and constructing data models; or the data consumers like executives, investors, and financial analysts.

Next, you want to set in place processes to drive up the effectiveness of your data observability initiatives: prevention (defining contracts, agile, unit tests), active observability (using CI/CD hooks), passive observability (continuous monitoring), and incident process playbook (PICERL). Many tools can help set up these processes: trying out open-source projects, building in-house, or leveraging commercial solutions.

You can set up data observability in less than 10 minutes with Metaplane, so check the product out!

8 - Building a Resilient System: How Wistia Ensured Interoperability and Observability in Their Data Pipelines

Data teams spend what seems like countless hours playing data detective to find out where their pipelines fail. Donny Flynn (customer data architect at reverse ETL pioneer Census) and Chris Bertrand (data scientist at Wistia) broke down the importance of interoperability and observability in building resilient data systems and how Reverse ETL helps you shift from data detective to data hero.

Census is a Reverse ETL tool that enables Operational Analytics, a “hub-and-spoke” in the modern data stack. As seen below in the diagram, Snowflake is the cloud data warehouse serving as your single source of truth for your business. You have a spoke in from different business applications and SaaS tools. You have transformations (dbt) happening within the data warehouse and insight consumption with BI tools (Looker). You might also have event ingestion coming from Segment into your warehouse. Census syncs data from your warehouse and BI tools to your operational tools.

Donny Flynn - Building a Resilient System: How Wistia Ensured Interoperability and Observability in Their Data Pipelines (DataOps Unleashed 2022)

Census is also interoperable with the data pipeline ecosystem that you have: be it includes event streaming solutions like Snowplow, Kafka, or Fivetran taking the data into your data hub; or jobs orchestrators like Airflow, Prefect, or Dagster to orchestrate data jobs.

Observability becomes top of mind for data engineers, considering the various services data pipelines interact with: managed servers, changing data types, memory constraints, CPU constraints, unpredictable data volumes, and buggy code. Furthermore, getting data into 3rd party services layers on record-level failures, schema nuances, rate limits, and API usage cost (with each tool having a different behavior for these concepts).

Going hand in hand with observability is alerting. How are you alerted when something goes wrong? It is a delicate balance to avoid silent failures but also to only be disrupted when something important is actually broken. It is crucial to determine when and how you are paged when something fails, even as your organization develops and implements new processes.

So how did Wistia leverage Census to make observability a reality? Wistia’s initial data architecture looks like the diagram below:

Wistia mobile and desktop users interacted with the application written in Rails and used MariaDB. Their sales and customer success teams use Salesforce, so it is essential for them to get the information from the main application about how people use Wistia. Therefore, they built a custom integration that makes API calls into Salesforce.
At the same time, they were using Fivetran to replicate MariaDB instances into their Amazon Redshift warehouse.

Chris Bertrand - Building a Resilient System: How Wistia Ensured Interoperability and Observability in Their Data Pipelines (DataOps Unleashed 2022)

Why did they decide to make the change to this architecture? There are five main issues with the custom integration with Salesforce.

Changes to the data warehouse required an app-level code review, which was cumbersome.
Changes to the data warehouse required working with a Rails (Ruby) codebase, which limited adoption from non-technical stakeholders.
They only had access to application data.
The architecture is customized for Wistia’s CRM schema.
Any system failures got sent to a garbage fire of a Bugsnag instance, so they were reactive in dealing with these errors.

Wistia’s new data architecture adds Census into the mix. They leverage Census’ API using Python scripts to manage the syncing of data from Redshift to Salesforce. As a result, they no longer need to send data from Rails to Salesforce. Such change led to these improvements:

Changes to the data warehouse now require passing dbt tests.
Changes to the data warehouse require working with SQL, a more familiar language.
They now have access to all of the data in their warehouse.
The architecture is still customized for Wistia’s CRM schema.
System failures now trigger email alerts and monitoring plugs into their internal systems.

That’s the end of this long recap. DataOps is the topic that I will focus a lot on in the upcoming months. If you had experience using tools to support DataOps for the modern machine learning stack, please reach out to trade notes and tell me more at james.le@superb-ai.com! 🎆