What I Learned From Attending DataOps Unleashed 2021
Last week, I attended DataOps Unleashed, a great event that examines the emergence of DataOps, CloudOps, AIOps, and other professionals coming together to aggregate conversations around the latest trends and best practices for running, managing, and monitoring data pipelines.
In this long-form blog recap, I will dissect content from the session talks that I found most useful from attending the summit. These talks are from DataOps professionals at leading organizations detailing how they establish data predictability, increase reliability, and create economic efficiencies with their data pipelines.
1 — Data Quality in DataOps
As the world’s leading tool for data quality, Great Expectations occupies a unique position in the DataOps ecosystem. Over the last year, thousands of data scientists, engineers, and analysts have joined the Great Expectations community, making it one of the fastest-growing data communities in the world. Besides, Great Expectations integrates with many other DataOps tools, giving their developers a unique perspective on how the ecosystem is developing.
Deployment Patterns For Data Quality
Abe Gong shared examples, patterns, and emerging best practices for data quality from the Great Expectations community. He brought up the two common deployment patterns for data quality: the data warehouse pattern and the MLOps pattern.
There are 4 data validation steps within the data warehouse pattern:
Validate the source data for completeness, freshness, and distribution.
Validate the extracted data as it goes from sources to a staging environment.
Validate the data transformation procedure.
If all the validation steps pass, the data gets promoted into production and goes into a data warehouse.
Similarly, there are also 4 data validation steps within the MLOps pattern:
At the data ingestion layer (where one team produces the data and another team consumes the data), data quality checks will keep these teams in sync.
At the data cleaning and feature engineering layers (where specific business and domain logic lives), it is imperative to check for de-duplication, schema changes, uniqueness, and feature drift.
At the model output layer (which is the last point of contact before the model serves predictions to the end-user), we should check for data histogram and distribution.
Data Quality and DataOps Trends
Then Abe shared his learnings for how data quality and DataOps are reshaping data workflows and collaboration.
Unlike observability and monitoring, data quality faces out from engineering. This means that data quality is an external-facing function, not an internal-facing function. There are various stakeholders (operations, finance/accounting, etc.) who will produce and consume the data; thus, data quality will be top of mind for all of them.
Over time, every organization’s data quality layer becomes a rich, shared language. This is the result of the rich collaboration between engineering and domain expertise.
Data quality must integrate with tools at every level of the DataOps stack: database, data warehouse, data exploration, data transformation, etc. In fact, there is an emerging term called the “dAGS” stack (dbt, Airflow, Great Expectation, Snowflake?) (or Databricks, Great Expectations, Spark?)
2021 is the year when data quality becomes a priority for morale and retention. There will be a big dividing line between working in environments where the data platform has a lot of technical debt versus working in places where good data quality is ensured for predictable development.
2 — DataOps Automation and Orchestration with Fivetran and The Modern Data Stack
Many organizations struggle with creating repeatable and standardized processes for their data pipeline. Fivetran reduces pipeline complexity by fully managing the extraction and loading of data from a source to a destination and orchestrating transformations in the warehouse. Nick Acosta from Fivetran gave a talk that explains and evaluates the various benefits currently available using a DataOps approach with Fivetran and the rest of the modern data stack.
The Problem With Data Pipelines
There is an astonishingly high rate of resources wasted on data pipelines:
52% of companies use 11 or more data sources (using antiquated integration approaches).
92% of analysts perform data integration tasks.
90% of analysts report several data sources being unreliable.
68% of analysts say they lack time to implement profit-driving ideas.
The root cause for this is that data pipelines are manual and brittle. A common pipeline includes identifying sources, scoping the analyses, defining the schema, building the ETL/ELT pipelines, and reporting the insights. Whenever a schema change or new data is required, the data integration phase breaks down.
Fivetran Benefits
The Fivetran platform provides automated data integration with these three core value props:
Automated: Fivetran creates and maintains a perfect replica of your data, with minimal user intervention across 150+ connectors.
Resilient: The core architecture of Fivetran enables recovery from any point of failure with no user intervention required.
Cloud-Native: Their engineers monitor and update their code as the data sources change.
This approach is simple, stable, secure, and scalable — going directly from sources to insights with the fully-managed capability and zero configuration. There are automatic data updates, automatic schema migrations, automated recovery from failure, a micro-batched architecture, an orchestration tool to make this procedure extensible.
Here are your top three use cases for orchestrating the ETL/ELT pipelines:
Syncing syncs: This includes identifying transformations across sources, programmatically prioritizing syncs, and reaping the inherent benefits of automated data integration.
Triggering transformations: Transforming the data too late leads to latency problems and SLA issues. Transforming the data too early leads to missing data and integrity issues.
DataOps management: This includes managing data tasks at the source layer and the business intelligence layer, plus integrating projects across data teams.
Fivetran integrations are available in dbt, Prefect, and Airflow now!
3 — DataOps Principles and Practices
DataOps has grown because of the need to support execution at scale in the data management space. Vijay Kiran, Head of Data Engineering at Soda Data, presented how the practice of DataOps is fundamental to how data moving across the stack (from source to data product) is monitored and managed to provide trusted data to the business to transform how data analytics works.
The first principle is continuous delivery — automating and orchestrating all data flows. Necessary steps include: (1) automate deployment with continuous delivery pipelines, (2) discourage manual data wrangling, (3) run the data flows using an orchestrator for backfilling, scheduling, and measuring pipeline metrics (like SLA). Industry-leading tools for continuous data delivery are Airflow, Luigi, and Prefect.
The second principle is continuous integration — testing data quality in all stages of the data lifecycle. Necessary steps include: (1) testing the data arriving from sources with unit tests and schema/SQL/streaming tests, (2) validating data at different stages in the data flow, (3) capturing and publishing metrics, and (4) re-using test tools across projects. Industry-leading tools for continuous data integration are Soda SQL and DbFit.
The third principle is data observability — monitoring the quality and performance metrics across the data flows. Necessary steps include: (1) defining data quality metrics (technical, functional, and performance), (2) visualizing metrics, and (3) configuring meaningful alerts. Notable data testing tools are Soda SQL to capture metrics and Soda Cloud to monitor metrics and raise alerts.
The fourth principle is data semantics — building a common data and meta-data model. Necessary steps include: (1) creating a common data model, (2) sharing the same terminologies and schemas across engineering, data, and business teams, and (3) using a data catalog to share knowledge. dbt is the leading tool for data model building. Various data catalog options exist, such as Alation, Collibra, Amundsen, and Dataworld.
The fifth principle is cross-functional teams — empowering collaboration among data stakeholders. The first step is to use knowledge in cross-functional teams to define important KPIs/metrics and align objectives with the business goals. The second step is to remove bottlenecks for analytics usage by enabling self-service data monitoring and democratizing access to data. Soda Cloud is a tangible option for this.
4 — DataOps For The New Data Stack
Shivan Babu from Unravel Data demystified the new data stack that thousands of companies deploy to convert data into insights continuously and with high agility. This stack continues to evolve with the emergence of new data roles like analytics engineers and ML engineers and new data technologies like lake houses and data validation. A new wave of operational challenges has emerged with this stack that will derail its success unless addressed from day one.
Good News and Bad News
The good news is that creating data pipelines is easy with the new data stack:
If you take the batch ingestion of the data, this might involve orchestrating different tasks, managing different entities, and making data available at the right place at the right time. Solutions like Azure Data Factory, Fivetran, Airflow, and Google Dataflow make it easy to run these batch ingestions.
If you take the stream ingestion and stream processing of the data, this is critical to building real-time store and real-time applications. You can use solutions like Fluentd or Kafka to ingest logs, Spark or Flink to process them, and Druid or Pinot or MongoDB to serve them in real-time.
For business intelligence and advanced analytics use cases, you can rely on data lake (like Databricks’ Delta Lake) or batch processing (like Trino or Presto) solutions.
If you are more of a data warehouse person, then you have solutions like Snowflake, Amazon Redshift, or Google BigQuery to make the process of extracting insights from the data warehouse easy. This extraction process usually happens in batch or via an advanced transformation tool like dbt.
There are many new solutions for the more complex machine learning use cases like PyTorch, TensorFlow, and Amazon SageMaker. To discover the data easier, you have data catalog solutions like DataHub and Amundsen. To publish and consume the rich features generated from the data, you have feature store options such as Tecton.
However, the bad news is that the curse of the 3C’s (complexity, crew, and coordination) hinders the building of data pipelines.
Different systems add to the pipeline complexity.
It’s not easy to find skilled engineers and data scientists to manage, monitor, troubleshoot, and tune these pipelines.
Different team members are responsible for different parts of the pipeline.
Based on a survey with 200+ companies on how they manage their data pipelines, the Unravel team identifies these common challenges:
Fixing problems takes weeks.
Users are always complaining.
Most of the time is wasted with bad data.
The pipeline schedules are all messed up
Developers do not test their pipelines.
Migrating to the cloud has been mostly a failure.
SLA misses are creating problems.
Cost reduction is a top priority.
The Data Pipeline Maturity Scale
From these anecdotes, Shivan brought up the data pipeline maturity scale that includes 5 phases to make these challenges actionable:
Phase 1: Detecting and fixing problems take weeks.
Phase 2: Problems are detected after the fact
Phase 3: There is one central place for all data to root cause problems.
Phase 4: Problems are detected proactively.
Phase 5: Root cause analysis of problems is automatically detected.
Phase 6: Data pipelines are self-healing.
He argued that most companies are in phases 1 and 2, but improving practices can get us all to phases 4 and 5! The Unravel product is designed for this, so check them out.
5 — Unleashing Excellent DataOps with LinkedIn DataHub
LinkedIn DataHub was open-sourced to enable other organizations to harness the power of metadata and unleash excellent DataOps practices. Doing DataOps well requires bringing together multiple disciplines of data science, data analytics, and data engineering into a cohesive unit. However, this is complicated because there are a wide variety of data tools that are in use by these different tribes.
Shirshanka Das, who founded and architected DataHub at LinkedIn, described its journey in enabling DataOps use-cases on top of the metadata platform. He started the talk with the macro trend in the data landscape: As enterprises are rushing to be data-driven and investing in the cloud, the data ecosystem has been categorized into segments with multiple good projects attempting to solve those segments well. These segments include data ingestion, data prep, workflow management, data transformation, data quality, data serving, data visualization, online data stores, data stream stores, data lakes, and data warehouse.
However, there are still various pain points faced by both the humans and the machines (let’s say for a typical software company):
Which is the authoritative customer dataset?
What are all the dashboards that are powered by the customer dataset?
Is the customer dataset ready to be consumed?
Are we sharing customer data with third parties?
Do salespeople have access to customer data on my platform? Did they ever?
LinkedIn’s open-sourced DataHub project was built to solve these problems. At the moment, it has 2.8k GitHub stars, 75+ contributors, and 100+ commits per month. It has been adopted by many leading organizations such as Expedia, Typeform, ThoughtWorks, Klarna, Spot Hero, Saxo Bank, Grofers, Viasat, and Geotab.
DataHub’s core principle is that the metadata lies at the heart of each individual. Therefore, if we can liberate and standardize the metadata locked inside these systems, we can drive innovation that does not depend on the specific technology being used.
The end-to-end architecture of LinkedIn’s data platform looks like above:
Step 1 is to liberate metadata. They pulled the metadata out of the data sources with some Python code scheduled using Airflow. The metadata can be either (1) written into Kafka and streamed into DataHub automatically, or (2) pushed straight into the service tier using HTTP.
Step 2 is to store metadata. They wrote metadata into a document store or a key-value store. They also emit metadata commit logs, which allow them to subscribe to the changes in metadata.
Step 3 is to index metadata. Taking the commit log stream from step 2, they applied it to the index they want to use. DataHub supports Elastic for search index and neo4j for graph index.
Step 4 is to serve metadata. Bringing everything together, they have a service API that is scalable for different query types (search, relationship, stream, etc.).
The first killer app that most companies need and want is a data discovery system, which is the DataHub frontend. You can explore and browse through different entities, collect good data signals/matches (social search), and understand where data comes from (lineage). Generally speaking, data scientists are more interested in operational questions such as:
Which system did this dataset originate from, and how fresh is it?
Which other systems consume this data?
Is this a time-partitioned dataset or a snapshot dataset?
Will this dataset break my pipeline?
Who is on-call for this dataset?
Can I set alerts on this dataset for important changes?
DataHub is perfectly positioned to answer these questions:
A stream of changes is already being pushed.
There is an end-to-end map of all data so that users can non-intrusively observe data.
Troubleshooting can leverage context and social signals.
There is an efficient routing of “pages” to the right on-call.
It can connect consumers and producers efficiently based on lineage information.
Pro-active data quality monitoring support will prevent incidents.
DataHub is actively developed with more robust observability features: dataset operational health at a glance, deeper insights at the dataset level, data quality integration with Great Expectations, event-oriented view to see what happens lately, and event log shown across all data producers.
The future of excellent DataOps will be unfolded across these three phases:
Wide visibility to provide a global snapshot of all datasets, pipelines, and relationships.
Deep visibility to understand how data is changing over time.
Foresight to understand before data can change.
6 — ELT-G: Locating Governance In The Modern Data Stack
ELT is a data ingestion pattern that promotes an “extract first, model later” approach to building data workflows. While it saved time for data teams and enabled more agile development, the buzz about ELT has not given proper credit to its silent G: governance. Unlike transformations, proper governance, and in particular, securing access to data, cannot be deferred until later. This process requires clear, consistent principles to be implemented by data teams. Stephen Bailey from Immuta provided a framework for thinking about data governance in an ELT landscape, introduced policy-based access controls, and suggested data teams on how to get started with better governance today.
The “modern data stack” enables rapid development of data products and high interoperability between systems. Modern data governance is the purposeful administration of metadata across these technologies and the organization. However, the specific governance processes are so variable right now that it is more important to define where it should be done than what it should actually do. Thus, Stephen defined the ELT-G thesis to be that modern data governance is maximally effective at the data product’s point of release.
ELT and Data Decentralization
Traditional data warehouses are powerful, efficient, and bound to a specific infrastructure. They are marked by large up-front investments, extended planning cycles, pre-planned ETL processes, centralized analytic transformations, and controlled distribution. Metadata, stats, user roles, access, and security are all managed on the database management system.
On the other hand, modern applications and cloud platforms are decoupled, modular, and liberated from the infrastructure. These modern ETL workflows increase the efficiency of data migration by decoupling Extract and Load from Transform. The point-and-click data replication creates standardization around replication. Moreover, companies can publish end-to-end data products out of their own data if they know the schema and desired analytic ends because they don’t have to worry about the underlying storage technology. The ELT approach has led to increases in quality, quantity, and variety of data products and tools in the ecosystem.
Data proliferation is a fact of life in modern enterprises. Ease of portability, transformation, and variety of tooling means that data consumers can quickly become data providers as well. Over-exertion of control will inevitably lead to wasted effort and “shadow clouds.” Therefore, the data governance challenge is more acute than ever. If not, organizations might face economic risks (ideas based on bad data or financial damage from misuse), culture risks (a data-driven culture that is ineffective or stifling), and human risks (privacy violations, security breaches, and unethical use of data cause real human harm).
Governance and Data Processing
The well-known a16z article on the emerging data infrastructure brings up the four functions that do not belong to any category: metadata management, quality and testing, entitlements and security, and observability. Stephen believes these functions belong to modern data governance:
Metadata management is about establishing metadata standards across the organization.
Data quality is about evaluating existing data assets against an expectation framework.
Observability is about collecting and translating system metadata into actionable intelligence.
Entitlements and security are about using metadata to enable access and protect assets.
Where does governance happen? Stephen argued that governance happens where people, data, and metadata meet — which is at the point where data is released. In the ELT-G framework above, there are 3 layers: metadata layer, data layer, and consumption layer. Immuta data governance platform sits at the intersection of data processing and data consumption — ingesting the data from the identity providers, pipelines, catalogs, and creating the secured view so that users can access and use the data in tools like Looker and Tableau.
Because access control is the original metadata problem, Stephen then presented the four ways that a metadata-driven approach to access control modernizes data access control. The overall objective is to get the right data to the right people at the right time for the right reasons.
1 — Separates utility and governance transformations
There are two types of transformations that teams have to manage. The first is utility transformations to tailor data to a particular use case (standardization, aggregation, data cleaning, denormalization, metric calculation, and dimensional modeling). The second is governance transformations to secure the data against some adverse outcome (access control, row-level security, data sandbox provisioning, identified masking, data minimization, k-anonymization).
At a small scale, teams can get by with having next to no access controls. The second step is often to create a copy of the data. Then come the requirements to restrict access to some resources, but not others, at both the data source and report levels. Immuta’s approach is to add an external governance engine to allow the logic to be abstracted from the transformation layer.
2 — Abstracts policy from the underlying technology
Let’s hypothesize that you have implemented access controls on your cloud data warehouse and are happily serving data. What if your cloud vendor increases prices and you want to switch to a different system? All those meticulous transformations you enacted on that one system now need to be duplicated on other systems! Immuta’s approach is to externalize governance transforms to provide a method for controlling access that is platform-agnostic.
3 — Establishes a comprehensible language around data
Abstraction of security from the technology requires developing a metadata framework that anyone in the organization can use to identify and control sensitive data. This requires a list of authoritative data and user attributes and clear policies for access controls. Typically, this list includes personal data, privileged data, and data origin/residency. To establish a comprehensible language around data, your team needs to experiment with different levels of access controls that can be role-based (simple to implement but susceptible to role bloat), attribute-based (enabling complex policies yet challenging to set up), and policy-based (ideal for managing access to disparate resources via transparent rationale). Overall, a proper metadata framework can simplify policy writing and enable communication across multiple lines of business.
4 — Unlocks meaningful privacy controls
Current ELT design encourages a “collect first, analyze later” approach to data processing. A better approach is Privacy-by-Design, which requires forethought and principled design decisions (process monitoring, security controls, limit data collection, continuous evaluation, and privacy impact assessments). To unlock meaningful privacy controls, we need to understand why data is being used and what its impact is, not just what the rules are. There are some great examples in practice, such as the Institutional Review Boards, data privacy impact assessments, Google’s Model Cards, and Data Sheets for Datasets. A dedicated governance layer can tackle the challenge of reproducing the same “project-level” permissions, access controls, and metadata right inside the database.
Governance and Humans
Currently, a large part of data governance work involves defining principles, policies, rules, and standards that fit the enterprise’s goals and risk appetite. There is not going to be a one-size-fits-all, but data teams can trade notes on best practices to improve each of the governance domains. We can develop excellent practices as a community if we dedicate time and energy to scalable governance and then innovate within that space.
7 — How To Find a Misbehaving Model
Monitoring machine learning models once they are deployed can make the difference between creating a competitive advantage with ML and suffering setbacks that erode trust with your users and customers. But measuring ML model quality in production environments requires a different perspective and toolbox than monitoring normal software applications. Tristan Spaulding of DataRobot shared some practical techniques for identifying decaying models and strategies for providing this protection at scale in large organizations.
Tristan started with a big message: Data science-specific monitoring in production is the difference between beating your competitors with ML and falling on your face. So why don’t we simply measure accuracy metrics like RMSE or AUC? Here are the three major reasons:
There will always be some delays in aligning predictions and “ground-truth” results.
The interventions on your predictions change the outcomes.
Even if you had it, higher accuracy doesn’t mean the model is good.
Warning Signs in Deployed Models
To find a misbehaving deployed model, Tristan proposed a few steps:
Basic monitoring steps are to measure prediction drift (measure and plot the changes of the prediction distribution over time), feature drift (measure and plot each important feature’s difference between training and scoring data, then inspect divergences and new values), and data quality rules (define and check for unexpected conditions, prepare to override outputs, review how often these rules are triggered to identify model/data improvements).
Better monitoring steps are to prioritize and understand (use what you know of the model’s use of the feature to zero in on important cases) and build segments and adjustments (monitor changes for different segments of your population individually and reduce false positives by adjusting for drift you see during training).
World-class monitoring steps involve continuous model competition (shadow your live model with alternatives to catch divergences, put performance changes in context, and have alternatives ready to swap in if the champion fails) and bias and fairness (disparate predictions and performance on sub-populations is a major risk in modeling, and can appear even after deployment).
There are many more, as this is an area of active innovation! Today, the key question for most organizations who have models deployed is: How can I apply the best practices consistently for all the models I use?
MLOps and Automation
The rest of Tristan’s talk focuses on the production gap to bring ML models into production (a well-known practice called MLOps). The gap exists in three major phases: building models (where multiple teams use multiple tools/languages to build ML models), operating models (where there is a need to unify deployment, monitoring, management, and governance of ML models for all stakeholders to consume), and running models (where these models need to be deployed and managed somehow in “live” complex production environments).
The key challenge to address this production gap is to keep the right people informed (DevOps, data scientists, model validators, and executives). Given the high cost of even a single mistake, all these parties with very different interests somehow need to be kept up to date.
Overall, to proactively fix the problem of misbehaving models, your organization would want to prepare alternative models to fall back to and invest in strategies for accelerated model development.
The DataRobot MLOps platform is designed to tackle most (if not all) of the challenges mentioned above, so give it a try yourself: https://www.datarobot.com/trial/
8 — Universal Data Authorization for Your Data Platform
With all the advances in DataOps, many data-driven initiatives still fail. Why? Because organizations still struggle to resolve two problems as old as data itself: people can retrieve and use data they should not have access to, and other people cannot access data for legitimate business purposes. These problems put immense business pressure on data architects and data platform owners. Mary Flynn from Okera presented a fantastic session on the notion of Universal Data Authorization and how adding it to the modern tech stack brings clarity and appropriate control across the entire data platform.
Fine-Grained Access Control
Fine-grained access control (FGAC) techniques are the new table stakes, not just for regulatory purposes anymore. These include column-level enforcement (hiding, masking, and tokenizing data), row-level enforcement (filtering data the user shouldn’t see), and cell-level enforcement (checking for thresholds and performing complex anonymization such as k-anonymization and differential privacy). Most companies implement FGAC in two ways:
Curated Data Extracts: They manage policies as pipelines and IAM roles, which is very difficult at scale. Even slight variations in the data can lead to massive data duplication. Furthermore, the definition of “secure” (i.e., what data is sensitive) changes over time.
Policies-as-Database Views: They create SQL views on top of tables, which is un-manageable at scale. This would definitely lead to inevitable duplication, making it hard to evolve and leverage user context.
Enterprises with complex data requirements need both role-based access control and attribute-based access control. Role-based access control (RBAC) generalizes access permissions by user or user groups and is not metadata-driven. In contrast, attribute-based access control (ABAC) uses abstraction that references data or users by attribute (rather than name) and is metadata-driven. Mary argued that ABAC is the key to manage data complexity at scale, and we can leverage machine learning for it with automated data discovery, data classification, and data catalogs.
Universal Data Authorization
The takeaway from the above is that your company, your competitors, and your technical sack vendors (data platform and business application providers) will need fine-grained access control. Because of that, universal data authorization will be critical for the same reasons you use external identity management because it provides consistency, clarity, and scale for your data platform.
Universal data authorization (UAD) is a framework for dynamically authorizing data access and a part of the data governance practice. Its three core capabilities are (1) universal policy management, (2) on-demand decision-making and policy enforcement, and (3) centralized audit and reporting. UAD also has multiple client access points and is agnostic to the data platform — multiple enforcement patterns.
Critical Success Factors
There are a couple of critical success factors for the enterprises to adopt the UAD framework into their data platform:
Automate data discovery and data attribute tagging: You want to collect attributes necessary for scalable policy management, dynamic enforcement, and reports. The best way to go about it is to let automated ML-based solutions do the heavy lifting for you.
Take user attribute management seriously: Keeping user attribute standardized and up-to-date gives you a lot of flexibility.
Take an API-first platform approach: The APIs will automate request and approval workflow. The enterprise technology stack is complex and fluid, so you want to avoid vendor lock-in.
Check for “universal-fit”: Make sure that you have support for clients across the top and support for data platforms you might be using in the future.
Practice delegated stewardship: Data governance is a collaboration. You want to make sure that policy management and reliable policy enforcement are distinct responsibilities.
Audit everything, automate reports, and allow ad-hoc analysis: This opens up opportunities to provision data better and more efficiently for your end-users.
Mary concluded the talk with three core benefits of the UAD framework (which is offered by Okera):
Accelerate business agility by being data-driven to provision data faster, capture new insights with analytics and data science, and build data-sharing products.
Minimize data security risks by reducing the attack surface of unintended data misuse (both internal and external exposure).
Demonstrate regulatory compliance by accelerating reporting that meets evolving regulations (GDPR, CCRA, HIPAA) for your customer agreements and corporate clients.
That’s the end of this long recap. DataOps is the topic that I will focus a lot on in the upcoming months. If you had experience using tools to support DataOps for the modern analytics and ML stack, please reach out to trade notes and tell me more at khanhle.1013@gmail.com! 🎆