Applying DevOps to Data

Software engineering transformed dramatically when DevOps emerged as the methodology for building and operating software at scale: version-controlled code, automated testing (CI), automated deployment (CD), infrastructure as code, monitoring and alerting in production. Code changes that would have required weeks of manual testing and coordination could be deployed to production multiple times per day with confidence.

Data engineering has historically lagged behind software engineering in adopting these practices. Data pipelines are often written as ad-hoc scripts without version control. Schema changes are applied manually by running SQL against production databases. Deployment involves copying notebooks from development to production without validation. Production data quality issues are discovered by business users, not automated monitoring.

DataOps applies DevOps principles to data engineering: version control for all data artifacts (SQL, Python, DAGs, schema definitions), automated testing for data pipelines (schema tests, data quality assertions, business logic validation), automated deployment pipelines (CI/CD for data), infrastructure as code for data infrastructure (declarative warehouse and catalog configuration), and automated monitoring for production data quality and freshness.

Version Control for Data Artifacts

In a DataOps-mature organization, all data artifacts are version-controlled in Git:

dbt projects: All SQL transformation logic, schema definitions, data quality tests, and documentation are committed to a Git repository. Pull requests require code review before merge, ensuring that SQL changes are reviewed by a second engineer. The dbt project history provides a complete audit trail of all transformation changes.

Airflow DAGs: All orchestration logic is defined as Python code in a Git repository, deployed to Airflow through a CI/CD pipeline rather than manually uploaded. DAG changes go through pull request review and automated validation before deployment.

Infrastructure as code: Warehouse configurations, catalog registrations, and compute resource definitions are defined in Terraform or Pulumi and version-controlled, enabling consistent, reproducible infrastructure deployment across development, staging, and production environments.

CI/CD for Data Pipelines

DataOps CI/CD pipelines automate validation and deployment of data code changes:

Continuous Integration (CI): When a pull request is opened against the dbt project, an automated CI pipeline runs: dbt compiles the project (validating SQL syntax), runs dbt tests against a development environment with sample data (schema tests, data quality assertions), generates updated documentation, and reports results back to the pull request. The pull request is blocked from merge until all CI checks pass.

Continuous Deployment (CD): When a pull request is merged to the main branch, an automated CD pipeline deploys the changes: running dbt run against the staging environment, validating results, then promoting to production. Deployment to production is gated on staging validation, preventing broken changes from reaching production data.

DataOps Pipeline

DataOps Monitoring and Observability

Production data quality monitoring is a core DataOps practice. Automated data quality monitors check production tables on a schedule (hourly, daily), alerting on anomalies: row count drops outside expected ranges, freshness lag beyond SLA thresholds, null rates exceeding baselines, distribution shifts in key columns.

Tools like Monte Carlo, Acyl, or custom dbt tests provide automated anomaly detection and alerting. Dremio’s monitoring integrations track query performance, reflection staleness, and source freshness, providing the operational observability layer for the lakehouse serving layer.

In an Iceberg lakehouse, snapshot freshness monitoring is a key DataOps indicator: each Iceberg table’s latest snapshot timestamp is tracked against the expected update frequency, alerting when tables have not been refreshed within their SLA window.

Learn More

To dive deeper into these architectures and master the modern data ecosystem, check out the comprehensive books by Alex Merced available in our Books section.