CI/CD for the Modern Data Stack: Master Data Quality with dbt

In the fast-growing reality of the modern data stack – and the modern data team that supports it – common software engineering best practices sometimes arrive late. Among others, the result is a never-ending battle with data quality and governance, which often originates from a surprising reality of poorly written code with minimal to no testing.

Adopting long-standing concepts from traditional software engineering, such as Software Development Life Cycle (SDLC) and Continuous Integration/Continuous Deployment (CI/CD) practices in these environments can be meaningful in increasing efficiency, scalability, and reliability.

In this post we’ll attempt to provide an overview of how organizations can effectively implement SDLC and CI/CD in the modern data stack, for now focusing on dbt and popular warehouses such as Snowflake or BigQuery.

First, let’s cover the basics

dbt: An open-source transformation tool that enables data analysts and engineers to transform, test, and move data in the warehouse more effectively. It allows writing modular SQL queries (also referred to as models), which it then runs on your data warehouse in the correct order with dependency management. dbt natively supports version control, testing, and documentation, which simplifies a lot the organization, management, and maintenance of all the code that manipulates data

SDLC: Software Development Life Cycle (SDLC) is a process for creating and maintaining software with a level of integrity while optimizing efficiency. It typically defines six stages for building software: Requirement analysis, Planning, Design, Development, Testing, and Deployment. Through implementing SDLC, organizations can produce high-quality software that is properly tested and can be used in production.

CI/CD: Continuous Integration (CI) and Continuous Deployment (CD) are practices designed to automate the testing and deployment of code changes. It ensures that changes in code, for example in data models for the case of the modern data stack, are integrated and deployed smoothly without disrupting existing workflows.

How would this look for data teams?

Version Control: Data teams should always use a version control tool, such as GitHub, to deploy code changes. While this may sound obvious, it suggests that any new entities, such as tables, should always be created through versioned code that goes through Git. The same is true for modifications, and perhaps less intuitively, for queries as well. Ideally, at least from an SDLC point of view, there should not be “ad hoc” queries that are not tracked anywhere. Properly adopting version control ensures that any code changes are always tracked and can be rolled back if necessary, while also supporting best practices such as code reviews.
Pull Requests and Code Reviews: While this may sound crazy for the typical software team, in data, properly working with pull requests (PRs) and forcing code reviews (CRs) on every PR is far from being the norm. Many teams merge code immediately from a branch, at most with some local testing. Working with PRs and utilizing code reviews slows things down – but at the same time it encourages accountability, supports knowledge development and most importantly, it meaningfully increase the chance to find bugs before they mess up the data!
Testing: Writing tests for functional parts such as dbt models allows developers to validate their logic as well as the underlying data. In the case of dbt, testing is not enforced but it is natively supported and highly encouraged.
Documentation: Most modern data stack tools support some sort of documentation. In the case of dbt, using dbt-docs to generate documentation for large dbt models is a convenient way to democratize the understanding and ongoing maintenance of the dbt environment.

And then, for CI/CD:

CI Setup: Dedicated tools such as Jenkins, GitHub Actions, or dbt Cloud can help you automate the testing of your dbt models every time a change is being made. You can configure, for example, the CI pipeline to run dbt tests on every commit. It is important to note that common practices such as syntax checks and end-to-end CI checks are not natively supported by dbt, and it is the responsibility of the organization to add the relevant tools for this. For syntax checks and linting, for example, using SQLFluff is growing in popularity, but many teams still don’t include this as part of shipping code for data.
Automated Deployment: You may set up automated deployment (CD) to deploy changes automatically to the warehouse. This can be triggered after successful CI runs or via manual approval for production deployments.
Environment Management: It is recommended that you use different warehouse environments (e.g., development, staging, production) to isolate and manage deployments. Ensure your CI/CD pipeline deploys to these environments appropriately.
Code quality, data monitoring, and alerting: Implement tools for code quality and data monitoring to track the health of your code and data pipelines, and set up alerts for any warnings, failures, or potential issues.

Best Practices for Scaling

Modular Design: In the case of dbt, it is recommended to keep dbt models modular and reusable to manage complexity as your project grows. While dbt pricing might push teams to consolidate models, having simple, modular models is crucial for large project maintainability. It’s a lot easier to understand and reference other models when they are split into core tasks.
Performance Optimization: Data engineers should regularly monitor and optimize the performance of their dbt models and warehouse queries, especially as data volume grows.
Scalability Considerations: Data teams can leverage warehouse features such as auto-scaling to handle varying loads efficiently.
Security and Compliance: Ensure that your CI/CD pipeline and data processes comply with security policies and data governance standards the company has in place, which may sometimes overlook data and pipelines.

Why Is This All Needed?

While SDLC and CI/CD are key frameworks used to scale software engineering, in data management there are even more reasons that help explain why these frameworks could help with change management and data governance at a large scale:

A fragmented ownership model - While the software stack typically sits within the responsibility of a single organization – Engineering – Data is often handled by different teams that work with separate projects and workspaces. A backend engineering team can be responsible for ingesting data from the operational environment, while analytics engineers are responsible for modeling and analytics. Changes in one environment have to be communicated and coordinated across all owners to ensure that rollouts do not cause issues that lead to incidents.
Data tends to be fragile - Data tends to break a lot, and in many cases queries produce unexpected fluctuations due to subtle changes that happen somewhere upstream. The reality today of having very little checks, minimal testing, and pipelines that are often not resilient to changes, results in higher statistics of breakages than what we’d expect.
Technology is still developing - Common data platforms such as dbt and Databricks are still evolving in the areas of quality and governance, which is great, but at the same time many teams need to ensure they update to the latest technologies and can also handle the variability between different tools that depend heavily on the development framework. For example, SQL would behave differently even between common warehouses such as Snowflake and BigQuery.

Our approach at Foundational

At Foundational, we are strong believers that data engineering and software engineering should work well together. SDLC and CI/CD are at the very center of this - And our goal is to create the technology that fills in for the gaps that data platforms often miss:

For every code change throughout the operational and analytical data stacks, developers are able to understand impact and validate their changes.
Foundational automatically validates code, thus adding complex CI checks that are harder to think about or implement manually.
Leverage tools such as Data Contracts to introduce automation, collaboration, and alerting when issues do occur.

We believe that this approach makes it easy for large-scale organizations to think and act around SDLC and CI/CD without investing expensive engineering resources or fundamentally changing how teams work together around developing code.

Chat with us

We are solving extremely complex problems that data teams face on a day-to-day basis. Scaling data teams through CI/CD is just one aspect – Connect with us to learn more.

CI/CD for the Data Team – and What Data Should Adopt From Software

First, let’s cover the basics

How would this look for data teams?

And then, for CI/CD:

Best Practices for Scaling

Why Is This All Needed?

Our approach at Foundational

Chat with us

Related posts

Supercharge Your AI: Connect Claude and Cursor to Your Foundational Environment

Announcing Foundational's Data Governance AI Agents

Spark Lineage via Code Analysis

Next-gen Data Management.
For Everyone

Related posts

Supercharge Your AI: Connect Claude and Cursor to Your Foundational Environment

Announcing Foundational's Data Governance AI Agents

Spark Lineage via Code Analysis

Next-gen Data Management.For Everyone

Next-gen Data Management.
For Everyone