Blog
Articles
Scaling dbt deployment: Data lineage, incremental models, and code checks

Scaling dbt deployment: Data lineage, incremental models, and code checks

Articles
May 2, 2024
Barak Fargoun
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

dbt is a very powerful, user-friendly framework that has strongly emerged in the past few years as a popular choice for many data organizations to build and manage data pipelines, also referred to as “transformations”. At the same time, as dbt projects and deployments grew bigger, pitfalls and certain limitations have become more meaningful—in this article we’ll shed some light on these so we can understand better, along with suggested solutions.

Incremental Models

Incremental models are a hugely important feature in dbt. As dbt made a big change in how data teams transform data in the modern data stack, bringing modularity and source control, data teams have quickly scaled into having hundreds, and sometimes thousands of dbt models. By default, dbt is running those models by simply recomputing the entire table. This is referred to as dbt materialization. In turn, this has resulted in big spike in the warehouse costs, bringing data teams to see significant increases, sometimes surpassing 6- and even 7-digit bills in annual spend, charged by the warehouse. 

This was clearly a problem and dbt introduced the very powerful support of incremental models, allowing data teams to process only new data, rather than reprocessing older data that was already processed before. However, using dbt incremental models is more complex, introducing one of the most common problems we are seeing, at Foundational, when checking dbt pull requests before they are merged - missing refreshes.

Let’s explain this by example, using the classic Jaffle Shop:

Table-level Lineage (Source: Foundational)

Let’s also assume that orders is an incremental model in this example. If we were to make a change to stg_payments, which feeds orders (or more formally, is an upstream dependency of orders), we will need to run a full refresh on orders. If we don’t, the change won’t affect the older records in orders. Similarly, if we were to make a change to orders itself, we would need to to run a full refresh as well. In this trivial example, understanding the data lineage and dependencies is simple. However, in a real project containing hundreds or thousands of models, the developer needs to remember to perform these refreshes with every update. It is not enforced by dbt, and we often see developers miss these steps or perform them incorrectly.

Partial vs. Full refreshes, and the impact of column-level lineage

Confirming that a change to an incremental model will always require a refresh to very specific set of models is an important starting point, and this is also where we could benefit from having data lineage. Fortunately, dbt contains a very handy lineage feature as part of dbt docs. However, this is also where there’s a big upside of using column-level lineage, which is only supported through dbt Cloud and dbt Explorer. Let’s see why:

Column-level Lineage (Source: Foundational)

Here, when we see the column-level dependencies of these models, we can confirm that in the case where there is a change to payment_id within stg_payments, there is no real need to do a full refresh to orders, since payment_id doesn’t directly affect any column in orders. 

If we only understand table-level lineage, we would not know this and would assume that this model needs to be refreshed as well. The impact of this aspect is that without column-level lineage, model refreshes are done on a much bigger list, occasionally containing most of the project! This is of course slow and even more important, expensive. In a small dbt project we could naively force an entire project refresh on every change, but when the project already has dozens of models, this becomes non-viable.

This is also why analyzing the actual pull request before merging is so important - understanding the exact list of impacted models, determined by accurate, column-level lineage that is derived on the actual pull request while it’s still pending, allows data and analytics engineers to avoid data issues and become a lot more efficient in cost and no less important, speed.

Schema changes

Breaking schema changes, which is when a column is removed or modified, are rightfully considered a scary change, and rightfully so. This is the no. 1 reason for problems, and thankfully, many data teams are aware of this and have put in place some process, and hopefully some technology such as our Data Contracts solution, to make these easier to roll out. 

What you may not know is that adding a column when you have an incremental model, may cause a problem too! This is happening because in the default behavior of dbt around incremental models, when you add a new column, dbt will ignore them. You must remember to update the default behavior.

This is of course not hard to do, but in a real-life setting where dozens of pull requests are happening continuously, people forget. It also gets trickier when models are defined non explicitly, for example through a [.code]SELECT * FROM Foo[.code] statement. In this case, adding a column to Foo is now also impacting another model. The end result is that the combination of schema changes and incremental models is often tricky, and these things are *very* commonly overlooked.

Cross-tool column-level lineage

We’ve seen so far how common dbt features can be important to consider when making changes within a single dbt project. Now, let’s consider an environment which has more than one project - These can be multiple dbt projects, or a dbt project that lives alongside another environment such as a warehouse, BI tool, or a semantic layer solution such as LookML. 

In these cases, schema changes and semantic code changes, which is when the actual value assignments are being modified, now need to be propagated throughout the entire set of tools and/or repositories that are potentially impacted. Outside of not necessarily knowing what is getting impacted, sometimes the list of impacted objects now contains other environments which may be owned by different teams.

Perhaps the most common issue here are model changes impacting downstream warehouse Views. In such cases, these continue to run and everything executes successfully, however the data is now malformed. Views are often a useful way to transform data coming from dbt into business intelligence and user-facing dashboards, so this means that a simple dbt change may actually have a dramatic impact in the case where there is downstream impact existing outside of the project.

Summary

Ultimately, all of these issues are not necessarily complex. Sometimes they are, but many times it is the dissonance between a very simple code change that can actually cause quite a bit of “damage”. Of course, the data team has tools in place, including those provided natively through dbt, however, the long list of things to remember, combined with the need to understand and identify column-level dependencies, often results in this list being complex to handle for a large project.

What we also need to remember, is that people make mistakes, even with the most strict process of pre-merge checklists and tight CI/CD process. And then, we should ask ourselves, are we sacrificing our speed and cycle times? This is where technology helps, and leveraging technology can dramatically improve data quality as well as the data team’s productivity.

Contact us to learn more about code checks for dbt

At Foundational, we believe that a lot of these checks can and should be automated. By leveraging technology, we help data teams manage growing dbt projects alongside other environments such as operational databases, BI tools, warehouses, and data lakes. Schedule time with us to learn more.

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free