Blog
Articles
Scaling Data Lineage for dbt

Scaling Data Lineage for dbt

Articles
February 22, 2024
Alon Nafta
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

dbt has recently announced that dbt Explorer will now support column-level lineage, one of the most highly anticipated feature requests since the early days of dbt, and a key feature in dozens of tools, Foundational included. 

But why is dbt data lineage such an important capability? and with so many commercial options available, ranging from completely free to enterprise-grade solutions, how is it the case that many teams still struggle around lineage?

In this article, we’ll review what functionalities dbt has for tracking data lineage, and additionally – what other options exist, whether open-source or commercially. We’ll also explore how you should decide between different options, and why Foundational’s data lineage is unique in this regard.

Data Lineage in dbt

Tracking data lineage is a key need and often a formal requirement across many organizations. And since dbt (data build tool) has become the de-facto standard for many companies running cloud-based data stacks, discussions around data lineage for dbt have always been a hot topic. Furthermore, and perhaps even surprisingly, there are many solutions for lineage tracking for dbt, including ones that dbt itself provides. Naturally, premium solutions cost more, and that is also the case for the native dbt solutions, provided by dbt Core, the open source dbt framework, and dbt Cloud, which is the paid version that is hosted by dbt.

dbt Core, and up until recently dbt Cloud as well, have provided data lineage to provide better visibility to dbt projects and data pipelines to both data teams and business users. dbt surfaced data lineage in both graph form, often referred to as DAG (Directed Acyclic Graph) and in other formats such as API to satisfy a variety of needs that data engineers often have, for example root cause analysis, discovery, and data catalog.

dbt highlights the main use cases of data lineage as the following:

  • Root cause analysis - This is a data quality use case, where data engineers attempt to identify the origin, or root cause, of an issue when one is detected somewhere in the data warehouse (on top of which dbt is running). Using the data lineage graph, one can identify the upstream dependencies that could be the root cause of the problem observed.
  • Downstream impact - This is a change management use case, where data engineers intend to make changes somewhere within the dbt project and want to understand all the downstream dependencies that the change may impact. Getting an accurate analysis of all the data flows helps – as the person making the change can validate there are no unintended paths where data may get impacted as a result of the change.
  • Value to business users - This is a data catalog use case, sometimes referred to as discoverability, where data consumers e.gA person looking at a dashboard, wants to understand the origin of the data and how it was modified until it got to the visual interface.

Despite the availability of dbt data lineage, meaningful limitations have existed around three areas:

  • Coverage - By definition, lineage provided by dbt mainly covers… dbt. Through Exposures, data teams could have added downstream elements such as dashboards, but since this is not too automated, it hasn’t addressed this problem in full. Moreover, Exposures only covers specific downstream elements and for many use cases, upstream matters a lot, too.
  • Lack of Column-level resolution - Up until recently, column-level lineage, providing the ability to understand column, or field-level dependencies, was not officially supported.
  • Cross-project dependencies - Some teams run more than one dbt project, and in those cases the native lineage solution did not support cross-project dependencies.

The combination of these three limitations along with the increasing popularity of dbt for data modeling and building data pipelines at scale, has led to the emergence of numerous solutions for data lineage, which at first glance can seem quite similar. What is sometimes more confusing is that different types of tools – for example, both observability tools and catalogs – have generally all supported some sort of data lineage and primarily for the dbt use case.

Available solutions for dbt data lineage 

Open-source solutions

Since dbt is built around SQL, calculating data lineage is generally a very solvable problem.

Yes, even column-level lineage is doable, and with some work to set it up, data teams could use one of the open-source projects such as sqlmesh, SQLLineage, or others, to analyze SQL queries and extract lineage information for everything that is running on the data warehouse. More recently, CLI-based tools, typically using the same open-source technologies, are providing lineage information directly in Visual Studio and other types of developer consoles. Many of them are free of charge or include some sort of a free tier.

Another approach that gained popularity in the open-source community is OpenLineage, which is certainly powerful, yet the main challenge is often implementation. OpenLineage is also useful for tracking lineage for data pipelines in other technologies that are not dbt, namely Spark, however similarly – getting up and running with OpenLineage requires quite a bit of effort and in addition, it only covers what developers have plugged it into, so “unknown unknowns” remain unhandled.

A third approach that is open source are catalogs - Amundsen, DataHub, and OpenMetadata. All three are in varying degrees of maturity and offer some connector support for non-dbt environments. Similarly, here, it would take quite a bit of upfront and ongoing effort to use these for accurate data lineage in large-scale environments, but some teams are opting for open source for obvious reasons, which makes this an interesting path.

Generally, the challenges with using all of these tools arrive with scale. Larger dbt projects typically mean large data teams and a lot more SQL. In large-size projects, more types of operators and queries are introduced, and generally in those cases, pretty much all open-source projects fall short.

Premium solutions

Premium solutions for dbt data lineage have emerged from vendors, primarily due to the previously mentioned limitations around coverage, resolution and effort involved in using open source. Overall quality is similar, though some vendors may have invested more in actual SQL coverage and accuracy when it comes to large-scale projects. Unfortunately, this is hard to tell upfront and the recommended approach would be to test and actually measure accuracy across all the different types of data pipelines running on the warehouse. 

Another major differentiator is coverage - The vast majority of vendors today rely on SQL parsing and fall short in environments where non-SQL pipelines are used, namely Spark. 

There is also a big gap currently with the ability of existing vendors to cover upstream dependencies on the operational side due to a variety of reasons, however the result is that data “before” the ETL tool is almost never covered. This is a huge gap in the current landscape. More subtle differences in coverage between vendors are often accuracy and resolution - for example, very few vendors support column-level lineage for Power BI, possibly due to the fact that there are no good open-source solutions for this as well.

When it comes to Foundational, the path we took in directly analyzing the underlying source code, as opposed to only looking at query logs, sets us apart in the ability to provide accurate lineage information at the time of build. It also allows us to look at lineage for code that has not yet been merged – a use case we believe is critical on its own. Of course, this comes with a unique set of challenges that we aim to address through technology.

Foundational analyzes the source code
Foundational analyzes the underlying source code

Choosing the best solution for your data flows

Ultimately, choosing the most appropriate solution for dbt data lineage depends on what are the main use cases, what coverage is needed and in which accuracy, and what’s the level of effort that you’re willing to put in. Budget is a derivative of that, since going with an open-source approach typically means an ongoing investment of data engineering headcount, which can get even more expensive than paying a vendor.

Thankfully, if the main environment is the data warehouse, and you don’t care about upstream or pending code changes, then all solutions are overall similar, and you should choose according to your main use case being data quality or catalog. If you’re using dbt Cloud, the new release of dbt Explorer is good to take advantage of and confirm whether the core needs are addressed. Open source is a good route if a) You understand the ongoing overhead this puts on the team, and b) You are okay with the delayed nature of updates going into open source projects. An exception here is probably OpenLineage and its suitability for the Enterprise, however the upfront investment is large, which may not suit all teams.

Data lineage for a dbt project
Data lineage for a dbt project

Foundational’s approach to data quality and governance

At Foundational, we took a unique approach to data lineage by directly analyzing the source code. This allows us to very quickly analyze both the existing state of every dbt project in the environment, including the cross-dependencies. It also allows us to inspect pending pull requests – and look for issues. While some issues may seem easy to spot, over time, we’ve observed that data teams spend an unusually large amount of time root-causing issues and fighting with data incidents. A lot of these can be prevented before the code is even merged, thus saving a lot of time and often, cloud costs as well.

Our goal is to streamline data development across the entire company, helping engineering and data teams deploy changes faster and with greater confidence – Chat with us to see a demo.

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free