Blog
Technology
Overcoming Challenges in Source-code based Data Lineage

Overcoming Challenges in Source-Code Based Data Lineage

Technology
December 6, 2023
Barak Fargoun
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

Our goal at Foundational is to validate every code change that may badly affect data, before it actually does. In order to do this successfully for data engineering code, for example in a framework such as dbt, we need to maintain an accurate data lineage graph, and understand every time new code is committed, what kind of lineage impact does the commit have.

Why is this important? Let’s consider one type of issue that is quite common in data, called field type mismatch. In this case, a certain field in a table is dependent on other upstream fields of the same type - for example, Boolean. If the type is changed for one of these upstream fields without changing the type of all the dependent fields, we may encounter a problem (which quite often will not get flagged). To identify this before the new code is merged, we would need to understand the lineage dependencies as well as the exact impact of the code change. 

Unfortunately, this type of analysis is not available in standard lineage information offered by the platforms or even data catalogs, hence the need for code validation such as the one we’re creating at Foundational.

In this blog post we’ll focus on SQL but very similar problems exist for other languages as well. Let’s elaborate on some of these problems:

Problem #1: Ambiguities

Consider the following query:

SELECT a, b
FROM table1, table2

Analyzing this query while having the schema information is generally easy, but in the case of having only the source code, the query analyzer cannot determine which column belongs to which table. For this query, each of the following schemas could be valid:

  • [.code]Table1→ a,b[.code] ; [.code]Table2 → [][.code]
  • [.code]Table1 → a[.code] ; [.code]Table2 → b[.code]
  • [.code]Table1 → [][.code] ; [.code]Table2 → a,b[.code]

How can we then determine which one is it for an actual query? Here are two methods that can be used:

  • Sometimes that above statement is only one part of a larger query, for example it could be that the rest of the query looks like this:
(SELECT a,b
FROM table1, table2
UNION ALL
(SELECT c
FROM table1
WHERE a == '1' and b == '2')

In this case, we can identify the last part of the query that ties [.code]a[.code] and [.code]b[.code] to [.code]table1[.code] through the WHERE clause, and determine that [.code]table1[.code] is the one having these columns.

  • We may have references to [.code]table1[.code] and [.code]table2[.code] in other queries, and through those determine the exact relationship.

Generally, ambiguity can be solved with more information and methods like these help address it when analyzing a large set of queries, for example in a dbt project, relying on source code only.

Problem #2: Warehouse-specific SQL

While SQL is similar for the most part, there are numerous differences between SQL “flavors” used by different types of data warehouses, for example between Redshift and Snowflake as well as between Snowflake and BigQuery.

Different warehouses support different formats and functions and may also accept some types of functions while not accepting others. These behaviors may also be the reason for code issues that could lead to data incidents or performance impact.

Unfortunately, it also seems that the open-source parsers do not handle these well enough, so we had to work hard to make warehouse-specific adjustments to our parsers.

Problem #3: Advanced functions

Overall, SQL that uses advanced functions is harder to analyze, and we couldn’t reference open-source parsers, which generally do not support these well. For example, structs are not well supported, and CASE … WHEN statements are harder to analyze as well.

Another example is dynamic functions such as PIVOT, which rely on the actual data that we do not access. The approach to handling these, similarly to handling ambiguities, is to look for additional references in the subsequent code that may reference the generated columns.

Understand the impact of code changes with Foundational

At Foundational, we are solving helping data teams deploy code changes with confidence. Understanding the full impact on lineage is only one aspect of what we do – Connect with us to learn more.

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free