Guides
Data Lineage for Businesses: The Beginner’s Guide

Data Lineage for Businesses: The Beginner’s Guide

Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

The Beginner’s Guide to Data Lineage for Businesses

If you want to implement data lineage for your business, you’ve come to the right place. In this beginner’s guide, you’ll learn what data lineage is, why it is important, and the various use cases it can solve. Additionally, you'll discover how to choose a great data lineage tool, implement data lineage effectively, and maintain it over time. We will also address common challenges associated with data lineage.

Let’s get started!

What is data lineage?

You’ve probably seen many definitions of data lineage — data flow, data provenance, data lifecycle, data tracking, and even a data family tree. Most of them describe data lineage as the process of recording data movement in your business from point of entry to point of exit.

But in reality, it’s much more than that.

Data lineage as a living and breathing map of data movement within your business.

For example, this is how we show data lineage in Foundational:

Why is data lineage important?

Data lineage can help your business with:

  • Data quality and governance: See where the data is coming from, where it’s going, who is using it, how it’s being used and transformed, and so on.
  • Troubleshooting and improving data development: See both the upstream and downstream dependencies when making changes and/or building a new functionality.
  • Keeps it all in one place: Make data changes, track them, and maintain them.

For example, you can get answers to the following questions (grouped by data movement direction):

  1. Coverage (left → right):
    • Where is the original data source (left)?
    • How is that data being used now (right)?
  2. Layers (up → down):
    • Who created this table?
    • When was it last updated?
    • Who is using it now?
    • What is its business impact?
    • Who are the business owners?
  3. Graph (point → point):
    • How is the data moved between table 1 and table 2?
    • How is it manipulated as it’s moved?

And so on.

Here is how you’d answer the question “Where is this data coming from?” in Foundational:

  1. Select a table, a column, or a dashboard.
  2. Open up more elements by clicking on the “+” sign.
  3. Select a table you want to view in detail. It’ll turn purple and show you both the upstream and the downstream dependencies.

Data Lineage for Businesses
Lineage in Foundational

What use cases can data lineage solve?

Let’s take a look at a couple use cases.

Use case #1: Creating a marketing report for a small B2B software company

Let’s say you’re a small B2B software company with about 200 employees. You’re building out your data team with 1-2 data engineers, 2 analysts, a data scientist, and a team lead, and you’re using Snowflake to store your data.

You want to create a marketing dashboard with an ad campaign report.

What do you do?

  • You collect data from all your sources—load your Facebook data into a table in Snowflake and call it the “Facebook table,” then do the same for LinkedIn, SalesForce, and so on.
  • You create another table that merges all of the previous tables, so you can run a report on it.
  • You create more tables, but now it’s hard to keep track of them.

How can data lineage help?

Data lineage can show you what data is going into which table and where it’s going after.

Use case #2: Reducing the cost of data storage for a mid-sized B2C enterprise

Let’s say, you’re a mid-sized B2C e-commerce enterprise with millions of customers and a team of 1,500-2,000 employees. The complexity of your data ecosystem is growing every day. You have a few dozen of data engineers, analysts, data scientists, and data platform engineers—all responsible for ingesting, transforming, and making the data usable for your business. And you have a data lake or even multiple data lakes.

Your want to reduce the cost of data storage.

What do you do?

  • You take a look at hundreds and hundreds of Snowflake tables.
  • You realize you don’t know if you actually use them all. It’s hard to keep track of so many.
  • You wonder who had created this table months ago. Did they already leave the company, but you’re still paying for it? Maybe you shouldn’t be paying for it. Maybe it’s obsolete.
  • You see that there is not just one obsolete table but hundreds of them, and you’re unsure which ones you need to delete and which ones you need to keep.

How can data lineage help?

Data lineage can show you the paths to all parts of your data stack so you can understand how it all fits together.

How to choose a great lineage tool

You’ve decided to implement a data lineage tool. Congratulations! But how do you choose the right one from so many?

There are tools that are okay, and tools that are great.

An okay data lineage tool will map only some of the data sets in your data stack. It’ll check the box on compliance and basic usage, but it won’t be easy to adopt. Your data team won’t use it as much. And it won’t address the areas like data management and improving data quality.

A great data lineage tool will map out all your data sets across your entire data stack (dbt, Snowflake, BigQuery, etc.) and all their origins, transformations, incidents, resting stops, and endpoints. It’ll help you increase your data coverage from 60% to 90%, make critical business decisions, ensure data compliance, and improve data management.

Look for a tool that has:

  1. High resolution: Supports both table-level and column-level lineage across all entities so you can track any data issue from table to table.
  2. Great UX: Provides intuitive and dynamic visualizations so you can see the data movement, search it, navigate it, and zoom in and out to get more detail.
  3. Easy integration: Integrates seamlessly with your existing data stack (databases, data warehouses, ETL tools, analytics platforms, etc.) and has pre-built connectors or APIs for data technologies like Snowflake, Redshift, BigQuery.
  4. Compliance: Gives you compliance mechanisms to protect sensitive data and comply with regulations like GDPR, HIPAA, and PCI-DSS, so you can perform data masking, role-based access control, and audit trails.

Metadata management: Lets you annotate, search, and understand all your data assets with descriptions, tags, and custom attributes so you can search your data assets, understand impact analysis, and see how changes will affect data assets.

Data Lineage Impact of a Pending Pull Request
Lineage Impact of a Pending Pull Request

How to implement a data lineage tool

Here's how to implement a data lineage tool step-by-step:

1. Prioritize what you need your data lineage for

First, you need to prioritize what you need your data lineage for. Is it security? Compliance? Data quality? Data development? And which one is more important to your business?

Let’s go through them one by one.

Security is important to track the movement and transformation of data across your systems and processes to spot any potential weaknesses and protect sensitive information.

  • Map data access and permissions to spot suspicious activity.
  • Monitor data transfers to detect and prevent security breaches or data leaks.
  • Implement encryption and access controls to protect sensitive data.

Compliance is required to follow regulations like GDPR, HIPAA, and PCI-DSS.

  • Document how data moves and changes to ensure regulatory reporting is accurate.
  • Match the rules and requirements that apply to your industry.
  • Automate checks and audits to follow the rules.

Data quality means tracking where you data is coming from, how it changes, and if there are any issues with its accuracy, completeness, or consistency.

  • Identify data sources to check data quality.
  • Monitor data transformations to make your data is reliable and consistent.
  • Set quality standards to keep improving data quality management.

Data development is nearly impossible without data lineage.

  • Document data dependencies to support agile development methodologies and DevOps practices.
  • Educate your team on data lineage requirements.
  • Optimize data development processes and workflows.

2. Determine the scope

Next, determine the scope. What are the must-haves that your data lineage should cover? And what are the nice-to-haves? Do you need it for your databases? Your BI tools? Your data lake?

Here are some examples of must-haves and nice-to-haves.

Must-haves:

  1. Data storage and compute: databases, data warehouses, cloud storage solutions, and external data feeds.
  2. Data flows: data movement across all your systems, processes, and applications and how it’s ingested, processed, and transformed.
  3. Data transformations: data cleansing, aggregation, and enrichment.
  4. Reporting: Can be internal (business intelligence) or user-facing.

Nice-to-haves:

  1. Data lake support: data lake assets and workflows.
  2. Metadata enrichment: data ownership, data classifications, and data usage policies.
  3. Real-time monitoring: identifying issues, anomalies, or deviations from expected data flows.
  4. Scalability and performance: accommodating the growing volume and complexity of your data ecosystem (performance optimization, resource management, and support for distributed architectures).

3. Figure out your implementation strategy

Next, you need to figure out your implementation strategy.

Do you want to buy a tool, or do you want to build one yourself?

Ultimately, there is no right data lineage tool for the job. Some tools will be more mature depending on your needs and scope.

For example, if you’re a bank, there is probably a tool that can help you with your databases. But why do you need it and what for? Answering these questions can help you with your implementation strategy.

4. Determine your ongoing strategy for keeping data lineage up to date

You chose a data lineage tool and implemented it. But now you need to determine your ongoing strategy for keeping data lineage up to date (both technical and process).

Let’s say, you created a process where everyone who makes a new dashboard needs to document it. That’s a data lineage process. Most companies don’t do that. Every time someone creates a new dashboard, they manually check to make sure the description is accurate.

Make sure you create a process for every change and always follow it.

5. Think about how you’ll be publishing your data lineage information

Finally, you need to think about how you’ll be publishing your data lineage information.

How will you make it accessible for a variety of uses cases?

Make sure you can tie it back to the use cases—to show people how they should consume and use your data lineage information.

How to maintain data lineage

As soon as you implement data lineage, you’ll need to maintain it. That means regularly monitoring it and updating it, to keep it accurate.

If you use Foundational, your updates will be fully automated and easy to embed into your existing workflow.

With Foundational, you can:

  • Conduct regular audits of your data lineage to ensure accuracy and completeness.
  • Establish processes for updating data lineage when there are changes (adding new data sources, changing data flows, modifying transformations, etc.).
  • Implement governance policies that support data lineage maintenance. This can include roles and responsibilities for managing data lineage information, data quality standards, and procedures for handling changes to data assets.
  • Collect user feedback to improve data lineage and keep aligning it with your business needs.

How to address common data lineage challenges

Finally, let’s talk about the three most common data lineage challenges.

1. Too much legacy

Let’s say, you have a legacy database. The problem is, no one in your organization knows how the data sets in that database are being used or who owns what. But you found some code, and you see that it’s using the data somehow. You won’t delete the data, but you need to understand all the different assets.

Here is how you can understand what’s happening:

  1. Inventory and documentation: Conduct a thorough inventory of all your legacy systems and document all your existing data assets. Seeing what data exists and how it's currently used is the first step in integrating these assets into your data lineage tool.
  2. Ownership and accountability: Assign clear ownership for each data set. Knowing who is responsible for each piece of data will help you manage and update lineage information.
  3. Gradual integration: Integrate your legacy systems into your data lineage gradually. Prioritize systems based on their importance to business processes and the quality of data.

2. Poor accuracy

You do something with your data, but you realize that its accuracy is poor. It doesn’t show up correctly.

Here is how to improve it:

  1. Validation and quality checks: Implement data validation and quality checks at key points in your data movement. This will help you identify and correct inaccuracies early.
  2. Feedback loops: Establish feedback loops that allow users to report inaccuracies. User insights can help you spot sources of errors and improve data quality.
  3. Continuous monitoring: Use automated tools to continuously monitor data quality. Automation can help you scale the quality checks and ensure high accuracy levels across large data sets.

3. Constant updates

Implementing data lineage may take anywhere from six to twelve months. Because it takes so long, some business just give up. But even after you have successfully implemented it, the whole thing goes out of date very quickly. You need to put in lots of processes to keep it up to date. It’s surprisingly manual—because the technology for it hasn’t changed for years.

Here is how you can keep it updated:

  1. Automation: Wherever possible, automate it.
  2. Change management processes: Implement change management processes that can update data lineage information as part of the workflow for making changes to data assets.
  3. Regular audits: Schedule regular audits to ensure it reflects the current state of your data ecosystem. This can help identify areas where you need updates.

Chat with us

There are some clear advantages to using a great data lineage tool for your business. It’s worth evaluating the many data lineage tools out there to choose the right one for you.

Want to learn more about data lineage for businesses and how we can help you with implementing, automating, and updating it?

Chat with us! We’d love to hear from you.

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free