Blog
Articles
What ​​are Data Contracts – and How to Implement Them Effectively

What ​​are Data Contracts – and How to Implement Them Effectively

Articles
January 18, 2024
Alon Nafta
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

What ​​are Data Contracts – and How to Implement Them Effectively 

One of the most common complaints today by people working with data is the unexpected nature of data changes which often simply breaks your dashboards. Of course, other types of elements – queries, jobs, models etc. – can break too, but often it is the dashboard that infamously stops working, sending data engineers to try and understand where, how and why did data change.

By their nature, queries are fragile. Unlike software engineering where every cross-system interface is managed through an API, in the data stack every query, which can be part of an Airflow job or a dbt model, defines an interface that needs to be managed – but is often not.

Consider the following scenarios:

  • An Analytics Engineer makes a schema change in a dbt model, without realizing that a Tableau dashboard is expecting a different column type than what now exists in the new model.
  • A Backend Engineer renames a column in Postgres, the operational database from which data is ingested into the data warehouse, where dozens of tables are now not getting the renamed column.
  • A Data Engineer removes a Kafka Topic, without realizing some parts of the data stack still expect those events.

In all of these cases, a change happening in one place is being pushed without completing the necessary downstream adjustments which would then introduce incidents.

Why Are Data Contracts Needed?

There are several reasons that would explain the fundamentals for why data contracts are needed in modern data architectures:

  • Fragmented ownership - Data is handled in different stages by different teams that work in separate projects and workspaces. The tools are often split between Engineering, Data Engineering, and Analytics. A change in one project will not be visible to other teams outside of that project, which creates a process problem even around small changes.
  • The Data Stack is Fragile - Whether it’s the language itself or the maturity of the tools, things in data break a lot. And even in cases where a query is still running, unexpected changes in values could be equally problematic. In the modern data stack, data is frequently exchanged between different elements with very little ongoing checks, which results in the overall system being extremely fragile.
  • Change Management technology in data is often immature - In the current landscape of data teams and change management, very often changes are simply not checked at the time of build. Some frameworks such as dbt do not include syntax checks as part of the build process, and the concept of CI and pull requests is frequently overlooked by numerous data teams. All of this results in a reality where even simple changes are often feared from and avoided.

Therefore, implementing data contracts can be important for several reasons:

  • Consistency - Data contracts make sure that data looks and acts the same way every time it's shared. This helps avoid issues when data is exchanged between different parts of the stack.
  • Quality - Data contracts set standards for data quality. This helps organizations establish trust in the data that is being used.
  • Collaboration: When there’s clarity and transparency around how data is formatted and used, it's easier for different types of data developers to work together. Different teams can collaborate without being on the same project in GitHub or exchanging numerous Slack messages around a simple change.
  • Mission-critical Data - Data contracts help organizations ensure that mission-critical data, such as PII or data used by ML applications, would always maintain integrity and quality.

Data Contracts Explained

Although the concept of data contracts presumably dates back to the Eighties, it has re-emerged as a popular topic when addressing data quality in the modern data stack and as such, received different interpretations and definitions.

At its core, a data contract is a specification that applies to a single element in the data stack, for example, a table. There’s no strict requirement to what should be defined but commonly two types of definitions appear:

  1. Schema - What fields or columns exist and what are their types.
  2. Values - What are the expected values, or ranges, for each field or column.

In addition, we recognize two parts around managing data contracts:

  1. Defining them and maintaining those definitions over time.
  2. Enforcing them when changes are made.

Better defined contracts would naturally create stronger resilience to issues but at the same time will introduce additional complexity in their ongoing management, handling updates, and enforcing them throughout the data stack. Therefore, a huge challenge in implementing data contracts is introducing them to an existing large-scale data stack - Creating definitions for hundreds or thousands of tables and implementing effective enforcement in a way that would not severely impact productivity is extremely hard. And ultimately, the top priority for everyone is to use data and power decisions - so productivity always wins.

A Pragmatic Approach to Implementing Data Contracts

When thinking about implementing data contracts in an existing organization, we need to assume that a mature, large-scale data stack already exists. There are thousands of different elements, and a similar number of dependencies. But ultimately, a person making a change wants to answer a very simple set of questions: 

  1. What will I impact?
  2. What type of impact will it have? 
  3. Will it cause an issue?

Answering these questions should happen at time of build, which is before a developer would merge code to production and while issues can still get fixed without negatively impacting the data. The natural place for this check to occur is when a pull request is created, allowing an organized process around change management. Working with pull requests is a foundational cornerstone of change management in software development and it should be the same for data engineering.

Another aspect of implementing data contracts goes to the process of managing a change:

  1. What changes, assuming they are validated as safe, are okay to deploy without any additional approval?
  2. What changes may require notifying data consumers who use the impacted data?
  3. What changes may require additional approval due to their sensitivity or impact on mission-critical data?

Different tools and frameworks can be used for these processes according to the organization’s task and change management methodologies.

Our approach at Foundational

At Foundational, we are pragmatists. We consider the starting point of the data stack as a constraint that is hard to change – maybe even impossible. Our goal is taking an existing data stack and automatically create the entire set of implied data contracts that already exist, while allowing data teams to add new contracts easily:

  1. For every code change throughout the operational and analytical data stacks, developers are able to understand impact and validate their changes.
  2. Data owners and data consumers can add additional rules (Contracts) to alert about changes or require additional approval.
  3. Most data contracts are already implied by the existing code and as such, can be automated without needing to be explicitly defined.

We believe that this approach makes it easy for large-scale organizations to think and act around data contracts without fundamentally changing the development process and negatively impact productivity.

Chat with us 

At Foundational, we are solving extremely complex problems that data teams face on a day-to-day basis. Automating data contracts is just one aspect – Connect with us to learn more.

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free