Blog
Articles
Boosting Data Quality with Automated Data Contracts

Boosting Data Quality with Automated Data Contracts

Articles
March 7, 2024
Alon Nafta
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free

The topic of Data Contracts has seen a massive surge in the past year, as data quality challenges continue to hinder the efforts of many organizations trying to leverage data for driving business value.

As with any new concept, there’s a lot of great content that is being published as well as different proposals for how data contracts should be defined and enforced.

In this article, we’ll cover the key concepts around data contracts with a focus on implementation, which is often overlooked, or underrepresented. For example, implementing a detailed contract for every table in your data warehouse is probably neither easy nor is it necessary to achieve similar results.

What are Data Contracts – and how can they contribute to data quality?

Data Contracts are a concept that aims to bring a combination of software development principles into data management. Those principles are versioning, service level agreements (SLAs), and continuous integration (CI). The motivation is to prevent damage done by unexpected changes. 

In software this is needed in API endpoints - If I’m the maintainer of a popular API endpoint and in a version update I decided to remove a field appearing in the results, all the existing “consumers” who had already integrated with the API, are going to get impacted. My responsibility then, as the owner of the API, is to properly inform all the consumers of the API that I’m about to roll out a breaking change. It should also be said that in reality that process is very burdensome, and as a result, developers aim to always make changes in a backward-compatible way.

A fragmented data stack

Due to the fragmented nature of a typical data platform, and the lack of a clear API-like interface between the different components, data professionals experience the same pain points as their engineering counterparts, hence the need to define data contracts. What’s more, in the age of AI where high-integrity data is needed for training, and when data systems become fragmented by design through the concept of data mesh, data contracts become a very useful tool for improving data quality across different functions. Data contracts contribute directly to data quality since when utilized properly, a lot if not most of data quality issues would get prevented, and contained before they make it to production.

Let’s consider an easy example, where a software engineer changes something in an event definition in the operational database. That change is needed for a bigger initiative, but it so happens to be that the event is also heavily used in dbt models that run on the data warehouse. Exchanging all of that context and information about every event that goes into the warehouse is a lot of work. And defining individual contracts for every event that goes into dbt is awfully time-consuming. 

Path to implementation

At Foundational, we look at data contracts as one of two types:

  1. An existing dependency, which exists but needs to be discovered. A company can have hundreds if not thousands of these dependencies – and we call them dependencies because a change in one side (data producer) would cause issues to the other side (data consumer).
  2. A custom constraint that can be defined against any data object. The constraint can be highly detailed or just capture key requirements.

This hybrid approach means that there's a healthy mix of contracts that are being added manually with a lot of automation to flag issues. For example, I can decide to assign a contract to a critical table where I want to ensure that the schema is not changed, but generally I want to get alerted when anything changes in data sets that I’m using on a regular basis.

Another common constraint that data organizations approach is already having hundreds and thousands of existing relationships. Implementing a separate contract for each one is a huge lift, but we do want to allow engineers to push changes without worrying about breaking things, which would also be more efficient.

Lastly, every new contract needs an owner, which ultimately may not be a bad thing but in the short term creates additional management overhead that the organization is not necessarily expecting.

What should manual contracts include?

The core requirement is schema definitions, which details what fields and field types define this asset. This could be an event that is streaming through Kafka, or a Snowflake table. Putting these in a contract together with an enforcement mechanism ensures that everyone can rely on this asset for further utilization. Some contract definitions also include values and freshness requirements, which in some cases may be helpful but might not always justify a tight definition.

The contract definition can sit in a yaml or json format and get added into git where it can also be enforced as part of CI/CD. Of course, enforcement should happen across every repository which might have an impact on that data. 

What should automated data contracts include?

Automated contracts should provide actionability on every code change that can introduce problems. Such problems can be breaking a downstream query, introducing a privacy issue, and even spiking up the warehouse’s cost. Enforcement should run automatically, no definitions required.

The contract definition is not needed anywhere, but there might be a need for contract enforcement across the entire system.

Implementing a set of custom contracts against key assets, while creating a layer of automation against common problems such as schema changes and semantic issues, is a good compromise between running a lengthy implementation project

Data contract enforcement is typically done through git

Foundational’s approach to automating data contracts

At Foundational, we took a unique approach to data contracts by deducing most of the definitions directly from the existing code. By natively integrating with git, the effort of adopting a new tool is minimized, and ensuring coverage is straightforward.

Foundational checks and validates every code change throughout all the different repositories and stakeholders in the business, covering engineering changes happening upstream, as well as changes in models and pipelines running on the warehouse. The goal is to unify data management across every function that may process or modify data. And as a bonus, since we analyze code, many of the issues that are often introduced to production, whether it’s data quality, privacy, or even cloud cost issues, can be addressed and in most cases, completely avoided.

Our goal is to streamline data development across the entire company, helping engineering and data teams deploy changes faster and with greater confidence – Chat with us to see a demo.

code snippet <goes here>
<style>.horizontal-trigger {height: calc(100% - 100vh);}</style>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/gsap.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.8.0/ScrollTrigger.min.js"></script>
<script>
// © Code by T.RICKS, https://www.timothyricks.com/
// Copyright 2021, T.RICKS, All rights reserved.
// You have the license to use this code in your projects but not to redistribute it to others
gsap.registerPlugin(ScrollTrigger);
let horizontalItem = $(".horizontal-item");
let horizontalSection = $(".horizontal-section");
let moveDistance;
function calculateScroll() {
 // Desktop
 let itemsInView = 3;
 let scrollSpeed = 1.2;  if (window.matchMedia("(max-width: 479px)").matches) {
   // Mobile Portrait
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 767px)").matches) {
   // Mobile Landscape
   itemsInView = 1;
   scrollSpeed = 1.2;
 } else if (window.matchMedia("(max-width: 991px)").matches) {
   // Tablet
   itemsInView = 2;
   scrollSpeed = 1.2;
 }
 let moveAmount = horizontalItem.length - itemsInView;
 let minHeight =
   scrollSpeed * horizontalItem.outerWidth() * horizontalItem.length;
 if (moveAmount <= 0) {
   moveAmount = 0;
   minHeight = 0;
   // horizontalSection.css('height', '100vh');
 } else {
   horizontalSection.css("height", "200vh");
 }
 moveDistance = horizontalItem.outerWidth() * moveAmount;
 horizontalSection.css("min-height", minHeight + "px");
}
calculateScroll();
window.onresize = function () {
 calculateScroll();
};let tl = gsap.timeline({
 scrollTrigger: {
   trigger: ".horizontal-trigger",
   // trigger element - viewport
   start: "top top",
   end: "bottom top",
   invalidateOnRefresh: true,
   scrub: 1
 }
});
tl.to(".horizontal-section .list", {
 x: () => -moveDistance,
 duration: 1
});
</script>
Share this post
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It Free