Table of Content
Subscribe to our Newsletter
Get the latest from our team delivered to your inbox
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Ready to get started?
Try It FreeThere’s a counter intuitive difference between writing code for software and writing code for data - in software you are a lot less surprised. Why counterintuitive? Because you’d think that SQL would not surprise you that much, but it’s oftentimes the opposite:
Most importantly, when writing code for data, the data layer plays a huge role in how your code would behave, and is usually not visible when the code is being written. That’s a big deal.
Another way to think about this - In Software, a code commit would almost entirely describe what’s changing. This creates predictability, and one obvious outcome of this is that rolling back a faulty version to the previous version is straightforward. In data, rolling back a faulty commit would require a code change but also carefully replaying the data, making this typically a lot more complex.
There are certainly several ways to go about this, but one aspect of why code for data is different is the process it goes through when being deployed. Let’s think about this using a simple example: What would happen if a data engineer changes a field’s type in a table? Would this cause a problem? The answer is of course, sometimes. For example, if that field is involved in a comparison, and now its type is boolean, that comparison is not effective anymore, which would cause a problem. The question is, would *anything* in the build process flag this? In the vast majority of cases, the answer is no. This is because in most data frameworks, the code is checked for very little, sometimes for basic syntax and that’s it.
Another aspect is that in data different projects, representing tools and environments, are isolated from each other. For example, a person that is pushing new code in a dbt project, doesn’t have any mechanism by default to create a constraint around what’s allowed and what’s not allowed, if that new code affects downstream dashboards in Looker or Tableau. These dependencies exist outside the boundaries where the dbt engineer is working, and there’s no exposure to them when writing, building and deploying the code. The dbt project will build correctly, and code will get deployed.
While there are some mechanisms that you could work with, for example Exposures in dbt, SQLFluff, and others, the simple examples above would still pass.
Of course, one can argue that bugs exists in software too, but while this is definitely the case, there is still an argument that in data engineering there’s a surprising amount of simple code changes that cause devastating effects. Even a straightforward change, such as renaming a column in a single table, is often seen as something most data teams avoid unless absolutely necessary.
Why is that the case? Because renaming a column is potentially a breaking change for downstream queries, similar to renaming a field in an API – We have to assume that it’s a change that needs to be carefully thought out while understanding the actual dependencies and determine how they would be impacted by the change. Another way to think about this, is by thinking of these dependencies as implied contracts that data engineers are continuously creating by adding new queries, jobs, and pipelines. In order to avoid breaking changes – that violate these data contracts – we’ll need to analyze the actual code change, understand all of its downstream as well as upstream dependencies, and determine the implied contracts and whether any of those are violated. Ideally we can do this at the time of the change, and before it’s merged and can break dashboards or cause a data incident.
However, doing this in a typical data stack is not easy:
What this means is that we need a combination of technology, together with process:
One encouraging aspect of this representation is knowing that we already have a strong parallel for the process - we know how software engineering and CI/CD look like in modern software development. However, we still need to understand what is the supporting technology needed to make it work seamlessly, with data.
At Foundational, we are solving extremely complex problems that data teams face on a day-to-day basis. Identifying issues in pending pull requests is only one aspect of it – Connect with us to learn more.