Stop Data Bugs: Semantic Issue Detection for SQL

Companies nowadays write a lot more SQL than before, more so with the increased popularity of tools like dbt, which allow data teams to quickly create models and pipelines. The increase in SQL is a byproduct of the cloud data warehouse era, where companies have better access to data and more tools to leverage it into reporting and business intelligence with far less bottlenecks than before.

In turn, this is also leading to a dramatic increase in SQL (as can be seen in this Reddit thread about dbt models) and ultimately to the amount of code that data teams own and maintain. At Foundational, we often see companies with hundreds or thousands of dbt models, where some dbt models can be as big as 600 lines of code with multiple references to other models, which makes them quite complex.

This situation leads to the “average” SQL code being far more complex than what should be reasonable for data engineers to maintain. It’s rare to see “simple” transformations, and many teams today create and maintain complex transformations code that results in big dbt DAGs with numerous dependencies across them.

This complexity typically leads to one or more of these outcomes:

Reduced development velocity: It’s a lot harder for data teams to deploy changes since it’s harder to build (and test) due to the increased complexity. Data teams are forced to do more rigorous testing to verify nothing breaks across the hundreds of existing models.
Poor data quality: Intuitively, as with any complex system, data teams simply experience a lot more bugs. Some of these bugs will slip all the way to production and impact data quality, and some of them will further turn into data incidents.

The situation of having bugs in data engineering code is similar to what usually happens in traditional software development of complex systems. A big difference however is that in software, once the bug is fixed the developer can usually move on, and in the case of data fixing the bug does not remediate the bad data that’s still running through the pipelines. Developers are then left with either fixing the bad data that’s already live (which is time-consuming and usually gets deprioritized), or with adding some code to ignore the bad data, which further increases complexity.

Most teams will usually suffer from one of these problems more than the other, depending on choice - they either sacrifice developer velocity in favor of fewer incidents, or sacrifice data quality in favor of moving faster. Yet, is there a better way here? Can better tooling, like those that exist for software development, allow teams to move fast while maintaining high data quality? This is one of the topics we had in mind when we built the technology for Foundational, and we made it our mission to allow data developers to deploy changes efficiently while maintaining high data quality.

One of the solutions we have built for this is leveraging static code analysis where the SQL code from the latest version is analyzed together with up-to-date data lineage information, which we extract and maintain as well. This allows us to find semantic bugs at the coding stage, where the relevant pull request is still pending or is in review, before it gets merged or pushed to production. This approach allows data developers to move faster while relying on Foundational to flag issues that degrade data quality, in line with our mission statement.

What are semantic bugs?

Semantic bugs are cases where the SQL or dbt code is perfectly valid, and will run properly in the data warehouse, but will produce unexpected or simply wrong data. For example, consider the following query:

(SELECT price_before_tax, total_price
FROM us_products_list)
UNION ALL
(SELECT total_price, price_before_tax
FROM eu_products_list)

This is a valid query, which will run without any issues in a data warehouse. However, this query performs the wrong calculation, since it has a mismatch in the UNION clause, where it unions price_before_tax with total_price and then does the opposite. Running this query will create bad rows in the warehouse tables, eventually leading to skews, in this example in revenue numbers. It also doesn’t help that in some of these cases, only some parts (e.g., rows) of the data are impacted, which typically means that testing and monitoring may not effectively flag this.

How does Foundational detect semantic bugs?

As part of the product, we’ve built a unique code analysis engine that analyzes every SQL query and understands the semantics, to try and find potential semantic bugs. For this example above, Foundational will identify that there is a UNION between two tables, but the order of the fields is different between the two clauses, which will result in a UNION Mismatch that will get flagged as an issue.

Since Foundational is also calculating column-level lineage, which the code analysis engine can access, we can detect even more complex cases where some of the tables in the UNION operator read from other tables i.e., are only linked through some column-level dependencies which without accurate column-level lineage, would have been impossible to detect.

Foundational utilizes a learning engine, some parts being automatic while others are inspected by our research team, so if we see a bug that we missed, new rules will be created so that bug is then flagged across our entire customer base. So far we have analyzed dozens of thousands of pull requests with many more queries.

Are these bugs really happening?

Yes! Depending on the number of developers (and level of testing), every week we may detect a few to multiple bugs in every customer. Some of these bugs are found early (E.g., as part of the initial pull request review) and get fixed in the same pull request. We’ve seen cases, before enabling this feature, where bugs make their way to production and can stay there for a while, as it may take time until they’re detected when the skew is large enough to get noticeable.

When we onboard new customers, we typically find multiple semantic bugs that are live in the HEAD version that runs in production. Unsurprisingly, these bugs are usually ones that introduce a moderate data skew and therefore are much harder to detect and pass most tests and monitoring tools.

Conclusion

Semantic bugs are a real thing, and in this article we gave a few examples for how such bugs can be found. Ultimately, a well trained engineer that carefully looks at a single query can find semantic bugs, and usually they do. However, since pipeline queries tend to be a lot bigger, and since there can be cross-query dependencies that require that engineer to look at hundreds of lines of code, this task becomes non-scalable, making semantic bugs extremely prevalent.

Our team has built a unique engine that finds these bugs at the pull request stage, allowing teams to move faster while avoiding semantic bugs that degrade the data quality.

We check and validate every code change throughout all the different repositories and stakeholders in the business, also covering engineering changes happening upstream. The goal is to unify data management across every function that may process or modify data.

Our goal is to streamline data development across the entire company, helping engineering and data teams deploy changes faster and with greater confidence – Schedule time with us to learn more.

We built semantic issue detection for SQL

What are semantic bugs?

(SELECT price_before_tax, total_price
FROM us_products_list)
UNION ALL
(SELECT total_price, price_before_tax
FROM eu_products_list)

How does Foundational detect semantic bugs?

Are these bugs really happening?

Conclusion

Related posts

Supercharge Your AI: Connect Claude and Cursor to Your Foundational Environment

Announcing Foundational's Data Governance AI Agents

Spark Lineage via Code Analysis

Next-gen Data Management.
For Everyone

Related posts

Supercharge Your AI: Connect Claude and Cursor to Your Foundational Environment

Announcing Foundational's Data Governance AI Agents

Spark Lineage via Code Analysis

Next-gen Data Management.For Everyone

Next-gen Data Management.
For Everyone