Escaping the SQL Jungle - Bringing Software Engineering Discipline to Data Transformations

Most data platforms don’t break overnight; they grow into complexity, query by query. Over time, business logic spreads across SQL scripts, dashboards, and scheduled jobs until the system becomes a “SQL jungle.” This article explores how that happens and how to bring structure back.

Escaping the SQL Jungle - Bringing Software Engineering Discipline to Data Transformations
The SQL jungle is hard to navigate (Photo by Weronika / Unsplash)

Most data systems don’t collapse overnight. They grow slowly, query by query.

"What breaks when I change a table?"

A dashboard needs a new metric, so someone writes a quick SQL query. Another team needs a slightly different version of the same dataset, so they copy the query and modify it. A scheduled job appears. A stored procedure is added. Someone creates a derived table directly in the warehouse.

Months later, the system looks nothing like the simple set of transformations it once was.

Business logic is scattered across scripts, dashboards, and scheduled queries. Nobody is entirely sure which datasets depend on which transformations. Making even a small change feels risky. A handful of engineers become the only ones who truly understand how the system works because there is no documentation.

Many organizations eventually find themselves trapped in what can only be described as a SQL jungle.

In this article we explore how systems end up in this state, how to recognize the warning signs, and how to bring structure back to analytical transformations. We’ll look at the principles behind a well-managed transformation layer, how it fits into a modern data platform, and common anti-patterns to avoid:

  1. How the SQL jungle came to be
  2. Requirements of a transformation layer
  3. Where the transformation layer fits in a data platform
  4. Common anti-patterns
  5. How to recognize when your organization needs a transformation framework

1. How the SQL jungle came to be

To understand the "SQL jungle" we first need to look at how modern data architectures evolved.

1.1 The shift from ETL to ELT

Historically data engineers built pipelines that followed an ETL structure:

Extract --> Transform --> Load

Data was extracted from operational systems, transformed using pipeline tools, and then loaded into a data warehouse. Transformations were implemented in tools such as SSIS, Spark or Python pipelines.

Because these pipelines were complex and infrastructure-heavy, analysts depended heavily on data engineers to create new datasets or transformations.

Modern architectures have largely flipped this model

Extract --> Load --> Transform

Instead of transforming data before loading it, organizations now load raw data directly into the warehouse, and transformations happen there. This architecture dramatically simplifies ingestion and enables analysts to work directly with SQL in the warehouse.

It also introduced an unintended side effect.


1.2 Consequences of ELT

In the ELT architecture, analysts can transform data themselves. This unlocked much faster iteration but also introduced a new challenge. The dependency on data engineers disappeared, but so did the structure that engineering pipelines provided.

Transformations can now be created by anyone (analysts, data scientists, engineer) in any place (BI tools, notebooks, warehouse tables, SQL jobs).

Over time, business logic grew organically inside the warehouse. Transformations accumulated as scripts, stored procedures, triggers and scheduled jobs. Before long, the system turned into a dense jungle of SQL logic and a lot of manual (re-)work.

In summary:

ETL centralized transformation logic in engineering pipelines.
ELT democratized transformations by moving them into the warehouse.

Without structure, transformations grow unmanaged, resulting in a system that becomes undocumented, fragile and inconsistent. A system in which different dashboards may compute the same metric in different ways and business logic becomes duplicated across queries, reports, and tables.


1.3 Bringing back structure with a transformation layer

In this article we use a transformation layer to manage transformations inside the warehouse effectively. This layer combines the engineering discipline of ETL pipelines while preserving the speed and flexibility of the ELT architecture:

The transformation layer brings engineering discipline to analytical transformations.

When implemented successfully, the transformation layer becomes the single place where business logic is defined and maintained. It acts as the semantic backbone of the data platform, bridging the gap between raw operational data and business-facing analytical models.

Without the transformation layer, organizations often accumulate large amounts of data but have difficulty to turn it into reliable information. The reason being that business logic tends to spread across the platform. Metrics get redefined in dashboards, notebooks, queries etc.

Over time this leads to one of the most common problems in analytics: multiple conflicting definitions of the same metric.


Layered Architecture for Building Readable, Robust, and Extensible Apps
If adding a feature feels like open-heart surgery on your codebase, the problem isn’t bugs, it’s structure. This article shows how better architecture reduces risk, speeds up change, and keeps teams moving.

2. Requirements of a Transformation Layer

If the core problem is unmanaged transformations, the next logical question is:

What would well-managed transformations look like?

Analytical transformations should follow the same engineering principles we expect in software systems, going from ad-hoc scripts scattered across databases to "transformations as maintainable software components".

In this chapter, we discuss what requirements a transformation layer must meet in order to properly manage transformations and, doing so, tame the SQL jungle.


2.1 From SQL scripts to modular components

Instead of large SQL scripts or stored procedures, transformations are broken up into small, composable models.

To be clear: a model is just an SQL query stored as a file. This query defines how one dataset is built from another dataset.

The examples below show how data transformation and modeling tool dbt creates models. Each tool has their own way, the principle of turning scripts into components is more important than the actual implementation.

Examples:

-- models/staging/stg_orders.sql
select
    order_id,
    customer_id,
    amount,
    order_date
from raw.orders

When executed, this query materializes as a table (staging.stg_orders) or view in your warehouse. Models can then build on top of each other by referencing each other:

-- models/intermediate/int_customer_orders.sql
select
    customer_id,
    sum(amount) as total_spent
from {{ ref('stg_orders') }}
group by customer_id

And:

-- models/marts/customer_revenue.sql
select
    c.customer_id,
    c.name,
    o.total_spent
from {{ ref('int_customer_orders') }} o
join {{ ref('stg_customers') }} c using (customer_id)

This creates a dependency graph:

stg_orders
      ↓
int_customer_orders
      ↓
customer_revenue

Each model has a single responsibility and builds upon other models by referencing them (e.g. ref('stg_orders')). This approach has has major advantages:

  • You can see exactly where data comes from
  • You know what will break if something changes
  • You can safely refactor transformations
  • You avoid duplicating logic across queries

This structured system of transformations makes transformation system easier to read, understand, maintain and evolve.


2.2 Transformations that live in code

A managed system stores transformations in version-controlled code repositories. Think of this as a project that contains SQL files instead of SQL being stored in a database. It's similar to how a software project contains source code.

This enables practices that are pretty familiar in software engineering but historically rare in data pipelines:

  • pull requests
  • code reviews
  • version history
  • reproducible deployments

Instead of editing SQL directly in production databases, engineers and analysts work in a controlled development workflow, even being able to experiment in branches.


2.3 Data Quality as part of development

Another key capability a managed transformation system should provide is the ability to define and run data tests.

Typical examples include:

  • ensuring columns are not null
  • verifying uniqueness of primary keys
  • validating relationships between tables
  • enforcing accepted value ranges

These tests validate assumptions about the data and help catch issues early. Without them, pipelines often fail silently where incorrect results propagate downstream until someone notices a broken dashboard


2.4 Clear lineage and documentation

A managed transformation framework also provides visibility into the data system itself.

This typically includes:

  • automatic lineage graphs (where does the data come from?)
  • dataset documentation
  • descriptions of models and columns
  • dependency tracking between transformations

This dramatically reduces reliance on tribal knowledge. New team members can explore the system rather than relying on a single person who “knows how everything works.”


2.5 Structured modeling layers

Another common pattern introduced by managed transformation frameworks is the ability to separate transformation layers.

For example, you might utilize the following layers:

raw
staging
intermediate
marts

These layers are often implemented as separate schemas in the warehouse.

Each layer has a specific purpose:

  • raw: ingested data from source systems
  • staging: cleaned and standardized tables
  • intermediate: reusable transformation logic
  • marts: business-facing datasets

This layered approach prevents analytical logic from becoming tightly coupled to raw ingestion tables.


Tools for your LLM: a Deep Dive into MCP
MCP is a key enabler into turning your LLM into an agent by providing it with tools to retrieve real-time information or perform actions. In this deep dive we cover how MCP works, when to use it, and what to watch out for.

3. Where the Transformation Layer Fits in a Data Platform

With the previous chapters, it becomes clear to see where a managed transformation framework fits within a broader data architecture.

A simplified modern data platform often looks like this:

Operational systems / APIs
           ↓
      1. Data ingestion
           ↓
      2. Raw data
           ↓
  3. Transformation layer
           ↓
    4. Analytics layer

Each layer has a distinct responsibility.

3.1 Ingestion layer

Responsibility: moving data into the warehouse with minimal transformation. Tools typically include custom ingestion scripts, Kafka or Airbyte.

3.2 Raw data layer

Responsible for storing data as close as possible to the source system. Prioritizes completeness, reproducibility and traceability of data. Very little transformation should happen here.

3.3 Transformation layer

This is where the main modelling work happens.

This layer converts raw datasets into structured, reusable analytical models. Typical tasks consist of cleaning and standardizing data, joining datasets, defining business logic, creating aggregated tables and defining metrics.

This is the layer where frameworks like dbt or SQLMesh operate. Their role is to ensure these transformations are

  • structured
  • version controlled
  • testable
  • documented

Without this layer, transformation logic tends to fragment across queries dashboards and scripts.

3.4 Analytics layer

This layer consumes the modeled datasets. Typical consumers include BI tools like Tableau or PowerBI, data science workflows, machine learning pipelines and internal data applications.

These tools can rely on consistent definitions of business metrics since transformations are centralized in the modelling layer.


3.5 Transformation tools

Several tools attempt to address the challenge of the transformation layer. Two well-known examples are dbt and SQLMesh. These tools make it very accessible to just get started applying structure to your transformations.

Just remember that these tools are not the architecture itself, they are simply frameworks that help implement the architectural layer that we need.


Pydantic Performance: 4 tips on how to Validate Large Amounts of Data Efficiently
Pydantic’s Rust core enables high-throughput validation, but only when used intentionally. This article examines four common gotchas and explains how aligning model design with the validation engine improves clarity, scalability, and performance.

4. Common Anti-Patterns

Even when organizations adopt modern data warehouses, the same problems often reappear if transformations remain unmanaged.

Below are common anti-patterns that, individually, may seem harmless, but together they create the conditions for the SQL jungle. When business logic is fragmented, pipelines are fragile and dependencies are undocumented, onboarding new engineers is slow and systems become difficult to maintain and evolve.

4.1 Business logic implemented in BI tools

One of the most common problems is business logic moving into the BI layer. Think about "calculating revenue in a Tableau dashboard".

At first this seems convenient since analysts can quickly build calculations without waiting for engineering support. In the long run, however, this leads to several issues:

  • metrics become duplicated across dashboards
  • definitions diverge over time
  • difficulty debugging

Instead of being centralized, business logic becomes fragmented across visualization tools. A healthy architecture keeps business logic in the transformation layer, not in dashboards.


4.2 Giant SQL queries

Another common anti-pattern is writing extremely large SQL queries that perform many transformations at once. Think about queries that:

  • join dozens of tables
  • contain deeply nested subqueries
  • implement multiple stages of transformation in a single file

These queries quickly become difficult to read, debug, reuse and maintain. Each model should ideally have a single responsibility. Break transformations into small, composable models to increase maintainability.


4.3 Mixing transformation layers

Avoid mixing transformation responsibilities within the same models, like:

  • joining raw ingestion tables directly with business logic
  • mixing data cleaning with metric definitions
  • creating aggregated datasets directly from raw data

Without separation between layers, pipelines become tightly coupled to raw source structures. To remedy this, introduce clear layers such as the earlier discussed raw, staging, intermediate or marts.

This helps isolate responsibilities and keeps transformations easier to evolve.


4.4 Lack of testing

In many systems, data transformations run without any form of validation. Pipelines execute successfully even when the resulting data is incorrect.

Introducing automated data tests helps detect issues like duplicate primary keys, unexpected null values and broken relationships between tables before they propagate into reports and dashboards.


4.5 Editing transformations directly in production

One of the most fragile patterns is modifying SQL directly inside the production warehouse. This causes many problems where:

  • changes are undocumented
  • errors immediately affect downstream systems
  • rollbacks are difficult

In a good transformation layer, transformations are treated as version-controlled code, allowing changes to be reviewed and tested before deployment.


How to Store and Query 100 Million Items Using Just 77MB with Python Bloom Filters
Perform lightning-fast, memory efficient membership checks in Python with this need-to-know data structure

5. How to Recognize When Your Organization Needs a Transformation Framework

Not every data platform needs a fully structured transformation framework from day one. In small systems, a handful of SQL queries may be perfectly manageable.

However, as the number of datasets and transformations grows, unmanaged SQL logic tends to accumulate. At some point the system becomes difficult to understand, maintain, and evolve.

There are several signs that your organization may be reaching this point.

  1. The number of transformation queries keeps growing
    Think of dozens or hundreds of derived tables
  2. Business metrics are defined in multiple places
    Example: different definition of "active users" across teams
  3. Difficulty understanding the system
    Onboarding new engineers takes weeks or months. Tribal knowledge required for questions about data origins, dependencies and lineage
  4. Small changes have unpredictable consequences
    Renaming a column may break several downstream datasets or dashboards
  5. Data issues are discovered too late
    Quality issues surface after a customers discovers incorrect numbers on a dashboard; the result of incorrect data propagating unchecked through several layers of transformations.

When these symptoms begin to appear, it is usually time to introduce a structured transformation layer. Frameworks like dbt or SQLMesh are designed to help teams introduce this structure while preserving the flexibility that modern data warehouses provide.


Python: __init__ is NOT a constructor: a deep dive in Python object creation
Tinkering with Python’s constructor to create fast, memory-efficient classes

Conclusion

Modern data warehouses have made working with data faster and more accessible by shifting from ETL to ELT. Analysts can now transform data directly in the warehouse using SQL, which greatly improves iteration speed and reduces dependence on complex engineering pipelines.

But this flexibility comes with a risk. Without structure, transformations quickly become fragmented across scripts, dashboards, notebooks, and scheduled queries. Over time this leads to duplicated business logic, unclear dependencies, and systems that are difficult to maintain: the SQL jungle.

The solution is to introduce engineering discipline into the transformation layer. By treating SQL transformations as maintainable software components — version controlled, modular, tested, and documented — organizations can build data platforms that remain understandable as they grow.

Frameworks like dbt or SQLMesh can help implement this structure, but the most important change is adopting the underlying principle: managing analytical transformations with the same discipline we apply to software systems.

With this we can create a data platform where business logic is transparent, metrics are consistent, and the system remains understandable even as it grows. When that happens, the SQL jungle turns into something far more valuable: a structured foundation that the entire organization can trust.


I hope this article was as clear as I intended it to be but if this is not the case please let me know what I can do to clarify further. In the meantime, check out my other articles on all kinds of programming-related topics.

Happy coding!

— Mike

P.s: like what I'm doing? Follow me!