SQL vs. Python: What language is best for data transformation?
SQL is the most widely used data language, but it’s not always the best choice for data transformation. Good thing dbt supports Python now.
By Kyle Dempsey (Head of CX and Solutions Architecture @ AirOps) & Jonathan Natkins (Regional Director, Solutions Architecture @ dbt Labs)
For years, dbt didn’t support Python. This was an intentional decision.
Back in 2017, large-scale data processing in Python was increasingly capable, but it wasn’t truly accessible. There wasn’t a good way to deliver the same “magic” that’s possible with SQL, which is why dbt remained a SQL-only modeling tool.
Here’s how dbt’s co-founder and CEO, Tristan Handy, framed the company’s decision not to support Python in the past (emphasis mine):
“We’re excited to support languages beyond SQL once they meet the same bar for user experience that SQL provides today. And over the past five years, that’s happened.”
As Tristan explained, things have changed a lot over the last five years, namely:
- Snowflake launched Python support in Snowpark.
- BigQuery released a Spark connector, allowing BQ users to process data natively in both SQL and Python.
- Databricks launched Databricks SQL analytics and now provides both SQL and Python interfaces to the same data.
These updates led dbt to reassess support for Python, because the changes mentioned above allow users of Snowflake, BigQuery, and Databricks to do things like:
- Reference the same data in both SQL and non-SQL workloads. No need to move it.
- Manage development environments in an abstracted way thanks to fully-native cloud solutions.
Additionally, the APIs needed to fuel a native dbt development experience were finally out in the wild.
All of these factors culminated in an exciting product update: As of November 2022, dbt Core v1.3 includes support for Python models.
This is great news! Plenty of analytics engineers and other data people know SQL. Some know Python, especially on the data engineering and data science side. Some folks also know R… but we won’t talk about that right now, even though it’s pretty cool.
But, as with all exciting changes, questions inevitably arise. Namely, “When should I use SQL for data transformations? When does it make more sense to use Python?”
Here’s a cheat sheet you can use to make the right call.
SQL: It’s all about speed and simplicity
In general, it’s best to use SQL when…
You need to run simple queries, fast.
SQL’s beauty lies in its simplicity. Compared to Python, SQL is a much simpler language. It’s also exclusively used for data. That means it’s easier to learn, and it provides the quickest, most efficient means of performing simple data analyses.
You need to retrieve, update, or insert data from a database or data warehouse.
Because SQL is a data-first language, it’s designed to communicate with databases and is the best option for accessing, modifying, and managing your data warehouse.
You need to perform simple join and aggregate operations.
The “simplicity” theme returns when we’re talking about uncomplicated join and aggregate operations. Since SQL runs directly on the database, using it for these types of jobs means that SQL typically provides a faster and easier way to develop a query to perform simple analyses.
These analyses could include things like:
- Selecting and filtering tables
- Joining multiple tables into a single output
- Performing aggregations over large numbers of rows
Plus, data warehouses, such as Snowflake, are designed to perform these operations very efficiently and can process huge data volumes with ease.
You need to perform multi step transformation jobs.
Tools like dbt make it easy to use SQL to create staging and intermediate tables as part of a transformation process.
SQL cons: Powerful, but with limitations for more advanced use cases
Of course, no coding language is perfect, and there are some cons to SQL that are worth noting.
SQL isn’t as flexible.
SQL queries almost exclusively rely on combinations of joins, aggregate functions, subqueries, and window functions. This is great for non-complex analyses, More complex use cases, such as data science, tend to be a better fit for Python.
A lack of libraries and packages = limited functions and operations.
Since there isn’t a SQL package manager that allows users to import functions, you’ll have to write your SQL queries from scratch.
SQL can be difficult to debug.
This is another key area in the SQL vs. Python debate where the developer experience isn’t quite as robust. Debugging SQL code is more difficult because there’s no option to set up a breakpoint to halt execution – with SQL, you have to execute a complete statement all at once.
You could use CTEs (common table expressions) and split your dbt models into several files, which makes debugging intermediate models a lot easier. But, this isn’t as powerful (or straightforward) as setting a breakpoint in your code.
SQL syntax can differ between databases.
While the basic structure of your SQL queries will always be the same (all databases support SELECT , FROM , and GROUP BY) differences in syntax do exist among databases.
For example, both MySQL and PostgreSQL databases support stored procedures. However, MySQL only supports standard SQL syntaxes, whereas PostgreSQL supports more advanced procedures.
The cross-database macros in dbt_utils can help resolve this, but it’s still a challenge worth noting.
Data analysts and analytics engineers know and love SQL.
Even if there are certain things SQL doesn’t do well, like data science workflows and large-scale data manipulations (more on that below), it’s still a powerful language that helps data professionals do their jobs efficiently and effectively. We all know it, there’s a strong community around it, and it works perfectly for a large percentage of what we need to do.
That being said, there are certainly times when Python is the superior choice.
Python: Best for complex, large-scale data operations
On the flip side, it makes more sense to use Python when…
You need to perform data science workflows.
SQL’s simplicity is both its biggest upside, and also its biggest downside. Things like complex transformations on large amounts of data and linear regressions require hundreds of lines of code in SQL.
Basically, if you want to perform data science workflows such as:
- Regression analysis
- Predictive modeling / forecasting
- Key driver analysis
- Machine learning
… then you’ll want to use Python.
You need to perform large scale data manipulations.
Performing certain tasks is extremely cumbersome in SQL. Python’s power and flexibility makes it much easier to do things like:
- Transpose and pivot tables (i.e., convert rows to columns)
- Manipulate text / string data
- Perform complex time series operations (e.g., period over period comparisons)
Here’s an example that shows the difference between how Python and SQL each handle creating a week-over-week percent change analysis:
While SQL requires each logical step of the calculation to be explicitly defined, the Pandas package used in Python includes the pct_change() method to easily run the calculation in one line.
You need a more robust set of programming libraries to get the job done.
The queries that SQL produces depend on functions, which are inherently limited since they can only perform very specific tasks.
Python uses programming libraries instead (to the tune of more than 300k available libraries as mentioned above). These libraries can be applied to a broader range of projects – there are a sizable number of options for data transformation alone.
However, keep in mind that there are limitations on what you can import – some packages aren’t available on all platforms, including Snowpark and Databricks, for example.
So, which language should I use for data transformation in dbt – SQL or Python?
The answer to this question depends entirely on the data you’re transforming and your goals for the project.
SQL is great for simple queries where you need a quick, efficient means of getting the job done. Python is ideal for more complex data science workflows and large-scale data manipulation.
Ideally, you know how to work with both languages and can choose the best one for your transformation work.
However, even if you aren’t proficient (or fast) with SQL and Python, there’s a tool that makes working with both languages infinitely easier: Data Sidekick from AirOps.
Data Sidekick uses AI to make working with Python and SQL significantly faster. It’s a free tool that allows you to use natural language to generate SQL and Python queries more accurately than ever before.
Sidekick can read a SQL query, understand its output, and then anticipate the query’s structure and generate Python operations to perform the next step. For example, you can use the tool to easily pivot a table, run a forecast, or parse a string field. It’s designed to be a schema aware “Sidekick” to speed up Analytics Engineering tools.
It can perform around 10 different operations, from generating single Python functions to drafting full scripts. Click here to try Sidekick today – dbt cloud users get priority access.