DuckDB vs. Azure Synapse SQL on-demand with Parquet
Short Disclaimer: This post is comparing apples to oranges, because DuckDB is an embedded database designed to execute analytical SQL queries on your local machine, whereas Azure Synapse SQL-on-Demand is Microsoft's new cloud offering for a serverless SQL on-demand (preview) endpoint to query data in Azure Blob Storage/Data Lake.
But in the end I was still triggered by Uwe Korn's post: Taking DuckDB for a spin to do a comparison, because even though they are fundamentally different offerings, at a high level both solutions offer a simple way to run analytical queries on datasets without needing to install or manage a server.
Load the data as Parquet data
To get started, convert the yellow tripdata sample data to a Parquet file and upload it to Azure Blob Storage:
import pandas as pd
from simplekv.net.azurestore import AzureBlockBlobStore
filename = "yellow_tripdata_2016-01.csv"
df = pd.read_csv(
filename,
dtype={"store_and_fwd_flag": "bool"},
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
index_col=False,
infer_datetime_format=True,
true_values=["Y"],
false_values=["N"],
)
df.to_parquet("yellow_tripdata_2016-01.parquet")
conn_string = 'DefaultEndpointsProtocol=https;AccountName=blob;AccountKey=xxx;'
store = AzureBlockBlobStore(conn_string=conn_string, container='benchmark', public=False)
store.putfile("yellow_tripdata_2016-01.parquet", "yellow_tripdata_2016-01.parquet")
Once the data is available in Azure Blob Storage/Data Lake, you can use the openrowset function to read the remote dataset in Azure SQL-on-Demand:
Count Distinct
The minimal COUNT DISTINCT example needs to read all selected columns from Azure Blob Storage once and then execute the count on the data.

The query roughly takes 4s, which is not too bad and is faster than the 5.58s for DuckDB and the 25s for SQLite.
Frequency of events
This type of query is a simple way to test aggregation performance.
SELECT
MIN(cnt),
AVG(cnt),
-- MEDIAN(cnt),
MAX(cnt)
FROM
(
SELECT
COUNT(*) as cnt
FROM
OPENROWSET(
BULK 'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet',
FORMAT='PARQUET'
) AS [r]
GROUP BY
DATEPART(day, tpep_pickup_datetime),
DATEPART(hour, tpep_pickup_datetime)
) as [stats];
As the query only needs to read a subset of the columns, it can be executed very fast in <1s, which again beats DuckDB 2.05s and SQLite 10.2s.
Simple fare regression
with yellow_tripdata_2016_01
as (select *
FROM OPENROWSET(
BULK 'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet',
FORMAT='PARQUET'
) AS [r]
)
SELECT
(SUM(trip_distance * fare_amount) - SUM(trip_distance) * SUM(fare_amount) / COUNT(*)) /
(SUM(trip_distance * trip_distance) - SUM(trip_distance) * SUM(trip_distance) / COUNT(*)) AS beta,
AVG(fare_amount) AS avg_fare_amount,
AVG(trip_distance) AS avg_trip_distance
FROM
yellow_tripdata_2016_01,
(
SELECT
AVG(fare_amount) + 3 * STDEV(fare_amount) as max_fare,
AVG(trip_distance) + 3 * STDEV(trip_distance) as max_distance
FROM yellow_tripdata_2016_01
) AS [sub]
WHERE
fare_amount > 0 AND
fare_amount < sub.max_fare AND
trip_distance > 0 AND
trip_distance < sub.max_distance
;
The result is returned again in less than 1s, which is on par with DuckDB 972 ms and beats SQLite 11.7s.
Conclusion
The performance result that SQL-on-Demand beats DuckDB is irrelevant, because the two database solutions are completely different.
What is really interesting is that while DuckDB provides the convenience of an easily embeddable database on your local machine, Azure Synapse SQL-on-Demand provides similar convenience for running T-SQL statements on Parquet files as a managed, elastic service in the cloud. You can just click on your SQL endpoint in Azure and start querying Parquet files without the need to run or manage a server, and you only pay for the data scanned ($5 per TB). If you are doing data engineering in Azure, this is a cool new tool in your tool belt.
And it scales horizontally. While the above examples were done on
a tiny 1.7MB Parquet file, we have done some scaling tests at
blueyonder up to terabytes of data and thousands of
Parquet files. Stay tuned for a post with the results and an example of how to use turbodbc as a high-performance interface for data pipelines...
