Inspired by Uwe Korns post on DuckDB this post shows how to use Azure Synapse SQL-on-Demand to query parquet files with T-SQL on a serverless cloud infrastructure.
Short Disclaimer: This post is comparing apples to oranges, because DuckDB is an embedded database designed to execute analytical SQL queries on your local machine, whereas Azure Synapse SQL-on-Demand is Microsoft new cloud offering for a serverless SQL on-demand (preview) endpoint to query data in the Azure Blob Store/Data Lake.
But in the end I was still triggerd by Uwe Korn's post: Taking DuckDB for a spin to do a comparison, because even if they are fundamentally different offerings, from a high level both solutions offer a simple way to run analytical queries on datasets without the need to install/manage a server.
To get started you need to convert the yellow tripdata sampledata to a parquet file and upload it to the Azure Blob Storage:
import pandas as pd
from simplekv.net.azurestore import AzureBlockBlobStore
filename = "yellow_tripdata_2016-01.csv"
df = pd.read_csv(
filename,
dtype={"store_and_fwd_flag": "bool"},
parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
index_col=False,
infer_datetime_format=True,
true_values=["Y"],
false_values=["N"],
)
df.to_parquet("yellow_tripdata_2016-01.parquet")
conn_string = 'DefaultEndpointsProtocol=https;AccountName=blob;AccountKey=xxx;'
store = AzureBlockBlobStore(conn_string=conn_string, container='benchmark', public=False)
store.putfile("yellow_tripdata_2016-01.parquet", "yellow_tripdata_2016-01.parquet")
Once the data is available in the Azure Blob Storage/Data Lake you can use the openrowset function to read the remote dataset in Azure SQL-on-Demand:
The minimal count distinct example needs to once read all selected columns from the azure blob storage and execute the count on the data.
The query roughly takes 4s
, which is not too bad and quite faster than the 5.58s
for DuckDB and the 25s
for sqlite.
This type of query is a nice and simple form for testing the aggregation performance.
SELECT
MIN(cnt),
AVG(cnt),
-- MEDIAN(cnt),
MAX(cnt)
FROM
(
SELECT
COUNT(*) as cnt
FROM
OPENROWSET(
BULK 'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet',
FORMAT='PARQUET'
) AS [r]
GROUP BY
DATEPART(day, tpep_pickup_datetime),
DATEPART(hour, tpep_pickup_datetime)
) as [stats];
As the query only has to read a subset of the columns it can be executed very fast in <1s
, which again beats DuckDB 2.05s
and SQLite 10.2s
.
with yellow_tripdata_2016_01
as (select *
FROM OPENROWSET(
BULK 'https://blob.dfs.core.windows.net/benchmark/yellow_tripdata_2016-01.parquet',
FORMAT='PARQUET'
) AS [r]
)
SELECT
(SUM(trip_distance * fare_amount) - SUM(trip_distance) * SUM(fare_amount) / COUNT(*)) /
(SUM(trip_distance * trip_distance) - SUM(trip_distance) * SUM(trip_distance) / COUNT(*)) AS beta,
AVG(fare_amount) AS avg_fare_amount,
AVG(trip_distance) AS avg_trip_distance
FROM
yellow_tripdata_2016_01,
(
SELECT
AVG(fare_amount) + 3 * STDEV(fare_amount) as max_fare,
AVG(trip_distance) + 3 * STDEV(trip_distance) as max_distance
FROM yellow_tripdata_2016_01
) AS [sub]
WHERE
fare_amount > 0 AND
fare_amount < sub.max_fare AND
trip_distance > 0 AND
trip_distance < sub.max_distance
;
The result is fetched again in less than 1s
which is on par with DuckDB 972 ms
and beats SQLite 11.7 s
.
The performance comparison result that SQL-on-Demand beats DuckDB is just irrelevant, because the two database solutions are completely different.
What's really interesting is that while DuckDB provides the comfort of an easily embeddedable database on your local machine, Azure Synapse SQL-on-Demand offers the same comfort on running T-SQL statements on parquet files as a managed, elastic service in the cloud. You can just click your SQL endpoint in Azure and start querying parquet files without the need to run or manage a server and you only pay for data scanned (5$ per TB). If you are doing data engineering in the Azure Cloud this is cool new tool in your toolbelt.
And one that scales horizontally. While the above examples have been done on
a tiny 1.7MB
parquet file, we have done some scaling tests at
blueyonder up to terabytes of data and thousands of
parquet files. Stay tuned for a post on the results and an example of how to use turbodbc as a high performance interface for data pipelines...