#PANDAS

DuckDB vs Azure Synapse SQL-on-Demand with parquet

Inspired by Uwe Korns post on DuckDB this post shows how to use Azure Synapse SQL-on-Demand to query parquet files with T-SQL on a serverless cloud infrastructure.
#python #parquet #pydata #pandas #azure

Peter Hoffmann Peter Hoffmann

Understand predicate pushdown on row group level in Parquet with pyarrow and python

Apache Parquet is a columnar file format to work with gigabytes of data. Reading and writing parquet files is efficiently exposed to python with pyarrow. Additional statistics allow clients to use predicate pushdown to only read subsets of data to reduce I/O. Organizing data by column allows for better compression, as data is more homogeneous. Better compression also reduces the bandwidth required to read the input.
#python #pydata #parquet #arrow #pandas

Peter Hoffmann Peter Hoffmann