Peter Hoffmann

EuroSciPy 2018 - Apache Parquet as a Columnar Storage for Large Datasets

Apache Parquet Data Format

Apache Parquet is a binary, efficient, columnar data format. It uses several techniques to store data in a CPU- and I/O-efficient way, such as row groups, page compression within column chunks, and dictionary encoding for columns. Index hints and statistics enable engines to quickly skip irrelevant chunks of data, allowing efficient queries on large amounts of data.

Apache Parquet with Pandas & Dask

Apache Parquet files can be read into pandas DataFrames with the fastparquet and PyArrow (Apache Arrow) libraries. While pandas is mostly used for data that fits in memory, Dask allows work with datasets larger than memory and even larger than local disk space. Data can be split into partitions and stored in cloud object storage systems such as Amazon S3 or Azure Blob Storage.

Using metadata from partition filenames, Parquet column statistics, and dictionary filtering enables faster performance for selective queries without reading all data. This talk shows how to use partitioning, row-group skipping, and data layout choices to speed up queries on large amounts of data.

EuroSciPy 2018