Peter Hoffmann

EuroPython 2018 - Using pandas and Dask to work with large columnar datasets in Apache Parquet

Apache Parquet Data Format

Apache Parquet is a binary, efficient, columnar data format. It uses techniques to store data in a CPU- and I/O-efficient way, such as row groups, page compression within column chunks, and dictionary encoding for columns. Indexes and statistics let query engines quickly skip over irrelevant chunks of data, enabling efficient queries on large datasets.

Apache Parquet with Pandas & Dask

Apache Parquet files can be read into Pandas DataFrames with the two libraries fastparquet and Apache Arrow (pyarrow). While Pandas is mostly used to work with data that fits into memory, Apache Dask allows us to work with data larger than memory and even larger than local disk space. Data can be split into partitions and stored in cloud object storage systems like Amazon S3 or Azure Storage.

Using metadata from partition filenames, Parquet column statistics, and dictionary filtering enables faster performance for selective queries without reading all data. This talk will show how to use partitioning, row-group skipping, column pruning, and data layout to speed up queries on large datasets.