Last summer Microsoft has rebranded the Azure Kusto Query engine as Azure Data Explorer. While it does not support fully elastic scaling, it at least allows to scale up and out a cluster via an API or the Azure portal to adapt to different workloads. It also offers parquet support out of the box which made me spend some time to look into it.
#python #pydata #azure #parquet
Apache Parquet is a columnar file format to
work with gigabytes of data. Reading and writing parquet files is efficiently
exposed to python with pyarrow. Additional statistics allow clients to use
predicate pushdown to only read subsets of data to reduce I/O.
Organizing data by column allows for better
compression, as data is more homogeneous. Better compression also reduces the
bandwidth required to read the input.
#python #pydata #parquet #arrow #pandas
Apache Spark is a computational engine for large-scale data processing.
PySpark exposes the Spark programming model to Python. It defines an API
for Resilient Distributed Datasets (RDDs) and the DataFrame API.
#python #pydata #spark #talk