Apache Parquet is a columnar file format to
work with gigabytes of data. Reading and writing parquet files is efficiently
exposed to python with pyarrow. Additional statistics allow clients to use
predicate pushdown to only read subsets of data to reduce I/O.
Organizing data by column allows for better
compression, as data is more homogeneous. Better compression also reduces the
bandwidth required to read the input.
#python #pydata #parquet #arrow #pandas
Using Infrastructure-as-Code principles with configuration through machine processable definition files in combination with the adoption of cloud computing provides faster feedback cycles in development/testing and less risk in deployment to production.
This talk will give an overview on how to deploy web services on the Azure Cloud with different tools like Azure Resource Manager Templates, the Azure SDK for Python and the Azure module for Ansible and present best practices learned while moving a company into the Azure Cloud.
PyScaffold helps you to easily setup a new Python project.
Apache Spark is a computational engine for large-scale data processing.
PySpark exposes the Spark programming model to Python. It defines an API
for Resilient Distributed Datasets (RDDs) and the DataFrame API.
#python #pydata #spark #talk