#PYDATA

Azure Synapse SQL-on-Demand Openrowset Common Table Expression with SQLAlchemy

Using SQLAlchemy to create openrowset common table expressions for Azure Synapse SQL-on-Demand
#python #sql #pydata #azure

Peter Hoffmann Peter Hoffmann

Using turbodbc to access Azure Synapse SQL-on-Demand endpoints

Azure Synapse SQL-on-Demand offers a web client, the desktop version Azure Data studio and odbc access with turbodbc to query parquet files in the Azure Data Lake.
#python #sql #pydata #azure

Peter Hoffmann Peter Hoffmann

DuckDB vs Azure Synapse SQL-on-Demand with parquet

Inspired by Uwe Korns post on DuckDB this post shows how to use Azure Synapse SQL-on-Demand to query parquet files with T-SQL on a serverless cloud infrastructure.
#python #parquet #pydata #pandas #azure

Peter Hoffmann Peter Hoffmann

Azure Data Explorer and Parquet files in the Azure Blob Storage

Last summer Microsoft has rebranded the Azure Kusto Query engine as Azure Data Explorer. While it does not support fully elastic scaling, it at least allows to scale up and out a cluster via an API or the Azure portal to adapt to different workloads. It also offers parquet support out of the box which made me spend some time to look into it.
#python #pydata #azure #parquet

Peter Hoffmann Peter Hoffmann

Understand predicate pushdown on row group level in Parquet with pyarrow and python

Apache Parquet is a columnar file format to work with gigabytes of data. Reading and writing parquet files is efficiently exposed to python with pyarrow. Additional statistics allow clients to use predicate pushdown to only read subsets of data to reduce I/O. Organizing data by column allows for better compression, as data is more homogeneous. Better compression also reduces the bandwidth required to read the input.
#python #pydata #parquet #arrow #pandas

Peter Hoffmann Peter Hoffmann

Azure Data Lake Storage Gen 2 with Python

Microsoft has released a beta version of the python client azure-storage-file-datalake for the Azure Data Lake Storage Gen 2 service with support for hierarchical namespaces.
#python #pydata

Peter Hoffmann Peter Hoffmann

Exasol User Group Karlsruhe

Exasol on Microsoft Azure – automatic deployment in less than 30 minutes
#exasol #azure #pydata #talk

Peter Hoffmann Peter Hoffmann

Getting started with the Cloudera Kudu storage engine in python

Cloudera Kudu is a distributed storage engine for fast data analytics. The python api is in alpha stage but already usable.
#python #pydata #spark

Peter Hoffmann Peter Hoffmann

EuroPython 2015 PySpark - Data Processing in Python on top of Apache Spark

Apache Spark is a computational engine for large-scale data processing. PySpark exposes the Spark programming model to Python. It defines an API for Resilient Distributed Datasets (RDDs) and the DataFrame API.
#python #pydata #spark #talk

Peter Hoffmann Peter Hoffmann

PyData 2015 Berlin - Introduction to the PySpark DataFrame API

This Talk from PyData 2015 Berlin gives an overview of the PySpark Data Frame API.
#python #pydata #spark #talk

Peter Hoffmann Peter Hoffmann