Peter Hoffmann

PyData 2015 Berlin - Introduction to the PySpark DataFrame API

This talk from PyData 2015 Berlin gives an overview of the PySpark DataFrame API. While Spark Core is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. The DataFrame API was introduced in Spark 1.3. DataFrames evolve Spark’s Resilient Distributed Dataset (RDD) model and are inspired by pandas and R data frames. The API provides simple operations for filtering, aggregating, and projecting over large datasets. The DataFrame API supports different data sources, such as JSON, Parquet, Hive tables, and JDBC connections. It also benefits from the Catalyst optimizer and the Tungsten execution engine for efficient queries.