Sal
Peter Hoffmann Director Data Engineering at Blue Yonder. Python Developer, Conference Speaker, Mountaineer

Python Support in Snowflake

Snowflake offers different ways to access and call python from within their compute infrastructure. This post will show how to access python in user defined functions, via stored procedures and in snowpark.

The snowflake cloud-based data storage and analytics service has python capabilities as part of their offerings. Snowpark offers native python integration into snowflake's execution engine. So python can be used to extend, call and trigger data pipelines inside their managed virtual warehouse infrastructure.

The snowpark python integration offers three ways to interact with python inside snowflake:

Python User defined functions: User defined function in Snowflake is called as a part of the SQL statements to extend functionality that is not part of the standard SQL interface. To address the performance Issus that come with row wise execution, snowpark also offers a vectorized mini batch interface for user defined functions.

Python Stored Procedures: Stored procedures in Snowflake are called as an independent statement, you cannot call a stored procedure as part of an expression. A stored procedure can return a value, but this can not be passed to another operation. It's possible to execute multiple statements within a stored procedures.

Snowpark Python Dataframe API: A dataframe/pyspark like API to query snowflake data and execute data pipelines. Snowflake transparently transforms the dataframe statements to SQL expressions on execution time and heavily benefits from the SQL query optimizer.

Creating a scalar user defined function in python

You can define user defined python functions and all them like normal sql functions from snowflake. The udf are scalar functions where each row is passed into the udf and a single value is returned. Compared to build in sql functions or UDFs in javascript the runtime performance is rather poor as snowflake has to convert every value into a python type and do the same on the output side. For performance critical statements, snowflake offers batch UDF api that works with pandas dataframes (see below).

Still python scalar UDFs are incredibly useful if you want to extend your sql statements with the power of python code.

CREATE OR REPLACE FUNCTION sizeof_fmt(val number)
returns text
language python
runtime_version = 3.8
handler = 'fn'
AS
$$

def fn(val):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(val) < 1024.0:
            return "{:3.1f}{}B".format(val, unit)
        val /= 1024.0
    return "{:.1f}{}B".format(val, 'Yi')
$$
;

The example below calculates a a human readable version for large numbers and uses it within the query to get the database sizes from the information schema:

select
    usage_date,
    database_name,
    average_database_bytes,
    sizeof_fmt(average_database_bytes)
from
    table(snowflake.information_schema.database_storage_usage_history(
            dateadd('days',-10,current_date()),current_date()));

This then gives us a nice representation as a resultset:

Create User defined function with the python UDF batch api.

The Python UDF Batch API offers a much more performant way to execute data on batches of rows. This is achieved by exposing an interface that directly works on Pandas DataFrames or numpy arrays.

The following example is a very trivial one to just use arithmetic in pandas. In a follow up blog post we will use this functionality to do online scoring with an logistic regression from sklearn.

create function add_one_to_inputs(x number(10, 0), y number(10, 0))
returns number(10, 0)
language python
runtime_version = 3.8
packages = ('pandas')
handler = 'add_one_to_inputs'
as $$
import pandas
from _snowflake import vectorized

@vectorized(input=pandas.DataFrame, max_batch_size=1000)
def add_one_to_inputs(df):
  return df[0] + df[1] + 1
$$;

The pandas user defined function then can be used as usual within sql statements

with features as (
    select
        row_number() over (order by false) as a,
        pow(2, row_number() over (order by false)) as b,
        uniform(1, 100, random()) as c
    from table(generator(rowcount => 10))
)

select a, b, add_one_to_inputs(a, b) from features;

Python Stored Procedures

Stored procedures in Snowflake are called as an independent statement, you cannot call a stored procedure as part of an expression. A stored procedure can return a value, but this can not be passed to another operation.

It's possible to execute multiple statements within a stored procedures. Inside a stored procedure you access to the same session object as within the python snowpark api

The session object is passed implicitly into the execution function.

CREATE OR REPLACE PROCEDURE MYPROC()
  RETURNS STRING
  LANGUAGE PYTHON
  RUNTIME_VERSION = '3.8'
  PACKAGES = ('snowflake-snowpark-python')
  HANDLER = 'run'
AS
$$
def run(session):
  stm = 'CREATE OR REPLACE TABLE sample_product_data (id INT, parent_id INT, category_id INT, name VARCHAR, serial_number VARCHAR, key INT, "3rd" INT)'
  res = session.sql(stm).collect()
  return str(res)
$$;

It is also possible to execute multiple statements within the stored procedure. This makes it useful to be executed for db maintenance tasks et al.

The stored procedure can be called as any other native stored procedure:

call MYPROC();

It will create the sample_product_data table and yield the following output: