Sal
Peter Hoffmann Director Data Engineering at Blue Yonder. Python Developer, Conference Speaker, Mountaineer

Using docker multistage build to build turbodbc with pyarrow support on Debian 11

turbodbc

Turbodbc is a Python module to access relational databases via the Open Database Connectivity (ODBC) interface. For maximum performance, turbodbc offers built-in NumPy and Apache Arrow support and internally relies on batched data transfer instead of single-record communication as other popular ODBC modules do.

Building turbodbc with pyarrow support has some caveats as it has build time detection if pyarrow is installed and needs pybind and several debian dev packages to get the C++ compilation.

By using docker multistage builds we can natively build turbodbc with pyarrow support without getting the dev packages into the final image.

First step is the base image that has all necessary debian packages to run turbodbc later on:

# syntax=docker/dockerfile:1

FROM debian:bullseye as base

# Create user, must not be ROOT and UID should be greater than 1000
RUN useradd --uid 1100 app --create-home

RUN apt-get update
RUN --mount=type=cache,target=/var/cache/apt  apt-get install --yes python3 python3-venv git
RUN --mount=type=cache,target=/var/cache/apt  apt-get install --yes libodbc1 odbcinst odbcinst1debian2 binutils-x86-64-linux-gnu
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:${PATH}"
WORKDIR /app/
ENV PYTHONPATH=/app/

In the second stage we install the build requirements that are only needed to compile turbodbc with arrow support. There are two important notes:

Firstly pyarrow has to be installed before turbodbc is build as the turbodbc build process automatically detects if pyarrow is available.

To make the detection work you need to pass --no-build-isolation to the turbodbc install and make sure the arrow libraries are linked correctly.

FROM base as builder
RUN  --mount=type=cache,target=/var/cache/apt  apt-get -yq install \
    build-essential \
    gdb \
    lcov \
    libbz2-dev \
    libffi-dev \
    libgdbm-dev \
    liblzma-dev \
    libboost-dev \
    libncurses5-dev \
    libreadline6-dev \
    libsqlite3-dev \
    libssl-dev \
    lzma \
    lzma-dev \
    python3-dev \
    tk-dev \
    unixodbc-dev \
    uuid-dev \
    xvfb \
    zlib1g-dev


RUN pip3 install -U pip==22.0.4 setuptools==45.2.0 wheel==0.37.1

RUN pip3 install -U pybind11==2.10.1 numpy==1.23.5 pandas==1.5.2 six==1.16.0 pyarrow==5.0.0

RUN python3 -c "import pyarrow; pyarrow.create_library_symlinks()" \
    && CPPFLAGS="-D_GLIBCXX_USE_CXX11_ABI=0" pip3 install  --no-build-isolation turbodbc==4.5.5

In the third stage we create a fresh stage and only reuse venv with the turbodbc build packages

FROM base as runner
COPY --from=builder /opt/venv /opt/venv

COPY requirements.txt /app/requirements.txt

RUN --mount=type=cache,target=/root/.cache  pip install  --requirement /app/requirements.txt

# Set the User we created above
USER 1100

CMD []