This post will look at two relatively new programs for transforming data sets and their applicability to moving data into a Postgres database. Both programs have Python API’s that will be used for this exercise. Polars is written in Rust and also has a Rust API. DuckDB is written in C++ and has multitude of API’s. Both programs also offer CLI versions that leverage the SQLite3 CLI. The DuckDB CLI is quite useable. The Polars CLI not so much. The data set to be transformed and transferred comes from the New York City taxi data page Trip data. The specific data is that for September 2023 in the form of a Parquet file and numbers approximately 2.8 million rows. As to why this month, the query against this data will answer that.
The software used for this demo:
adbc-driver-manager==0.10.0
adbc-driver-postgresql==0.10.0
duckdb==0.10.0
polars==0.20.7
SQLAlchemy==2.0.23
Machine specifications.
Intel® Core™ i5-8400 CPU @ 2.80GHz × 6
Memory 16.0 GiB
The code to do this is actually quite straightforward:
import polars as pl
import duckdb
import sqlalchemy as sa
uri = "postgresql://postgres:postgres@localhost:5452/test"
pl.read_parquet('/home/aklaver/Downloads/yellow_tripdata_2023-09.parquet').write_database('nyc_taxi_pl', uri, engine='adbc', if_table_exists='replace')
# Wall time: 10.7 s
duckdb.load_extension('postgres')
duckdb.sql("ATTACH 'postgres:dbname=test port=5452 user=postgres host=localhost' AS postgres")
duckdb.sql('create or replace table postgres.nyc_taxi_duckdb as select * from read_parquet("/home/aklaver/Downloads/yellow_tripdata_2023-09.parquet")')
# Wall time: 5.4 s
For the Polars case write_database() takes the data frame created by read_parquet() and writes it out to the Postgres table nyc_taxi_pl. The if_table_exists=’replace’ argument means an existing version of the table will be dropped and then a new version of the table is created and populated with the data from the data frame. The structure of the table, column names and types will be seen later when the query is run. Probably the most important part is the engine=’adbc’ argument, where ‘adbc’ means Arrow Database Connectivity from the Apache Arrow project. To use this the appropriate packages will need to be installed in the Python environment:
pip install adbc-driver-manager adbc-driver-postgresql
write_database() will also accept an SQLAlchemy connection. The issue is that
the SLA connection uses INSERTs to move the data into the table and is
painfully slow. The adbc connection uses COPY and is fast.
The DuckDB case is best understood when you realize the code is a layer over the
DuckDB CLI interface and that the CLI emulates the SQLite CLI. See the DuckDB
CLI link above to get the deep dive. In this case the:
duckdb.load_extension('postgres')
loads the DuckDB Postgres extension. In order for this to work at some point prior to this the extension would have needed be installed with
Python
duckdb.install_extension('postgres')
or
CLI
install postgres
Then:
duckdb.sql("ATTACH 'postgres:dbname=test port=5452 user=postgres host=localhost' AS postgres")
connects to the Postgres instance and attaches it to the DuckDB in memory
database. At this point the Postgres database can be treated essentially as a
schema in the DuckDB database. Which is where this:
duckdb.sql('create or replace table postgres.nyc_taxi_duckdb as select * from read_parquet("/home/aklaver/Downloads/yellow_tripdata_2023-09.parquet")')
comes into play. The query reads the Parquet file and then creates a Postgres
table if it does not exist or replaces it if it does exist.
A quick explanation on what led me to the query detailed in next section.
polars_df = pl.read_parquet('/home/aklaver/Downloads/yellow_tripdata_2023-09.parquet')
polars_df.select("trip_distance").describe()
When I did this originally I ran polars_df.describe() which showed all the columns
and their statistics. I have narrowed it down here to highlight the column that
caught my attention, in particular the max value of 265869.44 miles.
Now to the Postgres query against the data. The query displays the total trip
distance, total fare amount, time of drop off, time of pickup, the difference
between the times as duration and the average mph for the trip. Trip distances
are limited to trip distances of >= 1000 miles, these are the high achieving
cabs. Further only the 10 fastest cabs, in descending order, are displayed.
Using Polars.
Note for this case SQLAlchemy is being used instead of the adbc
driver used to add data to the table. The reason is the Postgres adbc driver is
a work in progress and at the time of writing does not support the interval
type represented by the duration values. The set_fmt_float=”full” is done to
keep the large numbers from being displayed in scientific notation.
pl_qry = """SELECT
trip_distance, total_amount, tpep_dropoff_datetime, tpep_pickup_datetime,
tpep_dropoff_datetime - tpep_pickup_datetime as duration,
round(trip_distance/(extract(epoch from tpep_dropoff_datetime - tpep_pickup_datetime)/3600)) as mph
FROM
nyc_taxi_pl
WHERE
trip_distance >= 1000
ORDER BY
mph desc
LIMIT
10
"""
import sqlalchemy as sa
pl_con = sa.create_engine('postgresql://postgres:@localhost:5452/test')
with pl.Config(set_fmt_float="full"):
print(pl.read_database(pl_qry, pl_con))
Using DuckDB. The query logic is the same, the table name is different to match
that in the Postgres database. The con = duckdb.connect(“:default:”) sets up
the connect to the DuckDB default in memory database. Then since the Postgres
database was attached to the DuckDB database the FROM postgres.nyc_taxi_duckdb
fetches from Postgres.
duck_qry = """SELECT
trip_distance, total_amount, tpep_dropoff_datetime, tpep_pickup_datetime,
tpep_dropoff_datetime - tpep_pickup_datetime as duration,
round(trip_distance/(extract(epoch from tpep_dropoff_datetime - tpep_pickup_datetime)/3600)) as mph
FROM
postgres.nyc_taxi_duckdb
WHERE
trip_distance >= 1000
ORDER BY
mph desc
LIMIT
10
"""
con = duckdb.connect(":default:")
con.sql(duck_qry)
My take away is that the cab at the top of the list is the Millennium Falcon disguised as a NYC Yellow cab. Also that a 42,652.39 mile trip in 1 minute for $20.48 is the deal of the millennium.