{"id":589,"date":"2024-03-08T15:37:06","date_gmt":"2024-03-08T23:37:06","guid":{"rendered":"https:\/\/aklaver.org\/wordpress\/?p=589"},"modified":"2024-03-08T15:37:06","modified_gmt":"2024-03-08T23:37:06","slug":"using-polars-duckdb-with-postgres","status":"publish","type":"post","link":"https:\/\/aklaver.org\/wordpress\/2024\/03\/08\/using-polars-duckdb-with-postgres\/","title":{"rendered":"Using Polars &#038; DuckDB with Postgres"},"content":{"rendered":"\n<p>This post will look at two relatively new programs for transforming data sets and their applicability to moving data into a Postgres database. Both programs have Python API&#8217;s that will be used for this exercise. <a href=\"http:\/\/Polars (https:\/\/pola.rs\/)\">Polars<\/a> is written in Rust and also has a <a href=\"https:\/\/docs.rs\/polars\/latest\/polars\/\">Rust API<\/a>. <a href=\"https:\/\/duckdb.org\/\">DuckDB<\/a> is written in C++ and has multitude of <a href=\"https:\/\/duckdb.org\/docs\/api\/overview\">API&#8217;s<\/a>. Both programs also offer CLI versions that leverage the SQLite3 CLI. The DuckDB <a href=\"https:\/\/duckdb.org\/docs\/api\/cli\/overview\">CLI<\/a> is quite useable. The Polars <a href=\"https:\/\/github.com\/pola-rs\/polars-cli\">CLI<\/a> not so much. The data set to be transformed and transferred comes from the New York City taxi data page <a href=\"https:\/\/www.nyc.gov\/site\/tlc\/about\/tlc-trip-record-data.page\">Trip data<\/a>. The specific data is that for September 2023 in the form of a Parquet file and numbers approximately 2.8 million rows. As to why this month, the query against this data will answer that.<\/p>\n\n\n\n<p>The software used for this demo:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>adbc-driver-manager==0.10.0\nadbc-driver-postgresql==0.10.0\nduckdb==0.10.0\npolars==0.20.7\nSQLAlchemy==2.0.23\n\n<\/code><\/pre>\n\n\n\n<p>Machine specifications. <\/p>\n\n\n\n<p>Intel\u00ae Core&#x2122; i5-8400 CPU @ 2.80GHz \u00d7 6<\/p>\n\n\n\n<p>Memory 16.0&nbsp;GiB<\/p>\n\n\n\n<p>The code to do this is actually quite straightforward:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import polars as pl<br>import duckdb<br>import sqlalchemy as sa<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>uri = \"postgresql:\/\/postgres:postgres@localhost:5452\/test\"\npl.read_parquet('\/home\/aklaver\/Downloads\/yellow_tripdata_2023-09.parquet').write_database('nyc_taxi_pl', uri, engine='adbc', if_table_exists='replace')\n# Wall time: 10.7 s<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>duckdb.load_extension('postgres')\nduckdb.sql(\"ATTACH 'postgres:dbname=test port=5452 user=postgres host=localhost' AS postgres\")\nduckdb.sql('create or replace table postgres.nyc_taxi_duckdb as select * from read_parquet(\"\/home\/aklaver\/Downloads\/yellow_tripdata_2023-09.parquet\")')\n# Wall time: 5.4 s<\/code><\/pre>\n\n\n\n<p>For the Polars case write_database() takes the data frame created by read_parquet() and writes it out to the Postgres table nyc_taxi_pl. The if_table_exists=&#8217;replace&#8217; argument means an existing version of the table will be dropped and then a new version of the table is created and populated with the data from the data frame. The structure of the table, column names and types will be seen later when the query is run. Probably the most important part is the engine=&#8217;adbc&#8217; argument, where &#8216;adbc&#8217; means <a href=\"https:\/\/arrow.apache.org\/adbc\/current\/index.html\">Arrow Database Connectivity<\/a> from the Apache Arrow project. To use this the appropriate packages will need to be installed in the Python environment:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install adbc-driver-manager adbc-driver-postgresql<\/code><\/pre>\n\n\n\n<p>write_database() will also accept an SQLAlchemy connection. The issue is that<br>the SLA connection uses INSERTs to move the data into the table and is<br>painfully slow. The adbc connection uses COPY and is fast.<\/p>\n\n\n\n<p>The DuckDB case is best understood when you realize the code is a layer over the<br>DuckDB CLI interface and that the CLI emulates the SQLite CLI. See the DuckDB<br>CLI link above to get the deep dive. In this case the:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>duckdb.load_extension('postgres')<\/code><\/pre>\n\n\n\n<p>loads the <a href=\"https:\/\/duckdb.org\/docs\/extensions\/postgres)\">DuckDB Postgres<\/a> extension. In order for this to work at some point prior to this the extension would have needed be installed with<\/p>\n\n\n\n<p>Python<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>duckdb.install_extension('postgres')<\/code><\/pre>\n\n\n\n<p>or<\/p>\n\n\n\n<p>CLI<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>install postgres<\/code><\/pre>\n\n\n\n<p>Then:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>duckdb.sql(\"ATTACH 'postgres:dbname=test port=5452 user=postgres host=localhost' AS postgres\")<\/code><\/pre>\n\n\n\n<p>connects to the Postgres instance and attaches it to the DuckDB in memory<br>database. At this point the Postgres database can be treated essentially as a<br>schema in the DuckDB database. Which is where this:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>duckdb.sql('create or replace table postgres.nyc_taxi_duckdb as select * from read_parquet(\"\/home\/aklaver\/Downloads\/yellow_tripdata_2023-09.parquet\")')<\/code><\/pre>\n\n\n\n<p>comes into play. The query reads the Parquet file and then creates a Postgres<br>table if it does not exist or replaces it if it does exist.<\/p>\n\n\n\n<p>A quick explanation on what led me to the query detailed in next section.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>polars_df = pl.read_parquet('\/home\/aklaver\/Downloads\/yellow_tripdata_2023-09.parquet')<br>polars_df.select(\"trip_distance\").describe()<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><a href=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-59-18.png\"><img loading=\"lazy\" decoding=\"async\" width=\"252\" height=\"268\" src=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-59-18.png\" alt=\"\" class=\"wp-image-621\" style=\"width:234px;height:auto\"\/><\/a><\/figure>\n\n\n\n<p>When I did this originally I ran polars_df.describe() which showed all the columns<br>and their statistics. I have narrowed it down here to highlight the column that<br>caught my attention, in particular the max value of 265869.44 miles.<\/p>\n\n\n\n<p>Now to the Postgres query against the data. The query displays the total trip<br>distance, total fare amount, time of drop off, time of pickup, the difference<br>between the times as duration and the average mph for the trip. Trip distances<br>are limited to trip distances of &gt;= 1000 miles, these are the high achieving<br>cabs. Further only the 10 fastest cabs, in descending order, are displayed.<\/p>\n\n\n\n<p>Using Polars.<\/p>\n\n\n\n<p>Note for this case SQLAlchemy is being used instead of the adbc<br>driver used to add data to the table. The reason is the Postgres adbc driver is<br>a work in progress and at the time of writing does not support the interval<br>type represented by the duration values. The set_fmt_float=&#8221;full&#8221; is done to<br>keep the large numbers from being displayed in scientific notation.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pl_qry = \"\"\"SELECT\ntrip_distance, total_amount, tpep_dropoff_datetime, tpep_pickup_datetime,\ntpep_dropoff_datetime - tpep_pickup_datetime as duration,\nround(trip_distance\/(extract(epoch from tpep_dropoff_datetime - tpep_pickup_datetime)\/3600)) as mph\nFROM\nnyc_taxi_pl\nWHERE\ntrip_distance &gt;= 1000\nORDER BY\nmph desc\nLIMIT\n10\n\"\"\"\n\nimport sqlalchemy as sa\n\npl_con = sa.create_engine('postgresql:\/\/postgres:@localhost:5452\/test')\n\nwith pl.Config(set_fmt_float=\"full\"):\nprint(pl.read_database(pl_qry, pl_con))<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-50-37.png\"><img loading=\"lazy\" decoding=\"async\" width=\"837\" height=\"289\" src=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-50-37.png\" alt=\"\" class=\"wp-image-617\" srcset=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-50-37.png 837w, https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-50-37-300x104.png 300w, https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-50-37-768x265.png 768w, https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-14-50-37-500x173.png 500w\" sizes=\"auto, (max-width: 837px) 100vw, 837px\" \/><\/a><\/figure>\n\n\n\n<p>Using DuckDB. The query logic is the same, the table name is different to match<br>that in the Postgres database. The con = duckdb.connect(&#8220;:default:&#8221;) sets up<br>the connect to the DuckDB default in memory database. Then since the Postgres<br>database was attached to the DuckDB database the FROM postgres.nyc_taxi_duckdb<br>fetches from Postgres.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>duck_qry = \"\"\"SELECT\ntrip_distance, total_amount, tpep_dropoff_datetime, tpep_pickup_datetime,\ntpep_dropoff_datetime - tpep_pickup_datetime as duration,\nround(trip_distance\/(extract(epoch from tpep_dropoff_datetime - tpep_pickup_datetime)\/3600)) as mph\nFROM\npostgres.nyc_taxi_duckdb\nWHERE\ntrip_distance &gt;= 1000\nORDER BY\nmph desc\nLIMIT\n10\n\"\"\"\n\ncon = duckdb.connect(\":default:\")\ncon.sql(duck_qry)\n<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-15-03-15.png\"><img loading=\"lazy\" decoding=\"async\" width=\"831\" height=\"293\" src=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-15-03-15.png\" alt=\"\" class=\"wp-image-625\" srcset=\"https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-15-03-15.png 831w, https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-15-03-15-300x106.png 300w, https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-15-03-15-768x271.png 768w, https:\/\/aklaver.org\/wordpress\/wp-content\/uploads\/2024\/03\/Screenshot-from-2024-03-08-15-03-15-500x176.png 500w\" sizes=\"auto, (max-width: 831px) 100vw, 831px\" \/><\/a><\/figure>\n\n\n\n<p>My take away is that the cab at the top of the list is the Millennium Falcon disguised as a NYC Yellow cab. Also that a 42,652.39 mile trip in 1 minute for $20.48 is the deal of the millennium.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post will look at two relatively new programs for transforming data sets and their applicability to moving data into a Postgres database. Both programs have Python API&#8217;s that will be used for this exercise. Polars is written in Rust &hellip; <a href=\"https:\/\/aklaver.org\/wordpress\/2024\/03\/08\/using-polars-duckdb-with-postgres\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ngg_post_thumbnail":0,"footnotes":""},"categories":[5],"tags":[],"class_list":["post-589","post","type-post","status-publish","format-standard","hentry","category-postgres"],"_links":{"self":[{"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/posts\/589","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/comments?post=589"}],"version-history":[{"count":37,"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/posts\/589\/revisions"}],"predecessor-version":[{"id":632,"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/posts\/589\/revisions\/632"}],"wp:attachment":[{"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/media?parent=589"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/categories?post=589"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aklaver.org\/wordpress\/wp-json\/wp\/v2\/tags?post=589"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}