Installation
The DuckDB Python API can be installed using pip: pip install duckdb
. Please see the installation page for details. It is also possible to install DuckDB using conda: conda install python-duckdb -c conda-forge
.
Basic API Usage
The standard DuckDB Python API provides a SQL interface compliant with the DB-API 2.0 specification described by PEP 249 similar to the SQLite Python API.
Startup & Shutdown
To use the module, you must first create a Connection
object that represents the database. The connection object takes as parameter the database file to read and write from. The special value :memory:
(the default) can be used to create an in-memory database. Note that for an in-memory database no data is persisted to disk (i.e. all data is lost when you exit the Python process). If you would like to connect to an existing database in read-only mode, you can set the read_only
flag to True
. Read-only mode is required if multiple Python processes want to access the same database file at the same time.
import duckdb
# to start an in-memory database
con = duckdb.connect(database=':memory:')
# to use a database file (not shared between processes)
con = duckdb.connect(database='my-db.duckdb', read_only=False)
# to use a database file (shared between processes)
con = duckdb.connect(database='my-db.duckdb', read_only=True)
If you want to create a second connection to an existing database, you can use the cursor()
method. This might be useful for example to allow parallel threads running queries independently. A single connection is thread-safe but is locked for the duration of the queries, effectively serializing database access in this case.
Connections are closed implicitly when they go out of scope or if they are explicitly closed using close()
. Once the last connection to a database instance is closed, the database instance is closed as well.
Querying
SQL queries can be sent to DuckDB using the execute()
method of connections. Once a query has been executed, results can be retrieved using the fetchone
and fetchall
methods on the connection. Below is a short example:
# create a table
con.execute("CREATE TABLE items(item VARCHAR, value DECIMAL(10,2), count INTEGER)")
# insert two items into the table
con.execute("INSERT INTO items VALUES ('jeans', 20.0, 1), ('hammer', 42.2, 2)")
# retrieve the items again
con.execute("SELECT * FROM items")
print(con.fetchall())
# [('jeans', 20.0, 1), ('hammer', 42.2, 2)]
The description
property of the connection object contains the column names as per the standard.
DuckDB also supports prepared statements in the API with the execute
and executemany
methods. In this case, the parameters are passed as an additional parameter after a query that contains ?
placeholders. Here is an example:
# insert a row using prepared statements
con.execute("INSERT INTO items VALUES (?, ?, ?)", ['laptop', 2000, 1])
# insert several rows using prepared statements
con.executemany("INSERT INTO items VALUES (?, ?, ?)", [['chainsaw', 500, 10], ['iphone', 300, 2]] )
# query the database using a prepared statement
con.execute("SELECT item FROM items WHERE value > ?", [400])
print(con.fetchall())
# [('laptop',), ('chainsaw',)]
Do not use
executemany
to insert large amounts of data into DuckDB. See below for better options.
Efficient Transfer
Transferring large datasets to and from DuckDB uses a separate API built around NumPy and Pandas. This API works with entire columns of data instead of scalar values and is therefore far more efficient.
DuckDB supports “registering” a Pandas data frame as a virtual table, comparable to a SQL VIEW
. Below is an example:
import pandas as pd
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
con.register('test_df_view', test_df)
con.execute('SELECT * FROM test_df_view')
con.fetchall()
# [(1, 'one'), (2, 'two'), (3, 'three'), (4, 'four')]
You can now use the registered view to create a persistent table in DuckDB:
con.execute('CREATE TABLE test_df_table AS SELECT * FROM test_df_view')
DuckDB keeps a reference to the Pandas data frame after registration. This prevents the data frame from being garbage-collected by the Python runtime. The reference is cleared when the connection is closed, but can also be cleared manually using the
unregister()
method.
When retrieving the data from DuckDB back into Python, the standard method of calling fetchall()
is inefficient as individual Python objects need to be created for every value in the result set. When retrieving a lot of data, this can become very slow.
In DuckDB, there are two additional methods that can be used to efficiently retrieve data: fetchnumpy()
and fetchdf()
. fetchnumpy()
fetches the data as a dictionary of NumPy
arrays. fetchdf()
fetches the data as a Pandas DataFrame.
Below is an example of using this functionality:
# fetch as pandas data frame
df = con.execute("SELECT * FROM items").fetchdf()
print(df)
# item value count
# 0 jeans 20.0 1
# 1 hammer 42.2 2
# 2 laptop 2000.0 1
# 3 chainsaw 500.0 10
# 4 iphone 300.0 2
# fetch as dictionary of numpy arrays
arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)
# {'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
# mask=[False, False, False, False, False],
# fill_value='?',
# dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
# mask=[False, False, False, False, False],
# fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
# mask=[False, False, False, False, False],
# fill_value=999999,
# dtype=int32)}
Also refer to the data import documentation for more options of efficiently importing data.
Relational API
In addition to the SQL API DuckDB supports a programmatic method to construct queries. See https://github.com/duckdb/duckdb/blob/master/examples/python/duckdb-python.py for an example.