The httpfs extension is a loadable extension implementing a file system that allows reading remote/writing remote files. For pure HTTP(S), only file reading is supported. For object storage using the S3 API, the httpfs extension supports reading/writing/globbing files.
Some clients come prebundled with this extension, in which case it’s not necessary to first install or even load the extension.
Depending on the client you use, no action may be required, or you might have to
INSTALL httpfs on first use and use
LOAD httpfs at the start of every session.
With the httpfs extension, it is possible to directly query files over HTTP(S). This currently works for CSV and Parquet files.
SELECT * FROM 'https://domain.tld/file.extension';
For CSV files, files will be downloaded entirely in most cases, due to the row-based nature of the format. For parquet files, DuckDB can use a combination of the Parquet metadata and HTTP range requests to only download the parts of the file that are actually required by the query. For example, the query:
SELECT column_a FROM 'https://domain.tld/file.parquet';
will only read the parquet metadata and the data for the
column_a column. In some cases even, no actual data needs to
be read at all as they only require reading the metadata:
SELECT COUNT(*) FROM 'https://domain.tld/file.parquet';
Scanning multiple files over HTTP(S) is also supported:
SELECT * FROM parquet_scan(['https://domain.tld/file1.parquet', 'https://domain.tld/file2.parquet']);
The httpfs extension supports reading/writing/globbing files on object storage servers using the S3 API.
The httpfs filesystem is tested with AWS S3, Minio, and Google cloud. Other services that implement the S3 API should also work, but not all features may be supported. Below is a list of which parts of the S3 API are required for each httpfs feature.
|Feature||Required S3 API features|
|Public file reads||HTTP Range requests|
|Private file reads||Secret key or session token authentication|
|File writes||Multipart upload|
To be able to read or write from S3, the correct region should be set:
Optionally, the endpoint can be configured in case a non-AWS object storage server is used:
Switching between path-style and vhost-style urls (see AWS docs) is possible using:
However, note that this may also require updating the endpoint. For example for AWS S3 it is required to change the
After configuring the correct endpoint and region, public files can be read. To also read private files, authentication credentials can be added:
SET s3_access_key_id='<AWS access key id>'; SET s3_secret_access_key='<AWS secret access key>';
Alternatively, session tokens are also supported and can be used instead:
SET s3_session_token='<AWS session token>';
Reading files from S3 is now as simple as:
SELECT * FROM 's3://bucket/file.extension';
Multiple files are also possible, for example:
SELECT * FROM parquet_scan(['s3://bucket/file1.parquet', 's3://bucket/file2.parquet']);
File globbing is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example:
SELECT * from parquet_scan('s3://bucket/*.parquet')
This query matches all files in the root of the bucket with the parquet extension.
Several features for matching are supported, such as
* to match any number of any character,
? for any single
[0-9] for a single character in a range of characters:
SELECT COUNT(*) FROM parquet_scan('s3://bucket/folder*/100?/t[0-9].parquet')
A useful feature when using globs is the
filename option which adds a column with the file that a row originated from:
SELECT * FROM parquet_scan('s3://bucket/*.parquet', FILENAME = 1);
could for example result in:
DuckDB also offers support for the Hive partitioning scheme. In the Hive partitioning scheme, data is partitioned in separate files. The columns by which the data is partitioned, are not actually in the files, but are encoded in the file path. So for example let us consider three parquet files Hive paritioned by year:
s3://bucket/year=2012/file.parquet s3://bucket/year=2013/file.parquet s3://bucket/year=2014/file.parquet
If scanning these files with the HIVE_PARTITIONING option enabled:
SELECT * FROM parquet_scan('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1);
could result in:
Note that the year column does not actually exist in the parquet files, it is parsed from the filenames. Within DuckDB however, these columns behave just like regular columns. For example, filters can be applied on Hive partition columns:
SELECT * FROM parquet_scan('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;
Writing to S3 uses the multipart upload API. This allows DuckDB to robustly upload files at high speed. Writing to S3 works for both CSV and Parquet:
COPY table_name TO 's3://bucket/file.extension';
Partioned copy to S3 also works:
COPY table TO 's3://my-bucket/partitioned' (FORMAT PARQUET, PARTITION_BY (part_col_a, part_col_b));
An automatic check is performed for existing files/directories, which is currently quite conservative (and on S3 will add a bit of latency). To disable this check and force writing, an
ALLOW_OVERWRITE flag is added:
COPY table TO 's3://my-bucket/partitioned' (FORMAT PARQUET, PARTITION_BY (part_col_a, part_col_b), ALLOW_OVERWRITE TRUE);
The naming scheme of the written files looks like this:
Some additional configuration options exist for the S3 upload, though the default values should suffice for most use cases.
|s3_uploader_max_parts_per_file||used for part size calculation, see AWS docs|
|s3_uploader_max_filesize||used for part size calculation, see AWS docs|
|s3_uploader_thread_limit||maximum number of uploader threads|