HTTPFS
Version current
Version:

The httpfs extension is a loadable extension implementing a file system that allows reading remote/writing remote files. For pure HTTP(S), only file reading is supported. For object storage using the S3 API, the httpfs extension supports reading/writing/globbing files.

HTTP(S)

With the httpfs extension, it is possible to directly query files over HTTP(S). This currently works for CSV and Parquet files.

SELECT * FROM 'https://domain.tld/file.extension';

For CSV files, files will be downloaded entirely in most cases, due to the row-based nature of the format. For parquet files, DuckDB can use a combination of the Parquet metadata and HTTP range requests to only download the parts of the file that are actually required by the query. For example, the query:

SELECT column_a FROM 'https://domain.tld/file.parquet';

will only read the parquet metadata and the data for the column_a column. In some cases even, no actual data needs to be read at all as they only require reading the metadata:

SELECT COUNT(*) FROM 'https://domain.tld/file.parquet';

Scanning multiple files over HTTP(S) is also supported:

SELECT * FROM parquet_scan(['https://domain.tld/file1.parquet', 'https://domain.tld/file2.parquet']);

S3

The httpfs extension supports reading/writing/globbing files on object storage servers using the S3 API.

Requirements

The httpfs filesystem is tested with AWS S3, Minio, and Google cloud. Other services that implement the S3 API should also work, but not all features may be supported. Below is a list of which parts of the S3 API are required for each httpfs feature.

Feature Required S3 API features
Public file reads HTTP Range requests
Private file reads Secret key or session token authentication
File glob ListObjectV2
File writes Multipart upload

Configuration

To be able to read or write from S3, the correct region should be set:

SET s3_region='us-east-1';

Optionally, the endpoint can be configured in case a non-AWS object storage server is used:

SET s3_endpoint='<domain>.<tld>:<port>';

Switching between path-style and vhost-style urls (see AWS docs) is possible using:

SET s3_url_style='path';

However, note that this may also require updating the endpoint. For example for AWS S3 it is required to change the endpoint to s3.<region>.amazonaws.com.

After configuring the correct endpoint and region, public files can be read. To also read private files, authentication credentials can be added:

SET s3_access_key_id='<AWS access key id>';
SET s3_secret_access_key='<AWS secret access key>';

Alternatively, session tokens are also supported and can be used instead:

SET s3_session_token='<AWS session token>';

Reading

Reading files from S3 is now as simple as:

SELECT * FROM 's3://bucket/file.extension';

Multiple files are also possible, for example:

SELECT * FROM parquet_scan(['s3://bucket/file1.parquet', 's3://bucket/file2.parquet']);

Glob

File globbing is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example:

SELECT * from parquet_scan('s3://bucket/*.parquet')

This query matches all files in the root of the bucket with the parquet extension.

Several features for matching are supported, such as * to match any number of any character, ? for any single character or [0-9] for a single character in a range of characters:

SELECT COUNT(*) FROM parquet_scan('s3://bucket/folder*/100?/t[0-9].parquet')

A useful feature when using globs is the filename option which adds a column with the file that a row originated from:

SELECT * FROM parquet_scan('s3://bucket/*.parquet', FILENAME = 1);

could for example result in:

column_a column_b filename
1 examplevalue1 s3://bucket/file1.parquet
2 examplevalue1 s3://bucket/file2.parquet

Hive partitioning

DuckDB also offers support for the Hive partitioning scheme. In the Hive partitioning scheme, data is partitioned in separate files. The columns by which the data is partitioned, are not actually in the files, but are encoded in the file path. So for example let us consider three parquet files Hive paritioned by year:

s3://bucket/year=2012/file.parquet
s3://bucket/year=2013/file.parquet
s3://bucket/year=2014/file.parquet

If scanning these files with the HIVE_PARTITIONING option enabled:

SELECT * FROM parquet_scan('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1);

could result in:

column_a column_b year
1 examplevalue1 2012
2 examplevalue2 2013
3 examplevalue3 2014

Note that the year column does not actually exist in the parquet files, it is parsed from the filenames. Within DuckDB however, these columns behave just like regular columns. For example, filters can be applied on Hive partition columns:

SELECT * FROM parquet_scan('s3://bucket/*/file.parquet', HIVE_PARTITIONING = 1) where year=2013;

Writing

Writing to S3 uses the multipart upload API. This allows DuckDB to robustly upload files at high speed. Writing to S3 works for both CSV and Parquet:

COPY table_name TO 's3://bucket/file.extension';

Configuration

Some additional configuration options exist for the S3 upload, though the default values should suffice for most use cases.

setting description
s3_uploader_max_parts_per_file used for part size calculation, see AWS docs
s3_uploader_max_filesize used for part size calculation, see AWS docs
s3_uploader_thread_limit maximum number of uploader threads