Search Shortcut cmd + k | ctrl + k
v0.10.0 (stable)
S3 API Support

The httpfs extension supports reading/writing/globbing files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers.

Platforms

The httpfs filesystem is tested with AWS S3, Minio, Google Cloud, and lakeFS. Other services that implement the S3 API (such as Cloudflare R2) should also work, but not all features may be supported.

The following table shows which parts of the S3 API are required for each httpfs feature.

Feature Required S3 API features
Public file reads HTTP Range requests
Private file reads Secret key or session token authentication
File glob ListObjectV2
File writes Multipart upload

Configuration and Authentication

The preferred way to configure and authenticate to S3 endpoints is to use secrets. Multiple secret providers are available.

Deprecated Prior to version 0.10.0, DuckDB did not have a Secrets manager. Hence, the configuration of and authentication to S3 endpoints was handled via variables. See the legacy authentication scheme for the S3 API.

CONFIG Provider

The default provider, CONFIG (i.e., user-configured), allows access to the S3 bucket by manually providing a key. For example:

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    REGION 'us-east-1'
);

Now, to query using the above secret, simply query any s3:// prefixed file:

SELECT * FROM 's3://my-bucket/file.parquet'

CREDENTIAL_CHAIN Provider

The CREDENTIAL_CHAIN provider allows automatically fetching credentials using mechanisms provided by the AWS SDK. For example, to use the AWS SDK default provider:

CREATE SECRET secret2 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

Again, to query a file using the above secret, simply query any s3:// prefixed file.

DuckDB also allows specifying a specific chain using the CHAIN keyword. This takes a ; separated list of providers that will be tried in order. For example:

CREATE SECRET secret3 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'env;config'
);

The possible values for CHAIN are the following:

The CREDENTIAL_CHAIN provider also allows overriding the automatically fetched config. For example, to automatically load credentials, and then override the region, run:

CREATE SECRET secret4 (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'config',
    REGION 'eu-west-1'
);

Overview of S3 Secret Parameters

Below is a complete list of the supported parameters that can be used for both the CONFIG and CREDENTIAL_CHAIN providers:

Name Description Secret Type Default
KEY_ID The ID of the key to use S3, GCS, R2 STRING -
SECRET The secret of the key to use S3, GCS, R2 STRING -
REGION The region for which to authenticate (should match the region of the bucket to query) S3, GCS, R2 STRING us-east-1
SESSION_TOKEN Optionally, a session token can be passed to use temporary credentials S3, GCS, R2 STRING -
ENDPOINT Specify a custom S3 endpoint S3, GCS, R2 STRING s3.amazonaws.com for S3,
URL_STYLE Either vhost or path S3, GCS, R2 STRING vhost for S3, path for R2 and GCS
USE_SSL Whether to use HTTPS or HTTP S3, GCS, R2 BOOLEAN true
URL_COMPATIBILITY_MODE Can help when urls contain problematic characters. S3, GCS, R2 BOOLEAN true
ACCOUNT_ID The R2 account ID to use for generating the endpoint url R2 STRING -

Platform-Specific Secret Types

R2 Secrets

While Cloudflare R2 uses the regular S3 API, DuckDB has a special Secret type, R2, to make configuring it a bit simpler:

CREATE SECRET secret5 (
    TYPE R2,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    ACCOUNT_ID 'my_account_id'
);

Note the addition of the ACCOUNT_ID which is used to generate to correct endpoint url for you. Also note that for R2 Secrets can also use both the CONFIG and CREDENTIAL_CHAIN providers. Finally, R2 secrets are only available when using urls starting with r2://, for example:

SELECT * FROM read_parquet('r2://some/file/that/uses/r2/secret/file.parquet')

GCS Secrets

While Google Cloud Storage is accessed by DuckDB using the S3 API, DuckDB has a special Secret type, GCS, to make configuring it a bit simpler:

CREATE SECRET secret6 (
    TYPE GCS,
    KEY_ID 'my_key',
    SECRET 'my_secret'
)

Note that the above secret, will automatically have the correct Google Cloud Storage endpoint configured. Also note that for GCS Secrets can also use both the CONFIG and CREDENTIAL_CHAIN providers. Finally, GCS secrets are only available when using urls starting with gcs:// or gs://, for example:

SELECT *
FROM read_parquet('gcs://some/file/that/uses/gcs/secret/file.parquet')

Reading

Reading files from S3 is now as simple as:

SELECT * FROM 's3://bucket/file.extension';

Multiple files are also possible, for example:

SELECT * FROM read_parquet([
    's3://bucket/file1.parquet',
    's3://bucket/file2.parquet'
]);

Glob

File globbing is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example:

SELECT * FROM read_parquet('s3://bucket/*.parquet');

This query matches all files in the root of the bucket with the Parquet extension.

Several features for matching are supported, such as * to match any number of any character, ? for any single character or [0-9] for a single character in a range of characters:

SELECT count(*) FROM read_parquet('s3://bucket/folder*/100?/t[0-9].parquet');

A useful feature when using globs is the filename option, which adds a column named filename that encodes the file that a particular row originated from:

SELECT * FROM read_parquet('s3://bucket/*.parquet', filename = true);

could for example result in:

column_a column_b filename
1 examplevalue1 s3://bucket/file1.parquet
2 examplevalue1 s3://bucket/file2.parquet

Hive Partitioning

DuckDB also offers support for the Hive partitioning scheme, which is available when using HTTP(S) and S3 endpoints.

Writing

Writing to S3 uses the multipart upload API. This allows DuckDB to robustly upload files at high speed. Writing to S3 works for both CSV and Parquet:

COPY table_name TO 's3://bucket/file.extension';

Partitioned copy to S3 also works:

COPY table TO 's3://my-bucket/partitioned' (
    FORMAT PARQUET,
    PARTITION_BY (part_col_a, part_col_b)
);

An automatic check is performed for existing files/directories, which is currently quite conservative (and on S3 will add a bit of latency). To disable this check and force writing, an OVERWRITE_OR_IGNORE flag is added:

COPY table TO 's3://my-bucket/partitioned' (
    FORMAT PARQUET,
    PARTITION_BY (part_col_a, part_col_b),
    OVERWRITE_OR_IGNORE true
);

The naming scheme of the written files looks like this:

s3://my-bucket/partitioned/part_col_a=<val>/part_col_b=<val>/data_<thread_number>.parquet

Configuration

Some additional configuration options exist for the S3 upload, though the default values should suffice for most use cases.

Name Description
s3_uploader_max_parts_per_file used for part size calculation, see AWS docs
s3_uploader_max_filesize used for part size calculation, see AWS docs
s3_uploader_thread_limit maximum number of uploader threads
About this page

Last modified: 2024-02-26