S3 API Support

Search Shortcut cmd + k | ctrl + k

Search cmd+k ctrl+k

Installation
Documentation

Overview
Connect

Data Import

Client APIs

Configuration

Extensions

Guides

Development

Internals

Documentation / Extensions / httpfs (HTTP and S3)

0.10 (stable)

0.10 (stable)

S3 API Support

The httpfs extension supports reading/writing/globbing files on object storage servers using the S3 API. S3 offers a standard API to read and write to remote files (while regular http servers, predating S3, do not offer a common write API). DuckDB conforms to the S3 API, that is now common among industry storage providers.

Platforms

The httpfs filesystem is tested with AWS S3, Minio, Google Cloud, and lakeFS. Other services that implement the S3 API (such as Cloudflare R2) should also work, but not all features may be supported.

The following table shows which parts of the S3 API are required for each httpfs feature.

Feature	Required S3 API features
Public file reads	HTTP Range requests
Private file reads	Secret key or session token authentication
File glob	ListObjectV2
File writes	Multipart upload

Configuration and Authentication

The preferred way to configure and authenticate to S3 endpoints is to use secrets. Multiple secret providers are available.

Deprecated Prior to version 0.10.0, DuckDB did not have a Secrets manager. Hence, the configuration of and authentication to S3 endpoints was handled via variables. See the legacy authentication scheme for the S3 API.

`CONFIG` Provider

The default provider, CONFIG (i.e., user-configured), allows access to the S3 bucket by manually providing a key. For example:

CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    REGION 'us-east-1'
);

Tip If you get an IO Error (Connection error for HTTP HEAD), configure the endpoint explicitly via ENDPOINT 's3.⟨your-region⟩.amazonaws.com'.

Now, to query using the above secret, simply query any s3:// prefixed file:

SELECT *
FROM 's3://my-bucket/file.parquet';

`CREDENTIAL_CHAIN` Provider

The CREDENTIAL_CHAIN provider allows automatically fetching credentials using mechanisms provided by the AWS SDK. For example, to use the AWS SDK default provider:

CREATE SECRET secret2 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

Again, to query a file using the above secret, simply query any s3:// prefixed file.

DuckDB also allows specifying a specific chain using the CHAIN keyword. This takes a ; separated list of providers that will be tried in order. For example:

CREATE SECRET secret3 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'env;config'
);

The possible values for CHAIN are the following:

The CREDENTIAL_CHAIN provider also allows overriding the automatically fetched config. For example, to automatically load credentials, and then override the region, run:

CREATE SECRET secret4 (
    TYPE AZURE,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'config',
    REGION 'eu-west-1'
);

Overview of S3 Secret Parameters

Below is a complete list of the supported parameters that can be used for both the CONFIG and CREDENTIAL_CHAIN providers:

Name	Description	Secret	Type	Default
`KEY_ID`	The ID of the key to use	`S3`, `GCS`, `R2`	`STRING`	-
`SECRET`	The secret of the key to use	`S3`, `GCS`, `R2`	`STRING`	-
`REGION`	The region for which to authenticate (should match the region of the bucket to query)	`S3`, `GCS`, `R2`	`STRING`	`us-east-1`
`SESSION_TOKEN`	Optionally, a session token can be passed to use temporary credentials	`S3`, `GCS`, `R2`	`STRING`	-
`ENDPOINT`	Specify a custom S3 endpoint	`S3`, `GCS`, `R2`	`STRING`	`s3.amazonaws.com` for `S3`,
`URL_STYLE`	Either `vhost` or `path`	`S3`, `GCS`, `R2`	`STRING`	`vhost` for `S3`, `path` for `R2` and `GCS`
`USE_SSL`	Whether to use HTTPS or HTTP	`S3`, `GCS`, `R2`	`BOOLEAN`	`true`
`URL_COMPATIBILITY_MODE`	Can help when urls contain problematic characters.	`S3`, `GCS`, `R2`	`BOOLEAN`	`true`
`ACCOUNT_ID`	The R2 account ID to use for generating the endpoint url	`R2`	`STRING`	-

Platform-Specific Secret Types

R2 Secrets

While Cloudflare R2 uses the regular S3 API, DuckDB has a special Secret type, R2, to make configuring it a bit simpler:

CREATE SECRET secret5 (
    TYPE R2,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY',
    ACCOUNT_ID 'my_account_id'
);

Note the addition of the ACCOUNT_ID which is used to generate to correct endpoint url for you. Also note that for R2 Secrets can also use both the CONFIG and CREDENTIAL_CHAIN providers. Finally, R2 secrets are only available when using urls starting with r2://, for example:

SELECT *
FROM read_parquet('r2://some/file/that/uses/r2/secret/file.parquet');

GCS Secrets

While Google Cloud Storage is accessed by DuckDB using the S3 API, DuckDB has a special Secret type, GCS, to make configuring it a bit simpler:

CREATE SECRET secret6 (
    TYPE GCS,
    KEY_ID 'my_key',
    SECRET 'my_secret'
);

Note that the above secret, will automatically have the correct Google Cloud Storage endpoint configured. Also note that for GCS Secrets can also use both the CONFIG and CREDENTIAL_CHAIN providers. Finally, GCS secrets are only available when using urls starting with gcs:// or gs://, for example:

SELECT *
FROM read_parquet('gcs://some/file/that/uses/gcs/secret/file.parquet');

Reading

Reading files from S3 is now as simple as:

SELECT *
FROM 's3://bucket/file.extension';

Multiple files are also possible, for example:

SELECT *
FROM read_parquet([
    's3://bucket/file1.parquet',
    's3://bucket/file2.parquet'
]);

Glob

File globbing is implemented using the ListObjectV2 API call and allows to use filesystem-like glob patterns to match multiple files, for example:

SELECT *
FROM read_parquet('s3://bucket/*.parquet');

This query matches all files in the root of the bucket with the Parquet extension.

Several features for matching are supported, such as * to match any number of any character, ? for any single character or [0-9] for a single character in a range of characters:

SELECT count(*) FROM read_parquet('s3://bucket/folder*/100?/t[0-9].parquet');

A useful feature when using globs is the filename option, which adds a column named filename that encodes the file that a particular row originated from:

SELECT *
FROM read_parquet('s3://bucket/*.parquet', filename = true);

could for example result in:

column_a	column_b	filename
1	examplevalue1	s3://bucket/file1.parquet
2	examplevalue1	s3://bucket/file2.parquet

Hive Partitioning

DuckDB also offers support for the Hive partitioning scheme, which is available when using HTTP(S) and S3 endpoints.

Writing

Writing to S3 uses the multipart upload API. This allows DuckDB to robustly upload files at high speed. Writing to S3 works for both CSV and Parquet:

COPY table_name TO 's3://bucket/file.extension';

Partitioned copy to S3 also works:

COPY table TO 's3://my-bucket/partitioned' (
    FORMAT PARQUET,
    PARTITION_BY (part_col_a, part_col_b)
);

An automatic check is performed for existing files/directories, which is currently quite conservative (and on S3 will add a bit of latency). To disable this check and force writing, an OVERWRITE_OR_IGNORE flag is added:

COPY table TO 's3://my-bucket/partitioned' (
    FORMAT PARQUET,
    PARTITION_BY (part_col_a, part_col_b),
    OVERWRITE_OR_IGNORE true
);

The naming scheme of the written files looks like this:

s3://my-bucket/partitioned/part_col_a=⟨val⟩/part_col_b=⟨val⟩/data_⟨thread_number⟩.parquet

Configuration

Some additional configuration options exist for the S3 upload, though the default values should suffice for most use cases.

Name	Description
`s3_uploader_max_parts_per_file`	used for part size calculation, see AWS docs
`s3_uploader_max_filesize`	used for part size calculation, see AWS docs
`s3_uploader_thread_limit`	maximum number of uploader threads

About this page

Last modified: 2024-04-25

About this page

In this article