S3 or GCS Parquet Import
Version current

How to load a Parquet file directly from S3 or GCS

To load a Parquet file from S3, the HTTPFS extension is required. This can be installed use the INSTALL SQL command. This only needs to be run once.

INSTALL httpfs;

To load the HTTPFS extension for usage, use the LOAD SQL command:

LOAD httpfs;

After loading the HTTPFS extension, set up the credentials and S3 region to read data. Firstly, the region where the data resides needs to be configured:

SET s3_region='us-east-1';

With the only the region set, public S3 data can be queried. To query private S3 data, you need to either use an access key and secret:

SET s3_access_key_id='<AWS access key id>';
SET s3_secret_access_key='<AWS secret access key>';

or a session token:

SET s3_session_token='<AWS session token>';

After the HTTPFS extension is set up and the S3 configuration is set correctly, Parquet files can be read from S3 using the following command:

SELECT * FROM read_parquet('s3://<bucket>/<file>');

For Google Cloud Storage (GCS), the Interoperability API enables you to have access to it like an S3 connection. You need to create HMAC keys and declare them:

SET s3_endpoint='storage.googleapis.com';
SET s3_access_key_id='key_id';
SET s3_secret_access_key='access_key';

Please note you will need to use the s3:// URL to read your data.

SELECT * FROM read_parquet('s3://<gcs_bucket>/<file>');
Search Shortcut cmd + k | ctrl + k