The azure extension will be transparently autoloaded on first use from the official extension repository.
If you would like to install and load it manually, run:
Maximum number of threads the Azure client can use for a single parallel read. If azure_read_transfer_chunk_size is less than azure_read_buffer_size then setting this > 1 will allow the Azure client to do concurrent requests to fill the buffer.
BIGINT
5
azure_read_transfer_chunk_size
Maximum size in bytes that the Azure client will read in a single request. It is recommended that this is a factor of azure_read_buffer_size.
BIGINT
1024*1024
azure_read_buffer_size
Size of the read buffer. It is recommended that this is evenly divisible by azure_read_transfer_chunk_size.
UBIGINT
1024*1024
azure_transport_option_type
Underlying adapter to use in the Azure SDK. Valid values are: default or curl.
VARCHAR
default
azure_context_caching
Enable/disable the caching of the underlying Azure SDK HTTP connection in the DuckDB connection context when performing queries. If you suspect that this is causing some side effect, you can try to disable it by setting it to false (not recommended).
BOOLEAN
true
Setting azure_transport_option_type explicitly to curl with have the following effect:
On Linux, this may solve certificates issue (Error: Invalid Error: Fail to get a new connection for: https://storage_account_name.blob.core.windows.net/. Problem with the SSL CA cert (path? access rights?)) because when specifying the extension will try to find the bundle certificate in various paths (that is not done by curl by default and might be wrong due to static linking).
On Windows, this replaces the default adapter (WinHTTP) allowing you to use all curl capabilities (for example using a socks proxies).
On all operating systems, it will honor the following environment variables:
CURL_CA_INFO: Path to a PEM encoded file containing the certificate authorities sent to libcurl. Note that this option is known to only work on Linux and might throw if set on other platforms.
CURL_CA_PATH: Path to a directory which holds PEM encoded file, containing the certificate authorities sent to libcurl.
Multiple Secret Providers are available for the Azure extension:
If you need to define different secrets for different storage accounts, use the SCOPE configuration. Note that the SCOPE requires a trailing slash (SCOPE 'azure://some_container/').
If you use fully qualified path then the ACCOUNT_NAME attribute is optional.
The credential_chain provider allows connecting using credentials automatically fetched by the Azure SDK via the Azure credential chain.
By default, the DefaultAzureCredential chain used, which tries credentials according to the order specified by the Azure documentation.
For example:
DuckDB also allows specifying a specific chain using the CHAIN keyword. This takes a semicolon-separated list (a;b;c) of providers that will be tried in order. For example:
Azure connection string, used for authenticating and configuring Azure requests.
STRING
-
azure_account_name
Azure account name, when set, the extension will attempt to automatically detect credentials (not used if you pass the connection string).
STRING
-
azure_endpoint
Override the Azure endpoint for when the Azure credential providers are used.
STRING
blob.core.windows.net
azure_credential_chain
Ordered list of Azure credential providers, in string format separated by ;. For example: 'cli;managed_identity;env'. See the list of possible values in the credential_chain provider section. Not used if you pass the connection string.
STRING
-
azure_http_proxy
Proxy to use when login & performing request to Azure.
The Azure extension relies on the Azure SDK to connect to Azure Blob storage and supports printing the SDK logs to the console.
To control the log level, set the AZURE_LOG_LEVEL environment variable.
For instance, verbose logs can be enabled as follows in Python:
Even though ADLS implements similar functionality as the Blob storage, there are some important performance benefits to using the ADLS endpoints for globbing, especially when using (complex) glob patterns.
To demonstrate, lets look at an example of how the a glob is performed internally using respectively the Blob and ADLS endpoints.
Filter and list subdirectories: root/l_receipmonth=1997-10, root/l_receipmonth=1997-11, root/l_receipmonth=1997-12
root/l_receipmonth=1997-10/l_shipmode=SHIP
root/l_receipmonth=1997-10/l_shipmode=AIR
root/l_receipmonth=1997-10/l_shipmode=TRUCK
root/l_receipmonth=1997-11/l_shipmode=SHIP
root/l_receipmonth=1997-11/l_shipmode=AIR
root/l_receipmonth=1997-11/l_shipmode=TRUCK
root/l_receipmonth=1997-12/l_shipmode=SHIP
root/l_receipmonth=1997-12/l_shipmode=AIR
root/l_receipmonth=1997-12/l_shipmode=TRUCK
Filter and list subdirectories: root/l_receipmonth=1997-10/l_shipmode=SHIP, root/l_receipmonth=1997-11/l_shipmode=SHIP, root/l_receipmonth=1997-12/l_shipmode=SHIP
As you can see because the Blob endpoint does not support the notion of directories, the filter can only be performed after the listing, whereas the ADLS endpoint will list files recursively. Especially with higher partition/directory counts, the performance difference can be very significant.