Read cached filesystem for httpfs
Installing and Loading
INSTALL cache_httpfs FROM community;
LOAD cache_httpfs;
Example
SELECT cache_httpfs_get_cache_size();
About cache_httpfs
This extension adds a read cache filesystem to DuckDB, which acts as a wrapper of httpfs extention. It supports a few key features:
- Supports both metadata cache and data block cache
- Supports both on-disk cache and in-memory cache, with block size and cache mode tunable
- Supports disk cache file eviction based on access timestamp
- Supports parallel IO request, with request size tunable
- Supports profiling for IO latency and cache hit / miss ratio, which provides an insight on workload characterization
- Exposes function to get cache size and cleanup cache
- Provides an option to disable / enable cache, which could act as a drop-in replacement for httpfs
Added Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
cache_httpfs_clear_cache | scalar | NULL | NULL | [] |
cache_httpfs_clear_cache_for_file | scalar | NULL | NULL | [] |
cache_httpfs_clear_profile | scalar | NULL | NULL | [] |
cache_httpfs_get_cache_size | scalar | NULL | NULL | [] |
cache_httpfs_get_profile | scalar | NULL | NULL | [] |
Added Settings
name | description | input_type | scope |
---|---|---|---|
ca_cert_file | Path to a custom certificate file for self-signed certificates. | VARCHAR | GLOBAL |
cache_httpfs_cache_block_size | Block size for cache, applies to both in-memory cache filesystem and on-disk cache filesystem. It's worth noting for on-disk filesystem, all existing cache files are invalidated after config update. | UBIGINT | GLOBAL |
cache_httpfs_cache_directory | The disk cache directory that stores cached data | VARCHAR | GLOBAL |
cache_httpfs_enable_metadata_cache | Whether metadata cache is enable for cache filesystem. By default enabled. | BOOLEAN | GLOBAL |
cache_httpfs_ignore_sigpipe | Whether to ignore SIGPIPE for the extension. By default not ignored. Once ignored, it cannot be reverted. | BOOLEAN | GLOBAL |
cache_httpfs_max_fanout_subrequest | Cached httpfs performs parallel request by splittng them into small request, with request size decided by config [cache_httpfs_cache_block_size]. The setting limits the maximum request to issue for a single filesystem read request. 0 means no limit, by default we set no limit. | BIGINT | GLOBAL |
cache_httpfs_max_in_mem_cache_block_count | Max in-memory cache block count for in-memory cache filesystem. It's worth noting it should be set only once before all filesystem access, otherwise there's no affect. | UBIGINT | GLOBAL |
cache_httpfs_profile_type | Profiling type for cached filesystem. There're three options available: noop , temp , and duckdb . temp option stores the latest IO operation profiling result, which potentially suffers concurrent updates; duckdb stores the IO operation profiling results into duckdb table, which unblocks advanced analysis. |
VARCHAR | GLOBAL |
cache_httpfs_type | Type for cached filesystem. Currently there're two types available, one is in_mem , another is on_disk . By default we use on-disk cache. Set to noop to disable, which behaves exactly same as httpfs extension. |
VARCHAR | GLOBAL |
enable_server_cert_verification | Enable server side certificate verification. | BOOLEAN | GLOBAL |
force_download | Forces upfront download of file | BOOLEAN | GLOBAL |
hf_max_per_page | Debug option to limit number of items returned in list requests | UBIGINT | GLOBAL |
http_keep_alive | Keep alive connections. Setting this to false can help when running into connection failures | BOOLEAN | GLOBAL |
http_retries | HTTP retries on I/O error | UBIGINT | GLOBAL |
http_retry_backoff | Backoff factor for exponentially increasing retry wait time | FLOAT | GLOBAL |
http_retry_wait_ms | Time between retries | UBIGINT | GLOBAL |
http_timeout | HTTP timeout read/write/connection/retry (in seconds) | UBIGINT | GLOBAL |
s3_access_key_id | S3 Access Key ID | VARCHAR | GLOBAL |
s3_endpoint | S3 Endpoint | VARCHAR | GLOBAL |
s3_region | S3 Region | VARCHAR | GLOBAL |
s3_secret_access_key | S3 Access Key | VARCHAR | GLOBAL |
s3_session_token | S3 Session Token | VARCHAR | GLOBAL |
s3_uploader_max_filesize | S3 Uploader max filesize (between 50GB and 5TB) | VARCHAR | GLOBAL |
s3_uploader_max_parts_per_file | S3 Uploader max parts per file (between 1 and 10000) | UBIGINT | GLOBAL |
s3_uploader_thread_limit | S3 Uploader global thread limit | UBIGINT | GLOBAL |
s3_url_compatibility_mode | Disable Globs and Query Parameters on S3 URLs | BOOLEAN | GLOBAL |
s3_url_style | S3 URL style | VARCHAR | GLOBAL |
s3_use_ssl | S3 use SSL | BOOLEAN | GLOBAL |