Read cached filesystem for httpfs
Installing and Loading
INSTALL cache_httpfs FROM community;
LOAD cache_httpfs;
Example
SELECT cache_httpfs_get_cache_size();
About cache_httpfs
This extension adds a read cache filesystem to DuckDB, which acts as a wrapper of httpfs extention. It supports a few key features:
- Supports both file metadata, glob, file handle and data block cache
- Supports both on-disk cache and in-memory cache for data blocks, with block size and cache mode tunable
- Supports disk cache file eviction based on access timestamp, allows tunable disk space reservation
- Supports parallel IO request, with request size and parallelism tunable
- Supports profiling for IO latency and cache hit / miss ratio for a few operations (i.e open, read, glob), which provides an insight on workload characterization
- Exposes function to get cache size and cleanup cache
- Provides an option to disable / enable cache, which could act as a drop-in replacement for httpfs
Added Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
cache_httpfs_cache_access_info_query | table | NULL | NULL | [] |
cache_httpfs_cache_status_query | table | NULL | NULL | [] |
cache_httpfs_clear_cache | scalar | NULL | NULL | [] |
cache_httpfs_clear_cache_for_file | scalar | NULL | NULL | [] |
cache_httpfs_clear_profile | scalar | NULL | NULL | [] |
cache_httpfs_get_ondisk_data_cache_size | scalar | NULL | NULL | [] |
cache_httpfs_get_profile | scalar | NULL | NULL | [] |
Added Settings
name | description | input_type | scope |
---|---|---|---|
ca_cert_file | Path to a custom certificate file for self-signed certificates. | VARCHAR | GLOBAL |
cache_httpfs_cache_block_size | Block size for cache, applies to both in-memory cache filesystem and on-disk cache filesystem. It's worth noting for on-disk filesystem, all existing cache files are invalidated after config update. | UBIGINT | GLOBAL |
cache_httpfs_cache_directory | The disk cache directory that stores cached data | VARCHAR | GLOBAL |
cache_httpfs_enable_file_handle_cache | Whether file handle cache is enable for cache filesystem. By default enabled. | BOOLEAN | GLOBAL |
cache_httpfs_enable_glob_cache | Whether glob cache is enable for cache filesystem. By default enabled. | BOOLEAN | GLOBAL |
cache_httpfs_enable_metadata_cache | Whether metadata cache is enable for cache filesystem. By default enabled. | BOOLEAN | GLOBAL |
cache_httpfs_file_handle_cache_entry_size | Max cache size for file handle cache. | UBIGINT | GLOBAL |
cache_httpfs_file_handle_cache_entry_timeout_millisec | Cache entry timeout in milliseconds for file handle cache. | UBIGINT | GLOBAL |
cache_httpfs_glob_cache_entry_size | Max cache size for glob cache. | UBIGINT | GLOBAL |
cache_httpfs_glob_cache_entry_timeout_millisec | Cache entry timeout in milliseconds for glob cache. | UBIGINT | GLOBAL |
cache_httpfs_ignore_sigpipe | Whether to ignore SIGPIPE for the extension. By default not ignored. Once ignored, it cannot be reverted. | BOOLEAN | GLOBAL |
cache_httpfs_in_mem_cache_block_timeout_millisec | Data block cache entry timeout in milliseconds. | UBIGINT | GLOBAL |
cache_httpfs_max_fanout_subrequest | Cached httpfs performs parallel request by splittng them into small request, with request size decided by config [cache_httpfs_cache_block_size]. The setting limits the maximum request to issue for a single filesystem read request. 0 means no limit, by default we set no limit. | BIGINT | GLOBAL |
cache_httpfs_max_in_mem_cache_block_count | Max in-memory cache block count for in-memory caches for all cache filesystems, so users are able to configure the maximum memory consumption. It's worth noting it should be set only once before all filesystem access, otherwise there's no affect. | UBIGINT | GLOBAL |
cache_httpfs_metadata_cache_entry_size | Max cache size for metadata LRU cache. | UBIGINT | GLOBAL |
cache_httpfs_metadata_cache_entry_timeout_millisec | Cache entry timeout in milliseconds for metadata LRU cache. | UBIGINT | GLOBAL |
cache_httpfs_min_disk_bytes_for_cache | Min number of bytes on disk for the cache filesystem to enable on-disk cache; if left bytes is less than the threshold, LRU based cache file eviction will be performed.By default, 5% disk space will be reserved for other usage. When min disk bytes specified with a positive value, the default value will be overriden. | UBIGINT | GLOBAL |
cache_httpfs_profile_type | Profiling type for cached filesystem. There're three options available: noop , temp , and duckdb . temp option stores the latest IO operation profiling result, which potentially suffers concurrent updates; duckdb stores the IO operation profiling results into duckdb table, which unblocks advanced analysis. |
VARCHAR | GLOBAL |
cache_httpfs_type | Type for cached filesystem. Currently there're two types available, one is in_mem , another is on_disk . By default we use on-disk cache. Set to noop to disable, which behaves exactly same as httpfs extension. |
VARCHAR | GLOBAL |
enable_server_cert_verification | Enable server side certificate verification. | BOOLEAN | GLOBAL |
force_download | Forces upfront download of file | BOOLEAN | GLOBAL |
hf_max_per_page | Debug option to limit number of items returned in list requests | UBIGINT | GLOBAL |
http_keep_alive | Keep alive connections. Setting this to false can help when running into connection failures | BOOLEAN | GLOBAL |
http_retries | HTTP retries on I/O error | UBIGINT | GLOBAL |
http_retry_backoff | Backoff factor for exponentially increasing retry wait time | FLOAT | GLOBAL |
http_retry_wait_ms | Time between retries | UBIGINT | GLOBAL |
http_timeout | HTTP timeout read/write/connection/retry (in seconds) | UBIGINT | GLOBAL |
s3_access_key_id | S3 Access Key ID | VARCHAR | GLOBAL |
s3_endpoint | S3 Endpoint | VARCHAR | GLOBAL |
s3_kms_key_id | S3 KMS Key ID | VARCHAR | GLOBAL |
s3_region | S3 Region | VARCHAR | GLOBAL |
s3_secret_access_key | S3 Access Key | VARCHAR | GLOBAL |
s3_session_token | S3 Session Token | VARCHAR | GLOBAL |
s3_uploader_max_filesize | S3 Uploader max filesize (between 50GB and 5TB) | VARCHAR | GLOBAL |
s3_uploader_max_parts_per_file | S3 Uploader max parts per file (between 1 and 10000) | UBIGINT | GLOBAL |
s3_uploader_thread_limit | S3 Uploader global thread limit | UBIGINT | GLOBAL |
s3_url_compatibility_mode | Disable Globs and Query Parameters on S3 URLs | BOOLEAN | GLOBAL |
s3_url_style | S3 URL style | VARCHAR | GLOBAL |
s3_use_ssl | S3 use SSL | BOOLEAN | GLOBAL |