Search Shortcut cmd + k | ctrl + k
cache_httpfs

Read cached filesystem for httpfs

Maintainer(s): dentiny, DouEnergy

Installing and Loading

INSTALL cache_httpfs FROM community;
LOAD cache_httpfs;

Example

SELECT cache_httpfs_get_cache_size();

About cache_httpfs

This extension adds a read cache filesystem to DuckDB, which acts as a wrapper of httpfs extention. It supports a few key features:

  • Supports both file metadata, glob, file handle and data block cache
  • Supports both on-disk cache and in-memory cache for data blocks, with block size and cache mode tunable
  • Supports disk cache file eviction based on access timestamp, allows tunable disk space reservation
  • Supports parallel IO request, with request size and parallelism tunable
  • Supports profiling for IO latency and cache hit / miss ratio for a few operations (i.e open, read, glob), which provides an insight on workload characterization
  • Exposes function to get cache size and cleanup cache
  • Provides an option to disable / enable cache, which could act as a drop-in replacement for httpfs

Added Functions

function_name function_type description comment examples
cache_httpfs_cache_access_info_query table NULL NULL []
cache_httpfs_cache_status_query table NULL NULL []
cache_httpfs_clear_cache scalar NULL NULL []
cache_httpfs_clear_cache_for_file scalar NULL NULL []
cache_httpfs_clear_profile scalar NULL NULL []
cache_httpfs_get_ondisk_data_cache_size scalar NULL NULL []
cache_httpfs_get_profile scalar NULL NULL []

Added Settings

name description input_type scope
ca_cert_file Path to a custom certificate file for self-signed certificates. VARCHAR GLOBAL
cache_httpfs_cache_block_size Block size for cache, applies to both in-memory cache filesystem and on-disk cache filesystem. It's worth noting for on-disk filesystem, all existing cache files are invalidated after config update. UBIGINT GLOBAL
cache_httpfs_cache_directory The disk cache directory that stores cached data VARCHAR GLOBAL
cache_httpfs_enable_file_handle_cache Whether file handle cache is enable for cache filesystem. By default enabled. BOOLEAN GLOBAL
cache_httpfs_enable_glob_cache Whether glob cache is enable for cache filesystem. By default enabled. BOOLEAN GLOBAL
cache_httpfs_enable_metadata_cache Whether metadata cache is enable for cache filesystem. By default enabled. BOOLEAN GLOBAL
cache_httpfs_file_handle_cache_entry_size Max cache size for file handle cache. UBIGINT GLOBAL
cache_httpfs_file_handle_cache_entry_timeout_millisec Cache entry timeout in milliseconds for file handle cache. UBIGINT GLOBAL
cache_httpfs_glob_cache_entry_size Max cache size for glob cache. UBIGINT GLOBAL
cache_httpfs_glob_cache_entry_timeout_millisec Cache entry timeout in milliseconds for glob cache. UBIGINT GLOBAL
cache_httpfs_ignore_sigpipe Whether to ignore SIGPIPE for the extension. By default not ignored. Once ignored, it cannot be reverted. BOOLEAN GLOBAL
cache_httpfs_in_mem_cache_block_timeout_millisec Data block cache entry timeout in milliseconds. UBIGINT GLOBAL
cache_httpfs_max_fanout_subrequest Cached httpfs performs parallel request by splittng them into small request, with request size decided by config [cache_httpfs_cache_block_size]. The setting limits the maximum request to issue for a single filesystem read request. 0 means no limit, by default we set no limit. BIGINT GLOBAL
cache_httpfs_max_in_mem_cache_block_count Max in-memory cache block count for in-memory caches for all cache filesystems, so users are able to configure the maximum memory consumption. It's worth noting it should be set only once before all filesystem access, otherwise there's no affect. UBIGINT GLOBAL
cache_httpfs_metadata_cache_entry_size Max cache size for metadata LRU cache. UBIGINT GLOBAL
cache_httpfs_metadata_cache_entry_timeout_millisec Cache entry timeout in milliseconds for metadata LRU cache. UBIGINT GLOBAL
cache_httpfs_min_disk_bytes_for_cache Min number of bytes on disk for the cache filesystem to enable on-disk cache; if left bytes is less than the threshold, LRU based cache file eviction will be performed.By default, 5% disk space will be reserved for other usage. When min disk bytes specified with a positive value, the default value will be overriden. UBIGINT GLOBAL
cache_httpfs_profile_type Profiling type for cached filesystem. There're three options available: noop, temp, and duckdb. temp option stores the latest IO operation profiling result, which potentially suffers concurrent updates; duckdb stores the IO operation profiling results into duckdb table, which unblocks advanced analysis. VARCHAR GLOBAL
cache_httpfs_type Type for cached filesystem. Currently there're two types available, one is in_mem, another is on_disk. By default we use on-disk cache. Set to noop to disable, which behaves exactly same as httpfs extension. VARCHAR GLOBAL
enable_server_cert_verification Enable server side certificate verification. BOOLEAN GLOBAL
force_download Forces upfront download of file BOOLEAN GLOBAL
hf_max_per_page Debug option to limit number of items returned in list requests UBIGINT GLOBAL
http_keep_alive Keep alive connections. Setting this to false can help when running into connection failures BOOLEAN GLOBAL
http_retries HTTP retries on I/O error UBIGINT GLOBAL
http_retry_backoff Backoff factor for exponentially increasing retry wait time FLOAT GLOBAL
http_retry_wait_ms Time between retries UBIGINT GLOBAL
http_timeout HTTP timeout read/write/connection/retry (in seconds) UBIGINT GLOBAL
s3_access_key_id S3 Access Key ID VARCHAR GLOBAL
s3_endpoint S3 Endpoint VARCHAR GLOBAL
s3_kms_key_id S3 KMS Key ID VARCHAR GLOBAL
s3_region S3 Region VARCHAR GLOBAL
s3_secret_access_key S3 Access Key VARCHAR GLOBAL
s3_session_token S3 Session Token VARCHAR GLOBAL
s3_uploader_max_filesize S3 Uploader max filesize (between 50GB and 5TB) VARCHAR GLOBAL
s3_uploader_max_parts_per_file S3 Uploader max parts per file (between 1 and 10000) UBIGINT GLOBAL
s3_uploader_thread_limit S3 Uploader global thread limit UBIGINT GLOBAL
s3_url_compatibility_mode Disable Globs and Query Parameters on S3 URLs BOOLEAN GLOBAL
s3_url_style S3 URL style VARCHAR GLOBAL
s3_use_ssl S3 use SSL BOOLEAN GLOBAL