Search Shortcut cmd + k | ctrl + k
cache_httpfs

Read cached filesystem for httpfs

Maintainer(s): dentiny, DouEnergy

Installing and Loading

INSTALL cache_httpfs FROM community;
LOAD cache_httpfs;

Example

SELECT cache_httpfs_get_cache_size();

About cache_httpfs

This extension adds a read cache filesystem to DuckDB, which acts as a wrapper of httpfs extention. It supports a few key features:

  • Supports both file metadata, glob, file handle and data block cache
  • Supports both on-disk cache and in-memory cache for data blocks, with cache mode, block size, cache directories tunable
  • Supports disk cache file eviction based on access timestamp or LRU, allows tunable disk space reservation
  • Supports parallel IO request, with request size and parallelism tunable
  • Supports profiling for IO latency and cache hit / miss ratio for a few operations (i.e open, read, glob), which provides an insight on workload characterization
  • Exposes function to get cache size and cleanup cache
  • Provides an option to disable / enable cache, which could act as a drop-in replacement for httpfs

Added Functions

function_name function_type description comment examples
cache_httpfs_cache_access_info_query table NULL NULL  
cache_httpfs_cache_status_query table NULL NULL  
cache_httpfs_clear_cache scalar NULL NULL  
cache_httpfs_clear_cache_for_file scalar NULL NULL  
cache_httpfs_clear_profile scalar NULL NULL  
cache_httpfs_get_cache_filesystems table NULL NULL  
cache_httpfs_get_ondisk_data_cache_size scalar NULL NULL  
cache_httpfs_get_profile scalar NULL NULL  
cache_httpfs_wrap_cache_filesystem scalar NULL NULL  

Added Settings

name description input_type scope aliases
auto_fallback_to_full_download Allows automatically falling back to full file downloads when possible. BOOLEAN GLOBAL []
ca_cert_file Path to a custom certificate file for self-signed certificates. VARCHAR GLOBAL []
cache_httpfs_cache_block_size Block size for cache, applies to both in-memory cache filesystem and on-disk cache filesystem. It's worth noting for on-disk filesystem, all existing cache files are invalidated after config update. UBIGINT GLOBAL []
cache_httpfs_cache_directories_config Advanced configuration for on-disk cache. It supports multiple directories, separated by semicolons (';'). Cache blocks will be evenly distributed under different directories deterministically.Between different runs, it's expected to provide same cache directories, otherwise it's not guaranteed cache files still exist and accessible.Overrides 'cache_httpfs_cache_directory' if set. VARCHAR GLOBAL []
cache_httpfs_cache_directory The disk cache directory that stores cached data VARCHAR GLOBAL []
cache_httpfs_enable_file_handle_cache Whether file handle cache is enable for cache filesystem. By default enabled. BOOLEAN GLOBAL []
cache_httpfs_enable_glob_cache Whether glob cache is enable for cache filesystem. By default enabled. BOOLEAN GLOBAL []
cache_httpfs_enable_metadata_cache Whether metadata cache is enable for cache filesystem. By default enabled. BOOLEAN GLOBAL []
cache_httpfs_evict_policy Eviction policy for on-disk cache cache blocks. By default it's creation timestamp based ('creation_timestamp'), which deletes all cache blocks created earlier than threshold. Other supported policy include 'lru_single_proc' (LRU forsingle process access), which performs LRU-based eviction, mainly made single processusage. VARCHAR GLOBAL []
cache_httpfs_file_handle_cache_entry_size Max cache size for file handle cache. UBIGINT GLOBAL []
cache_httpfs_file_handle_cache_entry_timeout_millisec Cache entry timeout in milliseconds for file handle cache. UBIGINT GLOBAL []
cache_httpfs_glob_cache_entry_size Max cache size for glob cache. UBIGINT GLOBAL []
cache_httpfs_glob_cache_entry_timeout_millisec Cache entry timeout in milliseconds for glob cache. UBIGINT GLOBAL []
cache_httpfs_ignore_sigpipe Whether to ignore SIGPIPE for the extension. By default not ignored. Once ignored, it cannot be reverted. BOOLEAN GLOBAL []
cache_httpfs_in_mem_cache_block_timeout_millisec Data block cache entry timeout in milliseconds. UBIGINT GLOBAL []
cache_httpfs_max_fanout_subrequest Cached httpfs performs parallel request by splittng them into small request, with request size decided by config [cache_httpfs_cache_block_size]. The setting limits the maximum request to issue for a single filesystem read request. 0 means no limit, by default we set no limit. BIGINT GLOBAL []
cache_httpfs_max_in_mem_cache_block_count Max in-memory cache block count for in-memory caches for all cache filesystems, so users are able to configure the maximum memory consumption. It's worth noting it should be set only once before all filesystem access, otherwise there's no affect. UBIGINT GLOBAL []
cache_httpfs_metadata_cache_entry_size Max cache size for metadata LRU cache. UBIGINT GLOBAL []
cache_httpfs_metadata_cache_entry_timeout_millisec Cache entry timeout in milliseconds for metadata LRU cache. UBIGINT GLOBAL []
cache_httpfs_min_disk_bytes_for_cache Min number of bytes on disk for the cache filesystem to enable on-disk cache; if left bytes is less than the threshold, LRU based cache file eviction will be performed.By default, 5% disk space will be reserved for other usage. When min disk bytes specified with a positive value, the default value will be overriden. UBIGINT GLOBAL []
cache_httpfs_profile_type Profiling type for cached filesystem. There're three options available: noop, temp, and duckdb. temp option stores the latest IO operation profiling result, which potentially suffers concurrent updates; duckdb stores the IO operation profiling results into duckdb table, which unblocks advanced analysis. VARCHAR GLOBAL []
cache_httpfs_type Type for cached filesystem. Currently there're two types available, one is in_mem, another is on_disk. By default we use on-disk cache. Set to noop to disable, which behaves exactly same as httpfs extension. VARCHAR GLOBAL []
enable_curl_server_cert_verification Enable server side certificate verification for CURL backend. BOOLEAN GLOBAL []
enable_server_cert_verification Enable server side certificate verification. BOOLEAN GLOBAL []
force_download Forces upfront download of file BOOLEAN GLOBAL []
hf_max_per_page Debug option to limit number of items returned in list requests UBIGINT GLOBAL []
http_keep_alive Keep alive connections. Setting this to false can help when running into connection failures BOOLEAN GLOBAL []
http_retries HTTP retries on I/O error UBIGINT GLOBAL []
http_retry_backoff Backoff factor for exponentially increasing retry wait time FLOAT GLOBAL []
http_retry_wait_ms Time between retries UBIGINT GLOBAL []
http_timeout HTTP timeout read/write/connection/retry (in seconds) UBIGINT GLOBAL []
httpfs_client_implementation Select which is the HTTPUtil implementation to be used VARCHAR GLOBAL []
s3_access_key_id S3 Access Key ID VARCHAR GLOBAL []
s3_endpoint S3 Endpoint VARCHAR GLOBAL []
s3_kms_key_id S3 KMS Key ID VARCHAR GLOBAL []
s3_region S3 Region VARCHAR GLOBAL []
s3_requester_pays S3 use requester pays mode BOOLEAN GLOBAL []
s3_secret_access_key S3 Access Key VARCHAR GLOBAL []
s3_session_token S3 Session Token VARCHAR GLOBAL []
s3_uploader_max_filesize S3 Uploader max filesize (between 50GB and 5TB) VARCHAR GLOBAL []
s3_uploader_max_parts_per_file S3 Uploader max parts per file (between 1 and 10000) UBIGINT GLOBAL []
s3_uploader_thread_limit S3 Uploader global thread limit UBIGINT GLOBAL []
s3_url_compatibility_mode Disable Globs and Query Parameters on S3 URLs BOOLEAN GLOBAL []
s3_url_style S3 URL style VARCHAR GLOBAL []
s3_use_ssl S3 use SSL BOOLEAN GLOBAL []
unsafe_disable_etag_checks Disable checks on ETag consistency BOOLEAN GLOBAL []