Search Shortcut cmd + k | ctrl + k
gaggle

A DuckDB extension for working with Kaggle datasets

Maintainer(s): habedi

Installing and Loading

INSTALL gaggle FROM community;
LOAD gaggle;

Example

-- 0. Assuming the extension is already installed and loaded

-- 1. Get extension version
SELECT gaggle_version();

-- 2. List files in the dataset
SELECT * FROM gaggle_ls('habedi/flickr-8k-dataset-clean') LIMIT 5;

-- 3. Read a Parquet file FROM local cache using a prepared statement
PREPARE rp as SELECT * FROM read_parquet(?) LIMIT 10;
EXECUTE rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));

-- 4. Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
SELECT COUNT(*) FROM 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';

-- 5. Check cache info
SELECT gaggle_cache_info();

-- 6. Check if cached dataset is current (is the newest version?)
SELECT gaggle_is_current('habedi/flickr-8k-dataset-clean');

About gaggle

Gaggle is a DuckDB extension that uses the Kaggle API to let you query Kaggle datasets directly with SQL. It aims to simplify the data science workflows by hiding the complexity of manually downloading, extracting, and managing dataset files from Kaggle.

For more information, like API references and usage examples, visit the project's GitHub repository.

Added Functions

function_name function_type description comment examples
gaggle_cache_info scalar NULL NULL  
gaggle_clear_cache scalar NULL NULL  
gaggle_download scalar NULL NULL  
gaggle_enforce_cache_limit scalar NULL NULL  
gaggle_file_path scalar NULL NULL  
gaggle_info scalar NULL NULL  
gaggle_is_current scalar NULL NULL  
gaggle_json_each scalar NULL NULL  
gaggle_last_error scalar NULL NULL  
gaggle_ls table NULL NULL  
gaggle_search scalar NULL NULL  
gaggle_set_credentials scalar NULL NULL  
gaggle_update_dataset scalar NULL NULL  
gaggle_version scalar NULL NULL  
gaggle_version_info scalar NULL NULL