A DuckDB extension for working with Kaggle datasets
Maintainer(s):
habedi
Installing and Loading
INSTALL gaggle FROM community;
LOAD gaggle;
Example
-- 0. Assuming the extension is already installed and loaded
-- 1. Get extension version
SELECT gaggle_version();
-- 2. List files in the dataset
SELECT * FROM gaggle_ls('habedi/flickr-8k-dataset-clean') LIMIT 5;
-- 3. Read a Parquet file FROM local cache using a prepared statement
PREPARE rp as SELECT * FROM read_parquet(?) LIMIT 10;
EXECUTE rp(gaggle_file_path('habedi/flickr-8k-dataset-clean', 'flickr8k.parquet'));
-- 4. Alternatively, we can use a replacement scan to read directly via `kaggle:` prefix
SELECT COUNT(*) FROM 'kaggle:habedi/flickr-8k-dataset-clean/flickr8k.parquet';
-- 5. Check cache info
SELECT gaggle_cache_info();
-- 6. Check if cached dataset is current (is the newest version?)
SELECT gaggle_is_current('habedi/flickr-8k-dataset-clean');
About gaggle
Gaggle is a DuckDB extension that uses the Kaggle API to let you query Kaggle datasets directly with SQL. It aims to simplify the data science workflows by hiding the complexity of manually downloading, extracting, and managing dataset files from Kaggle.
For more information, like API references and usage examples, visit the project's GitHub repository.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| gaggle_cache_info | scalar | NULL | NULL | |
| gaggle_clear_cache | scalar | NULL | NULL | |
| gaggle_download | scalar | NULL | NULL | |
| gaggle_enforce_cache_limit | scalar | NULL | NULL | |
| gaggle_file_path | scalar | NULL | NULL | |
| gaggle_info | scalar | NULL | NULL | |
| gaggle_is_current | scalar | NULL | NULL | |
| gaggle_json_each | scalar | NULL | NULL | |
| gaggle_last_error | scalar | NULL | NULL | |
| gaggle_ls | table | NULL | NULL | |
| gaggle_search | scalar | NULL | NULL | |
| gaggle_set_credentials | scalar | NULL | NULL | |
| gaggle_update_dataset | scalar | NULL | NULL | |
| gaggle_version | scalar | NULL | NULL | |
| gaggle_version_info | scalar | NULL | NULL |