The Encoding Extension enables text encoding detection and UTF-8 conversion from within DuckDB
Installing and Loading
INSTALL encoding FROM community;
LOAD encoding;
Example
-- Forces text with potentially mixed or unknown encoding to UTF-8
D SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as utf8_text;
┌─────────────────────────────────────┐
│ utf8_text │
│ varchar │
├─────────────────────────────────────┤
│ [{"Langue":["Français","English"]}] │
└─────────────────────────────────────┘
-- Use with json() function to parse converted text
D SELECT json(force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D')) as parsed_json;
┌─────────────────────────────────────┐
│ parsed_json │
│ json │
├─────────────────────────────────────┤
│ [{"Langue":["Français","English"]}] │
└─────────────────────────────────────┘
-- Detect the encoding of input text
D SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as detected_encoding;
┌───────────────────┐
│ detected_encoding │
│ varchar │
├───────────────────┤
│ windows-1252 │
└───────────────────┘
-- Convert simple hex-encoded text
D SELECT force_utf8_encoding('\x48\x65\x6C\x6C\x6F') as simple_conversion;
┌───────────────────┐
│ simple_conversion │
│ varchar │
├───────────────────┤
│ Hello │
└───────────────────┘
-- Handle regular UTF-8 text (pass-through)
D SELECT force_utf8_encoding('Hello World') as passthrough;
┌─────────────┐
│ passthrough │
│ varchar │
├─────────────┤
│ Hello World │
└─────────────┘
About encoding
This community extension implements text encoding detection and UTF-8 conversion functions for DuckDB. It solves the common problem of working with text data from various sources that may not be UTF-8 encoded.
Problem Solved
DuckDB's built-in decode()
function fails when trying to convert non-UTF-8 encoded data:
SELECT json(decode('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D'));
-- Error: Failure in decode: could not convert blob to UTF8 string,
-- the blob contained invalid UTF8 characters
This extension automatically detects the encoding and converts it to UTF-8:
SELECT json(force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D')) as json;
-- Returns: [{"Langue":["Français","English"]}]
Functions
force_utf8_encoding(input_text)
Forces text with potentially mixed or unknown encoding to UTF-8. Handles hex-encoded byte sequences and attempts automatic encoding detection.
Parameters:
input_text
(VARCHAR): Text that may contain encoded bytes (like '\x5B\x7B…')
Returns: VARCHAR - The text converted to UTF-8, or NULL on error
Examples:
-- Convert hex-encoded JSON with French characters
SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as utf8_text;
-- Returns: [{"Langue":["Français","English"]}]
-- Convert simple hex-encoded text
SELECT force_utf8_encoding('\x48\x65\x6C\x6C\x6F') as text;
-- Returns: Hello
-- Handle regular UTF-8 text (pass-through)
SELECT force_utf8_encoding('Hello World') as text;
-- Returns: Hello World
-- Mixed hex and regular characters
SELECT force_utf8_encoding('Hello \x57\x6F\x72\x6C\x64') as text;
-- Returns: Hello World
detect_encoding(input_text)
Detects the likely encoding of input text and returns the encoding name.
Parameters:
input_text
(VARCHAR): Text that may contain encoded bytes
Returns: VARCHAR - The detected encoding name, or NULL on error
Examples:
-- Detect encoding of regular UTF-8 text
SELECT detect_encoding('Hello World') as encoding;
-- Returns: UTF-8
-- Detect encoding of hex-encoded text
SELECT detect_encoding('\x48\x65\x6C\x6C\x6F') as encoding;
-- Returns: UTF-8
-- Detect encoding of French text with Windows-1252 encoding
SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as encoding;
-- Returns: windows-1252
Supported Encodings
The extension automatically detects and converts from these encodings:
- UTF-8: Already valid UTF-8 (pass-through)
- Windows-1252: Very common for Western text
- Windows-1251: Cyrillic text
- Windows-1250: Central European text
- Shift_JIS: Japanese text
- GB18030: Chinese text
- EUC-KR: Korean text
- ISO-8859-15: Latin-9 (with Euro sign)
- ISO-8859-2: Latin-2
Usage Examples
Basic Encoding Conversion
-- Convert encoded data to UTF-8 for JSON parsing
SELECT json(force_utf8_encoding('\x5B\x7B\x22test\x22\x3A\x22value\x22\x7D\x5D')) as data;
-- Convert and process in WHERE clause
SELECT * FROM documents
WHERE json_extract(force_utf8_encoding(raw_data), '$.language') = 'français';
-- Batch convert multiple encoded strings
SELECT
id,
force_utf8_encoding(encoded_text) as clean_text
FROM raw_data_table
WHERE force_utf8_encoding(encoded_text) IS NOT NULL;
Encoding Detection and Analysis
-- Analyze encoding distribution in your data
SELECT
detect_encoding(raw_text) as encoding,
COUNT(*) as count
FROM documents
GROUP BY detect_encoding(raw_text)
ORDER BY count DESC;
-- Find records that need encoding conversion
SELECT *
FROM documents
WHERE detect_encoding(raw_text) != 'UTF-8'
AND detect_encoding(raw_text) IS NOT NULL;
This extension is experimental and potentially unstable. See README for full examples.
Added Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
force_utf8_encoding | scalar | Forces text with potentially mixed or unknown encoding to UTF-8. Handles hex-encoded byte sequences and attempts automatic encoding detection. | NULL | [SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D');] |
detect_encoding | scalar | Detects the likely encoding of input text and returns the encoding name. | NULL | [SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D');] |