Search Shortcut cmd + k | ctrl + k
encoding

The Encoding Extension enables text encoding detection and UTF-8 conversion from within DuckDB

Maintainer(s): onnimonni

Installing and Loading

INSTALL encoding FROM community;
LOAD encoding;

Example

-- Forces text with potentially mixed or unknown encoding to UTF-8
D SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as utf8_text;
┌─────────────────────────────────────┐
             utf8_text               
               varchar               
├─────────────────────────────────────┤
 [{"Langue":["Français","English"]}] 
└─────────────────────────────────────┘

-- Use with json() function to parse converted text
D SELECT json(force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D')) as parsed_json;
┌─────────────────────────────────────┐
             parsed_json             
                json                 
├─────────────────────────────────────┤
 [{"Langue":["Français","English"]}] 
└─────────────────────────────────────┘

-- Detect the encoding of input text
D SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as detected_encoding;
┌───────────────────┐
 detected_encoding 
      varchar      
├───────────────────┤
 windows-1252      
└───────────────────┘

-- Convert simple hex-encoded text
D SELECT force_utf8_encoding('\x48\x65\x6C\x6C\x6F') as simple_conversion;
┌───────────────────┐
 simple_conversion 
      varchar      
├───────────────────┤
 Hello             
└───────────────────┘

-- Handle regular UTF-8 text (pass-through)
D SELECT force_utf8_encoding('Hello World') as passthrough;
┌─────────────┐
 passthrough 
   varchar   
├─────────────┤
 Hello World 
└─────────────┘

About encoding

This community extension implements text encoding detection and UTF-8 conversion functions for DuckDB. It solves the common problem of working with text data from various sources that may not be UTF-8 encoded.

Problem Solved

DuckDB's built-in decode() function fails when trying to convert non-UTF-8 encoded data:

SELECT json(decode('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D'));
-- Error: Failure in decode: could not convert blob to UTF8 string,
-- the blob contained invalid UTF8 characters

This extension automatically detects the encoding and converts it to UTF-8:

SELECT json(force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D')) as json;
-- Returns: [{"Langue":["Français","English"]}]

Functions

force_utf8_encoding(input_text)

Forces text with potentially mixed or unknown encoding to UTF-8. Handles hex-encoded byte sequences and attempts automatic encoding detection.

Parameters:

  • input_text (VARCHAR): Text that may contain encoded bytes (like '\x5B\x7B…')

Returns: VARCHAR - The text converted to UTF-8, or NULL on error

Examples:

-- Convert hex-encoded JSON with French characters
SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as utf8_text;
-- Returns: [{"Langue":["Français","English"]}]

-- Convert simple hex-encoded text
SELECT force_utf8_encoding('\x48\x65\x6C\x6C\x6F') as text;
-- Returns: Hello

-- Handle regular UTF-8 text (pass-through)
SELECT force_utf8_encoding('Hello World') as text;
-- Returns: Hello World

-- Mixed hex and regular characters
SELECT force_utf8_encoding('Hello \x57\x6F\x72\x6C\x64') as text;
-- Returns: Hello World

detect_encoding(input_text)

Detects the likely encoding of input text and returns the encoding name.

Parameters:

  • input_text (VARCHAR): Text that may contain encoded bytes

Returns: VARCHAR - The detected encoding name, or NULL on error

Examples:

-- Detect encoding of regular UTF-8 text
SELECT detect_encoding('Hello World') as encoding;
-- Returns: UTF-8

-- Detect encoding of hex-encoded text
SELECT detect_encoding('\x48\x65\x6C\x6C\x6F') as encoding;
-- Returns: UTF-8

-- Detect encoding of French text with Windows-1252 encoding
SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D') as encoding;
-- Returns: windows-1252

Supported Encodings

The extension automatically detects and converts from these encodings:

  • UTF-8: Already valid UTF-8 (pass-through)
  • Windows-1252: Very common for Western text
  • Windows-1251: Cyrillic text
  • Windows-1250: Central European text
  • Shift_JIS: Japanese text
  • GB18030: Chinese text
  • EUC-KR: Korean text
  • ISO-8859-15: Latin-9 (with Euro sign)
  • ISO-8859-2: Latin-2

Usage Examples

Basic Encoding Conversion

-- Convert encoded data to UTF-8 for JSON parsing
SELECT json(force_utf8_encoding('\x5B\x7B\x22test\x22\x3A\x22value\x22\x7D\x5D')) as data;

-- Convert and process in WHERE clause
SELECT * FROM documents
WHERE json_extract(force_utf8_encoding(raw_data), '$.language') = 'français';

-- Batch convert multiple encoded strings
SELECT
    id,
    force_utf8_encoding(encoded_text) as clean_text
FROM raw_data_table
WHERE force_utf8_encoding(encoded_text) IS NOT NULL;

Encoding Detection and Analysis

-- Analyze encoding distribution in your data
SELECT
    detect_encoding(raw_text) as encoding,
    COUNT(*) as count
FROM documents
GROUP BY detect_encoding(raw_text)
ORDER BY count DESC;

-- Find records that need encoding conversion
SELECT *
FROM documents
WHERE detect_encoding(raw_text) != 'UTF-8'
AND detect_encoding(raw_text) IS NOT NULL;

This extension is experimental and potentially unstable. See README for full examples.

Added Functions

function_name function_type description comment examples
force_utf8_encoding scalar Forces text with potentially mixed or unknown encoding to UTF-8. Handles hex-encoded byte sequences and attempts automatic encoding detection. NULL [SELECT force_utf8_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D');]
detect_encoding scalar Detects the likely encoding of input text and returns the encoding name. NULL [SELECT detect_encoding('\x5B\x7B\x22Langue\x22\x3A\x5B\x22Fran\xE7ais\x22,\x22English\x22\x5D\x7D\x5D');]