⌘+k ctrl+k
1.1.3 (stable)
Search Shortcut cmd + k | ctrl + k
Loading JSON

The DuckDB JSON reader can automatically infer which configuration flags to use by analyzing the JSON file. This will work correctly in most situations, and should be the first option attempted. In rare situations where the JSON reader cannot figure out the correct configuration, it is possible to manually configure the JSON reader to correctly parse the JSON file.

The read_json Function

The read_json is the simplest method of loading JSON files: it automatically attempts to figure out the correct configuration of the JSON reader. It also automatically deduces types of columns. In the following example, we use the todos.json file,

SELECT *
FROM read_json('todos.json')
LIMIT 5;
userId id title completed
1 1 delectus aut autem false
1 2 quis ut nam facilis et officia qui false
1 3 fugiat veniam minus false
1 4 et porro tempora true
1 5 laboriosam mollitia et enim quasi adipisci quia provident illum false

We can use read_json to create a persistent table as well:

CREATE TABLE todos AS
    SELECT *
    FROM read_json('todos.json');
DESCRIBE todos;
column_name column_type null key default extra
userId UBIGINT YES NULL NULL NULL
id UBIGINT YES NULL NULL NULL
title VARCHAR YES NULL NULL NULL
completed BOOLEAN YES NULL NULL NULL

If we specify the columns, we can bypass the automatic detection. Note that not all columns need to be specified:

SELECT *
FROM read_json(
    'todos.json',
    columns = {userId: 'UBIGINT', completed: 'BOOLEAN'}
);

Multiple files can be read at once by providing a glob or a list of files. Refer to the multiple files section for more information.

Functions for Reading JSON Objects

The following table functions are used to read JSON:

Function Description
read_json_objects(filename) Read a JSON object from filename, where filename can also be a list of files or a glob pattern.
read_ndjson_objects(filename) Alias for read_json_objects with the parameter format set to 'newline_delimited'.
read_json_objects_auto(filename) Alias for read_json_objects with the parameter format set to 'auto' .

Parameters

These functions have the following parameters:

Name Description Type Default
compression The compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are 'none', 'gzip', 'zstd', and 'auto'. VARCHAR 'auto'
filename Whether or not an extra filename column should be included in the result. BOOL false
format Can be one of ['auto', 'unstructured', 'newline_delimited', 'array']. VARCHAR 'array'
hive_partitioning Whether or not to interpret the path as a Hive partitioned path. BOOL false
ignore_errors Whether to ignore parse errors (only possible when format is 'newline_delimited'). BOOL false
maximum_sample_files The maximum number of JSON files sampled for auto-detection. BIGINT 32
maximum_object_size The maximum size of a JSON object (in bytes). UINTEGER 16777216

The format parameter specifies how to read the JSON from a file. With 'unstructured', the top-level JSON is read, e.g. for birds.json:

{
  "duck": 42
}
{
  "goose": [1, 2, 3]
}
FROM read_json('birds.json', format = 'unstructured');

will result in two objects being read:

┌──────────────────────────────┐
│             json             │
│             json             │
├──────────────────────────────┤
│ {\n    "duck": 42\n}         │
│ {\n    "goose": [1, 2, 3]\n} │
└──────────────────────────────┘

With 'newline_delimited', NDJSON is read, where each JSON is separated by a newline (\n), e.g., for birds-nd.json:

{"duck": 42}
{"goose": [1, 2, 3]}
FROM read_json_objects('birds-nd.json', format = 'newline_delimited');

will also result in two objects being read:

┌──────────────────────┐
│         json         │
│         json         │
├──────────────────────┤
│ {"duck": 42}         │
│ {"goose": [1, 2, 3]} │
└──────────────────────┘

With 'array', each array element is read, e.g., for birds-array.json:

[
    {
        "duck": 42
    },
    {
        "goose": [1, 2, 3]
    }
]
FROM read_json_objects('birds-array.json', format = 'array');

will again result in two objects being read:

┌──────────────────────────────────────┐
│                 json                 │
│                 json                 │
├──────────────────────────────────────┤
│ {\n        "duck": 42\n    }         │
│ {\n        "goose": [1, 2, 3]\n    } │
└──────────────────────────────────────┘

Functions for Reading JSON as a Table

DuckDB also supports reading JSON as a table, using the following functions:

Function Description
read_json(filename) Read JSON from filename, where filename can also be a list of files, or a glob pattern.
read_json_auto(filename) Alias for read_json.
read_ndjson(filename) Alias for read_json with parameter format set to 'newline_delimited'.
read_ndjson_auto(filename) Alias for read_json with parameter format set to 'newline_delimited'.

Parameters

Besides the maximum_object_size, format, ignore_errors and compression, these functions have additional parameters:

Name Description Type Default
auto_detect Whether to auto-detect the names of the keys and data types of the values automatically BOOL false
columns A struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferred STRUCT (empty)
dateformat Specifies the date format to use when parsing dates. See Date Format VARCHAR 'iso'
maximum_depth Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types BIGINT -1
records Can be one of ['auto', 'true', 'false'] VARCHAR 'records'
sample_size Option to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file UBIGINT 20480
timestampformat Specifies the date format to use when parsing timestamps. See Date Format VARCHAR 'iso'
union_by_name Whether the schema's of multiple JSON files should be unified BOOL false
map_inference_threshold Controls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a STRUCT type for a field that has more than this threshold number of subfields, it infers a MAP type instead. Set to -1 to disable MAP inference. BIGINT 200
field_appearance_threshold The JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a MAP type with value type of merged field types. 0.1  

Note that DuckDB can convert JSON arrays directly to its internal LIST type, and missing keys become NULL:

SELECT *
FROM read_json(
    ['birds1.json', 'birds2.json'],
    columns = {duck: 'INTEGER', goose: 'INTEGER[]', swan: 'DOUBLE'}
);
duck goose swan
42 [1, 2, 3] NULL
43 [4, 5, 6] 3.3

DuckDB can automatically detect the types like so:

SELECT goose, duck FROM read_json('*.json.gz');
SELECT goose, duck FROM '*.json.gz'; -- equivalent

DuckDB can read (and auto-detect) a variety of formats, specified with the format parameter. Querying a JSON file that contains an 'array', e.g.:

[
  {
    "duck": 42,
    "goose": 4.2
  },
  {
    "duck": 43,
    "goose": 4.3
  }
]

Can be queried exactly the same as a JSON file that contains 'unstructured' JSON, e.g.:

{
    "duck": 42,
    "goose": 4.2
}
{
    "duck": 43,
    "goose": 4.3
}

Both can be read as the table:

duck goose
42 4.2
43 4.3

If your JSON file does not contain 'records', i.e., any other type of JSON than objects, DuckDB can still read it. This is specified with the records parameter. The records parameter specifies whether the JSON contains records that should be unpacked into individual columns, i.e., reading the following file with records:

{"duck": 42, "goose": [1, 2, 3]}
{"duck": 43, "goose": [4, 5, 6]}

Results in two columns:

duck goose
42 [1,2,3]
43 [4,5,6]

You can read the same file with records set to 'false', to get a single column, which is a STRUCT containing the data:

json
{'duck': 42, 'goose': [1,2,3]}
{'duck': 43, 'goose': [4,5,6]}

For additional examples reading more complex data, please see the “Shredding Deeply Nested JSON, One Vector at a Time” blog post.

Loading with the COPY Statement Using FORMAT JSON

When the json extension is installed, FORMAT JSON is supported for COPY FROM, IMPORT DATABASE, as well as COPY TO, and EXPORT DATABASE. See the COPY statement and the IMPORT / EXPORT clauses.

By default, COPY expects newline-delimited JSON. If you prefer copying data to/from a JSON array, you can specify ARRAY true, e.g.,

COPY (SELECT * FROM range(5) r(i))
TO 'numbers.json' (ARRAY true);

will create the following file:

[
	{"i":0},
	{"i":1},
	{"i":2},
	{"i":3},
	{"i":4}
]

This can be read back to DuckDB as follows:

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (ARRAY true);

The format can be detected automatically the format like so:

CREATE TABLE numbers (i BIGINT);
COPY numbers FROM 'numbers.json' (AUTO_DETECT true);

We can also create a table from the auto-detected schema:

CREATE TABLE numbers AS
    FROM 'numbers.json';

Parameters

Name Description Type Default
auto_detect Whether to auto-detect detect the names of the keys and data types of the values automatically BOOL false
columns A struct that specifies the key names and value types contained within the JSON file (e.g., {key1: 'INTEGER', key2: 'VARCHAR'}). If auto_detect is enabled these will be inferred STRUCT (empty)
compression The compression type for the file. By default this will be detected automatically from the file extension (e.g., t.json.gz will use gzip, t.json will use none). Options are 'uncompressed', 'gzip', 'zstd', and 'auto_detect'. VARCHAR 'auto_detect'
convert_strings_to_integers Whether strings representing integer values should be converted to a numerical type. BOOL false
dateformat Specifies the date format to use when parsing dates. See Date Format VARCHAR 'iso'
filename Whether or not an extra filename column should be included in the result. BOOL false
format Can be one of ['auto', 'unstructured', 'newline_delimited', 'array'] VARCHAR 'array'
hive_partitioning Whether or not to interpret the path as a Hive partitioned path. BOOL false
ignore_errors Whether to ignore parse errors (only possible when format is 'newline_delimited') BOOL false
maximum_depth Maximum nesting depth to which the automatic schema detection detects types. Set to -1 to fully detect nested JSON types BIGINT -1
maximum_object_size The maximum size of a JSON object (in bytes) UINTEGER 16777216
records Can be one of ['auto', 'true', 'false'] VARCHAR 'records'
sample_size Option to define number of sample objects for automatic JSON type detection. Set to -1 to scan the entire input file UBIGINT 20480
timestampformat Specifies the date format to use when parsing timestamps. See Date Format VARCHAR 'iso'
union_by_name Whether the schema's of multiple JSON files should be unified. BOOL false