Shredding Deeply Nested JSON, One Vector at a Time

Author Avatar
Laurens Kuiper2023-03-03

TL;DR: We've recently improved DuckDB's JSON extension so JSON files can be directly queried as if they were tables.

JSON is not scary anymore! Jason IS scary though, even as a duck.

DuckDB has a JSON extension that can be installed and loaded through SQL:

INSTALL 'json';
LOAD 'json';

The JSON extension supports various functions to create, read, and manipulate JSON strings. These functions are similar to the JSON functionality provided by other databases such as PostgreSQL and MySQL. DuckDB uses yyjson internally to parse JSON, a high-performance JSON library written in ANSI C. Many thanks to the yyjson authors and contributors!

Besides these functions, DuckDB is now able to read JSON directly! This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB's vectors. The automated schema detection dramatically simplifies working with JSON data and subsequent queries on DuckDB's vectors are significantly faster!

Reading JSON Automatically with DuckDB

Since the 0.7.0 update, DuckDB has added JSON table functions. To demonstrate these, we will read todos.json, a fake TODO list containing 200 fake TODO items (only the first two items are shown):

[
  {
    "userId": 1,
    "id": 1,
    "title": "delectus aut autem",
    "completed": false
  },
  {
    "userId": 1,
    "id": 2,
    "title": "quis ut nam facilis et officia qui",
    "completed": false
  },
]

Each TODO item is an entry in the JSON array, but in DuckDB, we'd like a table where each entry is a row. This is now (since DuckDB's 0.7.0 release in February 2023) as easy as:

SELECT * FROM 'todos.json';
userId id title completed
1 1 delectus aut autem false
1 2 quis ut nam facilis et officia qui false
1 3 fugiat veniam minus false
1 4 et porro tempora true
1 5 laboriosam mollitia et enim quasi adipisci quia provident illum false

(Note: Only 5 rows shown)

Now, finding out which user completed the most TODO items is as simple as:

SELECT userId, sum(completed::int) total_completed
FROM 'todos.json'
GROUP BY userId
ORDER BY total_completed DESC
LIMIT 1;
userId total_completed
5 12

Under the hood, DuckDB recognizes the .json file extension in 'todos.json', and calls read_json_auto('todos.json') instead. This function is similar to our read_csv_auto function, which automatically infers column names and types for CSV files.

Like our other table functions, read_json_auto supports reading multiple files by passing a list, e.g., read_json_auto(['file1.json', 'file2.json']), but also globbing, e.g., read_json_auto('file*.json'). DuckDB will read multiple files in parallel.

Newline Delimited JSON

Not all JSON adheres to the format used in todos.json, which is an array of 'records'. Newline-delimited JSON, or NDJSON, stores each row on a new line. DuckDB also supports reading (and writing!) this format. First, let's write our TODO list as NDJSON:

COPY (SELECT * FROM 'todos.json') to 'todos2.json';

Again, DuckDB recognizes the .json suffix in the output file and automatically infers that we mean to use (FORMAT JSON). The created file looks like this (only the first two records are shown):

{"userId":1,"id":1,"title":"delectus aut autem","completed":false}
{"userId":1,"id":2,"title":"quis ut nam facilis et officia qui","completed":false}

DuckDB can read this file in precisely the same way as the original one:

SELECT * FROM 'todos2.json';

If your JSON file is newline-delimited, DuckDB can parallelize reading. This is specified with nd or the lines parameter:

SELECT * FROM read_ndjson_auto('todos2.json');
SELECT * FROM read_json_auto('todos2.json', lines='true');

You can also set lines='auto' to auto-detect whether the JSON file is newline-delimited.

Other JSON Formats

If using the read_json function directly, the format of the JSON can be specified using the json_format parameter. This parameter defaults to 'auto', which tells DuckDB to infer what kind of JSON we are dealing with. The first json_format is 'array_of_records', while the second is 'records'. This can be specified like so:

SELECT * FROM read_json('todos.json', auto_detect=true, json_format='array_of_records');
SELECT * FROM read_json('todos2.json', auto_detect=true, json_format='records');

Other supported formats are 'values' and 'array_of_values', which are similar to 'records' and 'array_of_records'. However, with these formats, each 'record' is not required to be a JSON object but can also be a JSON array, string, or anything supported in JSON.

Manual Schemas

What you may also have noticed is the auto_detect parameter. This parameter tells DuckDB to infer the schema, i.e., determine the names and types of the returned columns. These can manually be specified like so:

SELECT * FROM read_json('todos.json',
                        columns={userId: 'INT', id: 'INT', title: 'VARCHAR', completed: 'BOOLEAN'},
                        json_format='array_of_records');

You don't have to specify all fields, just the ones you're interested in:

SELECT * FROM read_json('todos.json',
                        columns={userId: 'INT', completed: 'BOOLEAN'},
                        json_format='array_of_records');

Now that we know how to use the new DuckDB JSON table functions let's dive into some analytics!

GitHub Archive Examples

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. Every hour, a GZIP compressed, newline-delimited JSON file containing all public events on GitHub is uploaded. I've downloaded a whole day (2023-02-08) of activity using wget and stored the 24 files in a directory called gharchive_gz.

wget https://data.gharchive.org/2023-02-08-0.json.gz
wget https://data.gharchive.org/2023-02-08-1.json.gz
...
wget https://data.gharchive.org/2023-02-08-23.json.gz

Keep in mind that the data is compressed:

du -sh gharchive_gz
2.3G  gharchive_gz
gunzip -dc gharchive_gz/* | wc -c
18396198934

One day of GitHub activity amounts to more than 18GB of JSON, which compresses to 2.3GB with GZIP.

To get a feel of what the data looks like, we run the following query:

SELECT json_group_structure(json)
FROM (
  SELECT *
  FROM read_ndjson_objects('gharchive_gz/*.json.gz')
  LIMIT 2048
);

Here, we use our read_ndjson_objects function, which reads the JSON objects in the file as raw JSON, i.e., as strings. The query reads the first 2048 records of JSON from the JSON files gharchive_gz directory and describes the structure. You can also directly query the JSON files from GH Archive using DuckDB's httpfs extension, but we will be querying the files multiple times, so it is better to download them in this case.

I've formatted the result using an online JSON formatter & validator:

{
   "id":"VARCHAR",
   "type":"VARCHAR",
   "actor":{
      "id":"UBIGINT",
      "login":"VARCHAR",
      "display_login":"VARCHAR",
      "gravatar_id":"VARCHAR",
      "url":"VARCHAR",
      "avatar_url":"VARCHAR"
   },
   "repo":{
      "id":"UBIGINT",
      "name":"VARCHAR",
      "url":"VARCHAR"
   },
   "payload":{"..."},
   "public":"BOOLEAN",
   "created_at":"VARCHAR",
   "org":{
      "id":"UBIGINT",
      "login":"VARCHAR",
      "gravatar_id":"VARCHAR",
      "url":"VARCHAR",
      "avatar_url":"VARCHAR"
   }
}

I've left "payload" out because it consists of deeply nested JSON, and its formatted structure takes up more than 1000 lines!

So, how many records are we dealing with exactly? Let's count it using DuckDB:

SELECT count(*) count FROM 'gharchive_gz/*.json.gz';
count
4434953

That's around 4.4M daily events, which amounts to almost 200K events per hour. This query takes around 7.3s seconds on my laptop, a 2020 MacBook Pro with an M1 chip and 16GB of memory. This is the time it takes to decompress the GZIP compression and parse every JSON record.

To see how much time is spent decompressing GZIP in the query, I've also created a gharchive directory containing the same data but uncompressed. Running the same query on the uncompressed data takes around 5.4s, almost 2 seconds faster. So we got faster, but we also read more than 18GB of data from storage, as opposed to 2.3GB when it was compressed. So, this comparison really depends on the speed of your storage. I prefer to keep the data compressed.

As a side note, the speed of this query really shows how fast yyjson is!

So, what kind of events are in the GitHub data?

SELECT type, count(*) count
FROM 'gharchive_gz/*.json.gz'
GROUP BY type
ORDER BY count DESC;
type count
PushEvent 2359096
CreateEvent 624062
PullRequestEvent 366090
IssueCommentEvent 238660
WatchEvent 231486
DeleteEvent 154383
PullRequestReviewEvent 131107
IssuesEvent 88917
PullRequestReviewCommentEvent 79540
ForkEvent 64233
CommitCommentEvent 36823
ReleaseEvent 23004
MemberEvent 14872
PublicEvent 14500
GollumEvent 8180

This query takes around 7.4s, not much more than the count(*) query. So as we can see, data analysis is very fast once everything has been decompressed and parsed.

The most common event type is the PushEvent, taking up more than half of all events, unsurprisingly, which is people pushing their committed code to GitHub. The least common event type is the GollumEvent, taking up less than 1% of all events, which is a creation or update of a wiki page.

If we want to analyze the same data multiple times, decompressing and parsing every time is redundant. Instead, we can create a DuckDB table like so:

CREATE TABLE events AS
SELECT * EXCLUDE (payload)
FROM 'gharchive_gz/*.json.gz';

Which takes around 9s if you're using an in-memory database. If you're using an on-disk database, this takes around 13s and results in a database size of 444MB. When using an on-disk database, DuckDB ensures the table is persistent and performs all kinds of compression. Note that we have temporarily ignored the payload field using the convenient EXCLUDE clause.

To get a feel of what we read, we can ask DuckDB to describe the table:

DESCRIBE SELECT * FROM events;

This gives us the following:

cid name type notnull dflt_value pk
0 id BIGINT false   false
1 type VARCHAR false   false
2 actor STRUCT(id UBIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) false   false
3 repo STRUCT(id UBIGINT, name VARCHAR, url VARCHAR) false   false
4 public BOOLEAN false   false
5 created_at TIMESTAMP false   false
6 org STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) false   false

As we can see, the "actor", "repo" and "org" fields, which are JSON objects, have been converted to DuckDB structs. The "id" column was a string in the original JSON but has been converted to a BIGINT by DuckDB's automatic type detection. DuckDB can also detect a few different DATE/TIMESTAMP formats within JSON strings, as well as TIME and UUID.

Now that we've created the table, we can analyze it like any other DuckDB table! Let's see how much activity there was in the duckdb/duckdb GitHub repository on this specific day:

SELECT type, count(*) count
FROM events
WHERE repo.name = 'duckdb/duckdb'
GROUP BY type
ORDER BY count DESC;
type count
PullRequestEvent 35
IssueCommentEvent 30
WatchEvent 29
PushEvent 15
PullRequestReviewEvent 14
IssuesEvent 9
PullRequestReviewCommentEvent 7
ForkEvent 3

That's a lot of pull request activity! Note that this doesn't mean that 35 pull requests were opened on this day, activity within a pull request is also counted. If we search through the pull requests for that day, we see that there are only 15. This is more activity than normal because most of the DuckDB developers were busy fixing bugs for the 0.7.0 release.

Now, let's see who was the most active:

SELECT actor.login, count(*) count
FROM events
WHERE repo.name = 'duckdb/duckdb'
  AND type = 'PullRequestEvent'
GROUP BY actor.login
ORDER BY count desc
LIMIT 5;
login count
Mytherin 19
Mause 4
carlopi 3
Tmonster 2
lnkuiper 2

As expected, Mark (Mytherin, co-founder of DuckDB Labs) was the most active! My activity (lnkuiper, software engineer at DuckDB Labs) also shows up.

Handling Inconsistent JSON Schemas

So far, we have ignored the "payload" of the events. We've ignored it because the contents of this field are different based on the type of event. We can see how they differ with the following query:

SELECT json_group_structure(payload) structure
FROM (SELECT *
  FROM read_json(
    'gharchive_gz/*.json.gz',
    columns={
      id: 'BIGINT',
      type: 'VARCHAR',
      actor: 'STRUCT(id UBIGINT,
                     login VARCHAR,
                     display_login VARCHAR,
                     gravatar_id VARCHAR,
                     url VARCHAR,
                     avatar_url VARCHAR)',
      repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
      payload: 'JSON',
      public: 'BOOLEAN',
      created_at: 'TIMESTAMP',
      org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
    },
    lines='true'
  )
  WHERE type = 'WatchEvent'
  LIMIT 2048
);
structure
{"action":"VARCHAR"}

The "payload" field is simple for events of type WatchEvent. However, if we change the type to PullRequestEvent, we get a JSON structure of more than 500 lines when formatted with a JSON formatter. We don't want to look through all those fields, so we cannot use our automatic schema detection, which will try to get them all. Instead, we can manually supply the structure of the fields we're interested in. DuckDB will skip reading the other fields. Another approach is to store the "payload" field as DuckDB's JSON data type and parse it at query time (see the example later in this post!).

I've stripped down the JSON structure for the "payload" of events with the type PullRequestEvent to the things I'm actually interested in:

{
   "action":"VARCHAR",
   "number":"UBIGINT",
   "pull_request":{
      "url":"VARCHAR",
      "id":"UBIGINT",
      "title":"VARCHAR",
      "user":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "body":"VARCHAR",
      "created_at":"TIMESTAMP",
      "updated_at":"TIMESTAMP",
      "assignee":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "assignees":[
         {
            "login":"VARCHAR",
            "id":"UBIGINT",
         }
      ],
  }
}

This is technically not valid JSON because there are trailing commas. However, we try to allow trailing commas wherever possible in DuckDB, including JSON!

We can now plug this into the columns parameter of read_json, but we need to convert it to a DuckDB type first. I'm lazy, so I prefer to let DuckDB do this for me:

SELECT typeof(json_transform('{}', '{
   "action":"VARCHAR",
   "number":"UBIGINT",
   "pull_request":{
      "url":"VARCHAR",
      "id":"UBIGINT",
      "title":"VARCHAR",
      "user":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "body":"VARCHAR",
      "created_at":"TIMESTAMP",
      "updated_at":"TIMESTAMP",
      "assignee":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "assignees":[
         {
            "login":"VARCHAR",
            "id":"UBIGINT",
         }
      ],
  }
}'));

This gives us back a DuckDB type that we can plug the type into our function! Note that because we are not auto-detecting the schema, we have to supply timestampformat to be able to parse the timestamps correctly. The key "user" must be surrounded by quotes because it is a reserved keyword in SQL:

CREATE TABLE pr_events as
SELECT *
FROM read_json(
  'gharchive_gz/*.json.gz',
  columns={
    id: 'BIGINT',
    type: 'VARCHAR',
    actor: 'STRUCT(id UBIGINT,
                   login VARCHAR,
                   display_login VARCHAR,
                   gravatar_id VARCHAR,
                   url VARCHAR,
                   avatar_url VARCHAR)',
    repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
    payload: 'STRUCT(
                action VARCHAR,
                number UBIGINT,
                pull_request STRUCT(
                  url VARCHAR,
                  id UBIGINT,
                  title VARCHAR,
                  "user" STRUCT(
                    login VARCHAR,
                    id UBIGINT
                  ),
                  body VARCHAR,
                  created_at TIMESTAMP,
                  updated_at TIMESTAMP,
                  assignee STRUCT(login VARCHAR, id UBIGINT),
                  assignees STRUCT(login VARCHAR, id UBIGINT)[]
                )
              )',
    public: 'BOOLEAN',
    created_at: 'TIMESTAMP',
    org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
  },
  json_format='records',
  lines='true',
  timestampformat='%Y-%m-%dT%H:%M:%SZ'
)
WHERE type = 'PullRequestEvent';

This query completes in around 36s with an on-disk database (resulting size is 478MB) and 9s with an in-memory database. If you don't care about preserving insertion order, you can speed the query up with this setting:

SET preserve_insertion_order = false;

With this setting, the query completes in around 27s with an on-disk database and 8.5s with an in-memory database. The difference between the on-disk and in-memory case is quite substantial here because DuckDB has to compress and persist much more data.

Now we can analyze pull request events! Let's see what the maximum number of assignees is:

SELECT max(length(payload.pull_request.assignees)) max_assignees
FROM pr_events;
max_assignees
10

That's a lot of people reviewing a single pull request!

We can check who was assigned the most:

WITH assignees AS (
  SELECT payload.pull_request.assignee.login assignee
  FROM pr_events
  UNION ALL
  SELECT unnest(payload.pull_request.assignees).login assignee
  FROM pr_events
)
SELECT assignee, count(*) count
FROM assignees
WHERE assignee NOT NULL
GROUP BY assignee
ORDER BY count DESC
LIMIT 5;
assignee count
poad 494
vinayakkulkarni 268
tmtmtmtm 198
fisker 98
icemac 84

That's a lot of assignments! Although I suspect there are duplicates in here.

Storing as JSON to Parse at Query Time

Specifying the JSON schema of the "payload" field was helpful because it allowed us to directly analyze what is there, and subsequent queries are much faster. Still, it can also be quite cumbersome if the schema is complex. If you don't want to specify the schema of a field, you can set the type as 'JSON':

CREATE TABLE pr_events AS
SELECT *
FROM read_json(
  'gharchive_gz/*.json.gz',
  columns={
    id: 'BIGINT',
    type: 'VARCHAR',
    actor: 'STRUCT(id UBIGINT,
                   login VARCHAR,
                   display_login VARCHAR,
                   gravatar_id VARCHAR,
                   url VARCHAR,
                   avatar_url VARCHAR)',
    repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
    payload: 'JSON',
    public: 'BOOLEAN',
    created_at: 'TIMESTAMP',
    org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
  },
  json_format='records',
  lines='true',
  timestampformat='%Y-%m-%dT%H:%M:%SZ'
)
WHERE type = 'PullRequestEvent';

This will load the "payload" field as a JSON string, and we can use DuckDB's JSON functions to analyze it when querying. For example:

SELECT DISTINCT payload->>'action' AS action, count(*) count
FROM pr_events
GROUP BY action
ORDER BY count DESC;

The ->> arrow is short-hand for our json_extract_string function. Creating the entire "payload" field as a column with type JSON is not the most efficient way to get just the "action" field, but this example is just to show the flexibility of read_json. The query results in the following table:

action count
opened 189096
closed 174914
reopened 2080

As we can see, only a few pull requests have been reopened.

Conclusion

DuckDB tries to be an easy-to-use tool that can read all kinds of data formats. In the 0.7.0 release, we have added support for reading JSON. JSON comes in many formats and all kinds of schemas. DuckDB's rich support for nested types (LIST, STRUCT) allows it to fully 'shred' the JSON to a columnar format for more efficient analysis.

We are excited to hear what you think about our new JSON functionality. If you have any questions or suggestions, please reach out to us on Discord or GitHub!