You can use DuckDB to analyze Git logs using the output of the git log
command.
We start by using a trick: we select a character that doesn't occur in any part of commit log (author names, messages, etc). Since version v1.2.0, DuckDB's CSV reader supports 4-byte delimiters, making it possible to use emojis.
We assume that the Fish Cake with Swirl emoji (π₯) is not a common occurrence in codebases and export the log as follows:
git log --date=iso-strict --pretty=format:%adπ₯%hπ₯%anπ₯%s > git-log.csv
The resulting file looks like this:
2025-02-21T16:59:07+01:00π₯cd0d0da9b1π₯Markπ₯docs: tweak explanation of median for even cardinality inputs (#13726)
2025-02-21T16:53:53+01:00π₯c8a53c3307π₯Markπ₯Linux CLI: override platform for ARM manylinux (#16347)
2025-02-21T16:53:29+01:00π₯0e98b26009π₯Markπ₯[chore] No ccache for OSX Python (#16348)
2025-02-21T15:01:33+01:00π₯96db35892aπ₯Carlo Piovesanπ₯[chore] No ccache for OSX Python
2025-02-21T14:26:56+01:00π₯4addad3422π₯Carlo Piovesanπ₯Linux CLI: override platform for ARM manylinux
Start DuckDB and read the log as a CSV (or βπ₯SVβ) file:
CREATE TABLE commits AS
FROM read_csv(
'git-log.csv',
delim = 'π₯',
header = false,
column_names = ['timestamp', 'hash', 'author', 'message']
);
This will result in a nice DuckDB table:
SELECT timestamp, author, message
FROM commits
LIMIT 5;
βββββββββββββββββββββββ¬βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β timestamp β author β message β
β timestamp β varchar β varchar β
βββββββββββββββββββββββΌβββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2025-02-21 15:59:07 β Mark β docs: tweak explanation of median for even cardinality inputs (#13726) β
β 2025-02-21 15:53:53 β Mark β Linux CLI: override platform for ARM manylinux (#16347) β
β 2025-02-21 15:53:29 β Mark β [chore] No ccache for OSX Python (#16348) β
β 2025-02-21 14:01:33 β Carlo Piovesan β [chore] No ccache for OSX Python β
β 2025-02-21 13:26:56 β Carlo Piovesan β Linux CLI: override platform for ARM manylinux β
βββββββββββββββββββββββ΄βββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
We can analyze this table as any other in DuckDB. For example, we can ask: which topics were most commonly mentioned in the commit messages β CI, CLI, or Python?
SELECT
message.lower().regexp_extract('\b(ci|cli|python)\b') AS topic,
count(*) AS num_commits
FROM commits
WHERE topic <> ''
GROUP BY ALL
ORDER BY num_commits DESC;
βββββββββββ¬ββββββββββββββ
β topic β num_commits β
β varchar β int64 β
βββββββββββΌββββββββββββββ€
β ci β 826 β
β python β 664 β
β cli β 47 β
βββββββββββ΄ββββββββββββββ
We can also visualize the number of commits each year:
SELECT
year(timestamp) AS year,
count(*) AS num_commits,
num_commits.bar(0, 20_000) AS num_commits_viz
FROM commits
GROUP BY ALL
ORDER BY ALL;
βββββββββ¬ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β year β num_commits β num_commits_viz β
β int64 β int64 β varchar β
βββββββββΌββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2018 β 870 β ββββ β
β 2019 β 1621 β βββββββ β
β 2020 β 3484 β ββββββββββββββ β
β 2021 β 6488 β ββββββββββββββββββββββββββ β
β 2022 β 9817 β ββββββββββββββββββββββββββββββββββββββββ β
β 2023 β 14585 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 2024 β 15949 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β 2025 β 1680 β βββββββ β
βββββββββ΄ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Enjoy!