You can use DuckDB to analyze Git logs using the output of the git log command.
Exporting the Git Log
We start by picking a character that doesn't occur in any part of the commit log (author names, messages, etc). Since version v1.2.0, DuckDB's CSV reader supports 4-byte delimiters, making it possible to use emojis! π
Despite being featured in the Emoji Movie (IMDb rating: 3.4),
we can assume that the Fish Cake with Swirl emoji (π₯) is not a common occurrence in most Git logs.
So, let's clone the duckdb/duckdb repository and export its log as follows:
git log --date=iso-strict --pretty=format:%adπ₯%hπ₯%anπ₯%s > git-log.csv
The resulting file looks like this:
2025-02-25T18:12:54+01:00π₯d608a31e13π₯Markπ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400)
2025-02-25T15:05:56+01:00π₯920b39ad96π₯Markπ₯Read support for Parquet Float16 (#16395)
2025-02-25T13:43:52+01:00π₯61f55734b9π₯Carlo Piovesanπ₯MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation
2025-02-25T12:35:28+01:00π₯87eff7ebd3π₯Markπ₯Fix issue #16377 (#16391)
2025-02-25T10:33:49+01:00π₯35af26476eπ₯Hannes MΓΌhleisenπ₯Read support for Parquet Float16
Loading the Git Log into DuckDB
Start DuckDB and read the log as a CSV π₯SV:
CREATE TABLE commits AS 
    FROM read_csv(
            'git-log.csv',
            delim = 'π₯',
            header = false,
            column_names = ['timestamp', 'hash', 'author', 'message']
        );
This will result in a nice DuckDB table:
FROM commits
LIMIT 5;
βββββββββββββββββββββββ¬βββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β      timestamp      β    hash    β      author      β                                    message                                    β
β      timestamp      β  varchar   β     varchar      β                                    varchar                                    β
βββββββββββββββββββββββΌβββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2025-02-25 17:12:54 β d608a31e13 β Mark             β MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation (#16400) β
β 2025-02-25 14:05:56 β 920b39ad96 β Mark             β Read support for Parquet Float16 (#16395)                                     β
β 2025-02-25 12:43:52 β 61f55734b9 β Carlo Piovesan   β MAIN_BRANCH_VERSIONING: Adopt also for Python build and amalgamation          β
β 2025-02-25 11:35:28 β 87eff7ebd3 β Mark             β Fix issue #16377 (#16391)                                                     β
β 2025-02-25 09:33:49 β 35af26476e β Hannes MΓΌhleisen β Read support for Parquet Float16                                              β
βββββββββββββββββββββββ΄βββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Analyzing the Log
We can analyze the table as any other in DuckDB.
Common Topics
Let's start with a simple question: which topic was the most commonly mentioned in the commit messages: CI, CLI, or Python?
SELECT
    message.lower().regexp_extract('\b(ci|cli|python)\b') AS topic,
    count(*) AS num_commits
FROM commits
WHERE topic <> ''
GROUP BY ALL
ORDER BY num_commits DESC;
βββββββββββ¬ββββββββββββββ
β  topic  β num_commits β
β varchar β    int64    β
βββββββββββΌββββββββββββββ€
β ci      β         828 β
β python  β         666 β
β cli     β          49 β
βββββββββββ΄ββββββββββββββ
Out of these three topics, commits related to continuous integration dominate the log!
We can also do a more exploratory analysis by looking at all words in the commit messages. To do so, we first tokenize the messages:
CREATE TABLE words AS
    SELECT unnest(
        message
            .lower()
            .regexp_replace('\W', ' ')
            .trim(' ')
            .string_split_regex('\W')
        ) AS word    
FROM commits;
Then, we remove stopwords using a pre-defined list:
CREATE TABLE stopwords AS
    SELECT unnest(['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'did', 'do', 'does', 'doing', 'don', 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'itself', 'just', 'me', 'more', 'most', 'my', 'myself', 'no', 'nor', 'not', 'now', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 's', 'same', 'she', 'should', 'so', 'some', 'such', 't', 'than', 'that', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', 'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom', 'why', 'will', 'with', 'you', 'your', 'yours', 'yourself', 'yourselves']) AS word;
CREATE OR REPLACE TABLE words AS
    FROM words
    NATURAL ANTI JOIN stopwords
    WHERE word != '';
We use the
NATURAL ANTI JOINclause here, which allows us to elegantly filter out values that occur in thestopwordstable.
Finally, we select the top-20 most common words.
SELECT word, count(*) AS count FROM words
GROUP BY ALL
ORDER BY count DESC
LIMIT 20;
ββββββββββββ¬ββββββββ
β    w     β count β
β varchar  β int64 β
ββββββββββββΌββββββββ€
β merge    β 12550 β
β fix      β  6402 β
β branch   β  6005 β
β pull     β  5950 β
β request  β  5945 β
β add      β  5687 β
β test     β  3801 β
β master   β  3289 β
β tests    β  2339 β
β issue    β  1971 β
β main     β  1935 β
β remove   β  1884 β
β format   β  1819 β
β duckdb   β  1710 β
β use      β  1442 β
β mytherin β  1410 β
β fixes    β  1333 β
β hawkfish β  1147 β
β feature  β  1139 β
β function β  1088 β
ββββββββββββ΄ββββββββ€
β     20 rows      β
ββββββββββββββββββββ
As expected, there are many Git terms (merge, branch, pull, etc.), followed by terminology related to development (fix, test/tests, issue, format).
We also see the account names of some developers (mytherin, hawkfish), which are likely there due to commit message for merging pull requests (e.g., βMerge pull request #13776 from Mytherin/expressiondepthβ).
Finally, we also see some DuckDB-related terms such as duckdb (shocking!) and function.
Visualizing the Number of Commits
Let's visualize the number of commits each year:
SELECT
    year(timestamp) AS year,
    count(*) AS num_commits,
    num_commits.bar(0, 20_000) AS num_commits_viz
FROM commits
GROUP BY ALL
ORDER BY ALL;
βββββββββ¬ββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β year  β num_commits β                                 num_commits_viz                                  β
β int64 β    int64    β                                     varchar                                      β
βββββββββΌββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β  2018 β         870 β ββββ                                                                             β
β  2019 β        1621 β βββββββ                                                                          β
β  2020 β        3484 β ββββββββββββββ                                                                   β
β  2021 β        6488 β ββββββββββββββββββββββββββ                                                       β
β  2022 β        9817 β ββββββββββββββββββββββββββββββββββββββββ                                         β
β  2023 β       14585 β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ                      β
β  2024 β       15949 β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ                 β
β  2025 β        1788 β ββββββββ                                                                         β
βββββββββ΄ββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
We see a steady growth over the years β especially considering that many of DuckDB's functionalities and clients, which were originally part of the main repository, are now maintained in separate repositories (e.g., Java, R).
Happy hacking!