The Return of the H2O.ai Database-like Ops Benchmark
TL;DR: We’ve resurrected the H2O.ai database-like ops benchmark with up to date libraries and plan to keep re-running it.
The H2O.ai Database-like Ops Benchmark is a well-known benchmark in the data analytics and R community. The benchmark measures the groupby and join performance of various analytical tools like data.table, polars, dplyr, clickhouse, duckdb and more. Since July 2nd 2021, the benchmark has been dormant, with no result updates or maintenance. Many of the analytical systems measured in the benchmark have since undergone substantial improvements, leaving many of the maintainers curious as to where their analytical tool ranks on the benchmark.
DuckDB has decided to give the H2O.ai benchmark new life and maintain it for the foreseeable future. One reason the DuckDB project has decided to maintain the benchmark is because DuckDB has had 10 new minor releases since the most recent published results on July 2nd, 2021. After managing to run parts of the benchmark on a r3-8xlarge AWS box, DuckDB ranked as a top performer on the benchmark. Additionally, the DuckDB project wants to demonstrate it’s commitment to performance by consistently comparing DuckDB with other analytical systems. While DuckDB delivers differentiated ease of use, raw performance and scalability are critically important for solving tough problems fast. Plus, just like many of our fellow data folks, we have a need for speed. Therefore, the decision was made to fork the benchmark, modernize underlying dependencies and run the benchmark on the latest versions of the included systems. You can find the repository here.
The results of the new benchmark are very interesting, but first a quick summary of the benchmark and what updates took place.
The H2O.ai Database-like Ops Benchmark
There are 5 basic grouping tests and 5 advanced grouping tests. The 10 grouping queries all focus on a combination of the following
- Low cardinality (a few big groups)
- High cardinality (lots of very small groups)
- Grouping integer types
- Grouping string types
Each query is run only twice with both results being reported. This way we can see the performance of a cold run and any effects data caching may have. The idea is to avoid reporting any potential “best” results on a hot system. Data analysts only need to run a query once to get their answer. No one drives to the store a second time to get another litre of milk faster.
The time reported is the sum of the time it takes to run all 5 queries twice.
More information about the specific queries can be found below.
The Data and Queries
The queries have not changed since the benchmark went dormant. The data is generated in a rather simple manner. Inspecting the datagen files you can see that the columns are generated with small, medium, and large groups of char and int values. Similar generation logic applies to the join data generation.
||Sum over large cardinality groups, grouped by varchar|
||Sum over medium cardinality groups, grouped by varchars|
||Sum and mean over many small cardinality groups, grouped by varchar|
||Mean over many large cardinality groups, grouped by integer|
||Sum over many small groups, grouped by integer|
|advanced groupby #1||
|advanced groupby #2||
||Range selection over small cardinality groups, grouped by integer|
|advanced groupby #3||
||Advanced group by query|
|advanced groupby #4||
||Arithmetic over medium sized groups, grouped by varchar, integer.|
|advanced groupby #5||
||Many many small groups, the number of groups is the cardinality of the dataset|
||Joining a large table (x) with a small-sized table on integer type|
||Joining a large table (x) with a medium-sized table on integer type|
||Left join a large table (x) with a medium-sized table on integer type|
||Join a large table (x) with a medium table on varchar type|
||Join a large table (x) with a large table on integer type.|
You can find more information about the queries in the Efficiency of Data Processing slides.
Modifications to the Benchmark & Hardware
No modifications have been made to the queries or the data generation. Some scripts required minor modifications so that the current version of the library could be run. The hardware used is slightly different as the exact AWS offering the benchmark previously used is no longer available. Base libraries have been updated as well. GPU libraries were not tested.
AWS is a m4.10xlarge
- CPU model: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
- CPU cores: 40
- RAM model: Unknown
- Memory: 160GB
- NO GPU specifications
- R upgraded, 4.0.0 -> 4.2.2
- Python upgraded 3.[6|7] -> 3.10
Changes made to install scripts of other systems
Pandas, Polars, Dask, and Clickhouse required changes to their setup/install scripts. The changes were relatively minor consisting mostly of syntax updates and data ingestion updates. Data ingestion did not affect the reporting timing results.
You can also look at the results here. DuckDB’s timings have improved significantly since v0.2.7 released over two years ago). A major contributor to our increased performance is parallel grouped aggregation, merged in March 2022, and parallel result set materialization. In addition, DuckDB now supports enum types, which makes DuckDB
group by aggregation even faster. Improvements to the out-of-core hash join were merged as well, further improving the performance of our joins.
Questions about certain results?
Some solutions may report internal errors for some queries. Feel free to investigate the errors by using the _utils/repro.sh script and file a github issue to resolve any confusion. In addition, there are many areas in the code where certain query results are automatically nullified. If you believe that is the case for a query for your system or if you have any other questions, you can create a github issue to discuss.
DuckDB will continue to maintain this benchmark for the forseeable future. The process for re-running the benchmarks with updated library versions must still be decided.
Do you have any other questions? Would you like to have your system added to the benchmark? Please feel free to read the ReadMe.md in the repository, and if you still have questions, you can reach out to me at [email protected] or on our Discord!back to news archive