TLDR: DuckDB, a free and open source analytical data management system, can run SQL queries directly on Parquet files and automatically take advantage of the advanced features of the Parquet format.
Apache Parquet is the most common “Big Data” storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.→ continue reading
TLDR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.
Recently, an article was published advocating for using SQL for Data Analysis. Here at team DuckDB, we are huge fans of SQL. It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.→ continue reading
TLDR: DuckDB now has full-text search functionality, similar to the FTS5 extension in SQLite. The main difference is that our FTS extension is fully formulated in SQL. We tested it out on TREC disks 4 and 5.
Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as “Give me all the documents about Mallard Ducks”: string patterns with
LIKE will only get you so far. Despite SQL’s shortcomings here, storing textual data in a database is commonplace. Consider the table
products (id INT, name VARCHAR, description VARCHAR) - it would be useful to search through the
description columns for a website that sells these products.