Vectors represent a horizontal slice of a column. They hold a number of values of a specific type, similar to an array. Vectors are the core data representation used in DuckDB. Vectors are typically stored within data chunks.
The vector and data chunk interfaces are the most efficient way of interacting with DuckDB, allowing for the highest performance. However, the interfaces are also difficult to use and care must be taken when using them.
Vector Format
Vectors are arrays of a specific data type. The logical type of a vector can be obtained using duckdb_vector_get_column_type
. The type id of the logical type can then be obtained using duckdb_get_type_id
.
Vectors themselves do not have sizes. Instead, the parent data chunk has a size (that can be obtained through duckdb_data_chunk_get_size
). All vectors that belong to a data chunk have the same size.
Primitive Types
For primitive types, the underlying array can be obtained using the duckdb_vector_get_data
method. The array can then be accessed using the correct native type. Below is a table that contains a mapping of the duckdb_type
to the native type of the array.
duckdb_type | NativeType |
---|---|
DUCKDB_TYPE_BOOLEAN | bool |
DUCKDB_TYPE_TINYINT | int8_t |
DUCKDB_TYPE_SMALLINT | int16_t |
DUCKDB_TYPE_INTEGER | int32_t |
DUCKDB_TYPE_BIGINT | int64_t |
DUCKDB_TYPE_UTINYINT | uint8_t |
DUCKDB_TYPE_USMALLINT | uint16_t |
DUCKDB_TYPE_UINTEGER | uint32_t |
DUCKDB_TYPE_UBIGINT | uint64_t |
DUCKDB_TYPE_FLOAT | float |
DUCKDB_TYPE_DOUBLE | double |
DUCKDB_TYPE_TIMESTAMP | duckdb_timestamp |
DUCKDB_TYPE_DATE | duckdb_date |
DUCKDB_TYPE_TIME | duckdb_time |
DUCKDB_TYPE_INTERVAL | duckdb_interval |
DUCKDB_TYPE_HUGEINT | duckdb_hugeint |
DUCKDB_TYPE_UHUGEINT | duckdb_uhugeint |
DUCKDB_TYPE_VARCHAR | duckdb_string_t |
DUCKDB_TYPE_BLOB | duckdb_string_t |
DUCKDB_TYPE_TIMESTAMP_S | duckdb_timestamp |
DUCKDB_TYPE_TIMESTAMP_MS | duckdb_timestamp |
DUCKDB_TYPE_TIMESTAMP_NS | duckdb_timestamp |
DUCKDB_TYPE_UUID | duckdb_hugeint |
DUCKDB_TYPE_TIME_TZ | duckdb_time_tz |
DUCKDB_TYPE_TIMESTAMP_TZ | duckdb_timestamp |
NULL
Values
Any value in a vector can be NULL
. When a value is NULL
, the values contained within the primary array at that index is undefined (and can be uninitialized). The validity mask is a bitmask consisting of uint64_t
elements. For every 64
values in the vector, one uint64_t
element exists (rounded up). The validity mask has its bit set to 1 if the value is valid, or set to 0 if the value is invalid (i.e .NULL
).
The bits of the bitmask can be read directly, or the slower helper method duckdb_validity_row_is_valid
can be used to check whether or not a value is NULL
.
The duckdb_vector_get_validity
returns a pointer to the validity mask. Note that if all values in a vector are valid, this function might return nullptr
in which case the validity mask does not need to be checked.
Strings
String values are stored as a duckdb_string_t
. This is a special struct that stores the string inline (if it is short, i.e., <= 12 bytes
) or a pointer to the string data if it is longer than 12
bytes.
typedef struct {
union {
struct {
uint32_t length;
char prefix[4];
char *ptr;
} pointer;
struct {
uint32_t length;
char inlined[12];
} inlined;
} value;
} duckdb_string_t;
The length can either be accessed directly, or the duckdb_string_is_inlined
can be used to check if a string is inlined.
Decimals
Decimals are stored as integer values internally. The exact native type depends on the width
of the decimal type, as shown in the following table:
Width | NativeType |
---|---|
<= 4 | int16_t |
<= 9 | int32_t |
<= 18 | int64_t |
<= 38 | duckdb_hugeint |
The duckdb_decimal_internal_type
can be used to obtain the internal type of the decimal.
Decimals are stored as integer values multiplied by 10^scale
. The scale of a decimal can be obtained using duckdb_decimal_scale
. For example, a decimal value of 10.5
with type DECIMAL(8, 3)
is stored internally as an int32_t
value of 10500
. In order to obtain the correct decimal value, the value should be divided by the appropriate power-of-ten.
Enums
Enums are stored as unsigned integer values internally. The exact native type depends on the size of the enum dictionary, as shown in the following table:
Dictionary size | NativeType |
---|---|
<= 255 | uint8_t |
<= 65535 | uint16_t |
<= 4294967295 | uint32_t |
The duckdb_enum_internal_type
can be used to obtain the internal type of the enum.
In order to obtain the actual string value of the enum, the duckdb_enum_dictionary_value
function must be used to obtain the enum value that corresponds to the given dictionary entry. Note that the enum dictionary is the same for the entire column - and so only needs to be constructed once.
Structs
Structs are nested types that contain any number of child types. Think of them like a struct
in C. The way to access struct data using vectors is to access the child vectors recursively using the duckdb_struct_vector_get_child
method.
The struct vector itself does not have any data (i.e., you should not use duckdb_vector_get_data
method on the struct). However, the struct vector itself does have a validity mask. The reason for this is that the child elements of a struct can be NULL
, but the struct itself can also be NULL
.
Lists
Lists are nested types that contain a single child type, repeated x
times per row. Think of them like a variable-length array in C. The way to access list data using vectors is to access the child vector using the duckdb_list_vector_get_child
method.
The duckdb_vector_get_data
must be used to get the offsets and lengths of the lists stored as duckdb_list_entry
, that can then be applied to the child vector.
typedef struct {
uint64_t offset;
uint64_t length;
} duckdb_list_entry;
Note that both list entries itself and any children stored in the lists can also be NULL
. This must be checked using the validity mask again.
Arrays
Arrays are nested types that contain a single child type, repeated exactly array_size
times per row. Think of them like a fixed-size array in C. Arrays work exactly the same as lists, except the length and offset of each entry is fixed. The fixed array size can be obtained by using duckdb_array_type_array_size
. The data for entry n
then resides at offset = n * array_size
, and always has length = array_size
.
Note that much like lists, arrays can still be NULL
, which must be checked using the validity mask.
Examples
Below are several full end-to-end examples of how to interact with vectors.
Example: Reading an int64 Vector with NULL
Values
duckdb_database db;
duckdb_connection con;
duckdb_open(nullptr, &db);
duckdb_connect(db, &con);
duckdb_result res;
duckdb_query(con, "SELECT CASE WHEN i%2=0 THEN NULL ELSE i END res_col FROM range(10) t(i)", &res);
// iterate until result is exhausted
while (true) {
duckdb_data_chunk result = duckdb_fetch_chunk(res);
if (!result) {
// result is exhausted
break;
}
// get the number of rows from the data chunk
idx_t row_count = duckdb_data_chunk_get_size(result);
// get the first column
duckdb_vector res_col = duckdb_data_chunk_get_vector(result, 0);
// get the native array and the validity mask of the vector
int64_t *vector_data = (int64_t *) duckdb_vector_get_data(res_col);
uint64_t *vector_validity = duckdb_vector_get_validity(res_col);
// iterate over the rows
for (idx_t row = 0; row < row_count; row++) {
if (duckdb_validity_row_is_valid(vector_validity, row)) {
printf("%lld\n", vector_data[row]);
} else {
printf("NULL\n");
}
}
duckdb_destroy_data_chunk(&result);
}
// clean-up
duckdb_destroy_result(&res);
duckdb_disconnect(&con);
duckdb_close(&db);
Example: Reading a String Vector
duckdb_database db;
duckdb_connection con;
duckdb_open(nullptr, &db);
duckdb_connect(db, &con);
duckdb_result res;
duckdb_query(con, "SELECT CASE WHEN i%2=0 THEN CONCAT('short_', i) ELSE CONCAT('longstringprefix', i) END FROM range(10) t(i)", &res);
// iterate until result is exhausted
while (true) {
duckdb_data_chunk result = duckdb_fetch_chunk(res);
if (!result) {
// result is exhausted
break;
}
// get the number of rows from the data chunk
idx_t row_count = duckdb_data_chunk_get_size(result);
// get the first column
duckdb_vector res_col = duckdb_data_chunk_get_vector(result, 0);
// get the native array and the validity mask of the vector
duckdb_string_t *vector_data = (duckdb_string_t *) duckdb_vector_get_data(res_col);
uint64_t *vector_validity = duckdb_vector_get_validity(res_col);
// iterate over the rows
for (idx_t row = 0; row < row_count; row++) {
if (duckdb_validity_row_is_valid(vector_validity, row)) {
duckdb_string_t str = vector_data[row];
if (duckdb_string_is_inlined(str)) {
// use inlined string
printf("%.*s\n", str.value.inlined.length, str.value.inlined.inlined);
} else {
// follow string pointer
printf("%.*s\n", str.value.pointer.length, str.value.pointer.ptr);
}
} else {
printf("NULL\n");
}
}
duckdb_destroy_data_chunk(&result);
}
// clean-up
duckdb_destroy_result(&res);
duckdb_disconnect(&con);
duckdb_close(&db);
Example: Reading a Struct Vector
duckdb_database db;
duckdb_connection con;
duckdb_open(nullptr, &db);
duckdb_connect(db, &con);
duckdb_result res;
duckdb_query(con, "SELECT CASE WHEN i%5=0 THEN NULL ELSE {'col1': i, 'col2': CASE WHEN i%2=0 THEN NULL ELSE 100 + i * 42 END} END FROM range(10) t(i)", &res);
// iterate until result is exhausted
while (true) {
duckdb_data_chunk result = duckdb_fetch_chunk(res);
if (!result) {
// result is exhausted
break;
}
// get the number of rows from the data chunk
idx_t row_count = duckdb_data_chunk_get_size(result);
// get the struct column
duckdb_vector struct_col = duckdb_data_chunk_get_vector(result, 0);
uint64_t *struct_validity = duckdb_vector_get_validity(struct_col);
// get the child columns of the struct
duckdb_vector col1_vector = duckdb_struct_vector_get_child(struct_col, 0);
int64_t *col1_data = (int64_t *) duckdb_vector_get_data(col1_vector);
uint64_t *col1_validity = duckdb_vector_get_validity(col1_vector);
duckdb_vector col2_vector = duckdb_struct_vector_get_child(struct_col, 1);
int64_t *col2_data = (int64_t *) duckdb_vector_get_data(col2_vector);
uint64_t *col2_validity = duckdb_vector_get_validity(col2_vector);
// iterate over the rows
for (idx_t row = 0; row < row_count; row++) {
if (!duckdb_validity_row_is_valid(struct_validity, row)) {
// entire struct is NULL
printf("NULL\n");
continue;
}
// read col1
printf("{'col1': ");
if (!duckdb_validity_row_is_valid(col1_validity, row)) {
// col1 is NULL
printf("NULL");
} else {
printf("%lld", col1_data[row]);
}
printf(", 'col2': ");
if (!duckdb_validity_row_is_valid(col2_validity, row)) {
// col2 is NULL
printf("NULL");
} else {
printf("%lld", col2_data[row]);
}
printf("}\n");
}
duckdb_destroy_data_chunk(&result);
}
// clean-up
duckdb_destroy_result(&res);
duckdb_disconnect(&con);
duckdb_close(&db);
Example: Reading a List Vector
duckdb_database db;
duckdb_connection con;
duckdb_open(nullptr, &db);
duckdb_connect(db, &con);
duckdb_result res;
duckdb_query(con, "SELECT CASE WHEN i % 5 = 0 THEN NULL WHEN i % 2 = 0 THEN [i, i + 1] ELSE [i * 42, NULL, i * 84] END FROM range(10) t(i)", &res);
// iterate until result is exhausted
while (true) {
duckdb_data_chunk result = duckdb_fetch_chunk(res);
if (!result) {
// result is exhausted
break;
}
// get the number of rows from the data chunk
idx_t row_count = duckdb_data_chunk_get_size(result);
// get the list column
duckdb_vector list_col = duckdb_data_chunk_get_vector(result, 0);
duckdb_list_entry *list_data = (duckdb_list_entry *) duckdb_vector_get_data(list_col);
uint64_t *list_validity = duckdb_vector_get_validity(list_col);
// get the child column of the list
duckdb_vector list_child = duckdb_list_vector_get_child(list_col);
int64_t *child_data = (int64_t *) duckdb_vector_get_data(list_child);
uint64_t *child_validity = duckdb_vector_get_validity(list_child);
// iterate over the rows
for (idx_t row = 0; row < row_count; row++) {
if (!duckdb_validity_row_is_valid(list_validity, row)) {
// entire list is NULL
printf("NULL\n");
continue;
}
// read the list offsets for this row
duckdb_list_entry list = list_data[row];
printf("[");
for (idx_t child_idx = list.offset; child_idx < list.offset + list.length; child_idx++) {
if (child_idx > list.offset) {
printf(", ");
}
if (!duckdb_validity_row_is_valid(child_validity, child_idx)) {
// col1 is NULL
printf("NULL");
} else {
printf("%lld", child_data[child_idx]);
}
}
printf("]\n");
}
duckdb_destroy_data_chunk(&result);
}
// clean-up
duckdb_destroy_result(&res);
duckdb_disconnect(&con);
duckdb_close(&db);
API Reference Overview
duckdb_logical_type duckdb_vector_get_column_type(duckdb_vector vector);
void *duckdb_vector_get_data(duckdb_vector vector);
uint64_t *duckdb_vector_get_validity(duckdb_vector vector);
void duckdb_vector_ensure_validity_writable(duckdb_vector vector);
void duckdb_vector_assign_string_element(duckdb_vector vector, idx_t index, const char *str);
void duckdb_vector_assign_string_element_len(duckdb_vector vector, idx_t index, const char *str, idx_t str_len);
duckdb_vector duckdb_list_vector_get_child(duckdb_vector vector);
idx_t duckdb_list_vector_get_size(duckdb_vector vector);
duckdb_state duckdb_list_vector_set_size(duckdb_vector vector, idx_t size);
duckdb_state duckdb_list_vector_reserve(duckdb_vector vector, idx_t required_capacity);
duckdb_vector duckdb_struct_vector_get_child(duckdb_vector vector, idx_t index);
duckdb_vector duckdb_array_vector_get_child(duckdb_vector vector);
Validity Mask Functions
bool duckdb_validity_row_is_valid(uint64_t *validity, idx_t row);
void duckdb_validity_set_row_validity(uint64_t *validity, idx_t row, bool valid);
void duckdb_validity_set_row_invalid(uint64_t *validity, idx_t row);
void duckdb_validity_set_row_valid(uint64_t *validity, idx_t row);
duckdb_vector_get_column_type
Retrieves the column type of the specified vector.
The result must be destroyed with duckdb_destroy_logical_type
.
Syntax
duckdb_logical_type duckdb_vector_get_column_type(
duckdb_vector vector
);
Parameters
vector
: The vector get the data from
Return Value
The type of the vector
duckdb_vector_get_data
Retrieves the data pointer of the vector.
The data pointer can be used to read or write values from the vector. How to read or write values depends on the type of the vector.
Syntax
void *duckdb_vector_get_data(
duckdb_vector vector
);
Parameters
vector
: The vector to get the data from
Return Value
The data pointer
duckdb_vector_get_validity
Retrieves the validity mask pointer of the specified vector.
If all values are valid, this function MIGHT return NULL!
The validity mask is a bitset that signifies null-ness within the data chunk. It is a series of uint64_t values, where each uint64_t value contains validity for 64 tuples. The bit is set to 1 if the value is valid (i.e., not NULL) or 0 if the value is invalid (i.e., NULL).
Validity of a specific value can be obtained like this:
idx_t entry_idx = row_idx / 64;
idx_t idx_in_entry = row_idx % 64;
bool is_valid = validity_mask[entry_idx] & (1 << idx_in_entry);
Alternatively, the (slower) duckdb_validity_row_is_valid function can be used.
Syntax
uint64_t *duckdb_vector_get_validity(
duckdb_vector vector
);
Parameters
vector
: The vector to get the data from
Return Value
The pointer to the validity mask, or NULL
if no validity mask is present
duckdb_vector_ensure_validity_writable
Ensures the validity mask is writable by allocating it.
After this function is called, duckdb_vector_get_validity
will ALWAYS return non-NULL.
This allows NULL
values to be written to the vector, regardless of whether a validity mask was present before.
Syntax
void duckdb_vector_ensure_validity_writable(
duckdb_vector vector
);
Parameters
vector
: The vector to alter
duckdb_vector_assign_string_element
Assigns a string element in the vector at the specified location.
Syntax
void duckdb_vector_assign_string_element(
duckdb_vector vector,
idx_t index,
const char *str
);
Parameters
vector
: The vector to alterindex
: The row position in the vector to assign the string tostr
: The null-terminated string
duckdb_vector_assign_string_element_len
Assigns a string element in the vector at the specified location. You may also use this function to assign BLOBs.
Syntax
void duckdb_vector_assign_string_element_len(
duckdb_vector vector,
idx_t index,
const char *str,
idx_t str_len
);
Parameters
vector
: The vector to alterindex
: The row position in the vector to assign the string tostr
: The stringstr_len
: The length of the string (in bytes)
duckdb_list_vector_get_child
Retrieves the child vector of a list vector.
The resulting vector is valid as long as the parent vector is valid.
Syntax
duckdb_vector duckdb_list_vector_get_child(
duckdb_vector vector
);
Parameters
vector
: The vector
Return Value
The child vector
duckdb_list_vector_get_size
Returns the size of the child vector of the list.
Syntax
idx_t duckdb_list_vector_get_size(
duckdb_vector vector
);
Parameters
vector
: The vector
Return Value
The size of the child list
duckdb_list_vector_set_size
Sets the total size of the underlying child-vector of a list vector.
Syntax
duckdb_state duckdb_list_vector_set_size(
duckdb_vector vector,
idx_t size
);
Parameters
vector
: The list vector.size
: The size of the child list.
Return Value
The duckdb state. Returns DuckDBError if the vector is nullptr.
duckdb_list_vector_reserve
Sets the total capacity of the underlying child-vector of a list.
After calling this method, you must call duckdb_vector_get_validity
and duckdb_vector_get_data
to obtain current
data and validity pointers
Syntax
duckdb_state duckdb_list_vector_reserve(
duckdb_vector vector,
idx_t required_capacity
);
Parameters
vector
: The list vector.required_capacity
: the total capacity to reserve.
Return Value
The duckdb state. Returns DuckDBError if the vector is nullptr.
duckdb_struct_vector_get_child
Retrieves the child vector of a struct vector.
The resulting vector is valid as long as the parent vector is valid.
Syntax
duckdb_vector duckdb_struct_vector_get_child(
duckdb_vector vector,
idx_t index
);
Parameters
vector
: The vectorindex
: The child index
Return Value
The child vector
duckdb_array_vector_get_child
Retrieves the child vector of a array vector.
The resulting vector is valid as long as the parent vector is valid. The resulting vector has the size of the parent vector multiplied by the array size.
Syntax
duckdb_vector duckdb_array_vector_get_child(
duckdb_vector vector
);
Parameters
vector
: The vector
Return Value
The child vector
duckdb_validity_row_is_valid
Returns whether or not a row is valid (i.e., not NULL) in the given validity mask.
Syntax
bool duckdb_validity_row_is_valid(
uint64_t *validity,
idx_t row
);
Parameters
validity
: The validity mask, as obtained throughduckdb_vector_get_validity
row
: The row index
Return Value
true if the row is valid, false otherwise
duckdb_validity_set_row_validity
In a validity mask, sets a specific row to either valid or invalid.
Note that duckdb_vector_ensure_validity_writable
should be called before calling duckdb_vector_get_validity
,
to ensure that there is a validity mask to write to.
Syntax
void duckdb_validity_set_row_validity(
uint64_t *validity,
idx_t row,
bool valid
);
Parameters
validity
: The validity mask, as obtained throughduckdb_vector_get_validity
.row
: The row indexvalid
: Whether or not to set the row to valid, or invalid
duckdb_validity_set_row_invalid
In a validity mask, sets a specific row to invalid.
Equivalent to duckdb_validity_set_row_validity
with valid set to false.
Syntax
void duckdb_validity_set_row_invalid(
uint64_t *validity,
idx_t row
);
Parameters
validity
: The validity maskrow
: The row index
duckdb_validity_set_row_valid
In a validity mask, sets a specific row to valid.
Equivalent to duckdb_validity_set_row_validity
with valid set to true.
Syntax
void duckdb_validity_set_row_valid(
uint64_t *validity,
idx_t row
);
Parameters
validity
: The validity maskrow
: The row index