SQL (Structured Query Language) Querying
Introduction
SQL is a declarative data manipulation languate. Simple SELECT
queries follow the general form of
SELECT [projection] FROM [tables] WHERE [selection]
.
[projection]
can be column names from the tables, constants or comutations between them. A*
stands for "all columns".[tables]
can be one or multiple table names.[selection]
is a set of logical data filter conditions combined withAND
orOR
Most of the query parts are optional. A very simple query is SELECT 42
, which results in a single-row single-column result set containing the value 42
. A more common simple query is SELECT * FROM sometable
, which retrieves all row from sometable
. In the following, we will slowly expand from these simple queries.
For the more formally inclined, you can look at SQLite's formal grammar for SELECT
queries
The sample data used below is based on the material published in the book J.R. Bruijn, F.S. Gaastra and I. Schaar, Dutch-Asiatic Shipping in the 17th and 18th Centuries, which gives an account of the trips made to the East and ships returned safely (or wrecked on the way). This is the main table, voyages
:
The exercises here will work as follows: For every task, you write a SQL query into the text box and press the "play" button left of it. This will execute the query and check the result. The following query is already given, so you can directly run it. Try it, you will see the query result. If you get the result right, the task box turns green and the result is recorded.
Basic Queries
Projections
Projections determine the columns that are retrieved from the database. Columns are usually data, but can also be the the result of computation on the data or constants. Several columns can be specified in the query separated by ,
. For example, SELECT boatname, master FROM voyages
retrieves the two columns boatname
and master
from the table voyages
.
voyages
.
We can also modify the data we retrieve. For example, the (pointless) query SELECT number, number + 1 FROM voyages
will add 1
to every value in the numbers
column. A quite large amount of computation is possible here, comparable to common programming languages.
2
from the table voyages
.
In the previous task, you can see that the column name of the result of the computation on the tonnage
column is tonnage*2
. This is sometimes quite unhandy, especially when continuing to use that column in some computation. For this reason, columns and computation results can be renamed using the AS
keyword. For example, SELECT boatname AS shipname FROM voyages
will retrieve the boatname column, but name it shipname
.
2
renamed to tonnage_times_two
from the table voyages
.
Selections
To determine which rows are retrieved, we can filter the table with some criteria in the WHERE
clause. For example, the query SELECT boatname FROM voyages WHERE number > 8395
will only retrieve voyages where the value for the number
column is greater than 8395
. Note that the number
column is not mentioned in the projection list, this is common.
150
from the table voyages
.
You can specify multiple such filter criteria and connect them with Boolean logic using AND
and OR
. For example, the query SELECT boatname FROM voyages WHERE number > 8395 AND type_of_boat = 'fluit'
will only retrieve rows where both conditions are met.
150
and for ships departing from the harbour of Batavia from the table voyages
. Note that string constants (Batavia
) need to be quoted with single quotes in SQL, like so: 'Batavia'
.
One special case for selections are NULL
values. Those typically cannot be filtered out with (non-)equality checks (because NULL != NULL
, but have special syntax. In particular, WHERE departure_date IS NULL
selects rows where departure_date
has the value NULL
, WHERE departure_date IS NOT NULL
does the opposite.
Another special case is string pattern matching. Sometimes, we want not to compare exact equality of strings, but use pattern matching instead (much like regular expressions). SQL uses the LIKE
keyword for this. Patterns can contain two special "magic" characters, %
and _
. %
matches an arbitrary number (zero or more) characters, _
matches a single arbitrary character. For example, if one wanted to retrieve all boat names from the voyages
table where the boat name starts with D
, the query would be SELECT boatname FROM voyages WHERE boatname LIKE 'D%'
. If we wanted to retrieve all five-character boat names starting with D
, we would use LIKE 'D____'
. There is also the ILIKE
(Borat) variant, which is case-insensitive.
NULL
and the boat name consists of five-characters ending with AI
.
Output Modifiers
Sometimes we want to change the result set independent from the query. Two common tasks are limiting the amount of rows that are retrieved (keyword LIMIT
) and changing the ordering of the result set (keyword ORDER BY
. Using LIMIT
is generally a good idea since databases can grow to billions of records and they will duly output all of them if so instructed.
Note that the relational model does not specify an explicit order of rows, hence it is important to specify an order when using LIMIT
. For example, to retrieve the first five boat names in alphabetical order, we can use the query SELECT boatname FROM voyages ORDER BY boatname LIMIT 5;
. We can order by multiple columns and use the DESC
keyword to invert the order.
All Combined
Now, let's combine all of the above in a single query.
2
renamed to half_tonnage
from the table voyages
for boats built after 1788
and the chamber code is A
. The result should be ordered by departure date and only include the top 3 results.
Aggregation Queries
Often, we are not interested in "raw" data, but aggregates like averages or counts. These can be expressed in SQL by adding an additional GROUP BY
clause after the WHERE
clause of the query. In addition, the projection list is modified by adding aggregation functions and groups. The commonly supported aggregate functions are COUNT
, MIN
, MAX
and AVG
.
Ungrouped Aggregation
The easiest case of aggregating values is aggregating without a GROUP BY
clause. In this case, the grouping is implicit and all rows fall into a single group. For example, the query SELECT MIN(number) FROM voyages
will compute the smallest value for the column number
.
max_tonnage
from the voyages
table.
Single-Column Groups
The next step of aggregation is the aggregation by a group determined by the values in the data. For example, the query SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat
generates the aggregate COUNT
for each distinct value of type_of_boat
. Its important to know that the projection of the query (the part behind the SELECT
) can only contain column names of columns that are used in the GROUP BY
part of the query as well. All other columns need to be wrapped in an aggregation function.
chamber
and the maximum tonnage for each distinct value of chamber
as the column max_tonnage
from the voyages
table grouped by chamber
.
Multi-Column Groups
We can also group by multiple columns. In this case, all combinations of values between the two columns that occur in the data are grouping values. The query will simply list all grouping columns separated by ,
after the GROUP BY
keyword.
n
from the voyages
table grouped by chamber
and type_of_boat
.
Filtering Values and Groups
In one of the previous excersises, we have seen how we can select (WHERE ...
) only part of the table to be relevant to the query. This also applies to groups. For example, before grouping, we might decide to filter out values. We can also filter out groups based on their aggregation values using the HAVING
keyword.
For example, we could aggregate only the type of boat for chamber A
using a query like SELECT type_of_boat, COUNT() AS n FROM voyages WHERE chamber = 'A' GROUP BY type_of_boat
. The logic here tells the database to conceptually first compute the result of the WHERE
clause and then run the aggregation according to the GROUP BY
column. The result of this query is:
n
from the voyages
table grouped by departure_harbour
while filtering out rows where the field departure_harbour
is NULL
(IS NOT NULL
, see above).
From the previous result, we have seen some groups with low values for n
. Let's say we want to remove those. For this we can use the HAVING
clause, which operates on the result of aggregation functions. For example, we can modify the example from above to not consider boat types for which the COUNT
value in n
is less than 5
: SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat HAVING n > 5
. All expressions that can used following the SELECT
in a grouped query are acceptable in the HAVING
clause to filter the groups.
chamber
and sum of tonnage
as field s
from the voyages
table grouped by chamber
while filtering out groups where the sum of tonnage is less than or equal to 5000
.
Distinct Values
A special case of grouping is the DISTINCT
keyword. It retrieves the set of unique values from a set of columns. An example is SELECT DISTINCT type_of_boat FROM voyages
. This is set-equivalent to the query SELECT type_of_boat FROM voyages GROUP BY type_of_boat
. Try it!
voyages
table.
All Combined
Now let's try to combine a lot of what we have learned so far about grouping (and before).
n
, the minimum and maximum tonnage in each group (min_t
/ max_t
) from the voyages
table. Group by departure harbour and departure date. Filter out rows where departure harbour is NULL
or equal to Batavia
. Filter the groups to have at least two voyages in them.
Join
One of the most powerful features of relational databases is the JOIN
, the horizontal combination of two tables according to the data. For example, if one table contains information about voyages and another table contains information about the invoices (value of ship's cargo) for those voyages, the JOIN
operator can be used to combine the two.
To extend the example, we use an additional table, chambers
, which contains the expansion of the VOC chamber (department) code we used in the previous section:
We will also use the table invoices
, which contains the total amount of money charged for a voyage:
Equi-Join
The basic form of JOIN
is the "Equi-Join". "Equi" stands for "Equality" and means that values have to be equal to create a match for the join. In SQL, there are various syntaxes for this, but generally we need to name the two tables in the FROM
clause and then either immediately or later in the WHERE
specify the columns that should be considered.
For example, we can join the voyages
table and the chambers
using an Equi-Join on the respective chamber
columns to add the actual chamber name to the output with SELECT boatname, chamber, name FROM voyages JOIN chambers USING (chamber) LIMIT 5;
. The result of this query is:
Note the list of tables in the FROM
clause and the USING
keyword, which specifies that values from the column chamber
, which exists in both tables, should be compared to generate the join result. There are at least two equivalent ways to formulate this query. First, the join condition can also be more verbose: SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers ON voyages.chamber = chambers.chamber LIMIT 5;
. Here, we use the ON
keyword to specify a more precise and expressive join condition. We explicitly name the tables from which the join columns are coming using the tablename.columname
syntax. We also explicitly use the =
comparision operator. Both of these definitions are implicit when using USING
. The USING
keyword also adds an implicit projection which removes one of the chamber
results from the result set. The ON
version does not, this is why we need to explicitly name one of them in the SELECT
part of the query. Second, the join condition can also be in the WHERE
clause of the query: SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers WHERE voyages.chamber = chambers.chamber LIMIT 5;
. This method of specifying a join condition is widespread, but confusing and thus discouraged.
In the result above, also note how several rows from voyages
(# 1, 3 and 5) are joined up with the same row from chambers
. This is normal and expected behavior, rows are re-used if they match multiple rows in the join partner.
Now try to write a JOIN query yourself!
voyages
and invoices
table using equality on the number
column. Project only the boatname
and invoice
column.
Joins can (of course) use more than one pair of columns to determine that a match is present. For example, the query SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber)
joins the two tables by checking equality for both the number
and chamber
columns. The equivalent form with the ON
keyword would be SELECT boatname, invoice FROM voyages JOIN invoices ON voyages.number = invoices.number AND voyages.chamber = invoices.chamber
.
Natural Join
A special case of the Equi-join is the NATURAL JOIN
. What it does is perform a join between two tables while using all columns from both tables with matching names as join criteria. The following two queries are equivalent: SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber)
and SELECT boatname, invoice FROM voyages NATURAL JOIN invoices
, because the columns number
and chamber
are the only column names the two tables have in common. Think of natural joins as syntactic sugar, which often reduces readability of your query due to its implicit nature.
Outer Join
The previous section used (by default) what is called an INNER JOIN
. Here, only rows where a match in the other table is found for are candidates for the result. There is also an opposite version, the LEFT OUTER JOIN
(The OUTER
keyword is optional). Here, all rows from the left side of the join (the table named first) are included in the result. If a matching row from the right side is found, it is joined. If none is found, the left row is still returned with the missing fields from the right side set to NULL
voyages
table have a match in the invoices
table. Use a LEFT OUTER JOIN
on number
and a filter for invoice
equal to NULL
to only retrieve rows where no match was found. Retrieve only the number
column.
Self-Join
A powerful use case of joins in SQL is the self-join. Here, we combine the table with itself, which can for example be used to express queries about relationships between rows in the same table. In our dataset, we have such a relationship with the next_voyage
column in the voyages
table. This column indicates the next voyage number of a particular ship. Since both table have the same name in a self-join, it is required to rename them using the AS
statement to unique temporary table names. For example, SELECT v1.boatname FROM voyages AS v1 where v1.chamber='A'
demonstrates such a renaming.
c1
and c2
.
More Tables
We can join over more than two tables by chaining JOIN
keywords. We also use multiple ON
or USING
statements to define the join criteria.
voyages
, invoices
and chambers
. Join voyages
and invoices
on the number
and chamber
columns, and the resulting table with chambers
using only chamber
column. Limit the result set to 5 rows.
Subqueries
SQL supports nested queries, the so-called subqueries. They can be used in all parts of the query and often simplify query expression as opposed to a JOIN
. Other uses are creation of more convenient join partners or computation of projection results. These queries mostly differ in the allowed cardinality of their results. In projections, they can only return a single value whiled in the FROM
clause, entire tables may be produced.
Filters
Subqueries in a WHERE
clause are typically comparing an attribute, either against a single value (e.g. using =
) or against a set of values (e.g. with IN
or EXISTS
). In both cases, it is possible to refer to tuple variables from the enclosing query but not the other way around. Subqueries are enclosed in brackets ()
and are otherwise complete queries on their own. For example, the rather over-complicated query SELECT boatname FROM voyages WHERE tonnage = (SELECT 884)
uses a subquery to retrieve all ships with a tonnage of 884
.
WHERE
clause to retrieve boat names from the voyages
table where there is no matching entry in the invoices
table for the particular voyage number. Use a subquery with either NOT IN
or NOT EXISTS
(with a reference to the outer query).
Tables
The result of a subquery is also a table, hence they can also be used to use a subquery result where a table could be used, i.e. the FROM
clause. For example, the (again) over-complicated query SELECT number FROM (SELECT * FROM voyages) AS v
uses a table-creating subquery. Note how we have to use the AS
clause to give the table created by the subquery a name. We often use subqueries to avoid complex filter conditions after joining tables.
FROM
clause to only retrieve invoices
from chamber 'H'
and the invoice
amount of larger than 10000
and join the result with the voyages
table using the number
column. Project to only retrieve the boatname
and the invoice
amount of the join result. Order by invoice
amount.
Set Operations
SQL is grounded in the set semantics of the relational model. Hence, result sets can be interpreted as sets of tuples and we can use set operations to modify them. The most common set operation is the UNION
, but intersections and set differences are also possible. These operators are often used to combine tuples from different relations.
Unions
The UNION
keyword combines two result sets while eliminating duplicates, the UNION ALL
keyword does the same while keeping duplicates. For example, the query SELECT name, chamber FROM chambers UNION ALL SELECT boatname, chamber FROM voyages
will stack the otherwise unrelated boat and chamber names into one result set. Often, constants are also selected to denote the source of the respective tables.
UNION
keyword to add a (modern-day) province
column to the chambers
values. 'Noord-Holland
for 'Amsterdam'
, 'Hoorn'
and 'Enkhuizen'
, 'Zeeland
for 'Zeeland'
, 'Zuid-Holland
for 'Delft'
and 'Rotterdam'
. Select chamber code, name and province. Order entire result set by chamber code.
You made it to the end, congratulations.