SQL (Structured Query Language) Querying
SQL is a declarative data manipulation languate. Simple
SELECT queries follow the general form of
SELECT [projection] FROM [tables] WHERE [selection]
[projection]can be column names from the tables, constants or comutations between them. A
*stands for "all columns".
[tables]can be one or multiple table names.
[selection]is a set of logical data filter conditions combined with
Most of the query parts are optional. A very simple query is
SELECT 42, which results in a single-row single-column result set containing the value
42. A more common simple query is
SELECT * FROM sometable, which retrieves all row from
sometable. In the following, we will slowly expand from these simple queries.
For the more formally inclined, you can look at SQLite's formal grammar for
The sample data used below is based on the material published in the book J.R. Bruijn, F.S. Gaastra and I. Schaar, Dutch-Asiatic Shipping in the 17th and 18th Centuries, which gives an account of the trips made to the East and ships returned safely (or wrecked on the way). This is the main table,
The exercises here will work as follows: For every task, you write a SQL query into the text box and press the "play" button left of it. This will execute the query and check the result. The following query is already given, so you can directly run it. Try it, you will see the query result. If you get the result right, the task box turns green and the result is recorded.
Projections determine the columns that are retrieved from the database. Columns are usually data, but can also be the the result of computation on the data or constants. Several columns can be specified in the query separated by
,. For example,
SELECT boatname, master FROM voyages retrieves the two columns
master from the table
We can also modify the data we retrieve. For example, the (pointless) query
SELECT number, number + 1 FROM voyages will add
1 to every value in the
numbers column. A quite large amount of computation is possible here, comparable to common programming languages.
2from the table
In the previous task, you can see that the column name of the result of the computation on the
tonnage column is
tonnage*2. This is sometimes quite unhandy, especially when continuing to use that column in some computation. For this reason, columns and computation results can be renamed using the
AS keyword. For example,
SELECT boatname AS shipname FROM voyages will retrieve the boatname column, but name it
tonnage_times_twofrom the table
To determine which rows are retrieved, we can filter the table with some criteria in the
WHERE clause. For example, the query
SELECT boatname FROM voyages WHERE number > 8395 will only retrieve voyages where the value for the
number column is greater than
8395. Note that the
number column is not mentioned in the projection list, this is common.
150from the table
You can specify multiple such filter criteria and connect them with Boolean logic using
OR. For example, the query
SELECT boatname FROM voyages WHERE number > 8395 AND type_of_boat = 'fluit' will only retrieve rows where both conditions are met.
150and for ships departing from the harbour of Batavia from the table
voyages. Note that string constants (
Batavia) need to be quoted with single quotes in SQL, like so:
One special case for selections are
NULL values. Those typically cannot be filtered out with (non-)equality checks (because
NULL != NULL, but have special syntax. In particular,
WHERE departure_date IS NULL selects rows where
departure_date has the value
WHERE departure_date IS NOT NULL does the opposite.
Another special case is string pattern matching. Sometimes, we want not to compare exact equality of strings, but use pattern matching instead (much like regular expressions). SQL uses the
LIKE keyword for this. Patterns can contain two special "magic" characters,
% matches an arbitrary number (zero or more) characters,
_ matches a single arbitrary character. For example, if one wanted to retrieve all boat names from the
voyages table where the boat name starts with
D, the query would be
SELECT boatname FROM voyages WHERE boatname LIKE 'D%'. If we wanted to retrieve all five-character boat names starting with
D, we would use
LIKE 'D____'. There is also the
ILIKE (Borat) variant, which is case-insensitive.
NULLand the boat name consists of five-characters ending with
Sometimes we want to change the result set independent from the query. Two common tasks are limiting the amount of rows that are retrieved (keyword
LIMIT) and changing the ordering of the result set (keyword
ORDER BY. Using
LIMIT is generally a good idea since databases can grow to billions of records and they will duly output all of them if so instructed.
Note that the relational model does not specify an explicit order of rows, hence it is important to specify an order when using
LIMIT. For example, to retrieve the first five boat names in alphabetical order, we can use the query
SELECT boatname FROM voyages ORDER BY boatname LIMIT 5;. We can order by multiple columns and use the
DESC keyword to invert the order.
Now, let's combine all of the above in a single query.
half_tonnagefrom the table
voyagesfor boats built after
1788and the chamber code is
A. The result should be ordered by departure date and only include the top 3 results.
Often, we are not interested in "raw" data, but aggregates like averages or counts. These can be expressed in SQL by adding an additional
GROUP BY clause after the
WHERE clause of the query. In addition, the projection list is modified by adding aggregation functions and groups. The commonly supported aggregate functions are
The easiest case of aggregating values is aggregating without a
GROUP BY clause. In this case, the grouping is implicit and all rows fall into a single group. For example, the query
SELECT MIN(number) FROM voyages will compute the smallest value for the column
The next step of aggregation is the aggregation by a group determined by the values in the data. For example, the query
SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat generates the aggregate
COUNT for each distinct value of
type_of_boat. Its important to know that the projection of the query (the part behind the
SELECT) can only contain column names of columns that are used in the
GROUP BY part of the query as well. All other columns need to be wrapped in an aggregation function.
chamberand the maximum tonnage for each distinct value of
chamberas the column
voyagestable grouped by
We can also group by multiple columns. In this case, all combinations of values between the two columns that occur in the data are grouping values. The query will simply list all grouping columns separated by
, after the
GROUP BY keyword.
voyagestable grouped by
Filtering Values and Groups
In one of the previous excersises, we have seen how we can select (
WHERE ...) only part of the table to be relevant to the query. This also applies to groups. For example, before grouping, we might decide to filter out values. We can also filter out groups based on their aggregation values using the
For example, we could aggregate only the type of boat for chamber
A using a query like
SELECT type_of_boat, COUNT() AS n FROM voyages WHERE chamber = 'A' GROUP BY type_of_boat. The logic here tells the database to conceptually first compute the result of the
WHERE clause and then run the aggregation according to the
GROUP BY column. The result of this query is:
voyagestable grouped by
departure_harbourwhile filtering out rows where the field
IS NOT NULL, see above).
From the previous result, we have seen some groups with low values for
n. Let's say we want to remove those. For this we can use the
HAVING clause, which operates on the result of aggregation functions. For example, we can modify the example from above to not consider boat types for which the
COUNT value in
n is less than
SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat HAVING n > 5. All expressions that can used following the
SELECT in a grouped query are acceptable in the
HAVING clause to filter the groups.
chamberand sum of
voyagestable grouped by
chamberwhile filtering out groups where the sum of tonnage is less than or equal to
A special case of grouping is the
DISTINCT keyword. It retrieves the set of unique values from a set of columns. An example is
SELECT DISTINCT type_of_boat FROM voyages. This is set-equivalent to the query
SELECT type_of_boat FROM voyages GROUP BY type_of_boat. Try it!
Now let's try to combine a lot of what we have learned so far about grouping (and before).
n, the minimum and maximum tonnage in each group (
max_t) from the
voyagestable. Group by departure harbour and departure date. Filter out rows where departure harbour is
NULLor equal to
Batavia. Filter the groups to have at least two voyages in them.
One of the most powerful features of relational databases is the
JOIN, the horizontal combination of two tables according to the data. For example, if one table contains information about voyages and another table contains information about the invoices (value of ship's cargo) for those voyages, the
JOIN operator can be used to combine the two.
To extend the example, we use an additional table,
chambers, which contains the expansion of the VOC chamber (department) code we used in the previous section:
We will also use the table
invoices, which contains the total amount of money charged for a voyage:
The basic form of
JOIN is the "Equi-Join". "Equi" stands for "Equality" and means that values have to be equal to create a match for the join. In SQL, there are various syntaxes for this, but generally we need to name the two tables in the
FROM clause and then either immediately or later in the
WHERE specify the columns that should be considered.
For example, we can join the
voyages table and the
chambers using an Equi-Join on the respective
chamber columns to add the actual chamber name to the output with
SELECT boatname, chamber, name FROM voyages JOIN chambers USING (chamber) LIMIT 5;. The result of this query is:
Note the list of tables in the
FROM clause and the
USING keyword, which specifies that values from the column
chamber, which exists in both tables, should be compared to generate the join result. There are at least two equivalent ways to formulate this query. First, the join condition can also be more verbose:
SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers ON voyages.chamber = chambers.chamber LIMIT 5;. Here, we use the
ON keyword to specify a more precise and expressive join condition. We explicitly name the tables from which the join columns are coming using the
tablename.columname syntax. We also explicitly use the
= comparision operator. Both of these definitions are implicit when using
USING keyword also adds an implicit projection which removes one of the
chamber results from the result set. The
ON version does not, this is why we need to explicitly name one of them in the
SELECT part of the query. Second, the join condition can also be in the
WHERE clause of the query:
SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers WHERE voyages.chamber = chambers.chamber LIMIT 5;. This method of specifying a join condition is widespread, but confusing and thus discouraged.
In the result above, also note how several rows from
voyages (# 1, 3 and 5) are joined up with the same row from
chambers. This is normal and expected behavior, rows are re-used if they match multiple rows in the join partner.
Now try to write a JOIN query yourself!
invoicestable using equality on the
numbercolumn. Project only the
Joins can (of course) use more than one pair of columns to determine that a match is present. For example, the query
SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber) joins the two tables by checking equality for both the
chamber columns. The equivalent form with the
ON keyword would be
SELECT boatname, invoice FROM voyages JOIN invoices ON voyages.number = invoices.number AND voyages.chamber = invoices.chamber.
A special case of the Equi-join is the
NATURAL JOIN. What it does is perform a join between two tables while using all columns from both tables with matching names as join criteria. The following two queries are equivalent:
SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber) and
SELECT boatname, invoice FROM voyages NATURAL JOIN invoices, because the columns
chamber are the only column names the two tables have in common. Think of natural joins as syntactic sugar, which often reduces readability of your query due to its implicit nature.
The previous section used (by default) what is called an
INNER JOIN. Here, only rows where a match in the other table is found for are candidates for the result. There is also an opposite version, the
LEFT OUTER JOIN (The
OUTER keyword is optional). Here, all rows from the left side of the join (the table named first) are included in the result. If a matching row from the right side is found, it is joined. If none is found, the left row is still returned with the missing fields from the right side set to
voyagestable have a match in the
invoicestable. Use a
LEFT OUTER JOINon
numberand a filter for
NULLto only retrieve rows where no match was found. Retrieve only the
A powerful use case of joins in SQL is the self-join. Here, we combine the table with itself, which can for example be used to express queries about relationships between rows in the same table. In our dataset, we have such a relationship with the
next_voyage column in the
voyages table. This column indicates the next voyage number of a particular ship. Since both table have the same name in a self-join, it is required to rename them using the
AS statement to unique temporary table names. For example,
SELECT v1.boatname FROM voyages AS v1 where v1.chamber='A' demonstrates such a renaming.
We can join over more than two tables by chaining
JOIN keywords. We also use multiple
USING statements to define the join criteria.
chambercolumns, and the resulting table with
chambercolumn. Limit the result set to 5 rows.
SQL supports nested queries, the so-called subqueries. They can be used in all parts of the query and often simplify query expression as opposed to a
JOIN. Other uses are creation of more convenient join partners or computation of projection results. These queries mostly differ in the allowed cardinality of their results. In projections, they can only return a single value whiled in the
FROMclause, entire tables may be produced.
Subqueries in a
WHERE clause are typically comparing an attribute, either against a single value (e.g., using
=) or against a set of values (e.g., with
EXISTS). In both cases, it is possible to refer to tuple variables from the enclosing query but not the other way around. Subqueries are enclosed in brackets
() and are otherwise complete queries on their own. For example, the rather over-complicated query
SELECT boatname FROM voyages WHERE tonnage = (SELECT 884) uses a subquery to retrieve all ships with a tonnage of
WHEREclause to retrieve boat names from the
voyagestable where there is no matching entry in the
invoicestable for the particular voyage number. Use a subquery with either
NOT EXISTS(with a reference to the outer query).
The result of a subquery is also a table, hence they can also be used to use a subquery result where a table could be used, i.e., the
FROM clause. For example, the (again) over-complicated query
SELECT number FROM (SELECT * FROM voyages) AS v uses a table-creating subquery. Note how we have to use the
AS clause to give the table created by the subquery a name. We often use subqueries to avoid complex filter conditions after joining tables.
FROMclause to only retrieve
invoiceamount of larger than
10000and join the result with the
voyagestable using the
numbercolumn. Project to only retrieve the
invoiceamount of the join result. Order by
SQL is grounded in the set semantics of the relational model. Hence, result sets can be interpreted as sets of tuples and we can use set operations to modify them. The most common set operation is the
UNION, but intersections and set differences are also possible. These operators are often used to combine tuples from different relations.
UNION keyword combines two result sets while eliminating duplicates, the
UNION ALL keyword does the same while keeping duplicates. For example, the query
SELECT name, chamber FROM chambers UNION ALL SELECT boatname, chamber FROM voyages will stack the otherwise unrelated boat and chamber names into one result set. Often, constants are also selected to denote the source of the respective tables.
UNIONkeyword to add a (modern-day)
provincecolumn to the
'Rotterdam'. Select chamber code, name and province. Order entire result set by chamber code.
You made it to the end, congratulations.