SQL (Structured Query Language) Querying

Introduction

SQL is a declarative data manipulation languate. Simple SELECT queries follow the general form of SELECT [projection] FROM [tables] WHERE [selection] .

  • [projection] can be column names from the tables, constants or comutations between them. A * stands for "all columns".
  • [tables] can be one or multiple table names.
  • [selection] is a set of logical data filter conditions combined with AND or OR

Most of the query parts are optional. A very simple query is SELECT 42, which results in a single-row single-column result set containing the value 42. A more common simple query is SELECT * FROM sometable, which retrieves all row from sometable. In the following, we will slowly expand from these simple queries.

For the more formally inclined, you can look at SQLite's formal grammar for SELECT queries

The sample data used below is based on the material published in the book J.R. Bruijn, F.S. Gaastra and I. Schaar, Dutch-Asiatic Shipping in the 17th and 18th Centuries, which gives an account of the trips made to the East and ships returned safely (or wrecked on the way). This is the main table, voyages:

The exercises here will work as follows: For every task, you write a SQL query into the text box and press the "play" button left of it. This will execute the query and check the result. The following query is already given, so you can directly run it. Try it, you will see the query result. If you get the result right, the task box turns green and the result is recorded.

Basic Queries

Projections

Projections determine the columns that are retrieved from the database. Columns are usually data, but can also be the the result of computation on the data or constants. Several columns can be specified in the query separated by ,. For example, SELECT boatname, master FROM voyages retrieves the two columns boatname and master from the table voyages.

For the first task, write a query below that retrieves the boat name and the departure date from the table voyages.

We can also modify the data we retrieve. For example, the (pointless) query SELECT number, number + 1 FROM voyages will add 1 to every value in the numbers column. A quite large amount of computation is possible here, comparable to common programming languages.

Write a query below that retrieves the boat name and the tonnage multiplied by 2 from the table voyages.

In the previous task, you can see that the column name of the result of the computation on the tonnage column is tonnage*2. This is sometimes quite unhandy, especially when continuing to use that column in some computation. For this reason, columns and computation results can be renamed using the AS keyword. For example, SELECT boatname AS shipname FROM voyages will retrieve the boatname column, but name it shipname.

Write a query below that retrieves the boat name and the tonnage multiplied by 2 renamed to tonnage_times_two from the table voyages.

Selections

To determine which rows are retrieved, we can filter the table with some criteria in the WHERE clause. For example, the query SELECT boatname FROM voyages WHERE number > 8395 will only retrieve voyages where the value for the number column is greater than 8395. Note that the number column is not mentioned in the projection list, this is common.

Write a query below that retrieves the boat name for boats with a tonnage of less than 150 from the table voyages.

You can specify multiple such filter criteria and connect them with Boolean logic using AND and OR. For example, the query SELECT boatname FROM voyages WHERE number > 8395 AND type_of_boat = 'fluit' will only retrieve rows where both conditions are met.

Write a query below that retrieves the boat name only for boats with a tonnage of less than 150 and for ships departing from the harbour of Batavia from the table voyages. Note that string constants (Batavia) need to be quoted with single quotes in SQL, like so: 'Batavia'.

One special case for selections are NULL values. Those typically cannot be filtered out with (non-)equality checks (because NULL != NULL, but have special syntax. In particular, WHERE departure_date IS NULL selects rows where departure_date has the value NULL, WHERE departure_date IS NOT NULL does the opposite.

Another special case is string pattern matching. Sometimes, we want not to compare exact equality of strings, but use pattern matching instead (much like regular expressions). SQL uses the LIKE keyword for this. Patterns can contain two special "magic" characters, % and _. % matches an arbitrary number (zero or more) characters, _ matches a single arbitrary character. For example, if one wanted to retrieve all boat names from the voyages table where the boat name starts with D, the query would be SELECT boatname FROM voyages WHERE boatname LIKE 'D%'. If we wanted to retrieve all five-character boat names starting with D, we would use LIKE 'D____'. There is also the ILIKE (Borat) variant, which is case-insensitive.

Write a query below that retrieves the boat name and the master only for boats where the type of boat is not NULL and the boat name consists of five-characters ending with AI.

Output Modifiers

Sometimes we want to change the result set independent from the query. Two common tasks are limiting the amount of rows that are retrieved (keyword LIMIT) and changing the ordering of the result set (keyword ORDER BY. Using LIMIT is generally a good idea since databases can grow to billions of records and they will duly output all of them if so instructed.

Note that the relational model does not specify an explicit order of rows, hence it is important to specify an order when using LIMIT. For example, to retrieve the first five boat names in alphabetical order, we can use the query SELECT boatname FROM voyages ORDER BY boatname LIMIT 5;. We can order by multiple columns and use the DESC keyword to invert the order.

Write a query below that retrieves the top six boat names ordered by tonnage.

All Combined

Now, let's combine all of the above in a single query.

Write a query below that retrieves the boat name and the tonnage divided by 2 renamed to half_tonnage from the table voyages for boats built after 1788 and the chamber code is A. The result should be ordered by departure date and only include the top 3 results.

Aggregation Queries

Often, we are not interested in "raw" data, but aggregates like averages or counts. These can be expressed in SQL by adding an additional GROUP BY clause after the WHERE clause of the query. In addition, the projection list is modified by adding aggregation functions and groups. The commonly supported aggregate functions are COUNT, MIN, MAX and AVG.

Ungrouped Aggregation

The easiest case of aggregating values is aggregating without a GROUP BY clause. In this case, the grouping is implicit and all rows fall into a single group. For example, the query SELECT MIN(number) FROM voyages will compute the smallest value for the column number.

Write a query below that retrieves the maximum tonnage as the column max_tonnage from the voyages table.

Single-Column Groups

The next step of aggregation is the aggregation by a group determined by the values in the data. For example, the query SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat generates the aggregate COUNT for each distinct value of type_of_boat. Its important to know that the projection of the query (the part behind the SELECT) can only contain column names of columns that are used in the GROUP BY part of the query as well. All other columns need to be wrapped in an aggregation function.

Write a query below that retrieves the chamber and the maximum tonnage for each distinct value of chamber as the column max_tonnage from the voyages table grouped by chamber.

Multi-Column Groups

We can also group by multiple columns. In this case, all combinations of values between the two columns that occur in the data are grouping values. The query will simply list all grouping columns separated by , after the GROUP BY keyword.

Write a query below that retrieves the chamber, the type of boat, and the number of tuples in each group as the column n from the voyages table grouped by chamber and type_of_boat.

Filtering Values and Groups

In one of the previous excersises, we have seen how we can select (WHERE ...) only part of the table to be relevant to the query. This also applies to groups. For example, before grouping, we might decide to filter out values. We can also filter out groups based on their aggregation values using the HAVING keyword.

For example, we could aggregate only the type of boat for chamber A using a query like SELECT type_of_boat, COUNT() AS n FROM voyages WHERE chamber = 'A' GROUP BY type_of_boat. The logic here tells the database to conceptually first compute the result of the WHERE clause and then run the aggregation according to the GROUP BY column. The result of this query is:

Write a query below that retrieves the departure harbour and the number of tuples in each group as the column n from the voyages table grouped by departure_harbour while filtering out rows where the field departure_harbour is NULL (IS NOT NULL, see above).

From the previous result, we have seen some groups with low values for n. Let's say we want to remove those. For this we can use the HAVING clause, which operates on the result of aggregation functions. For example, we can modify the example from above to not consider boat types for which the COUNT value in n is less than 5: SELECT type_of_boat, COUNT() AS n FROM voyages GROUP BY type_of_boat HAVING n > 5. All expressions that can used following the SELECT in a grouped query are acceptable in the HAVING clause to filter the groups.

Write a query below that retrieves the chamber and sum of tonnage as field s from the voyages table grouped by chamber while filtering out groups where the sum of tonnage is less than or equal to 5000.

Distinct Values

A special case of grouping is the DISTINCT keyword. It retrieves the set of unique values from a set of columns. An example is SELECT DISTINCT type_of_boat FROM voyages. This is set-equivalent to the query SELECT type_of_boat FROM voyages GROUP BY type_of_boat. Try it!

Write a query below that retrieves all unique values for departure harbour from the voyages table.

All Combined

Now let's try to combine a lot of what we have learned so far about grouping (and before).

Write a query below that retrieves the departure harbour, the departure date, the amount of voyages (rows) as column n, the minimum and maximum tonnage in each group (min_t / max_t) from the voyages table. Group by departure harbour and departure date. Filter out rows where departure harbour is NULL or equal to Batavia. Filter the groups to have at least two voyages in them.

Join

One of the most powerful features of relational databases is the JOIN, the horizontal combination of two tables according to the data. For example, if one table contains information about voyages and another table contains information about the invoices (value of ship's cargo) for those voyages, the JOIN operator can be used to combine the two.

To extend the example, we use an additional table, chambers, which contains the expansion of the VOC chamber (department) code we used in the previous section:

We will also use the table invoices, which contains the total amount of money charged for a voyage:

Equi-Join

The basic form of JOIN is the "Equi-Join". "Equi" stands for "Equality" and means that values have to be equal to create a match for the join. In SQL, there are various syntaxes for this, but generally we need to name the two tables in the FROM clause and then either immediately or later in the WHERE specify the columns that should be considered.

For example, we can join the voyages table and the chambers using an Equi-Join on the respective chamber columns to add the actual chamber name to the output with SELECT boatname, chamber, name FROM voyages JOIN chambers USING (chamber) LIMIT 5;. The result of this query is:

Note the list of tables in the FROM clause and the USING keyword, which specifies that values from the column chamber, which exists in both tables, should be compared to generate the join result. There are at least two equivalent ways to formulate this query. First, the join condition can also be more verbose: SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers ON voyages.chamber = chambers.chamber LIMIT 5;. Here, we use the ON keyword to specify a more precise and expressive join condition. We explicitly name the tables from which the join columns are coming using the tablename.columname syntax. We also explicitly use the = comparision operator. Both of these definitions are implicit when using USING. The USING keyword also adds an implicit projection which removes one of the chamber results from the result set. The ON version does not, this is why we need to explicitly name one of them in the SELECT part of the query. Second, the join condition can also be in the WHERE clause of the query: SELECT boatname, voyages.chamber, name FROM voyages JOIN chambers WHERE voyages.chamber = chambers.chamber LIMIT 5;. This method of specifying a join condition is widespread, but confusing and thus discouraged.

In the result above, also note how several rows from voyages (# 1, 3 and 5) are joined up with the same row from chambers. This is normal and expected behavior, rows are re-used if they match multiple rows in the join partner.

Now try to write a JOIN query yourself!

Join the voyages and invoices table using equality on the number column. Project only the boatname and invoice column.

Joins can (of course) use more than one pair of columns to determine that a match is present. For example, the query SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber) joins the two tables by checking equality for both the number and chamber columns. The equivalent form with the ON keyword would be SELECT boatname, invoice FROM voyages JOIN invoices ON voyages.number = invoices.number AND voyages.chamber = invoices.chamber.

Natural Join

A special case of the Equi-join is the NATURAL JOIN. What it does is perform a join between two tables while using all columns from both tables with matching names as join criteria. The following two queries are equivalent: SELECT boatname, invoice FROM voyages JOIN invoices USING (number, chamber) and SELECT boatname, invoice FROM voyages NATURAL JOIN invoices, because the columns number and chamber are the only column names the two tables have in common. Think of natural joins as syntactic sugar, which often reduces readability of your query due to its implicit nature.

Outer Join

The previous section used (by default) what is called an INNER JOIN. Here, only rows where a match in the other table is found for are candidates for the result. There is also an opposite version, the LEFT OUTER JOIN (The OUTER keyword is optional). Here, all rows from the left side of the join (the table named first) are included in the result. If a matching row from the right side is found, it is joined. If none is found, the left row is still returned with the missing fields from the right side set to NULL

Not all rows in the voyages table have a match in the invoices table. Use a LEFT OUTER JOIN on number and a filter for invoice equal to NULL to only retrieve rows where no match was found. Retrieve only the number column.

Self-Join

A powerful use case of joins in SQL is the self-join. Here, we combine the table with itself, which can for example be used to express queries about relationships between rows in the same table. In our dataset, we have such a relationship with the next_voyage column in the voyages table. This column indicates the next voyage number of a particular ship. Since both table have the same name in a self-join, it is required to rename them using the AS statement to unique temporary table names. For example, SELECT v1.boatname FROM voyages AS v1 where v1.chamber='A' demonstrates such a renaming.

Retrieve all voyages where the next voyage of a ship was for a different chamber. Project ot boatname, and the two (differing) chambers as c1 and c2.

More Tables

We can join over more than two tables by chaining JOIN keywords. We also use multiple ON or USING statements to define the join criteria.

Join all three tables voyages, invoices and chambers. Join voyages and invoices on the number and chamber columns, and the resulting table with chambers using only chamber column. Limit the result set to 5 rows.

Subqueries

SQL supports nested queries, the so-called subqueries. They can be used in all parts of the query and often simplify query expression as opposed to a JOIN. Other uses are creation of more convenient join partners or computation of projection results. These queries mostly differ in the allowed cardinality of their results. In projections, they can only return a single value whiled in the FROMclause, entire tables may be produced.

Filters

Subqueries in a WHERE clause are typically comparing an attribute, either against a single value (e.g., using =) or against a set of values (e.g., with IN or EXISTS). In both cases, it is possible to refer to tuple variables from the enclosing query but not the other way around. Subqueries are enclosed in brackets () and are otherwise complete queries on their own. For example, the rather over-complicated query SELECT boatname FROM voyages WHERE tonnage = (SELECT 884) uses a subquery to retrieve all ships with a tonnage of 884.

Use a subquery in the WHERE clause to retrieve boat names from the voyages table where there is no matching entry in the invoices table for the particular voyage number. Use a subquery with either NOT IN or NOT EXISTS (with a reference to the outer query).

Tables

The result of a subquery is also a table, hence they can also be used to use a subquery result where a table could be used, i.e., the FROM clause. For example, the (again) over-complicated query SELECT number FROM (SELECT * FROM voyages) AS v uses a table-creating subquery. Note how we have to use the AS clause to give the table created by the subquery a name. We often use subqueries to avoid complex filter conditions after joining tables.

Use a subquery in the FROM clause to only retrieve invoices from chamber 'H' and the invoice amount of larger than 10000 and join the result with the voyages table using the number column. Project to only retrieve the boatname and the invoice amount of the join result. Order by invoice amount.

Set Operations

SQL is grounded in the set semantics of the relational model. Hence, result sets can be interpreted as sets of tuples and we can use set operations to modify them. The most common set operation is the UNION, but intersections and set differences are also possible. These operators are often used to combine tuples from different relations.

Unions

The UNION keyword combines two result sets while eliminating duplicates, the UNION ALL keyword does the same while keeping duplicates. For example, the query SELECT name, chamber FROM chambers UNION ALL SELECT boatname, chamber FROM voyages will stack the otherwise unrelated boat and chamber names into one result set. Often, constants are also selected to denote the source of the respective tables.

Use the UNION keyword to add a (modern-day) province column to the chambers values. 'Noord-Holland for 'Amsterdam', 'Hoorn' and 'Enkhuizen', 'Zeeland for 'Zeeland', 'Zuid-Holland for 'Delft' and 'Rotterdam'. Select chamber code, name and province. Order entire result set by chamber code.


You made it to the end, congratulations.