Sunday, October 25, 2009

Section 10.2. Subqueries







10.2. Subqueries

A subquery is a SELECT statement nested within another SQL statement. This feature became
available as of version 4.1 of MySQL. Although the same results can be
accomplished by using the JOIN clause or
UNION, depending on the situation, subqueries are a
cleaner approach that is sometimes easier to read. They make a complex
query more modular, which makes it easier to create and to troubleshoot.
Here is a simple example of a subquery:

SELECT *
FROM
(SELECT col1, col2
FROM table1
WHERE col_id = 1000) AS derived1
ORDER BY col2;


In this example, the subquery or inner query
is a SELECT statement specifying two column names. The other query is called the main or
outer query. It doesn't have to be a
SELECT. It can be an INSERT, a
DELETE, a DO, an
UPDATE, or even a SET statement. The
outer query generally can't select data or modify data from the same table
as an inner query, but this doesn't apply if the subquery is part of a
FROM clause. A subquery can return a value (a scalar),
a field, multiple fields containing values, or a full results set that
serves as a derived table.

You can encounter performance problems with subqueries if they are
not well constructed. One problem occurs when a subquery is placed within
an IN⁠(⁠ ⁠ ⁠) clause as part of a
WHERE clause. It's generally better to use the
= operator for each value, along with
AND for each parameter/value pair.

When you see a performance problem with a subquery, try
reconstructing the SQL statement with JOIN and compare
the differences using the BENCHMARK⁠(⁠ ⁠ ⁠) function.
If the performance is better without a subquery, don't give up on
subqueries. Only in some situations is performance poorer. For those
situations where there is a performance drain, MySQL AB is working on
improving MySQL subqueries. So performance problems you experience now may
be resolved in future versions. You may just need to upgrade to the
current release or watch for improvements in future releases.

10.2.1. Single Field Subqueries

The most basic subquery is one that returns a scalar or single
value. This type of subquery is particularly useful in a WHERE clause in conjunction with an
= operator, or in other instances where a single
value from an expression is permitted.

As an example of this situation, suppose that at our fictitious
college one of the music teachers, Sonia Oram, has called us saying that
she wants a list of students for one of her classes so that she can call
them to invite them to a concert. She wants the names and telephone
numbers for only the students in her first period Monday morning
class.

The way most databases store this data, the course number would be
a unique key and would make it easy to retrieve the other data without a
subquery. But Sonia doesn't know the course number, so we enter an SQL
statement like this:

SELECT CONCAT(name_first, ' ', name_last) AS student,
phone_home, phone_dorm
FROM students
JOIN course_rosters USING (student_id)
WHERE course_id =
(SELECT course_id
FROM course_schedule
JOIN teachers USING (teacher_id)
WHERE semester_code = '2007AU'
AND class_time = 'monday_01'
AND name_first = 'Sonia'
AND name_last = 'Oram');


Notice in the subquery that we're joining the
course_schedule table with
teachers so we can give the teacher's first and last
name in the WHERE clause of the subquery. We're also
indicating in the WHERE clause a specific semester
(Autumn 2007) and time slot (Monday, first period). The results of these
specifics should be one course identification number because a teacher
won't teach more than one class during a particular class period. That
single course number will be used by the WHERE clause
of the main query to return the list of students on the class roster for
the course, along with their telephone numbers.

If by chance more than one value is returned by the subquery in
the previous example, MySQL will
return an error:

ERROR 1242 (ER_SUBSELECT_NO_1_ROW)
SQLSTATE = 21000
Message = "Subquery returns more than 1 row"


Despite our supposition, it is possible that a teacher might teach
more than one class at a time: perhaps the teacher is teaching one
course in violin and another in viola, but each class had so few
students that the department head put them together. In such a
situation, the teacher would want the data for both course numbers. To
use multiple fields derived from a subquery in a
WHERE clause like this, we would have to use
something other than the = operator, such as
IN. For this kind of situation, see the next section
on Section 10.2.2."

10.2.2. Multiple Fields Subqueries

In the previous section, we discussed instances where one scalar
value was obtained from a subquery in a WHERE clause.
However, there are times when you may want to match multiple values. For
those situations you will need to use the subquery in conjunction with
an operator or a clause: ALL, ANY,
EXISTS, IN, or
SOME.

As an example of a multiple fields subquery—and specifically of a
subquery using IN (or using ANY or
SOME)—let's adapt the example from the previous
section to a situation where the teacher wants the contact information
for students in all of her classes. To do this, we can enter the
following SQL statement:

SELECT CONCAT(name_first, ' ', name_last) AS student,
phone_home, phone_dorm
FROM students
JOIN course_rosters USING (student_id)
WHERE course_id IN
(SELECT course_id
FROM course_schedule
JOIN teachers USING (teacher_id)
WHERE semester_code = '2007AU'
AND name_first = 'Sonia'
AND name_last = 'Oram');


In this example, notice that the subquery is contained within the
parentheses of the IN clause. Subqueries are executed
first, so the results will be available before the
WHERE clause is executed. Although a comma-separated
list isn't returned, MySQL still accepts the results so that they may be
used by the outer query. The criteria of the WHERE
clause here does not specify a specific time slot as the earlier example
did, so multiple values are much more likely to be returned.

Instead of IN, you can use ANY or SOME
to obtain the same results by the same methods. (ANY
and SOME are synonymous.) These two keywords must be
preceded by a comparison operator (e.g., =,
<, >). For example, we could
replace the IN in the SQL previous statement with
= ANY or with = SOME and the same
results will be returned. IN can be preceded with
NOT for negative comparisons: NOT
IN(...)
. This is the same as != ANY (...)
and != SOME (...).

Let's look at another subquery returning multiple values but using
the ALL operator. The ALL operator must be preceded by a
comparison operator (e.g., =,
<, >). As an example of this
usage, suppose one of the piano teachers provides weekend seminars for
students. Suppose also that he heard a few students are enrolled in all
of the seminars he has scheduled for the semester and he wants a list of
their names and telephone numbers in advance. We should be able to get
that data by entering an SQL statement like the following (though
currently it doesn't work, for reasons to be explained shortly):

SELECT DISTINCT student_id, 
CONCAT(name_first, ' ', name_last) AS student
FROM students
JOIN seminar_rosters USING (student_id)
WHERE seminar_id = ALL
(SELECT seminar_id
FROM seminar_schedule
JOIN teachers ON (instructor_id = teacher_id)
WHERE semester_code = '2007AU'
AND name_first = 'Sam'
AND name_last = 'Oram');


In this example, a couple of the tables have different column
names for the ID we want, and we have to join one of them with
ON instead of USING, but that has
nothing to do with the subquery. What's significant is that this
subquery returns a list of seminar identification numbers and is used in
the WHERE clause of the main query with =
ALL
. Unfortunately, although this statement is constructed
correctly, it doesn't work with MySQL at the time of this writing and
just returns an empty set. However, it should work in future releases of
MySQL, so I've included it for future reference. For now, we would have
to reorganize the SQL statement like so:

SELECT student_id, student
FROM
(SELECT student_id, COUNT(*)
AS nbr_seminars_registered,
CONCAT(name_first, ' ', name_last)
AS student
FROM students
JOIN seminar_rosters USING (student_id)
WHERE seminar_id IN
(SELECT seminar_id
FROM seminar_schedule
JOIN teachers
ON (instructor_id = teacher_id)
WHERE semester_code = '2007AU'
AND name_first = 'Sam'
AND name_last = 'Oram')
GROUP BY student_id) AS students_registered
WHERE nbr_seminars_registered =
(SELECT COUNT(*) AS nbr_seminars
FROM seminar_schedule
JOIN teachers
ON (instructor_id = teacher_id)
WHERE semester_code = '2007AU'
AND name_first = 'Sam'
AND name_last = 'Oram');



This is much more involved, but it does work with the latest
release of MySQL.

The first subquery is used to get the student's name. This
subquery's WHERE clause uses another subquery to
retrieve the list of seminars taught by the professor for the semester,
to determine the results set from which the main query will draw its
ultimate data. The third subquery
counts the number of seminars that the same professor is teaching for
the semester. This single value is used with the
WHERE clause of the main query. In essence, we're
determining the number of seminars the professor is teaching and which
students are registered for all of them.

The last possible method for using multiple fields in a subquery
uses EXISTS. With
EXISTS, in order for it to return meaningful or
desired results, you need to stipulate in the WHERE
clauses of the subquery a point in which it is joined to the outer
query. Using the example from the previous section involving the teacher
Sonia Oram, let's suppose that we want to retrieve a list of courses
that she teaches:

SELECT DISTINCT course_id, course_name 
FROM courses
WHERE EXISTS
(SELECT course_id
FROM course_schedule
JOIN teachers USING (teacher_id)
WHERE semester_code = '2007AU'
AND name_first = 'Sonia'
AND name_last = 'Oram'
AND courses.course_id = course_schedule.course_id);


As you can see here, we've added EXISTS to the
WHERE clause with the subquery in parentheses,
similar to using IN. The significant difference is
that we added courses.course_id =
course_schedule.course_id
to the end. Without it, a list of
all courses would be returned regardless of the criteria of the
WHERE clause in the subquery. Incidentally, if we
specified NOT EXISTS instead, we would get all
courses except for the ones taught by the teacher
given.

10.2.3. Results Set Subqueries

A subquery can be used to generate a results set, which is a table
from which an outer query can select data. That is, a subquery can be
used in a FROM clause as if it were another table in
a database. It is a derived table. Along these
lines, each derived table must be named. This is done with
AS following the parentheses containing the subquery.
A subquery contained in a FROM clause generally
cannot be a correlated subquery—that is, it cannot reference the same
table as the outer query. The exception is if it's constructed with a
JOIN.

In the following example, let's consider the subquery separately
as though it were a plain query and not a subquery. It will generate a
results set containing the student's ID and the student's average exam
score for a specific course taught during a specific semester. The query
uses AVG⁠(⁠ ⁠ ⁠), which requires a
GROUP BY clause. The problem with GROUP BY is that it will order
data only by the columns by which it's given to group data. In this
case, it will order the data by student_id and not
list the results by any other, more useful column. If we want to order
the data so that the highest student average is first, descending in
order to the lowest student average, we have to turn our query into a
subquery and have the outer query re-sort the results:

SELECT CONCAT(name_first, ' ', name_last) AS student,
student_id, avg_grade
FROM students
JOIN
(SELECT student_id,
AVG(exam_grade) AS avg_grade
FROM exams
WHERE semester_code = '2007AU'
AND course_id = 1489
GROUP BY student_id) AS grade_averages
USING(student_id)
ORDER BY avg_grade DESC;


The results set (the derived table generated by the subquery in
the FROM clause) is named
grade_averages. Notice that although the column
student_id exists in the derived table, in the table
from which it gets its data (i.e., exams) and in the
primary table used in the main query (i.e.,
students), there is no ambiguity. No error is
generated. However, if we wanted to specify that the data be taken from
the derived table, we could put
grade_averages.student_id in the
SELECT of the outer query.

This subquery is a correlated subquery, which is generally not permitted in a
FROM clause. It's allowed in this example because we
are using a JOIN to join the results set to the table
referenced in the outer query.








No comments: