Programmer's Life: 7.8 Dividing a Summary into Subgroups

I l@ve RuBoard

7.8 Dividing a Summary into Subgroups

7.8.1 Problem

You want to calculate a summary for each
subgroup of a set of rows, not an overall summary value.

7.8.2 Solution

Use a GROUP BY clause to arrange
rows into groups.

7.8.3 Discussion

The summary queries shown so far calculate summary values over all
rows in the result set. For example, the following query determines
the number of daily driving records in the
driver_log table, and thus the total number of
days that drivers were on the road:

mysql> SELECT COUNT(*) FROM driver_log;
+----------+
| COUNT(*) |
+----------+
|       10 |
+----------+

But sometimes it's desirable to break a set of rows
into subgroups and summarize each group. This is done by using
aggregate functions in conjunction
with a GROUP BY clause. To
determine the number of days driven by each driver, group the rows by
driver name, count how many rows there are for each name, and display
the names with the counts:

mysql> SELECT name, COUNT(name) FROM driver_log GROUP BY name;
+-------+-------------+
| name  | COUNT(name) |
+-------+-------------+
| Ben   |           3 |
| Henry |           5 |
| Suzi  |           2 |
+-------+-------------+

That query summarizes the same column used for grouping
(name), but that's not always
necessary. Suppose you want a quick characterization of the
driver_log table, showing for each person listed
in it the total number of miles driven and the average number of
miles per day. In this case, you still use the
name column to place the rows in groups, but the
summary functions operate on the miles values:

mysql> SELECT name,
    -> SUM(miles) AS 'total miles',
    -> AVG(miles) AS 'miles per day'
    -> FROM driver_log GROUP BY name;
+-------+-------------+---------------+
| name  | total miles | miles per day |
+-------+-------------+---------------+
| Ben   |         362 |      120.6667 |
| Henry |         911 |      182.2000 |
| Suzi  |         893 |      446.5000 |
+-------+-------------+---------------+

Use as many grouping columns as necessary to achieve as fine-grained
a summary as you require. The following query produces a coarse
summary showing how many messages were sent by each message sender
listed in the mail table:

mysql> SELECT srcuser, COUNT(*) FROM mail
    -> GROUP BY srcuser;
+---------+----------+
| srcuser | COUNT(*) |
+---------+----------+
| barb    |        3 |
| gene    |        6 |
| phil    |        5 |
| tricia  |        2 |
+---------+----------+

To be more specific and find out how many messages each sender sent
from each host, use two grouping columns. This produces a result with
nested groups (groups within groups):

mysql> SELECT srcuser, srchost, COUNT(*) FROM mail
    -> GROUP BY srcuser, srchost;
+---------+---------+----------+
| srcuser | srchost | COUNT(*) |
+---------+---------+----------+
| barb    | saturn  |        2 |
| barb    | venus   |        1 |
| gene    | mars    |        2 |
| gene    | saturn  |        2 |
| gene    | venus   |        2 |
| phil    | mars    |        3 |
| phil    | venus   |        2 |
| tricia  | mars    |        1 |
| tricia  | saturn  |        1 |
+---------+---------+----------+

Getting Distinct Values Without Using DISTINCT

If you use GROUP
BY without selecting the value of any aggregate
functions, you achieve the same effect as DISTINCT
without using DISTINCT explicitly:

mysql> SELECT name FROM driver_log GROUP BY name;
+-------+
| name  |
+-------+
| Ben   |
| Henry |
| Suzi  |
+-------+

Normally with this kind of query you'd select a
summary value (for example, by invoking
COUNT(name) to count the instances of each name),
but it's legal not to. The net effect is to produce
a list of the unique grouped values. I prefer to use
DISTINCT, because it makes the point of the query
more obvious. (Internally, MySQL actually maps the
DISTINCT form of the query onto the
GROUP BY form.)

The preceding examples in this section have used COUNT( ), SUM( ) and AVG( )
for per-group summaries. You can use MIN( ) or
MAX( ), too. With a GROUP
BY clause, they will tell you the smallest or
largest value per group. The following query groups
mail table rows by message sender, displaying for
each one the size of the largest message sent and the date of the
most recent message:

mysql> SELECT srcuser, MAX(size), MAX(t) FROM mail GROUP BY srcuser;
+---------+-----------+---------------------+
| srcuser | MAX(size) | MAX(t)              |
+---------+-----------+---------------------+
| barb    |     98151 | 2001-05-14 14:42:21 |
| gene    |    998532 | 2001-05-19 22:21:51 |
| phil    |     10294 | 2001-05-17 12:49:23 |
| tricia  |   2394482 | 2001-05-14 17:03:01 |
+---------+-----------+---------------------+

You can group by multiple
columns and display a maximum for each combination of values in those
columns. This query finds the size of the largest message sent
between each pair of sender and recipient values listed in the
mail table:

mysql> SELECT srcuser, dstuser, MAX(size) FROM mail GROUP BY srcuser, dstuser;
+---------+---------+-----------+
| srcuser | dstuser | MAX(size) |
+---------+---------+-----------+
| barb    | barb    |     98151 |
| barb    | tricia  |     58274 |
| gene    | barb    |      2291 |
| gene    | gene    |     23992 |
| gene    | tricia  |    998532 |
| phil    | barb    |     10294 |
| phil    | phil    |      1048 |
| phil    | tricia  |      5781 |
| tricia  | gene    |    194925 |
| tricia  | phil    |   2394482 |
+---------+---------+-----------+

When using aggregate functions to produce per-group summary values,
watch out for the following trap. Suppose you want to know the
longest trip per driver in the driver_log table.
That's produced by this query:

mysql> SELECT name, MAX(miles) AS 'longest trip'
    -> FROM driver_log GROUP BY name;
+-------+--------------+
| name  | longest trip |
+-------+--------------+
| Ben   |          152 |
| Henry |          300 |
| Suzi  |          502 |
+-------+--------------+

But what if you also want to show the date on which each
driver's longest trip occurred? Can you just add
trav_date to the output column list? Sorry, that
won't work:

mysql> SELECT name, trav_date, MAX(miles) AS 'longest trip'
    -> FROM driver_log GROUP BY name;
+-------+------------+--------------+
| name  | trav_date  | longest trip |
+-------+------------+--------------+
| Ben   | 2001-11-30 |          152 |
| Henry | 2001-11-29 |          300 |
| Suzi  | 2001-11-29 |          502 |
+-------+------------+--------------+

The query does produce a result, but if you compare it to the full
table (shown below), you'll see that although the
dates for Ben and Henry are correct, the date for Suzi is not:


+--------+-------+------------+-------+
| rec_id | name  | trav_date  | miles |
+--------+-------+------------+-------+
|      1 | Ben   | 2001-11-30 |   152 |   <-- Ben's longest trip
|      2 | Suzi  | 2001-11-29 |   391 |
|      3 | Henry | 2001-11-29 |   300 |   <-- Henry's longest trip
|      4 | Henry | 2001-11-27 |    96 |
|      5 | Ben   | 2001-11-29 |   131 |
|      6 | Henry | 2001-11-26 |   115 |
|      7 | Suzi  | 2001-12-02 |   502 |   <-- Suzi's longest trip
|      8 | Henry | 2001-12-01 |   197 |
|      9 | Ben   | 2001-12-02 |    79 |
|     10 | Henry | 2001-11-30 |   203 |
+--------+-------+------------+-------+

So what's going on? Why does the summary query
produce incorrect results? This happens because when you include a
GROUP BY clause in a query, the
only values you can select are the grouped columns or the summary
values calculated from them. If you display additional columns,
they're not tied to the grouped columns and the
values displayed for them are indeterminate. (For the query just
shown, it appears that MySQL may simply be picking the first date for
each driver, whether or not it matches the driver's
maximum mileage value.)

The general solution to the problem of displaying contents of rows
associated with minimum or maximum group values involves a join. The
technique is described in Chapter 12. If you
don't want to read ahead, or you
don't want to use another table, consider using the
MAX-CONCAT
trick described earlier. It produces the correct result, although the
query is fairly ugly:

mysql> SELECT name,
    -> SUBSTRING(MAX(CONCAT(LPAD(miles,3,' '), trav_date)),4) AS date,
    -> LEFT(MAX(CONCAT(LPAD(miles,3,' '), trav_date)),3) AS 'longest trip'
    -> FROM driver_log GROUP BY name;
+-------+------------+--------------+
| name  | date       | longest trip |
+-------+------------+--------------+
| Ben   | 2001-11-30 | 152          |
| Henry | 2001-11-29 | 300          |
| Suzi  | 2001-12-02 | 502          |
+-------+------------+--------------+

I l@ve RuBoard

Programmer's Life

Tuesday, November 3, 2009

7.8 Dividing a Summary into Subgroups

7.8 Dividing a Summary into Subgroups

7.8.1 Problem

7.8.2 Solution

7.8.3 Discussion

Getting Distinct Values Without Using DISTINCT

No comments:

Blog Archive

About Me

Link