Wednesday, December 16, 2009

7.16 Date-Based Summaries




I l@ve RuBoard










7.16 Date-Based Summaries




7.16.1 Problem



You
want to produce a summary based on date or time values.





7.16.2 Solution



Use GROUP BY to categorize
temporal values into bins of the appropriate duration. Often this
will involve using expressions to extract the significant parts of
dates or times.





7.16.3 Discussion



To put records in time order, you use an ORDER
BY clause to sort a column that has a temporal
type. If instead you want to summarize records based on groupings
into time intervals, you need to determine how to categorize each
record into the proper interval and use GROUP
BY to group them accordingly.



Sometimes you can use temporal values directly if they group
naturally into the desired categories. This is quite likely if a
table represents date or time parts using separate columns. For
example, the baseball1.com master ballplayer
table represents birth dates using separate year, month, and day
columns. To see how many ballplayers were born on each day of the
year, perform a calendar date summary that uses the month and day
values but ignores the year:



mysql> SELECT birthmonth, birthday, COUNT(*)
-> FROM master
-> WHERE birthmonth IS NOT NULL AND birthday IS NOT NULL
-> GROUP BY birthmonth, birthday;
+------------+----------+----------+
| birthmonth | birthday | COUNT(*) |
+------------+----------+----------+
| 1 | 1 | 47 |
| 1 | 2 | 40 |
| 1 | 3 | 50 |
| 1 | 4 | 38 |
...
| 12 | 28 | 33 |
| 12 | 29 | 32 |
| 12 | 30 | 32 |
| 12 | 31 | 27 |
+------------+----------+----------+


A less fine-grained summary can be obtained by using only the month
values:



mysql> SELECT birthmonth, COUNT(*)
-> FROM master
-> WHERE birthmonth IS NOT NULL
-> GROUP BY birthmonth;
+------------+----------+
| birthmonth | COUNT(*) |
+------------+----------+
| 1 | 1311 |
| 2 | 1144 |
| 3 | 1243 |
| 4 | 1179 |
| 5 | 1118 |
| 6 | 1105 |
| 7 | 1244 |
| 8 | 1438 |
| 9 | 1314 |
| 10 | 1438 |
| 11 | 1314 |
| 12 | 1269 |
+------------+----------+


Sometimes temporal values can be used directly, even when not
represented as separate columns. To determine how many drivers were
on the road and how many miles were driven each day, group the
records in the driver_log table by date:



mysql> SELECT trav_date,
-> COUNT(*) AS 'number of drivers', SUM(miles) As 'miles logged'
-> FROM driver_log GROUP BY trav_date;
+------------+-------------------+--------------+
| trav_date | number of drivers | miles logged |
+------------+-------------------+--------------+
| 2001-11-26 | 1 | 115 |
| 2001-11-27 | 1 | 96 |
| 2001-11-29 | 3 | 822 |
| 2001-11-30 | 2 | 355 |
| 2001-12-01 | 1 | 197 |
| 2001-12-02 | 2 | 581 |
+------------+-------------------+--------------+


However, this summary will grow lengthier as you add more records to
the table. At some point, the number of distinct dates likely will
become so large that the summary fails to be useful, and
you'd probably decide to change the category size
from daily to weekly or monthly.



When a temporal column
contains so many distinct values that it fails to categorize well,
it's typical for a summary to group records using
expressions that map the relevant parts of the date or time values
onto a smaller set of categories. For example, to produce a
time-of-day summary for records in the mail table,
do this:[1]


[1] Note that the result includes an entry only
for hours of the day actually represented in the data. To generate a
summary with an entry for every hour, use a join to fill in the
"missing" values. See Recipe 12.10.



mysql> SELECT HOUR(t) AS hour,
-> COUNT(*) AS 'number of messages',
-> SUM(size) AS 'number of bytes sent'
-> FROM mail
-> GROUP BY hour;
+------+--------------------+----------------------+
| hour | number of messages | number of bytes sent |
+------+--------------------+----------------------+
| 7 | 1 | 3824 |
| 8 | 1 | 978 |
| 9 | 2 | 2904 |
| 10 | 2 | 1056806 |
| 11 | 1 | 5781 |
| 12 | 2 | 195798 |
| 13 | 1 | 271 |
| 14 | 1 | 98151 |
| 15 | 1 | 1048 |
| 17 | 2 | 2398338 |
| 22 | 1 | 23992 |
| 23 | 1 | 10294 |
+------+--------------------+----------------------+


To produce a day-of-week summary instead, use the DAYOFWEEK(
)

function:



mysql> SELECT DAYOFWEEK(t) AS weekday,
-> COUNT(*) AS 'number of messages',
-> SUM(size) AS 'number of bytes sent'
-> FROM mail
-> GROUP BY weekday;
+---------+--------------------+----------------------+
| weekday | number of messages | number of bytes sent |
+---------+--------------------+----------------------+
| 1 | 1 | 271 |
| 2 | 4 | 2500705 |
| 3 | 4 | 1007190 |
| 4 | 2 | 10907 |
| 5 | 1 | 873 |
| 6 | 1 | 58274 |
| 7 | 3 | 219965 |
+---------+--------------------+----------------------+


To make the output more meaningful, you might want to use
DAYNAME( ) to display weekday names instead. However,
because day names sort lexically (for example,
"Tuesday" sorts after
"Friday"), use DAYNAME(
)
only for display purposes. Continue to group on the
numeric day values so that output rows sort that way:



mysql> SELECT DAYNAME(t) AS weekday,
-> COUNT(*) AS 'number of messages',
-> SUM(size) AS 'number of bytes sent'
-> FROM mail
-> GROUP BY DAYOFWEEK(t);
+-----------+--------------------+----------------------+
| weekday | number of messages | number of bytes sent |
+-----------+--------------------+----------------------+
| Sunday | 1 | 271 |
| Monday | 4 | 2500705 |
| Tuesday | 4 | 1007190 |
| Wednesday | 2 | 10907 |
| Thursday | 1 | 873 |
| Friday | 1 | 58274 |
| Saturday | 3 | 219965 |
+-----------+--------------------+----------------------+


A similar technique can be used for summarizing month-of-year
categories that are sorted by numeric value but displayed by month
name.



Uses for temporal categorizations are plentiful:




  • DATETIME or
    TIMESTAMP columns have the potential to contain
    many unique values. To produce daily summaries, strip off the time of
    day part to collapse all values occurring within a given day to the
    same value. Any of the following GROUP
    BY clauses will do this, though the last one is
    likely to be slowest:

    GROUP BY FROM_DAYS(TO_DAYS(col_name))
    GROUP BY YEAR(col_name), MONTH(col_name), DAYOFMONTH(col_name)
    GROUP BY DATE_FORMAT(col_name,'%Y-%m-%e')

  • To produce monthly or quarterly sales reports, group by
    MONTH(col_name)
    or
    QUARTER(col_name)
    to place dates into the correct part of the year.


  • To summarize web server activity, put your server's
    logs into MySQL and run queries that collapse the records into
    different time categories. Chapter 18 discusses how
    to do this for Apache.










    I l@ve RuBoard



    No comments: