Wednesday, November 18, 2009

2.4 Denormalization



[ Team LiB ]






2.4 Denormalization






Denormalization is
the process of consciously removing entities created through the
normalization process. An unnormalized database is
not a denormalized database. A database can be
denormalized only after it has been sufficiently normalized, and
solid justifications need to exist to support every act of
denormalization.



Nevertheless, fully normalized databases can require complex
programming and generally require more joins than their unnormalized
or denormalized counterparts. Joins are resource-intensive
operations; thus, the more joins, the more time a query will take.



To deal with queries that take too long or are too complex to be
maintainable, a database architect denormalizes the database. As we
have seen from the process of normalization, each lower normal form
introduces database anomalies that can compromise the integrity,
maintainability, and extensibility of the database. Denormalization
is thus a reasoned trade-off between query complexity/performance and
system integrity, maintenance, and extensibility.




The Perils of Denormalization



Denormalization must be approached with caution. In general, a table
should have proven it requires denormalization in testing or even in
production before you actually denormalize it. Data architects very
commonly denormalize based on hunches about performance or experience
with similar applications in the past梐 practice that leads
down the path to a poorly designed database.



Denormalization can in some circumstances incur performance
penalties. More important, however, most of the time you do not see
the kinds of performance improvements from denormalization that
actually make a difference. When you denormalize without concrete
performance benchmarks backing the denormalization, you end up:



  • Denormalizing tables without appreciable performance improvement

  • Denormalizing again later, after you have done performance testing


The result is a database that looks more unnormalized than
denormalized. The best rule of thumb is to prove the database needs
denormalization and document that need for the people who will be
maintaining the database. Subsequently, you should prove that your
denormalization actually improves performance and back out the
changes if they fail to address the performance concerns.





In most cases, you can deal with complexity simply by creating views
that hide the complexity. Performance is thus the general
driver of denormalization. To determine whether denormalization makes
sense, I recommend

Craig
Mullins's simple guidelines posted in an online
article for The Data Administration Newsletter
in an article called "Denormalization
Guidelines" (http://www.tdan.com/i001fe02.htm):



  • Can you achieve performance goals without denormalization?

  • Will the system still fail to achieve performance goals with
    denormalization?

  • Will the system be less reliable as a result of denormalization?


If you answer "yes" to any of these
questions, you should not use denormalization as your performance
tuning tool.




BEST PRACTICE: Denormalize only when you have concrete proof that denormalization will boost performance.




The most common temptation to denormalize comes from queries that
require joins to retrieve a single value. Any query pulling a
film's suitability for children along with the film
from the database would fall into this category. For example:



SELECT Film.title, Film.language, Film.year, Rating.forChildren
FROM Film, Rating
WHERE Film.filmID = 2
AND Film.rating = Rating.code;


Denormalization would move the rating code and suitability for
children back into the Film table.
Wouldn't the query perform much better without that
join? Actually, it probably would not perform
noticeably better梩he join is done using a
unique index (Rating.code). Denormalization,
however, would incur all of the anomalies that led us to normalize
the table in the first place.



A better candidate for normalization might be pulling a state name
into an Address table along with the state code
used in the join. If most queries actually want the state name and
the query would definitely benefit from avoiding the join to the
State table, it can make sense to add an extra
column to Address for
stateName. You do not, however, remove the
State table. This denormalization
works梐ssuming real performance benefits are achieved for the
application梑ecause the state name is a candidate key for the
State table. Though stateName
would technically be a transitive depencency in the
Address table (and thus violate 3NF), its status
as a candidate key for State makes it almost a
functional depencency and consequently almost acceptable to put into
the Address table.



A common situation in which performance does truly become a problem
is reporting. For reporting, database normalization is just one of
many factors that lead to performance degradation. Because complex
reports generally eat server resources regardless of normalization
issues, it is generally a bad idea to empower users to execute
complex reports against live tables. Instead, you can denormalize by
replicating the data into special tables designed to support
reporting needs. To create a table for reporting on westerns, we
might create a WesternReport table that looks like
the table in Table 2-9.



Table 2-9. A table for reporting on westerns

Attribute



Domain



Notes



NULL?



filmID



BIGINT



PRIMARY KEY



No



title



VARCHAR(100)


 

No



rating



CHAR(5)


 

Yes



forChildren



CHAR(1)



DEFAULT 'N'



No



otherGenres



VARCHAR(255)


 

Yes



directors



VARCHAR(255)


 

Yes



actors



VARCHAR(255)


 

Yes



ranking



INT


 

No



year



INT


 

No




Reporting on all of the westerns from 1992 would look like this:



SELECT * FROM WesternReport
WHERE year = 1992;


The alternative is to have users constantly executing the following
query against the tables that actually maintain your data:



SELECT Film.filmID, Film.title, Film.rating,
IFNULL(Rating.forChildren, 'N'), Film.ranking, Film.year
FROM Film, FilmGenre
LEFT OUTER JOIN Rating ON Film.rating = Rating.code

WHERE FilmGenre.code = 'WES'
AND year = 1992
AND Film.filmID = FilmGenre.filmID;


Use follow-up queries to get other genres, directors, and actors
associated with the film.







    [ Team LiB ]



    No comments: