[ Team LiB ] |
URL Abbreviation with mod_rewriteURL abbreviation is one of the most effective techniques you can use to optimize your HTML. First seen on Yahoo!'s home page, URL abbreviation substitutes short redirect URLs (like r/ci) for longer ones (like Computers_and_Internet/) to save space. The Apache and IIS web servers, and Manila (http://www.userland.com) and Zope (http://www.zope.org) all support this technique. In Apache, the mod_rewrite module transparently handles URL expansion. For IIS, ISAPI filters handle URL rewrites. Here are some IIS rewriting filters:
URL abbreviation is especially effective for home or index pages, which typically have a lot of links. As you will discover in Chapter 19, "Case Studies: Yahoo.com and WebReference.com," URL abbreviation can save anywhere from 20 to 30 percent off of your HTML file size. The more links you have, the more you'll save. NOTE As with most of these techniques, there's always a tradeoff. Using abbreviated URLs can lower search engine relevance, although you can alleviate this somewhat with clever expansions with mod_rewrite. The popular Apache web server[9] has an optional module, mod_rewrite, that enables your server to automatically rewrite URLs.[10] Created by Ralf Engelschall, this versatile module has been called the "Swiss Army knife of URL manipulation."[11] mod_rewrite can handle everything from URL layout, load balancing, to access restriction. We'll be using only a small portion of this module's power by substituting expanded URLs with regular expressions.
The module first examines each requested URL. If it matches one of the patterns you specify, the URL is rewritten according to the rule conditions you set. Essentially, mod_rewrite replaces one URL with another, allowing abbreviations and redirects. This URL rewriting machine manipulates URLs based on various tests including environment variables, time stamps, and even database lookups. You can have up to 50 global rewrite rules without any discernible effect on server performance.[12] Abbreviated URI expansion requires only one.
Tuning mod_rewriteTo install mod_rewrite on your Apache Web server, you or your IT department needs to edit one of your server configuration files. The best way to run mod_rewrite is through the httpd.conf file, as this is accessed once per server restart. Without configuration file access, you'll have to use .htaccess for each directory. Keep in mind that the same mod_include performance caveats apply; .htaccess files are slower as each directory must be traversed to read each .htaccess file for each requested URL. The Abbreviation ChallengeThe strategy is to abbreviate the longest, most frequently accessed URLs with the shortest abbreviations. Most webmasters choose one, two, or three-letter abbreviations for directories. On WebReference.com, the goal was to create a mod_rewrite rule that would expand URLs like this:
Into this:
Like Yahoo!, r is the flag we've chosen for redirects. But why stop there? We can extend this concept into more directories. So turn this:
Into this:
and so on. Note that the lack of a trailing forward slash in this second example allows us to intelligently append column numbers. With the right RewriteRule, the abbreviation of /r/c/66 expands into the string /dhtml/column66/. The RewriteRule SolutionTo accomplish this expansion, you need to write a RewriteRule regular expression. First, you need to find the URI pattern /r/d, and then extract /d and turn it into /dhtml. Next, append a trailing slash. Apache requires two directives to turn on and configure the mod_rewrite module: the RewriteEngine Boolean and the RewriteRule directive. The RewriteRule is a regular expression that transforms one URI into another. The syntax is shown here:
So to create a RewriteRule to solve this problem, you need to add the following two mod_rewrite directives to your server configuration file (httpd.conf or .htaccess):
This regular expression matches a URL that begins with /r/ (the ^ character at the beginning means to match from the beginning of the string). Following that pattern is d(.*), which matches one or more characters after the d. Note that using /r/dodo would expand to /dhtmlodo, so you'll have to make sure anything after r/d always includes a /. So when a request comes in for the URI <a href="/r/d/diner/">DHTML Diner</a>, this rule expands this abbreviated URI into <a href="/dhtml/diner/">DHTML Diner</a>. The RewriteMap Solution for Multiple AbbreviationsThe RewriteRule solution would work well for a few abbreviations, but what if you want to abbreviate a large number of links? That's where the RewriteMap directive comes in. This feature allows you to group multiple lookup keys (abbreviations) and their corresponding expanded values into one tab-delimited file. Here's an example map file at (/www/misc/redir/abbr_webref.txt):
The MapName specifies a mapping function between keys and values for a rewriting rule using the following syntax:
When you are using a mapping construct, you generalize the RewriteRule regular expression. Instead of a hard-coded value, the MapName is consulted, and the LookupKey accessed. If there is a key match, the mapping function substitutes the expanded value into the regular expression. If there is no match, the rule substitutes the default value or a blank string. To use this external map file, we'll add the RewriteMap directive and tweak the regular expression correspondingly. The following httpd.conf commands turn rewriting on, show where to look for your rewrite map, and show the definition of the RewriteRule:
The first directive turns on rewrites as before. The second points the rewrite module to the text version of our map file. The third tells the processor to look up the value of the matching expression in the map file. Note that the RewriteRule has a permanent redirect (301 instead of 302) and last flags appended to it. Once an abbreviation is found for this URL, no further rewrite rules are processed for it, which speeds up lookups. Here we've set the rewrite MapName to abbr and the map file location (text format) to the following:
The RewriteRule processes requested URLs using the regular expression:
This regular expression matches an URL that begins with /r/. (The ^ character at the beginning means to match from the beginning of the string.) Then the regular expression ([^/]*) matches as many non-slash characters it can to the end of the string. This effectively pulls out the first string between two slashes following the /r. For example, in the URL /r/pg/javascript/, this portion of the regular expression matches pg. It also will match ht in /r/ht. (Because there are no slashes following, it just continues until it reaches the end of the URL.) The rest of the pattern /?(.*) matches 0 or 1 forward slashes / with any characters that follow. These two parenthesized expressions will be used in the replacement pattern. The Replacement PatternThe substitution (${abbr:$1}$2) is the replacement pattern that will be used in the building of the new URL. The $1 and $2 variables refer back (backreferences) to the first and second patterns found in the supplied URL. They represent the first set of parentheses and the second set of parentheses in the regular expression, respectively. Thus for /r/pg/javascript/, $1 = "pg" and $2 = "javascript/". Replacing these in the example produces the following:
The ${abbr:pg} is a mapping directive that says, "Refer to the map abbr (recall our map command, RewriteMap abbr txt:/www/misc/redir/abbr_webref.txt), look up the key pg, and return the corresponding data value for that key." In this case, that value is programming/. Thus the abbreviated URL, /r/pg/javascript, is replaced by the following:
Voila! So you've effectively created an abbreviation expander using a regular expression and a mapping file. Using the preceding rewrite map file, the following URL expansions would occur:
The server, upon seeing a matching abbreviation in the map file, will automatically rewrite the URL to the longer value. But what happens if you have many keys in your RewriteMap file? Scanning a long text file every time a user clicks a link can slow down lookups. That's where binary hash files come in handy. Binary Hash RewriteMapFor maximum speed, convert your text RewriteMap file into a binary *DBM hash file. This binary hash version of your key and value pairs is optimized for maximum lookup speed. Convert your text file with a DBM tool or the txt2dbm Perl script provided at http://httpd.apache.org/docs-2.0/mod/mod_rewrite.html. NOTE Note that this example is specific to Apache on Unix. Your platform may vary. Next, change the RewriteMap directive to point to your optimized DBM hash file:
That's the abbreviated version of how you set up link abbreviation on an Apache server. It is a bit of work, but once you've got your site hierarchy fixed, you can do this once and forget it. This technique saves space by allowing abbreviated URLs on the client side and shunting the longer actual URLs to the server. The delay using this technique is hardly noticeable. (If Yahoo! can do it, anyone can.) Done correctly, the rewriting can be transparent to the client. The abbreviated URL is requested, the server expands it, and serves back the content at the expanded location without telling the browser what it has done. You also can use the /r/ flag or the RewriteLog directive to track click-throughs in your server logs. This technique works well for sites that don't change very often: You would manually abbreviate your URIs to match your RewriteMap abbreviations stored on your server. But what about sites that are updated every day, or every hour, or every minute? Wouldn't it be nice if you could make the entire abbreviation process automatic? That's where the magic of Perl and cron jobs (or the Schedule Tasks GUI in Windows) comes in. Automatic URL AbbreviationYou can create a Perl or shell script (insert your favorite CGI scripting language here) to look for URLs that match the lookup keys in your map file and automatically abbreviate your URLs. We use this technique on WebReference.com's home page. To make it easy for other developers to auto-abbreviate their URLs, we've created an open source script called shorturls.pl. It is available at http://www.webreference.com/scripts/. NOTE XSLT gives you another way to abbreviate URLs automatically. Just create the correct templates to abbreviate all the local links in your files. The shorturls.pl script allows you to abbreviate URLs automatically and exclude portions of your HTML code from optimization with simple XML tags (<NOABBREV> ...</NOABBREV>). Using this URL abbreviation technique, we saved over 20 percent (5KB) off our 24KB hand-optimized front page. We could have saved even more space, but for various reasons, we excluded some URLs from abbreviation. This gives you an idea of the link abbreviation process, but what about all the other areas of WebReference? Here is a truncated version of our abbreviation file to give you an idea of what it looks like (the full version is available at http://www.webreference.com/scripts/):
Note that we use two and three-letter abbreviations to represent longer URLs on WebReference.com. Yahoo! uses two-letter abbreviations throughout their home page. How brief you make your abbreviations depends on how many links you need to abbreviate, and how descriptive you want the URLs to be. The URL Abbreviation/Expansion Process: Step by StepIn order to enable automatic link abbreviation (with shorturls.pl) and expansion (with mod_rewrite), do the following:
That's it. Now any new content that appears on your home page will be automatically abbreviated according to the RewriteMap file that you created, listing the abbreviations you want.
|
[ Team LiB ] |
No comments:
Post a Comment