URL Encoding
By themselves, URLs are nothing but
alphanumeric strings, with some other symbols thrown in. The character set
chosen to express a URL string consists of the following symbols:
style='width:100.0%'>
Symbols style='font-size:10.5pt;font-family:Arial;color:black'> | Values |
Alphanumeric symbols | A-Z, a-z, 0-9 |
Reserved symbols | ; / ? : @ & = + $ , < > # % " |
Other special characters | - _ . ! ~ * ' ( ) {} | \^ [ ] ` |
For the most part, a URL string consists of
letters, numbers, and reserved symbols that have special meaning within the URL
string. Other special characters are found in some URL strings, although they
don't have any special meaning as far as the URL is concerned. However, they
may have special meaning for the Web server receiving the URL or the
application that is requested via the Web server.
Interpretations of some of these special
characters are presented in lang=EN-GB style='color:#003399'>Table 5-2.
Meta-Characters
Characters such as * and ; and | and ` have
special meanings as meta-characters in applications and scripts. These
characters don't affect the URL in any way, but if they end up making their way
into applications, they may change the meaning of the input altogether and
sometimes create gaping security holes.
style='width:100.0%'>
lang=EN-GB style='font-size:10.5pt;font-family:Arial'>Table 5-2. Special | |
Special Characters style='font-size:10.5pt;font-family:Arial;color:black'> | Interpretation |
? | Query String separator. The part of the URL string to the right |
& | Parameter delimiter. Used to separate name=value parameter pairs |
= | Separates the parameter name from the parameter value while |
+ | Is translated into a space. |
: | Protocol separator. The portion of the URL string from the |
# | Used to specify an anchor point within a Web page. For example |
% | Used as an escape character for specifying hexadecimal encoded |
@ | Used in mailto: URLs while specifying Internet e-mail addresses |
~ | Used for specifying a user's home directory on a multiuser |
Many meta-characters are interpreted
differently by different Web servers. lang=EN-GB style='color:#003399'>Table 5-3 describes
how various meta-characters are interpreted inside applications.
Specifying Special Characters on the URL String
The question that arises now is, "What
if we want to specify special characters such as % or ? or & or + without
giving them any special meaning?" For example, suppose we want to pass two
parameters, book=pride&prejudice and shipping=snailmail, on the Query
String. In this case, the URL is:
http://mycheapbookshop.com/purchase.cgi?book=pride&predjudice&shipping=snailmail
style='width:100.0%'>
lang=EN-GB style='font-size:10.5pt;font-family:Arial'>Table 5-3. Meta-Characters | |
Meta-Character style='font-size:10.5pt;font-family:Arial;color:black'> | Interpretation/Use style='font-size:10.5pt;font-family:Arial;color:black'> |
* | The star character is used as a wild card or a file globbing |
; | The semicolon character has many meanings in many different |
| | The pipe character, if sneaked through without proper checking, |
` | The grave accent character (commonly called a back-tick or a |
style='color:black;display:none'>
style='width:90.0%'>
Meta-Characters and Input ValidationThe single most prominent cause of over 90% |
The result is an ambiguous URL because there
are three & symbols in the Query String. Most likely, a Web server would
split such a Query String into three parameters instead of two�namely,
book=pride, prejudice= and shipping=snailmail.
If we want to pass the & symbol as part
of the parameter value, the URL specification allows us to express reserved and
special characters in a two-digit hexadecimal encoded ASCII format, prefixed
with a % symbol, as follows:
style='width:100.0%'>
Characters style='font-size:10.5pt;font-family:Arial;color:black'> | Hex Values style='font-size:10.5pt;font-family:Arial;color:black'> |
All hex encoded characters | %XX (%00-%FF) |
Control characters | %00-%1F, %7F |
Upper 8-bit ASCII characters | %80-%FF |
Space | %20 or + |
Carriage return | %0d |
Line feed | %0a |
In the preceding example, the ASCII value of
the & symbol is 38 in decimal and 26 in hexadecimal. Therefore, if we want
to express the & symbol, we can use %26 in its place. The URL in the
example would become:
http://mycheapbookshop.com/purchase.cgi?book=pride%26predjudice&shipping=snailmail
Unicode Encoding
Hexadecimal ASCII encoding, while serving
purposes for the most part, isn't broad enough to represent character sets
larger than 256 symbols. Most modern operating systems and applications support
multibyte representations of character sets of languages other than English. Microsoft's
IIS Web server supports URLs containing characters encoded with multibyte UCS
Translation Format (UTF-8), in addition to hexadecimal ASCII encoding.
style='width:90.0%'>
lang=EN-GB style='font-size:16.5pt;font-family:Arial'>The Acme Art, Inc., |
The Universal Character Set (UCS) is defined
by the International Standards Organization's draft ISO 10646. Although UCS is
maintained by ISO, a separate group was formed (primarily by software vendors)
to allow representation of a variety of character sets with one unified scheme.
This group came to be known as the Unicode Consortium (style='color:#003399'>http://www.unicode.org). As
standards were developed, both Unicode and UCS decided to adopt a common representation
scheme so that the computing world didn't have to deal with separate standards
for the same thing. UTF-8 encoding is defined in ISO 10646-1:2000 and in RFC
2279. For operating systems that have been designed around the ASCII character
encoding scheme, UTF-8 allows for easy conversion and representation of multibyte
Unicode characters using ASCII mappings.
Without going into the intricacies of how
UTF-8 works, let's look at Unicode encoding from a URL's point of view. Two-byte
Unicode characters are encoded by using %uXXYY, where XX and YY are hexadecimal
values of the higher and lower byte respectively. For the standard ASCII
characters %00 to %FF, the Unicode representation is %u0000 to %u00FF. The Web
server decodes 16 bits at a time when dealing with Unicode encoded symbols.
No comments:
Post a Comment