Tuesday, October 27, 2009

Section 6.7. Special Features of Strings










6.7. Special Features of Strings



6.7.1. Special or Control Characters


Like most other high-level or scripting languages, a backslash paired with another single character indicates the presence of a "special" character, usually a nonprintable character, and that this pair of characters will be substituted by the special character. These are the special characters we discussed above that will not be interpreted if the raw string operator precedes a string containing these characters.


In addition to the well-known characters such as NEWLINE ( \n ) and (horizontal) tab ( \t ), specific characters via their ASCII values may be used as well: \OOO or \xXX where OOO and XX are their respective octal and hexadecimal ASCII values. Here are the base 10, 8, and 16 representations of 0, 65, and 255:


 

ASCII

ASCII

ASCII

Decimal

0

65

255

Octal

\000

\101

\177

Hexadecimal

\x00

\x41

\xFF



Special characters, including the backslash-escaped ones, can be stored in Python strings just like regular characters.


Another way that strings in Python are different from those in C is that Python strings are not terminated by the NUL (\000) character (ASCII value 0). NUL characters are just like any of the other special backslash-escaped characters. In fact, not only can NUL characters appear in Python strings, but there can be any number of them in a string, not to mention that they can occur anywhere within the string. They are no more special than any of the other control characters. Table 6.7 represents a summary of the escape characters supported by most versions of Python.


Table 6.7. String Literal Backslash Escape Characters

/X

Oct

Dec

Hex

Char

Description

\0

000

0

0x00

NUL

Null character

\a

007

7

0x07

BEL

Bell

\b

010

8

0x08

BS

Backspace

\t

011

9

0x09

HT

Horizontal tab

\n

012

10

0x0A

LF

Linefeed/Newline

\v

013

11

0x0B

VT

Vertical tab

\f

014

12

0x0C

FF

Form feed

\r

015

13

0x0D

CR

Carriage return

\e

033

27

0x1B

ESC

Escape

\"

042

34

0x22

"

Double quote

\'

047

39

0x27

'

Single quote/apostrophe

\\

134

92

0x5C

\

Backslash



As mentioned before, explicit ASCII octal or hexadecimal values can be given, as well as escaping a NEWLINE to continue a statement to the next line. All valid ASCII character values are between 0 and 255 (octal 0177, hexadecimal 0XFF).


\OOO    Octal value OOO (range is 0000 to 0177)
\xXX 'x' plus hexadecimal value XX (range is 0X00 to 0xFF)
\ escape NEWLINE for statement continuation


One use of control characters in strings is to serve as delimiters. In database or Internet/Web processing, it is more than likely that most printable characters are allowed as data items, meaning that they would not make good delimiters.


It becomes difficult to ascertain whether or not a character is a delimiter or a data item, and by using a printable character such as a colon ( : ) as a delimiter, you are limiting the number of allowed characters in your data, which may not be desirable.


One popular solution is to employ seldomly used, nonprintable ASCII values as delimiters. These make the perfect delimiters, freeing up the colon and the other printable characters for more important uses.




6.7.2. Triple Quotes


Although strings can be represented by single or double quote delimitation, it is often difficult to manipulate strings containing special or nonprintable characters, especially the NEWLINE character. Python's triple quotes comes to the rescue by allowing strings to span multiple lines, including verbatim NEWLINEs, tabs, and any other special characters.


The syntax for triple quotes consists of three consecutive single or double quotes (used in pairs, naturally):


>>> hi = '''hi
there'''
>>> hi # repr()
'hi\nthere'
>>> print hi # str()
hi
there


Triple quotes lets the developer avoid playing quote and escape character games, all the while bringing at least a small chunk of text closer to WYSIWIG (what you see is what you get) format.


The most powerful use cases are when you have a large block of HTML or SQL that would be completely inconvenient to use by concanentation or wrapped with backslash escapes:


errHTML = '''
<HTML><HEAD><TITLE>
Friends CGI Demo</TITLE></HEAD>

<BODY><H3>ERROR</H3>
<B>%s</B><P>
<FORM><INPUT TYPE=button VALUE=Back
ONCLICK="window.history.back()"></FORM>
</BODY></HTML>
'''

cursor.execute('''
CREATE TABLE users (
login VARCHAR(8),
uid INTEGER,
prid INTEGER)
''')




6.7.3. String Immutability


In Section 4.7.2, we discussed how strings are immutable data types, meaning that their values cannot be changed or modified. This means that if you do want to update a string, either by taking a substring, concatenating another string on the end, or concatenating the string in question to the end of another string, etc., a new string object must be created for it.


This sounds more complicated than it really is. Since Python manages memory for you, you won't really notice when this occurs. Any time you modify a string or perform any operation that is contrary to immutability, Python will allocate a new string for you. In the following example, Python allocates space for the strings, 'abc' and 'def'. But when performing the addition operation to create the string 'abcdef', new space is allocated automatically for the new string.


>>> 'abc' + 'def'
'abcdef'


Assigning values to variables is no different:


>>> s = 'abc'
>>> s = s + 'def'
>>> s
'abcdef'


In the above example, it looks like we assigned the string 'abc' to string, then appended the string 'def' to string. To the naked eye, strings look mutable. What you cannot see, however, is the fact that a new string was created when the operation "s + 'def'" was performed, and that the new object was then assigned back to s. The old string of 'abc'was deallocated.


Once again, we can use the id() built-in function to help show us exactly what happened. If you recall, id() returns the "identity" of an object. This value is as close to a "memory address" as we can get in Python.


>> s = 'abc'
>>>
>>> id(s)
135060856
>>>
>>> s += 'def'
>>> id(s)
135057968


Note how the identities are different for the string before and after the update. Another test of mutability is to try to modify individual characters or substrings of a string. We will now show how any update of a single character or a slice is not allowed:


>>> s
'abcdef'
>>>
>>> s[2] = 'C'
Traceback (innermost last):
File "<stdin>", line 1, in ?
AttributeError: __setitem__
>>>
>>> s[3:6] = 'DEF'
Traceback (innermost last):
File "<stdin>", line 1, in ?
AttributeError: __setslice__


Both operations result in an error. In order to perform the actions that we want, we will have to create new strings using substrings of the existing string, then assign those new strings back to string:


>>> s
'abcdef'
>>>
>>> s = '%sC%s' % (s[0:2], s[3:])
>>> s
'abCdef'
>>>
>>> s[0:3] + 'DEF'
'abCDEF'


So for immutable objects like strings, we make the observation that only valid expressions on the left-hand side of an assignment (to the left of the equals sign [ = ]) must be the variable representation of an entire object such as a string, not single characters or substrings. There is no such restriction for the expression on the right-hand side.













No comments: