Wednesday, November 4, 2009

Opening HTML Documents











Opening HTML Documents






import urllib
u = urllib.urlopen(webURL)
u = urllib.urlopen(localURL)
buffer = u.read()
print u.info()
print "Read %d bytes from %s.\n" % \
(len(buffer), u.geturl())




The urllib and urllib2 modules included with Python provide the functionality to open and fetch data from URLs, including HTML documents.


To use the urllib module to open an HTML document, specify the URL location of the document, including the filename in the urlopen(url [,data]) function. The urlopen function will open a local file and return a file-like object that can be used to read data from the HTML document.


Once you have opened the HTML document, you can read the file using the read([nbytes]), readline(), and readlines() functions similar to normal files. To read the entire contents of the HTML document, use the read() function to return the file contents as a string.


After you open a location, you can retrieve the location of the file using the geturl() function. The geturl function returns the URL in string format, taking into account any redirection that might have taken place when accessing the HTML file.


Note



Another helpful function included in the file-like object returned from urlopen is the info() function. The info() function returns the available metadata about the URL location, including content length, content type, and so on.




import urllib

webURL = "http://www.python.org"
localURL = "/books/python/CH8/code/test.html"

#Open web-based URL
u = urllib.urlopen(webURL)
buffer = u.read()
print u.info()
print "Read %d bytes from %s.\n" % \
(len(buffer), u.geturl())

#Open local-based URL
u = urllib.urlopen(localURL)
buffer = u.read()
print u.info()
print "Read %d bytes from %s." % \
(len(buffer), u.geturl())


html_open.py


Date: Tue, 18 Jul 2006 18:28:19 GMT
Server: Apache/2.0.54 (Debian GNU/Linux)
DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5
mod_ssl/2.0.54 OpenSSL/0.9.7e
Last-Modified: Mon, 17 Jul 2006 23:06:04 GMT
ETag: "601f6-351c-1310af00"
Accept-Ranges: bytes
Content-Length: 13596
Connection: close
Content-Type: text/html

Web-Based URL
Read 13596 bytes from http://www.python.org.
Content-Type: text/html
Content-Length: 433
Last-modified: Thu, 13 Jul 2006 22:07:53 GMT

Local-Based URL
Read 433 bytes from
file:///books/python/CH8/code/test.html.


Output from html_open.py code












No comments: