Module htmldata
[show private | hide private]
[frames | no frames]

Module htmldata

Manipulate HTML or XHTML documents.

Version 1.1.1. This source code has been placed in the public domain by Connelly Barnes.

Features: See the examples for a quick start.
Classes
URLMatch A matched URL inside an HTML document or stylesheet.

Function Summary
  examples()
Examples of the htmldata module.
  tagextract(doc)
Convert HTML to data structure.
  tagjoin(L)
Convert data structure back to HTML.
  urlextract(doc, siteurl, mimetype)
Extract URLs from HTML or stylesheet.
  urljoin(s, L)
Write back document with modified URLs (reverses urlextract).

Function Details

examples()

Examples of the htmldata module.

Example 1: Print all absolutized URLs from Google.

Here we use urlextract to obtain all URLs in the document.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> for u in htmldata.urlextract(contents, url):
...   print u.url
...

http://www.google.com/images/logo.gif
http://www.google.com/search
(More output)

Note that the second argument to urlextract causes the URLs to be made absolute with respect to that base URL.

Example 2: Print all image URLs from Google in relative form.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> for u in htmldata.urlextract(contents):
...   if u.tag_name == 'img':
...     print u.url
...

/images/logo.gif

Equivalently, one can use tagextract, and look for occurrences of <img> tags. The urlextract function is mostly a convenience function for when one wants to extract and/or modify all URLs in a document.

Example 3: Replace all <a href> links on Google with the Microsoft web page.

Here we use tagextract to turn the HTML into a data structure, and then loop over the in-order list of tags (items which are not tuples are plain text, which is ignored).
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> L = htmldata.tagextract(contents)
>>> for item in L:
...   if isinstance(item, tuple) and item[0] == 'a':
...     # It's an HTML <a> tag!  Give it an href=.
...     item[1]['href'] = 'http://www.microsoft.com/'
...
>>> htmldata.tagjoin(L)
(Microsoftized version of Google)
Example 4: Make all URLs on an HTML document be absolute.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> htmldata.urljoin(htmldata.urlextract(contents, url))
(Google HTML page with absolute URLs)
Example 5: Properly quote all HTML tag values for pedants.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> htmldata.tagjoin(htmldata.tagextract(contents))
(Properly quoted version of the original HTML)
Example 6: Modify all URLs in a document so that they are appended to our proxy CGI script http://mysite.com/proxy.cgi.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> proxy_url = 'http://mysite.com/proxy.cgi?url='
>>> L = htmldata.urlextract(contents)
>>> for u in L:
...   u.url = proxy_url + u.url
...
>>> htmldata.urljoin(L)
(Document with all URLs wrapped in our proxy script)
Example 7: Download all images from a website.
>>> import urllib, htmldata, time
>>> url = 'http://www.google.com/'
>>> contents = urllib.urlopen(url).read()
>>> for u in htmldata.urlextract(contents, url):
...   if u.tag_name == 'img':
...     filename = urllib.quote_plus(u.url)
...     urllib.urlretrieve(u.url, filename)
...     time.sleep(0.5)
...

(Images are downloaded to the current directory)
Many sites will protect against bandwidth-draining robots by checking the HTTP Referer [sic] and User-Agent fields. To circumvent this, one can create a urllib2.Request object with a legitimate Referer and a User-Agent such as "Mozilla/4.0 (compatible; MSIE 5.5)". Then use urllib2.urlopen to download the content. Be warned that some website operators will respond to rapid robot requests by banning the offending IP address.

tagextract(doc)

Convert HTML to data structure.

Returns a list. HTML tags become (name, keyword_dict) tuples within the list, while plain text becomes strings within the list. All tag names are lowercased and stripped of whitespace. Tags which end with forward slashes have a single forward slash placed at the end of their name, to indicate that they are XML unclosed tags.

Example:
>>> tagextract('<img src=hi.gif alt="hi">foo<br><br/></body>')
[('img', {'src': 'hi.gif', 'alt': 'hi'}), 'foo',
 ('br', {}), ('br/', {}), ('/body', {})]
Text between '<script>' and '<style>' is rendered directly to plain text. This prevents rogue '<' or '>' characters from interfering with parsing.
>>> tagextract('<script type="a"><blah>var x; </script>')
[('script', {'type': 'a'}), '<blah>var x; ', ('/script', {})]
Comment strings and XML directives are rendered as a single long tag with no attributes. The case of the tag "name" is not changed:
>>> tagextract('<!-- blah -->')
[('!-- blah --', {})]

>>> tagextract('<?xml version="1.0" encoding="utf-8" ?>')
[('?xml version="1.0" encoding="utf-8" ?', {})]

>>> tagextract('<!DOCTYPE html PUBLIC etc...>')
[('!DOCTYPE html PUBLIC etc...', {})]
Greater-than and less-than characters occuring inside comments or CDATA blocks are correctly kept as part of the block:
>>> tagextract('<!-- <><><><>>..> -->')
[('!-- <><><><>>..> --', {})]

>>> tagextract('<!CDATA[[><>><>]<> ]]>')
[('!CDATA[[><>><>]<> ]]', {})]
Note that if one modifies these tags, it is important to retain the "--" (for comments) or "]]" (for CDATA) at the end of the tag name, so that output from tagjoin will be correct HTML/XHTML.

tagjoin(L)

Convert data structure back to HTML.

This reverses the tagextract function.

More precisely, if an HTML string is turned into a data structure, then back into HTML, the resulting string will be functionally equivalent to the original HTML.
>>> tagjoin(tagextract(s))
(string that is functionally equivalent to s)
Three changes are made to the HTML by tagjoin: tags are lowercased, key=value pairs are sorted, and values are placed in double-quotes.

urlextract(doc, siteurl=None, mimetype='text/html')

Extract URLs from HTML or stylesheet.

Extracts only URLs that are linked to or embedded in the document. Ignores plain text URLs that occur in the non-HTML part of the document.

Returns a list of URLMatch objects.
>>> L = urlextract('<img src="a.gif"><a href="www.google.com">')
>>> L[0].url
'a.gif'

>>> L[1].url
'www.google.com'
If siteurl is specified, all URLs are made into absolute URLs by assuming that doc is located at the URL siteurl.
>>> doc = '<img src="a.gif"><a href="/b.html">'
>>> L = urlextract(doc, 'http://www.python.org/~guido/')
>>> L[0].url
'http://www.python.org/~guido/a.gif'

>>> L[1].url
'http://www.python.org/b.html'

If mimetype is "text/css", the document will be parsed as a stylesheet.

If a stylesheet is embedded inside an HTML document, then urlextract will extract the URLs from both the HTML and the stylesheet.

urljoin(s, L)

Write back document with modified URLs (reverses urlextract).

Given a list L of URLMatch objects obtained from urlextract, substitutes changed URLs into the original document s, and returns the modified document.

One should only modify the .url attribute of the URLMatch objects. The ordering of the URLs in the list is not important.
>>> doc = '<img src="a.png"><a href="b.png">'
>>> L = urlextract(doc)
>>> L[0].url = 'foo'
>>> L[1].url = 'bar'
>>> urljoin(doc, L)
'<img src="foo"><a href="bar">'