Module htmldata
[show private | hide private]
[frames | no frames]

Module htmldata

Manipulate HTML or XHTML documents.

Version 1.1.1. This source code has been placed in the public domain by Connelly Barnes.

Features: See the examples for a quick start.
Classes
URLMatch A matched URL inside an HTML document or stylesheet.
_HTMLTag HTML tag extracted by _full_tag_extract.
_TextTag Text extracted from an HTML document by _full_tag_extract.

Function Summary
  examples()
Examples of the htmldata module.
  tagextract(doc)
Convert HTML to data structure.
  tagjoin(L)
Convert data structure back to HTML.
  urlextract(doc, siteurl, mimetype)
Extract URLs from HTML or stylesheet.
  urljoin(s, L)
Write back document with modified URLs (reverses urlextract).
  _cast_to_str(arg, str_class)
Casts string components of several data structures to str_class.
  _enumerate(L)
Like enumerate, provided for compatibility with Python < 2.3.
  _finditer(pattern, string)
Like re.finditer, provided for compatibility with Python < 2.3.
  _full_tag_extract(s)
Like tagextract, but different return format.
  _html_split(s)
Helper routine: Split string into a list of tags and non-tags.
  _ignore_tag_index(s, i)
Helper routine: Find index within _IGNORE_TAGS, or -1.
  _is_str(s)
True iff s is a string (checks via duck typing).
  _python_has_unicode()
True iff Python was compiled with unicode().
  _remove_comments(doc)
Replaces commented out characters with spaces in a CSS document.
  _shlex_split(s)
Like shlex.split, but reversible, and for HTML.
  _tag_dict(s)
Helper routine: Extracts a dict from an HTML tag string.
  _test()
Unit test main routine.
  _test_remove_comments()
Unit test for _remove_comments.
  _test_shlex_split()
Unit test for _shlex_split.
  _test_tag_dict()
Unit test for _tag_dict.
  _test_tagextract(str_class)
Unit tests for tagextract and tagjoin.
  _test_tuple_replace()
Unit test for _tuple_replace.
  _test_urlextract(str_class)
Unit tests for urlextract and urljoin.
  _tuple_replace(s, Lindices, Lreplace)
Replace slices of a string with new substrings.

Variable Summary
str __version__ = '1.1.1'
str _BEGIN_CDATA = '<![CDATA['
str _BEGIN_COMMENT = '<!--'
list _CSS_MIMETYPES = ['text/css']
str _END_CDATA = ']]>'
str _END_COMMENT = '-->'
list _HTML_MIMETYPES = ['text/html', 'application/xhtml', 'ap...
list _IGNORE_TAGS = [('script', '/script'), ('style', '/style...
list _URL_TAGS = [('a', 'href'), ('applet', 'archive'), ('app...

Function Details

examples()

Examples of the htmldata module.

Example 1: Print all absolutized URLs from Google.

Here we use urlextract to obtain all URLs in the document.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> for u in htmldata.urlextract(contents, url):
...   print u.url
...

http://www.google.com/images/logo.gif
http://www.google.com/search
(More output)

Note that the second argument to urlextract causes the URLs to be made absolute with respect to that base URL.

Example 2: Print all image URLs from Google in relative form.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> for u in htmldata.urlextract(contents):
...   if u.tag_name == 'img':
...     print u.url
...

/images/logo.gif

Equivalently, one can use tagextract, and look for occurrences of <img> tags. The urlextract function is mostly a convenience function for when one wants to extract and/or modify all URLs in a document.

Example 3: Replace all <a href> links on Google with the Microsoft web page.

Here we use tagextract to turn the HTML into a data structure, and then loop over the in-order list of tags (items which are not tuples are plain text, which is ignored).
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> L = htmldata.tagextract(contents)
>>> for item in L:
...   if isinstance(item, tuple) and item[0] == 'a':
...     # It's an HTML <a> tag!  Give it an href=.
...     item[1]['href'] = 'http://www.microsoft.com/'
...
>>> htmldata.tagjoin(L)
(Microsoftized version of Google)
Example 4: Make all URLs on an HTML document be absolute.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> htmldata.urljoin(htmldata.urlextract(contents, url))
(Google HTML page with absolute URLs)
Example 5: Properly quote all HTML tag values for pedants.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> htmldata.tagjoin(htmldata.tagextract(contents))
(Properly quoted version of the original HTML)
Example 6: Modify all URLs in a document so that they are appended to our proxy CGI script http://mysite.com/proxy.cgi.
>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> proxy_url = 'http://mysite.com/proxy.cgi?url='
>>> L = htmldata.urlextract(contents)
>>> for u in L:
...   u.url = proxy_url + u.url
...
>>> htmldata.urljoin(L)
(Document with all URLs wrapped in our proxy script)
Example 7: Download all images from a website.
>>> import urllib, htmldata, time
>>> url = 'http://www.google.com/'
>>> contents = urllib.urlopen(url).read()
>>> for u in htmldata.urlextract(contents, url):
...   if u.tag_name == 'img':
...     filename = urllib.quote_plus(u.url)
...     urllib.urlretrieve(u.url, filename)
...     time.sleep(0.5)
...

(Images are downloaded to the current directory)
Many sites will protect against bandwidth-draining robots by checking the HTTP Referer [sic] and User-Agent fields. To circumvent this, one can create a urllib2.Request object with a legitimate Referer and a User-Agent such as "Mozilla/4.0 (compatible; MSIE 5.5)". Then use urllib2.urlopen to download the content. Be warned that some website operators will respond to rapid robot requests by banning the offending IP address.

tagextract(doc)

Convert HTML to data structure.

Returns a list. HTML tags become (name, keyword_dict) tuples within the list, while plain text becomes strings within the list. All tag names are lowercased and stripped of whitespace. Tags which end with forward slashes have a single forward slash placed at the end of their name, to indicate that they are XML unclosed tags.

Example:
>>> tagextract('<img src=hi.gif alt="hi">foo<br><br/></body>')
[('img', {'src': 'hi.gif', 'alt': 'hi'}), 'foo',
 ('br', {}), ('br/', {}), ('/body', {})]
Text between '<script>' and '<style>' is rendered directly to plain text. This prevents rogue '<' or '>' characters from interfering with parsing.
>>> tagextract('<script type="a"><blah>var x; </script>')
[('script', {'type': 'a'}), '<blah>var x; ', ('/script', {})]
Comment strings and XML directives are rendered as a single long tag with no attributes. The case of the tag "name" is not changed:
>>> tagextract('<!-- blah -->')
[('!-- blah --', {})]

>>> tagextract('<?xml version="1.0" encoding="utf-8" ?>')
[('?xml version="1.0" encoding="utf-8" ?', {})]

>>> tagextract('<!DOCTYPE html PUBLIC etc...>')
[('!DOCTYPE html PUBLIC etc...', {})]
Greater-than and less-than characters occuring inside comments or CDATA blocks are correctly kept as part of the block:
>>> tagextract('<!-- <><><><>>..> -->')
[('!-- <><><><>>..> --', {})]

>>> tagextract('<!CDATA[[><>><>]<> ]]>')
[('!CDATA[[><>><>]<> ]]', {})]
Note that if one modifies these tags, it is important to retain the "--" (for comments) or "]]" (for CDATA) at the end of the tag name, so that output from tagjoin will be correct HTML/XHTML.

tagjoin(L)

Convert data structure back to HTML.

This reverses the tagextract function.

More precisely, if an HTML string is turned into a data structure, then back into HTML, the resulting string will be functionally equivalent to the original HTML.
>>> tagjoin(tagextract(s))
(string that is functionally equivalent to s)
Three changes are made to the HTML by tagjoin: tags are lowercased, key=value pairs are sorted, and values are placed in double-quotes.

urlextract(doc, siteurl=None, mimetype='text/html')

Extract URLs from HTML or stylesheet.

Extracts only URLs that are linked to or embedded in the document. Ignores plain text URLs that occur in the non-HTML part of the document.

Returns a list of URLMatch objects.
>>> L = urlextract('<img src="a.gif"><a href="www.google.com">')
>>> L[0].url
'a.gif'

>>> L[1].url
'www.google.com'
If siteurl is specified, all URLs are made into absolute URLs by assuming that doc is located at the URL siteurl.
>>> doc = '<img src="a.gif"><a href="/b.html">'
>>> L = urlextract(doc, 'http://www.python.org/~guido/')
>>> L[0].url
'http://www.python.org/~guido/a.gif'

>>> L[1].url
'http://www.python.org/b.html'

If mimetype is "text/css", the document will be parsed as a stylesheet.

If a stylesheet is embedded inside an HTML document, then urlextract will extract the URLs from both the HTML and the stylesheet.

urljoin(s, L)

Write back document with modified URLs (reverses urlextract).

Given a list L of URLMatch objects obtained from urlextract, substitutes changed URLs into the original document s, and returns the modified document.

One should only modify the .url attribute of the URLMatch objects. The ordering of the URLs in the list is not important.
>>> doc = '<img src="a.png"><a href="b.png">'
>>> L = urlextract(doc)
>>> L[0].url = 'foo'
>>> L[1].url = 'bar'
>>> urljoin(doc, L)
'<img src="foo"><a href="bar">'

_cast_to_str(arg, str_class)

Casts string components of several data structures to str_class.

Casts string, list of strings, or list of tuples (as returned by tagextract) such that all strings are made to type str_class.

_enumerate(L)

Like enumerate, provided for compatibility with Python < 2.3.

Returns a list instead of an iterator.

_finditer(pattern, string)

Like re.finditer, provided for compatibility with Python < 2.3.

Returns a list instead of an iterator. Otherwise the return format is identical to re.finditer (except possibly in the details of empty matches).

_full_tag_extract(s)

Like tagextract, but different return format.

Returns a list of _HTMLTag and _TextTag instances.

The return format is very inconvenient for manipulating HTML, and only will be useful if you want to find the exact locations where tags occur in the original HTML document.

_html_split(s)

Helper routine: Split string into a list of tags and non-tags.
>>> html_split(' blah <tag text> more </tag stuff> ')
[' blah ', '<tag text>', ' more ', '</tag stuff>', ' ']

Tags begin with '<' and end with '>'.

The identity ''.join(L) == s is always satisfied.

Exceptions to the normal parsing of HTML tags:

'<script>', '<style>', and HTML comment tags ignore all HTML until the closing pair, and are added as three elements:
>>> html_split(' blah<style><<<><></style><!-- hi -->' +
...            ' <script language="Javascript"></>a</script>end')
[' blah', '<style>', '<<<><>', '</style>', '<!--', ' hi ', '-->',
 ' ', '<script language="Javascript">', '</>a', '</script>', 'end']

_ignore_tag_index(s, i)

Helper routine: Find index within _IGNORE_TAGS, or -1.

If s[i:] begins with an opening tag from _IGNORE_TAGS, return the index. Otherwise, return -1.

_is_str(s)

True iff s is a string (checks via duck typing).

_python_has_unicode()

True iff Python was compiled with unicode().

_remove_comments(doc)

Replaces commented out characters with spaces in a CSS document.

_shlex_split(s)

Like shlex.split, but reversible, and for HTML.

Splits a string into a list L of strings. List elements contain either an HTML tag name=value pair, an HTML name singleton (eg "checked"), or whitespace.

The identity ''.join(L) == s is always satisfied.
>>> _shlex_split('a=5 b="15" name="Georgette A"')
['a=5', ' ', 'b="15"', ' ', 'name="Georgette A"']
>>> _shlex_split('a = a5 b=#b19 name="foo bar" q="hi"')
['a = a5', ' ', 'b=#b19', ' ', 'name="foo bar"', ' ', 'q="hi"']
>>> _shlex_split('a="9"b="15"')
['a="9"', 'b="15"']

_tag_dict(s)

Helper routine: Extracts a dict from an HTML tag string.
>>> _tag_dict('bgcolor=#ffffff text="#000000" blink')
({'bgcolor':'#ffffff', 'text':'#000000', 'blink': None},
 {'bgcolor':(0,7),  'text':(16,20), 'blink':(31,36)},
 {'bgcolor':(8,15), 'text':(22,29), 'blink':(36,36)})

Returns a 3-tuple. First element is a dict of (key, value) pairs from the HTML tag. Second element is a dict mapping keys to (start, end) indices of the key in the text. Third element maps keys to (start, end) indices of the value in the text.

Names are lowercased.

Raises ValueError for unmatched quotes and other errors.

_test()

Unit test main routine.

_test_remove_comments()

Unit test for _remove_comments.

_test_shlex_split()

Unit test for _shlex_split.

_test_tag_dict()

Unit test for _tag_dict.

_test_tagextract(str_class=<type 'str'>)

Unit tests for tagextract and tagjoin.

Strings are cast to the string class argument str_class.

_test_tuple_replace()

Unit test for _tuple_replace.

_test_urlextract(str_class=<type 'str'>)

Unit tests for urlextract and urljoin.

Strings are cast to the string class argument str_class.

_tuple_replace(s, Lindices, Lreplace)

Replace slices of a string with new substrings.

Given a list of slice tuples in Lindices, replace each slice in s with the corresponding replacement substring from Lreplace.

Example:
>>> _tuple_replace('0123456789',[(4,5),(6,9)],['abc', 'def'])
'0123abc5def9'

Variable Details

__version__

Type:
str
Value:
'1.1.1'                                                                

_BEGIN_CDATA

Type:
str
Value:
'<![CDATA['                                                            

_BEGIN_COMMENT

Type:
str
Value:
'<!--'                                                                 

_CSS_MIMETYPES

Type:
list
Value:
['text/css']                                                           

_END_CDATA

Type:
str
Value:
']]>'                                                                  

_END_COMMENT

Type:
str
Value:
'-->'                                                                  

_HTML_MIMETYPES

Type:
list
Value:
['text/html',
 'application/xhtml',
 'application/xhtml+xml',
 'text/xml',
 'application/xml']                                                    

_IGNORE_TAGS

Type:
list
Value:
[('script', '/script'), ('style', '/style')]                           

_URL_TAGS

Type:
list
Value:
[('a', 'href'),
 ('applet', 'archive'),
 ('applet', 'code'),
 ('applet', 'codebase'),
 ('area', 'href'),
 ('base', 'href'),
 ('blockquote', 'cite'),
 ('body', 'background'),
...