htmldata

Module htmldata

[show private | hide private]

[frames | no frames]

Module htmldata

Manipulate HTML or XHTML documents.

Version 1.1.1. This source code has been placed in the public domain by Connelly Barnes.

Features:

Translate HTML back and forth to data structures. This allows you to read and write HTML documents programmably, with much flexibility.
Extract and modify URLs in an HTML document.
Compatible with Python 2.0 - 2.4.

See the examples for a quick start.

Classes
`URLMatch`	A matched URL inside an HTML document or stylesheet.
`_HTMLTag`	HTML tag extracted by `_full_tag_extract`.
`_TextTag`	Text extracted from an HTML document by `_full_tag_extract`.

Function Summary
	`examples()` Examples of the `htmldata` module.
	`tagextract(doc)` Convert HTML to data structure.
	`tagjoin(L)` Convert data structure back to HTML.
	`urlextract(doc, siteurl, mimetype)` Extract URLs from HTML or stylesheet.
	`urljoin(s, L)` Write back document with modified URLs (reverses `urlextract`).
	`_cast_to_str(arg, str_class)` Casts string components of several data structures to str_class.
	`_enumerate(L)` Like `enumerate`, provided for compatibility with Python < 2.3.
	`_finditer(pattern, string)` Like `re.finditer`, provided for compatibility with Python < 2.3.
	`_full_tag_extract(s)` Like `tagextract`, but different return format.
	`_html_split(s)` Helper routine: Split string into a list of tags and non-tags.
	`_ignore_tag_index(s, i)` Helper routine: Find index within `_IGNORE_TAGS`, or `-1`.
	`_is_str(s)` True iff s is a string (checks via duck typing).
	`_python_has_unicode()` True iff Python was compiled with unicode().
	`_remove_comments(doc)` Replaces commented out characters with spaces in a CSS document.
	`_shlex_split(s)` Like `shlex.split`, but reversible, and for HTML.
	`_tag_dict(s)` Helper routine: Extracts a dict from an HTML tag string.
	`_test()` Unit test main routine.
	`_test_remove_comments()` Unit test for `_remove_comments`.
	`_test_shlex_split()` Unit test for `_shlex_split`.
	`_test_tag_dict()` Unit test for `_tag_dict`.
	`_test_tagextract(str_class)` Unit tests for `tagextract` and `tagjoin`.
	`_test_tuple_replace()` Unit test for `_tuple_replace`.
	`_test_urlextract(str_class)` Unit tests for `urlextract` and `urljoin`.
	`_tuple_replace(s, Lindices, Lreplace)` Replace slices of a string with new substrings.

Variable Summary
`str`	`__version__` = `'1.1.1'`
`str`	`_BEGIN_CDATA` = `'<![CDATA['`
`str`	`_BEGIN_COMMENT` = `'<!--'`
`list`	`_CSS_MIMETYPES` = `['text/css']`
`str`	`_END_CDATA` = `']]>'`
`str`	`_END_COMMENT` = `'-->'`
`list`	`_HTML_MIMETYPES` = `['text/html', 'application/xhtml', 'ap...`
`list`	`_IGNORE_TAGS` = `[('script', '/script'), ('style', '/style...`
`list`	`_URL_TAGS` = `[('a', 'href'), ('applet', 'archive'), ('app...`

Function Details

examples()

Examples of the htmldata module.

Example 1: Print all absolutized URLs from Google.

Here we use urlextract to obtain all URLs in the document.

>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> for u in htmldata.urlextract(contents, url):
...   print u.url
...

http://www.google.com/images/logo.gif
http://www.google.com/search
(More output)

Note that the second argument to urlextract causes the URLs to be made absolute with respect to that base URL.

Example 2: Print all image URLs from Google in relative form.

>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> for u in htmldata.urlextract(contents):
...   if u.tag_name == 'img':
...     print u.url
...

/images/logo.gif

Equivalently, one can use tagextract, and look for occurrences of <img> tags. The urlextract function is mostly a convenience function for when one wants to extract and/or modify all URLs in a document.

Example 3: Replace all <a href> links on Google with the Microsoft web page.

Here we use tagextract to turn the HTML into a data structure, and then loop over the in-order list of tags (items which are not tuples are plain text, which is ignored).

>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> L = htmldata.tagextract(contents)
>>> for item in L:
...   if isinstance(item, tuple) and item[0] == 'a':
...     # It's an HTML <a> tag!  Give it an href=.
...     item[1]['href'] = 'http://www.microsoft.com/'
...
>>> htmldata.tagjoin(L)
(Microsoftized version of Google)

Example 4: Make all URLs on an HTML document be absolute.

>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> htmldata.urljoin(htmldata.urlextract(contents, url))
(Google HTML page with absolute URLs)

Example 5: Properly quote all HTML tag values for pedants.

>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> htmldata.tagjoin(htmldata.tagextract(contents))
(Properly quoted version of the original HTML)

Example 6: Modify all URLs in a document so that they are appended to our proxy CGI script http://mysite.com/proxy.cgi.

>>> import urllib2, htmldata
>>> url = 'http://www.google.com/'
>>> contents = urllib2.urlopen(url).read()
>>> proxy_url = 'http://mysite.com/proxy.cgi?url='
>>> L = htmldata.urlextract(contents)
>>> for u in L:
...   u.url = proxy_url + u.url
...
>>> htmldata.urljoin(L)
(Document with all URLs wrapped in our proxy script)

Example 7: Download all images from a website.

>>> import urllib, htmldata, time
>>> url = 'http://www.google.com/'
>>> contents = urllib.urlopen(url).read()
>>> for u in htmldata.urlextract(contents, url):
...   if u.tag_name == 'img':
...     filename = urllib.quote_plus(u.url)
...     urllib.urlretrieve(u.url, filename)
...     time.sleep(0.5)
...

(Images are downloaded to the current directory)

Many sites will protect against bandwidth-draining robots by checking the HTTP Referer [sic] and User-Agent fields. To circumvent this, one can create a urllib2.Request object with a legitimate Referer and a User-Agent such as "Mozilla/4.0 (compatible; MSIE 5.5)". Then use urllib2.urlopen to download the content. Be warned that some website operators will respond to rapid robot requests by banning the offending IP address.

tagextract(doc)

Convert HTML to data structure.

Returns a list. HTML tags become (name, keyword_dict) tuples within the list, while plain text becomes strings within the list. All tag names are lowercased and stripped of whitespace. Tags which end with forward slashes have a single forward slash placed at the end of their name, to indicate that they are XML unclosed tags.

Example:

>>> tagextract('<img src=hi.gif alt="hi">foo<br><br/></body>')
[('img', {'src': 'hi.gif', 'alt': 'hi'}), 'foo',
 ('br', {}), ('br/', {}), ('/body', {})]

Text between '<script>' and '<style>' is rendered directly to plain text. This prevents rogue '<' or '>' characters from interfering with parsing.

>>> tagextract('<script type="a"><blah>var x; </script>')
[('script', {'type': 'a'}), '<blah>var x; ', ('/script', {})]

Comment strings and XML directives are rendered as a single long tag with no attributes. The case of the tag "name" is not changed:

>>> tagextract('<!-- blah -->')
[('!-- blah --', {})]

>>> tagextract('<?xml version="1.0" encoding="utf-8" ?>')
[('?xml version="1.0" encoding="utf-8" ?', {})]

>>> tagextract('<!DOCTYPE html PUBLIC etc...>')
[('!DOCTYPE html PUBLIC etc...', {})]

Greater-than and less-than characters occuring inside comments or CDATA blocks are correctly kept as part of the block:

>>> tagextract('<!-- <><><><>>..> -->')
[('!-- <><><><>>..> --', {})]

>>> tagextract('<!CDATA[[><>><>]<> ]]>')
[('!CDATA[[><>><>]<> ]]', {})]

Note that if one modifies these tags, it is important to retain the "--" (for comments) or "]]" (for CDATA) at the end of the tag name, so that output from tagjoin will be correct HTML/XHTML.

tagjoin(L)

Convert data structure back to HTML.

This reverses the tagextract function.

More precisely, if an HTML string is turned into a data structure, then back into HTML, the resulting string will be functionally equivalent to the original HTML.

>>> tagjoin(tagextract(s))
(string that is functionally equivalent to s)

Three changes are made to the HTML by tagjoin: tags are lowercased, key=value pairs are sorted, and values are placed in double-quotes.

urlextract(doc, siteurl=None, mimetype='text/html')

Extract URLs from HTML or stylesheet.

Extracts only URLs that are linked to or embedded in the document. Ignores plain text URLs that occur in the non-HTML part of the document.

Returns a list of URLMatch objects.

>>> L = urlextract('<img src="a.gif"><a href="www.google.com">')
>>> L[0].url
'a.gif'

>>> L[1].url
'www.google.com'

If siteurl is specified, all URLs are made into absolute URLs by assuming that doc is located at the URL siteurl.

>>> doc = '<img src="a.gif"><a href="/b.html">'
>>> L = urlextract(doc, 'http://www.python.org/~guido/')
>>> L[0].url
'http://www.python.org/~guido/a.gif'

>>> L[1].url
'http://www.python.org/b.html'

If mimetype is "text/css", the document will be parsed as a stylesheet.

If a stylesheet is embedded inside an HTML document, then urlextract will extract the URLs from both the HTML and the stylesheet.

urljoin(s, L)

Write back document with modified URLs (reverses urlextract).

Given a list L of URLMatch objects obtained from urlextract, substitutes changed URLs into the original document s, and returns the modified document.

One should only modify the .url attribute of the URLMatch objects. The ordering of the URLs in the list is not important.

>>> doc = '<img src="a.png"><a href="b.png">'
>>> L = urlextract(doc)
>>> L[0].url = 'foo'
>>> L[1].url = 'bar'
>>> urljoin(doc, L)
'<img src="foo"><a href="bar">'

_cast_to_str(arg, str_class)

Casts string components of several data structures to str_class.

Casts string, list of strings, or list of tuples (as returned by tagextract) such that all strings are made to type str_class.

_enumerate(L)

Like enumerate, provided for compatibility with Python < 2.3.

Returns a list instead of an iterator.

_finditer(pattern, string)

Like re.finditer, provided for compatibility with Python < 2.3.

Returns a list instead of an iterator. Otherwise the return format is identical to re.finditer (except possibly in the details of empty matches).

_full_tag_extract(s)

Like tagextract, but different return format.

Returns a list of _HTMLTag and _TextTag instances.

The return format is very inconvenient for manipulating HTML, and only will be useful if you want to find the exact locations where tags occur in the original HTML document.

_html_split(s)

Helper routine: Split string into a list of tags and non-tags.

>>> html_split(' blah <tag text> more </tag stuff> ')
[' blah ', '<tag text>', ' more ', '</tag stuff>', ' ']

Tags begin with '<' and end with '>'.

The identity ''.join(L) == s is always satisfied.

Exceptions to the normal parsing of HTML tags:

'<script>', '<style>', and HTML comment tags ignore all HTML until the closing pair, and are added as three elements:

>>> html_split(' blah<style><<<><></style><!-- hi -->' +
...            ' <script language="Javascript"></>a</script>end')
[' blah', '<style>', '<<<><>', '</style>', '<!--', ' hi ', '-->',
 ' ', '<script language="Javascript">', '</>a', '</script>', 'end']

_ignore_tag_index(s, i)

Helper routine: Find index within _IGNORE_TAGS, or -1.

If s[i:] begins with an opening tag from _IGNORE_TAGS, return the index. Otherwise, return -1.

_is_str(s)

True iff s is a string (checks via duck typing).

_python_has_unicode()

True iff Python was compiled with unicode().

_remove_comments(doc)

Replaces commented out characters with spaces in a CSS document.

_shlex_split(s)

Like shlex.split, but reversible, and for HTML.

Splits a string into a list L of strings. List elements contain either an HTML tag name=value pair, an HTML name singleton (eg "checked"), or whitespace.

The identity ''.join(L) == s is always satisfied.

>>> _shlex_split('a=5 b="15" name="Georgette A"')
['a=5', ' ', 'b="15"', ' ', 'name="Georgette A"']

>>> _shlex_split('a = a5 b=#b19 name="foo bar" q="hi"')
['a = a5', ' ', 'b=#b19', ' ', 'name="foo bar"', ' ', 'q="hi"']

>>> _shlex_split('a="9"b="15"')
['a="9"', 'b="15"']

_tag_dict(s)

Helper routine: Extracts a dict from an HTML tag string.

>>> _tag_dict('bgcolor=#ffffff text="#000000" blink')
({'bgcolor':'#ffffff', 'text':'#000000', 'blink': None},
 {'bgcolor':(0,7),  'text':(16,20), 'blink':(31,36)},
 {'bgcolor':(8,15), 'text':(22,29), 'blink':(36,36)})

Returns a 3-tuple. First element is a dict of (key, value) pairs from the HTML tag. Second element is a dict mapping keys to (start, end) indices of the key in the text. Third element maps keys to (start, end) indices of the value in the text.

Names are lowercased.

Raises ValueError for unmatched quotes and other errors.

_test()

Unit test main routine.

_test_remove_comments()

Unit test for _remove_comments.

_test_shlex_split()

Unit test for _shlex_split.

_test_tag_dict()

Unit test for _tag_dict.

_test_tagextract(str_class=<type 'str'>)

Unit tests for tagextract and tagjoin.

Strings are cast to the string class argument str_class.

_test_tuple_replace()

Unit test for _tuple_replace.

_test_urlextract(str_class=<type 'str'>)

Unit tests for urlextract and urljoin.

Strings are cast to the string class argument str_class.

_tuple_replace(s, Lindices, Lreplace)

Replace slices of a string with new substrings.

Given a list of slice tuples in Lindices, replace each slice in s with the corresponding replacement substring from Lreplace.

Example:

>>> _tuple_replace('0123456789',[(4,5),(6,9)],['abc', 'def'])
'0123abc5def9'

Variable Details

version

Type:

str

Value:

'1.1.1'

_BEGIN_CDATA

Type:

str

Value:

'<![CDATA['

_BEGIN_COMMENT

Type:

str

Value:

'<!--'

_CSS_MIMETYPES

Type:

list

Value:

['text/css']

_END_CDATA

Type:

str

Value:

']]>'

_END_COMMENT

Type:

str

Value:

'-->'

_HTML_MIMETYPES

Type:

list

Value:

['text/html',
 'application/xhtml',
 'application/xhtml+xml',
 'text/xml',
 'application/xml']

_IGNORE_TAGS

Type:

list

Value:

[('script', '/script'), ('style', '/style')]

_URL_TAGS

Type:

list

Value:

[('a', 'href'),
 ('applet', 'archive'),
 ('applet', 'code'),
 ('applet', 'codebase'),
 ('area', 'href'),
 ('base', 'href'),
 ('blockquote', 'cite'),
 ('body', 'background'),
...

Module htmldata

examples()

tagextract(doc)

tagjoin(L)

urlextract(doc, siteurl=None, mimetype='text/html')

urljoin(s, L)

_cast_to_str(arg, str_class)

_enumerate(L)

_finditer(pattern, string)

_full_tag_extract(s)

_html_split(s)

_ignore_tag_index(s, i)

_is_str(s)

_python_has_unicode()

_remove_comments(doc)

_shlex_split(s)

_tag_dict(s)

_test()

_test_remove_comments()

_test_shlex_split()

_test_tag_dict()

_test_tagextract(str_class=<type 'str'>)

_test_tuple_replace()

_test_urlextract(str_class=<type 'str'>)

_tuple_replace(s, Lindices, Lreplace)

__version__

_BEGIN_CDATA

_BEGIN_COMMENT

_CSS_MIMETYPES

_END_CDATA

_END_COMMENT

_HTML_MIMETYPES

_IGNORE_TAGS

_URL_TAGS

version