Home | Trees | Index | Help |
---|
|
Manipulate HTML or XHTML documents.
Version 1.1.1. This source code has been placed in the public domain by Connelly Barnes.
Features:examples
for a quick start.
Classes | |
---|---|
URLMatch |
A matched URL inside an HTML document or stylesheet. |
_HTMLTag |
HTML tag extracted by _full_tag_extract . |
_TextTag |
Text extracted from an HTML document by _full_tag_extract . |
Function Summary | |
---|---|
Examples of the htmldata module. | |
Convert HTML to data structure. | |
Convert data structure back to HTML. | |
Extract URLs from HTML or stylesheet. | |
Write back document with modified URLs (reverses urlextract ). | |
Casts string components of several data structures to str_class. | |
Like enumerate , provided for compatibility with Python
< 2.3. | |
Like re.finditer , provided for compatibility with Python
< 2.3. | |
Like tagextract , but different return
format. | |
Helper routine: Split string into a list of tags and non-tags. | |
Helper routine: Find index within _IGNORE_TAGS , or
-1 . | |
True iff s is a string (checks via duck typing). | |
True iff Python was compiled with unicode(). | |
Replaces commented out characters with spaces in a CSS document. | |
Like shlex.split , but reversible, and for HTML. | |
Helper routine: Extracts a dict from an HTML tag string. | |
Unit test main routine. | |
Unit test for _remove_comments . | |
Unit test for _shlex_split . | |
Unit test for _tag_dict . | |
Unit tests for tagextract and tagjoin . | |
Unit test for _tuple_replace . | |
Unit tests for urlextract and urljoin . | |
Replace slices of a string with new substrings. |
Variable Summary | |
---|---|
str |
__version__ = '1.1.1'
|
str |
_BEGIN_CDATA = '<![CDATA['
|
str |
_BEGIN_COMMENT = '<!--'
|
list |
_CSS_MIMETYPES = ['text/css']
|
str |
_END_CDATA = ']]>'
|
str |
_END_COMMENT = '-->'
|
list |
_HTML_MIMETYPES = ['text/html', 'application/xhtml', 'ap...
|
list |
_IGNORE_TAGS = [('script', '/script'), ('style', '/style...
|
list |
_URL_TAGS = [('a', 'href'), ('applet', 'archive'), ('app...
|
Function Details |
---|
examples()Examples of the Example 1: Print all absolutized URLs from Google. Here we useurlextract to obtain all URLs in the
document.
>>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> for u in htmldata.urlextract(contents, url): ... print u.url ... http://www.google.com/images/logo.gif http://www.google.com/search (More output) Note that the second argument to >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> for u in htmldata.urlextract(contents): ... if u.tag_name == 'img': ... print u.url ... /images/logo.gif Equivalently, one can use Example 3: Replace all tagextract to turn the HTML into a data
structure, and then loop over the in-order list of tags (items which
are not tuples are plain text, which is ignored).
>>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> L = htmldata.tagextract(contents) >>> for item in L: ... if isinstance(item, tuple) and item[0] == 'a': ... # It's an HTML <a> tag! Give it an href=. ... item[1]['href'] = 'http://www.microsoft.com/' ... >>> htmldata.tagjoin(L) (Microsoftized version of Google)Example 4: Make all URLs on an HTML document be absolute. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> htmldata.urljoin(htmldata.urlextract(contents, url)) (Google HTML page with absolute URLs)Example 5: Properly quote all HTML tag values for pedants. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> htmldata.tagjoin(htmldata.tagextract(contents)) (Properly quoted version of the original HTML)Example 6: Modify all URLs in a document so that they are appended to our proxy CGI script http://mysite.com/proxy.cgi .
>>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> proxy_url = 'http://mysite.com/proxy.cgi?url=' >>> L = htmldata.urlextract(contents) >>> for u in L: ... u.url = proxy_url + u.url ... >>> htmldata.urljoin(L) (Document with all URLs wrapped in our proxy script)Example 7: Download all images from a website. >>> import urllib, htmldata, time >>> url = 'http://www.google.com/' >>> contents = urllib.urlopen(url).read() >>> for u in htmldata.urlextract(contents, url): ... if u.tag_name == 'img': ... filename = urllib.quote_plus(u.url) ... urllib.urlretrieve(u.url, filename) ... time.sleep(0.5) ... (Images are downloaded to the current directory)Many sites will protect against bandwidth-draining robots by checking the HTTP Referer [sic] and
User-Agent fields. To circumvent this, one can create a
urllib2.Request object with a legitimate
Referer and a User-Agent such as
"Mozilla/4.0 (compatible; MSIE 5.5)" . Then use
urllib2.urlopen to download the content. Be warned that
some website operators will respond to rapid robot requests by banning
the offending IP address.
|
tagextract(doc)Convert HTML to data structure. Returns a list. HTML tags become >>> tagextract('<img src=hi.gif alt="hi">foo<br><br/></body>') [('img', {'src': 'hi.gif', 'alt': 'hi'}), 'foo', ('br', {}), ('br/', {}), ('/body', {})]Text between '<script>' and
'<style>' is rendered directly to plain text. This
prevents rogue '<' or '>' characters
from interfering with parsing.
>>> tagextract('<script type="a"><blah>var x; </script>') [('script', {'type': 'a'}), '<blah>var x; ', ('/script', {})]Comment strings and XML directives are rendered as a single long tag with no attributes. The case of the tag "name" is not changed: >>> tagextract('<!-- blah -->') [('!-- blah --', {})] >>> tagextract('<?xml version="1.0" encoding="utf-8" ?>') [('?xml version="1.0" encoding="utf-8" ?', {})] >>> tagextract('<!DOCTYPE html PUBLIC etc...>') [('!DOCTYPE html PUBLIC etc...', {})]Greater-than and less-than characters occuring inside comments or CDATA blocks are correctly kept as part of the block: >>> tagextract('<!-- <><><><>>..> -->') [('!-- <><><><>>..> --', {})] >>> tagextract('<!CDATA[[><>><>]<> ]]>') [('!CDATA[[><>><>]<> ]]', {})]Note that if one modifies these tags, it is important to retain the "--" (for comments) or
"]]" (for CDATA ) at the end of the
tag name, so that output from tagjoin will be correct HTML/XHTML.
|
tagjoin(L)Convert data structure back to HTML. This reverses the >>> tagjoin(tagextract(s)) (string that is functionally equivalent to s)Three changes are made to the HTML by tagjoin : tags are lowercased,
key=value pairs are sorted, and values are placed in
double-quotes.
|
urlextract(doc, siteurl=None, mimetype='text/html')Extract URLs from HTML or stylesheet. Extracts only URLs that are linked to or embedded in the document. Ignores plain text URLs that occur in the non-HTML part of the document. Returns a list ofURLMatch objects.
>>> L = urlextract('<img src="a.gif"><a href="www.google.com">') >>> L[0].url 'a.gif' >>> L[1].url 'www.google.com'If siteurl is specified, all URLs are made into
absolute URLs by assuming that doc is located at the URL
siteurl .
>>> doc = '<img src="a.gif"><a href="/b.html">' >>> L = urlextract(doc, 'http://www.python.org/~guido/') >>> L[0].url 'http://www.python.org/~guido/a.gif' >>> L[1].url 'http://www.python.org/b.html' If urlextract will extract the URLs from both the HTML and
the stylesheet.
|
urljoin(s, L)Write back document with modified URLs (reverses Given a list .url attribute of the URLMatch objects. The ordering of the
URLs in the list is not important.
>>> doc = '<img src="a.png"><a href="b.png">' >>> L = urlextract(doc) >>> L[0].url = 'foo' >>> L[1].url = 'bar' >>> urljoin(doc, L) '<img src="foo"><a href="bar">' |
_cast_to_str(arg, str_class)Casts string components of several data structures to str_class. Casts string, list of strings, or list of tuples (as returned bytagextract ) such that all strings are
made to type str_class.
|
_enumerate(L)Like |
_finditer(pattern, string)Like re.finditer (except possibly in the
details of empty matches).
|
_full_tag_extract(s)Like Returns a list of |
_html_split(s)Helper routine: Split string into a list of tags and non-tags.>>> html_split(' blah <tag text> more </tag stuff> ') [' blah ', '<tag text>', ' more ', '</tag stuff>', ' '] Tags begin with The identity Exceptions to the normal parsing of HTML tags: '<script>' , '<style>' , and
HTML comment tags ignore all HTML until the closing pair, and are added
as three elements:
>>> html_split(' blah<style><<<><></style><!-- hi -->' + ... ' <script language="Javascript"></>a</script>end') [' blah', '<style>', '<<<><>', '</style>', '<!--', ' hi ', '-->', ' ', '<script language="Javascript">', '</>a', '</script>', 'end'] |
_ignore_tag_index(s, i)Helper routine: Find index within s[i:] begins with an opening tag from
_IGNORE_TAGS , return the index. Otherwise, return
-1 .
|
_is_str(s)True iff s is a string (checks via duck typing). |
_python_has_unicode()True iff Python was compiled with unicode(). |
_remove_comments(doc)Replaces commented out characters with spaces in a CSS document. |
_shlex_split(s)Like Splits a string into a list ''.join(L) == s is always satisfied.
>>> _shlex_split('a=5 b="15" name="Georgette A"') ['a=5', ' ', 'b="15"', ' ', 'name="Georgette A"'] >>> _shlex_split('a = a5 b=#b19 name="foo bar" q="hi"') ['a = a5', ' ', 'b=#b19', ' ', 'name="foo bar"', ' ', 'q="hi"'] >>> _shlex_split('a="9"b="15"') ['a="9"', 'b="15"'] |
_tag_dict(s)Helper routine: Extracts a dict from an HTML tag string.>>> _tag_dict('bgcolor=#ffffff text="#000000" blink') ({'bgcolor':'#ffffff', 'text':'#000000', 'blink': None}, {'bgcolor':(0,7), 'text':(16,20), 'blink':(31,36)}, {'bgcolor':(8,15), 'text':(22,29), 'blink':(36,36)}) Returns a 3-tuple. First element is a dict of Names are lowercased. RaisesValueError for unmatched quotes and other
errors.
|
_test()Unit test main routine. |
_test_remove_comments()Unit test for_remove_comments .
|
_test_shlex_split()Unit test for_shlex_split .
|
_test_tag_dict()Unit test for_tag_dict .
|
_test_tagextract(str_class=<type 'str'>)Unit tests for |
_test_tuple_replace()Unit test for_tuple_replace .
|
_test_urlextract(str_class=<type 'str'>)Unit tests for |
_tuple_replace(s, Lindices, Lreplace)Replace slices of a string with new substrings. Given a list of slice tuples in >>> _tuple_replace('0123456789',[(4,5),(6,9)],['abc', 'def']) '0123abc5def9' |
Variable Details |
---|
__version__
|
_BEGIN_CDATA
|
_BEGIN_COMMENT
|
_CSS_MIMETYPES
|
_END_CDATA
|
_END_COMMENT
|
_HTML_MIMETYPES
|
_IGNORE_TAGS
|
_URL_TAGS
|
Home | Trees | Index | Help |
---|
Generated by Epydoc 2.1 on Thu Sep 15 10:52:17 2005 | http://epydoc.sf.net |