Home | Trees | Index | Help |
---|
|
Manipulate HTML or XHTML documents.
Version 1.1.1. This source code has been placed in the public domain by Connelly Barnes.
Features:examples
for a quick start.
Classes | |
---|---|
URLMatch |
A matched URL inside an HTML document or stylesheet. |
Function Summary | |
---|---|
Examples of the htmldata module. | |
Convert HTML to data structure. | |
Convert data structure back to HTML. | |
Extract URLs from HTML or stylesheet. | |
Write back document with modified URLs (reverses urlextract ). |
Function Details |
---|
examples()Examples of the Example 1: Print all absolutized URLs from Google. Here we useurlextract to obtain all URLs in the
document.
>>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> for u in htmldata.urlextract(contents, url): ... print u.url ... http://www.google.com/images/logo.gif http://www.google.com/search (More output) Note that the second argument to >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> for u in htmldata.urlextract(contents): ... if u.tag_name == 'img': ... print u.url ... /images/logo.gif Equivalently, one can use Example 3: Replace all tagextract to turn the HTML into a data
structure, and then loop over the in-order list of tags (items which
are not tuples are plain text, which is ignored).
>>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> L = htmldata.tagextract(contents) >>> for item in L: ... if isinstance(item, tuple) and item[0] == 'a': ... # It's an HTML <a> tag! Give it an href=. ... item[1]['href'] = 'http://www.microsoft.com/' ... >>> htmldata.tagjoin(L) (Microsoftized version of Google)Example 4: Make all URLs on an HTML document be absolute. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> htmldata.urljoin(htmldata.urlextract(contents, url)) (Google HTML page with absolute URLs)Example 5: Properly quote all HTML tag values for pedants. >>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> htmldata.tagjoin(htmldata.tagextract(contents)) (Properly quoted version of the original HTML)Example 6: Modify all URLs in a document so that they are appended to our proxy CGI script http://mysite.com/proxy.cgi .
>>> import urllib2, htmldata >>> url = 'http://www.google.com/' >>> contents = urllib2.urlopen(url).read() >>> proxy_url = 'http://mysite.com/proxy.cgi?url=' >>> L = htmldata.urlextract(contents) >>> for u in L: ... u.url = proxy_url + u.url ... >>> htmldata.urljoin(L) (Document with all URLs wrapped in our proxy script)Example 7: Download all images from a website. >>> import urllib, htmldata, time >>> url = 'http://www.google.com/' >>> contents = urllib.urlopen(url).read() >>> for u in htmldata.urlextract(contents, url): ... if u.tag_name == 'img': ... filename = urllib.quote_plus(u.url) ... urllib.urlretrieve(u.url, filename) ... time.sleep(0.5) ... (Images are downloaded to the current directory)Many sites will protect against bandwidth-draining robots by checking the HTTP Referer [sic] and
User-Agent fields. To circumvent this, one can create a
urllib2.Request object with a legitimate
Referer and a User-Agent such as
"Mozilla/4.0 (compatible; MSIE 5.5)" . Then use
urllib2.urlopen to download the content. Be warned that
some website operators will respond to rapid robot requests by banning
the offending IP address.
|
tagextract(doc)Convert HTML to data structure. Returns a list. HTML tags become >>> tagextract('<img src=hi.gif alt="hi">foo<br><br/></body>') [('img', {'src': 'hi.gif', 'alt': 'hi'}), 'foo', ('br', {}), ('br/', {}), ('/body', {})]Text between '<script>' and
'<style>' is rendered directly to plain text. This
prevents rogue '<' or '>' characters
from interfering with parsing.
>>> tagextract('<script type="a"><blah>var x; </script>') [('script', {'type': 'a'}), '<blah>var x; ', ('/script', {})]Comment strings and XML directives are rendered as a single long tag with no attributes. The case of the tag "name" is not changed: >>> tagextract('<!-- blah -->') [('!-- blah --', {})] >>> tagextract('<?xml version="1.0" encoding="utf-8" ?>') [('?xml version="1.0" encoding="utf-8" ?', {})] >>> tagextract('<!DOCTYPE html PUBLIC etc...>') [('!DOCTYPE html PUBLIC etc...', {})]Greater-than and less-than characters occuring inside comments or CDATA blocks are correctly kept as part of the block: >>> tagextract('<!-- <><><><>>..> -->') [('!-- <><><><>>..> --', {})] >>> tagextract('<!CDATA[[><>><>]<> ]]>') [('!CDATA[[><>><>]<> ]]', {})]Note that if one modifies these tags, it is important to retain the "--" (for comments) or
"]]" (for CDATA ) at the end of the
tag name, so that output from tagjoin will be correct HTML/XHTML.
|
tagjoin(L)Convert data structure back to HTML. This reverses the >>> tagjoin(tagextract(s)) (string that is functionally equivalent to s)Three changes are made to the HTML by tagjoin : tags are lowercased,
key=value pairs are sorted, and values are placed in
double-quotes.
|
urlextract(doc, siteurl=None, mimetype='text/html')Extract URLs from HTML or stylesheet. Extracts only URLs that are linked to or embedded in the document. Ignores plain text URLs that occur in the non-HTML part of the document. Returns a list ofURLMatch objects.
>>> L = urlextract('<img src="a.gif"><a href="www.google.com">') >>> L[0].url 'a.gif' >>> L[1].url 'www.google.com'If siteurl is specified, all URLs are made into
absolute URLs by assuming that doc is located at the URL
siteurl .
>>> doc = '<img src="a.gif"><a href="/b.html">' >>> L = urlextract(doc, 'http://www.python.org/~guido/') >>> L[0].url 'http://www.python.org/~guido/a.gif' >>> L[1].url 'http://www.python.org/b.html' If urlextract will extract the URLs from both the HTML and
the stylesheet.
|
urljoin(s, L)Write back document with modified URLs (reverses Given a list .url attribute of the URLMatch objects. The ordering of the
URLs in the list is not important.
>>> doc = '<img src="a.png"><a href="b.png">' >>> L = urlextract(doc) >>> L[0].url = 'foo' >>> L[1].url = 'bar' >>> urljoin(doc, L) '<img src="foo"><a href="bar">' |
Home | Trees | Index | Help |
---|
Generated by Epydoc 2.1 on Thu Sep 15 10:52:17 2005 | http://epydoc.sf.net |