1.1.1 2007-10-26 * Handle non-space characters inside tags the same as spaces. (Previously, would fail on ). 1.1.0 2005-09-15 * Short version: Got rid of some exceptions that occurred on malformed input, and improved handling of malformed input cases. * Long version: (skip if you don't care about malformed input). Performed round-trip parsing (urlextract => urljoin, tagextract => tagjoin) of some Fortune 500 Web pages, and roughly 6000 pages from a "free for all links" site. Malformed HTML would previously cause exceptions to be raised, but this is undesirable, since we really want to act like a browser, and take our "best guess" at parsing the HTML in malformed and ambiguous cases. Got rid of the exceptions. There are new algorithms intended for recovering from malformed quotes and ">" within a quoted value. I looked in detail at five of the sites that were previously raising errors. The algorithms seem to be working (i.e. document parsing continues, similarly to what a human might do, instead of considering the entire rest of the document as plaintext). I plan to build a catalogue of the malformed documents and heuristically tweak the algorithm to "do well" on them. In the end, there will probably be two sets of unit tests, one for correct (or close enough to be unambiguous) documents, and one for malformed documents. Of course, the former set of unit tests must always pass. Feel free to report either kind of bug (the correctness bugs, or the malformed input pseudobugs). I guess I could modify this module to use a real parser and the Mozilla grammar DTDs, but that's a lot of work, and thus it only seems worthwhile if *this* code is plagued with bugs... - Connelly Barnes 1.0.9 2005-09-15 * Better mime type handling for urlextract(). * Duck typing, so string-like objects can be passed in. * Naive Unicode tests. * The function tagjoin() handles HTML attribute values with single, or double quotes, but not both (if both, then an error is raised). - Connelly Barnes 1.0.8 2005-07-14 * Fixed parsing of single quoted attribute values in HTML (eg ). - Connelly Barnes 1.0.7 2005-04-26 * Fixed bug where duplicate matches would be returned in urlextract(). This would cause urljoin() to fail. - Connelly Barnes 1.0.6 2005-02-06 * urlextract() finds URLs inside style="..." tag attributes. - Connelly Barnes 1.0.5 2005-02-06 * Correctly parses tags like * urlextract() handles @import statements in CSS. - Connelly Barnes 1.0.4 2004-12-10 * Python 2.0-2.4 compatibility. - Connelly Barnes 1.0.3 2004-12-10 * Python 2.2 compatibility. * Fixed XHTML parsing (which didn't work correctly with and directives). * Added rules for XML directives. * Changed comments so that a comment becomes [('!-- comment --', {})] after being parsed by tagextract(), instead of [('!--', {}), ' comment ', ('--', {})]. - Connelly Barnes 1.0.2 2004-10-07 * Stopped parsing HTML tags by accident inside comments. * Fixed HTML decoding %ff bug. * Fixed HTML dropping characters inside comment. * Changed interface for HTML <-> Data structure (use tagextract() and tagjoin() now). * Added URL extraction and modification functions. (urlextract() and urljoin()). - Connelly Barnes 1.0.1 2004-10-01 * Fixed bug for parsing tag inside comment. - Connelly Barnes 1.0.0 2004-09-30 * Initial release - Connelly Barnes