Module web_search
[show private | hide private]
[frames | no frames]

Module web_search

Query Web search engines.

This module works by filtering the HTML returned by the search engine and thus tends to break when search engines modify their HTML output.

Public domain, Connelly Barnes 2005-2007. Compatible with Python 2.3-2.5.

See examples for a quick start. See description for the full explanation, precautions, and legal disclaimers.
Function Summary
  description()
Full explanation and precautions for web_search module.
  examples()
Examples of the web_search module.
  ask(query, max_results, blocking)
See docstring for web_search module.
  dmoz(query, max_results, blocking)
See docstring for web_search module.
  excite(query, max_results, blocking)
See docstring for web_search module.
  fix_url(url)
Given url str, trim redirect stuff and return actual URL.
  get_search_page_links(page, results_per_page, begin, end, link_re)
Given str contents of search result page, return list of links.
  google(query, max_results, blocking)
See docstring for web_search module.
  html_to_text(s)
Given an HTML formatted str, convert it to a text str.
  make_searcher(query_url, results_per_page, page_url, page_mode, begin, end, link_re)
Return a search function for the given search engine.
  msn(query, max_results, blocking)
See docstring for web_search module.
  nonblocking(f, blocking_return, sleep_time)
Wrap a callable which returns an iter so that it no longer blocks.
  quote_plus(s)
A variant of urllib.quote_plus which handles ASCII and Unicode.
  read_url(url, headers, blocking)
Read str contents of given str URL.
  test()
Unit test main routine.
  test_engine(search)
Test a search engine function returned by make_searcher().
  yahoo(query, max_results, blocking)
See docstring for web_search module.

Variable Summary
str __version__ = '1.0.2'
dict DEFAULT_HEADERS = {'User-Agent': 'Mozilla/4.0 (compatibl...
int DEFAULT_MAX_RESULTS = 10                                                                    
list SEARCH_ENGINES = ['ask', 'dmoz', 'excite', 'google', 'ms...

Function Details

description()

Full explanation and precautions for web_search module.

The search functions in this module follow a common interface:
   search(query, max_results=10, blocking=True) =>
     iterator of (name, url, description) search results.

Here query is the query string, max_results gives the maximum number of search results, and the items in the returned iterator are string 3-tuples containing the Website name, URL, and description for each search result.

If blocking=False, then an iterator is returned which does not block execution: the iterator yields None when the next search result is not yet available (a background thread is created).

Supported search engines are 'ask', 'dmoz', 'excite', 'google', 'msn', 'yahoo'. This module is not associated with or endorsed by any of these search engine corporations.

Be warned that if searches are made too frequently, or max_results is large and you enumerate all search results, then you will be a drain on the search engine's bandwidth, and the search engine organization may respond by banning your IP address or IP address range.

This software has been placed in the public domain with the following legal notice:
   http://oregonstate.edu/~barnesc/documents/public_domain.txt

examples()

Examples of the web_search module.

Example 1:
>>> from web_search import google
>>> for (name, url, desc) in google('python', 20):
...   print name, url
...

(First 20 results for Google search of "python").
Example 2:
>>> from web_search import dmoz
>>> list(dmoz('abc', 10))
[('ABC.com', 'http://www.abc.com', "What's on ABC..."), ...]

ask(query, max_results=10, blocking=True)

See docstring for web_search module.

dmoz(query, max_results=10, blocking=True)

See docstring for web_search module.

excite(query, max_results=10, blocking=True)

See docstring for web_search module.

fix_url(url)

Given url str, trim redirect stuff and return actual URL.

Currently this just returns the URL unmodified.

get_search_page_links(page, results_per_page, begin, end, link_re)

Given str contents of search result page, return list of links.

Returns list of (name, url, desc) str tuples. See make_searcher() for a description of results_per_page and link_re.

google(query, max_results=10, blocking=True)

See docstring for web_search module.

html_to_text(s)

Given an HTML formatted str, convert it to a text str.

make_searcher(query_url, results_per_page, page_url, page_mode, begin, end, link_re)

Return a search function for the given search engine.

Here query_url is the URL for the initial search, with %(q)s for the query string, results_per_page is the number of search results per page, page_url is the URL for the 2nd and subsequent pages of search results, with %(q)s for the query string and %(n)s for the page "number." Here page_mode controls the actual value for the page "number:"
  • page_mode='page0': Use 0-based index of the page.
  • page_mode='page1': Use 1-based index of the page.
  • page_mode='offset0': Use 0-based index of the search result, which is a multiple of results_per_page.
  • page_mode='offset1': Use 1-based index of the search result (one plus a multiple of results_per_page).

If begin is not None, then only text after the first occurrence of begin will be used in the search results page. If end is not None, then only text before the first occurrence of end will be used.

Finally, link_re is a regex string (see module re) which matches three named groups: 'name', 'url', and 'desc'. These correspond to the name, URL and description of each search result. The regex is applied in re.DOTALL mode.

Returns a search() function which has the same interface as described in the module docstring.

msn(query, max_results=10, blocking=True)

See docstring for web_search module.

nonblocking(f, blocking_return=None, sleep_time=0.01)

Wrap a callable which returns an iter so that it no longer blocks.

The wrapped iterator returns blocking_return while callable f is blocking. The callable f is called in a background thread. If the wrapped iterator is deleted, then the iterator returned by f is deleted also and the background thread is terminated.

quote_plus(s)

A variant of urllib.quote_plus which handles ASCII and Unicode.

read_url(url, headers={'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5)'}, blocking=True)

Read str contents of given str URL.

Here headers is a map of str -> str for HTTP request headers. If blocking is True, returns the str page contents. If blocking is False, returns an iterator which gives None until a successful read, at which point the str page contents is yielded.

test()

Unit test main routine.

test_engine(search)

Test a search engine function returned by make_searcher().

yahoo(query, max_results=10, blocking=True)

See docstring for web_search module.

Variable Details

__version__

Type:
str
Value:
'1.0.2'                                                                

DEFAULT_HEADERS

Type:
dict
Value:
{'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5)'}                   

DEFAULT_MAX_RESULTS

Type:
int
Value:
10                                                                    

SEARCH_ENGINES

Type:
list
Value:
['ask', 'dmoz', 'excite', 'google', 'msn', 'yahoo']                    

Generated by Epydoc 2.1 on Sat Feb 3 16:45:05 2007 http://epydoc.sf.net