lazyscraper package

Submodules

lazyscraper.consts module

lazyscraper.htmltools module

lazyscraper.htmltools.table_to_dict(node, strip_lf=True)[source]

Extracts data from table

lazyscraper.htmltools.taglist_to_dict(tags, fields, strip_lf=True)[source]

Converts list of tags into dict

lazyscraper.patterns module

lazyscraper.patterns.pattern_extract_exturls(tree, nodeclass, nodeid, fields)[source]

Pattern to extract external urls

lazyscraper.patterns.pattern_extract_forms(tree, nodeclass, nodeid, fields)[source]

Extracts web forms from page

lazyscraper.patterns.pattern_extract_simpleoptions(tree, nodeclass, nodeid, fields)[source]

Simple SELECT / OPTION extractor pattern

lazyscraper.patterns.pattern_extract_simpleul(tree, nodeclass, nodeid, fields)[source]

Simple UL lists extractor pattern

lazyscraper.scraper module

lazyscraper.scraper.extract_data_xpath(url, filename=None, xpath=None, fieldnames=None, absolutize=False, post=None, pagekey=None, pagerange=None)[source]

Extract data with xpath

Parameters:
  • url (str|unicode) – HTML webpage url
  • xpath (str|unicode) – xpath expression
  • fieldnames (str|unicode) – string with list of fields like “src,alt,href,_text”
  • absolutize – Absolutize all urls returned as href and other url-like fields
  • post (bool) – If True use POST for HTTP requests
  • pagekey (str|unicode) – Key of the page listing. GET or POST parameter
  • pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns:

Returns array of extracted values

Return type:

array.

lazyscraper.scraper.get_table(url, nodeid=None, nodeclass=None, pagekey=False, pagerange=False, agent=None)[source]

Extracts table with data from html :param url:

HTML webpage url
Parameters:
  • nodeid (str|unicode) – id key for nodes
  • nodeclass (str|unicode) – class key for nodes
  • pagekey (str|unicode) – Key of the page listing. GET or POST parameter
  • pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns:

Returns array of extracted values

Return type:

array.

lazyscraper.scraper.use_pattern(url, pattern, nodeid=None, nodeclass=None, fieldnames=None, absolutize=False, pagekey=False, pagerange=False)[source]

Uses predefined pattern to extract page data :param url:

HTML webpage url
Parameters:
  • nodeid (str|unicode) – id key for nodes
  • nodeclass (str|unicode) – class key for nodes
  • fieldnames (str|unicode) – string with list of fields like “src,alt,href,_text”
  • absolutize – Absolutize all urls returned as href and other url-like fields
  • pagekey (str|unicode) – Key of the page listing. GET or POST parameter
  • pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns:

Returns array of extracted values

Return type:

array.

lazyscraper.urltools module

lazyscraper.urltools.get_cached_post(url, postdata, host=None, port=11211, agent='Mozilla/5.0 (Linux; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Mobile Safari/537.36')[source]

Returns url data from url with post request

lazyscraper.urltools.get_cached_url(url, timeout=3600, host=None, port=11211, agent='Mozilla/5.0 (Linux; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Mobile Safari/537.36')[source]

Returns url data from url or from local memcached

lazyscraper.urltools.get_from_file(filename, encoding='utf-8')[source]

Returns parsed data from file

Module contents