lazyscraper package¶

Submodules¶

lazyscraper.consts module¶

lazyscraper.htmltools module¶

lazyscraper.htmltools.table_to_dict(node, strip_lf=True)[source]¶: Extracts data from table

lazyscraper.htmltools.taglist_to_dict(tags, fields, strip_lf=True)[source]¶: Converts list of tags into dict

lazyscraper.patterns module¶

lazyscraper.patterns.pattern_extract_exturls(tree, nodeclass, nodeid, fields)[source]¶: Pattern to extract external urls

lazyscraper.patterns.pattern_extract_forms(tree, nodeclass, nodeid, fields)[source]¶: Extracts web forms from page

lazyscraper.patterns.pattern_extract_simpleoptions(tree, nodeclass, nodeid, fields)[source]¶: Simple SELECT / OPTION extractor pattern

lazyscraper.patterns.pattern_extract_simpleul(tree, nodeclass, nodeid, fields)[source]¶: Simple UL lists extractor pattern

lazyscraper.scraper module¶

lazyscraper.scraper.extract_data_xpath(url, filename=None, xpath=None, fieldnames=None, absolutize=False, post=None, pagekey=None, pagerange=None)[source]¶

Extract data with xpath

Parameters:	url (str\|unicode) – HTML webpage url xpath (str\|unicode) – xpath expression fieldnames (str\|unicode) – string with list of fields like “src,alt,href,_text” absolutize – Absolutize all urls returned as href and other url-like fields post (bool) – If True use POST for HTTP requests pagekey (str\|unicode) – Key of the page listing. GET or POST parameter pagerange (str\|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns:	Returns array of extracted values
Return type:	`array`.

lazyscraper.scraper.get_table(url, nodeid=None, nodeclass=None, pagekey=False, pagerange=False, agent=None)[source]¶

Extracts table with data from html :param url:

HTML webpage url

Parameters:	nodeid (str\|unicode) – id key for nodes nodeclass (str\|unicode) – class key for nodes pagekey (str\|unicode) – Key of the page listing. GET or POST parameter pagerange (str\|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns:	Returns array of extracted values
Return type:	`array`.

lazyscraper.scraper.use_pattern(url, pattern, nodeid=None, nodeclass=None, fieldnames=None, absolutize=False, pagekey=False, pagerange=False)[source]¶

Uses predefined pattern to extract page data :param url: