lazyscraper package¶
Submodules¶
lazyscraper.consts module¶
lazyscraper.htmltools module¶
lazyscraper.patterns module¶
-
lazyscraper.patterns.
pattern_extract_exturls
(tree, nodeclass, nodeid, fields)[source]¶ Pattern to extract external urls
-
lazyscraper.patterns.
pattern_extract_forms
(tree, nodeclass, nodeid, fields)[source]¶ Extracts web forms from page
lazyscraper.scraper module¶
-
lazyscraper.scraper.
extract_data_xpath
(url, filename=None, xpath=None, fieldnames=None, absolutize=False, post=None, pagekey=None, pagerange=None)[source]¶ Extract data with xpath
Parameters: - url (str|unicode) – HTML webpage url
- xpath (str|unicode) – xpath expression
- fieldnames (str|unicode) – string with list of fields like “src,alt,href,_text”
- absolutize – Absolutize all urls returned as href and other url-like fields
- post (bool) – If True use POST for HTTP requests
- pagekey (str|unicode) – Key of the page listing. GET or POST parameter
- pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns: Returns array of extracted values
Return type: array
.
-
lazyscraper.scraper.
get_table
(url, nodeid=None, nodeclass=None, pagekey=False, pagerange=False, agent=None)[source]¶ Extracts table with data from html :param url:
HTML webpage urlParameters: - nodeid (str|unicode) – id key for nodes
- nodeclass (str|unicode) – class key for nodes
- pagekey (str|unicode) – Key of the page listing. GET or POST parameter
- pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns: Returns array of extracted values
Return type: array
.
-
lazyscraper.scraper.
use_pattern
(url, pattern, nodeid=None, nodeclass=None, fieldnames=None, absolutize=False, pagekey=False, pagerange=False)[source]¶ Uses predefined pattern to extract page data :param url:
HTML webpage urlParameters: - nodeid (str|unicode) – id key for nodes
- nodeclass (str|unicode) – class key for nodes
- fieldnames (str|unicode) – string with list of fields like “src,alt,href,_text”
- absolutize – Absolutize all urls returned as href and other url-like fields
- pagekey (str|unicode) – Key of the page listing. GET or POST parameter
- pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns: Returns array of extracted values
Return type: array
.
lazyscraper.urltools module¶
-
lazyscraper.urltools.
get_cached_post
(url, postdata, host=None, port=11211, agent='Mozilla/5.0 (Linux; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Mobile Safari/537.36')[source]¶ Returns url data from url with post request