lazyscraper package¶
Submodules¶
lazyscraper.consts module¶
lazyscraper.htmltools module¶
lazyscraper.patterns module¶
-
lazyscraper.patterns.pattern_extract_exturls(tree, nodeclass, nodeid, fields)[source]¶ Pattern to extract external urls
-
lazyscraper.patterns.pattern_extract_forms(tree, nodeclass, nodeid, fields)[source]¶ Extracts web forms from page
lazyscraper.scraper module¶
-
lazyscraper.scraper.extract_data_xpath(url, filename=None, xpath=None, fieldnames=None, absolutize=False, post=None, pagekey=None, pagerange=None)[source]¶ Extract data with xpath
Parameters: - url (str|unicode) – HTML webpage url
- xpath (str|unicode) – xpath expression
- fieldnames (str|unicode) – string with list of fields like “src,alt,href,_text”
- absolutize – Absolutize all urls returned as href and other url-like fields
- post (bool) – If True use POST for HTTP requests
- pagekey (str|unicode) – Key of the page listing. GET or POST parameter
- pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns: Returns array of extracted values
Return type: array.
-
lazyscraper.scraper.get_table(url, nodeid=None, nodeclass=None, pagekey=False, pagerange=False, agent=None)[source]¶ Extracts table with data from html :param url:
HTML webpage urlParameters: - nodeid (str|unicode) – id key for nodes
- nodeclass (str|unicode) – class key for nodes
- pagekey (str|unicode) – Key of the page listing. GET or POST parameter
- pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns: Returns array of extracted values
Return type: array.
-
lazyscraper.scraper.use_pattern(url, pattern, nodeid=None, nodeclass=None, fieldnames=None, absolutize=False, pagekey=False, pagerange=False)[source]¶ Uses predefined pattern to extract page data :param url:
HTML webpage urlParameters: - nodeid (str|unicode) – id key for nodes
- nodeclass (str|unicode) – class key for nodes
- fieldnames (str|unicode) – string with list of fields like “src,alt,href,_text”
- absolutize – Absolutize all urls returned as href and other url-like fields
- pagekey (str|unicode) – Key of the page listing. GET or POST parameter
- pagerange (str|unicode) – Range of pages to process. String with format ‘min,max,step’, example: ‘1,72,1’
Returns: Returns array of extracted values
Return type: array.
lazyscraper.urltools module¶
-
lazyscraper.urltools.get_cached_post(url, postdata, host=None, port=11211, agent='Mozilla/5.0 (Linux; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Mobile Safari/537.36')[source]¶ Returns url data from url with post request