scrapper

Iterates list page and then scraps the detailed page to scrap values from it. The output is in a csv file format

Usage:

The executable is inside ./bin folder

You can either use the executable from cli

scrapper -f <config filename>

OR

use from your nodejs script

Scrapper = require ('scapper');
var config = {
}; // The config file
(new Scrapper(config)).execute();

A example is added in config_example.json.

site - The domain where you want to scrap from. eg: example.com
list - configuration from list page.
- url - The url of the list page. eg: http://www.example.com?pagetype=list&page=%page%
- startIndex - The starting page id. This value replaces %page% in url above.
- pageLimit - The ending page id. This value replaces %page% in url above. The code iterates pages from startIndex to pageLimit
- selectorForLink - The selector to find the links for detailed page
browserDetails - configuration to mimic a browser
- userAgent - Add a valid user agent
- cookie - If cookie is required then add this
throttleTime - time in millisecond. Throttles the page fetch speed .
listPageThrottleTime (optional) - time in millisecond. Throttles the page fetch speed page for list page. If not present then it uses throttleTime
detailed - Configuration for the detailed page
- scrapValues - List of option to scrap from the detailed page. It is an array of scrapOptions.
output - Configuration for the output
- location - The file location where the output csv is saved
- bufferLength - The length of buffer to hold the output scrap values. After it is filled up , the program empties it to the output file csv file

selector - The css seelctor where the content lies
key - The field name
split - configuration to split the selectors content
- sep - The separator by which the content is split
- idx - The index of the splitted string array to be picked as the value
func - for complex processing of the selectors content you can add the function content here. The function is passed one argument and it is the selectors content. The output of the function is picked as the value

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
LICENSE		LICENSE
README.md		README.md
config_example.json		config_example.json
exec.js		exec.js
package.json		package.json