Skip to content
/ scrapper Public

Iterates paginated list pages and then scraps the detailed page

License

Notifications You must be signed in to change notification settings

ogkla/scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scrapper

Iterates list page and then scraps the detailed page to scrap values from it. The output is in a csv file format

Usage:

The executable is inside ./bin folder

You can either use the executable from cli

scrapper -f <config filename>

OR

use from your nodejs script

Scrapper = require ('scapper');
var config = {
}; // The config file
(new Scrapper(config)).execute();

Explaining the config

A example is added in config_example.json.

  • site - The domain where you want to scrap from. eg: example.com
  • list - configuration from list page.
    • url - The url of the list page. eg: http://www.example.com?pagetype=list&page=%page%
    • startIndex - The starting page id. This value replaces %page% in url above.
    • pageLimit - The ending page id. This value replaces %page% in url above. The code iterates pages from startIndex to pageLimit
    • selectorForLink - The selector to find the links for detailed page
  • browserDetails - configuration to mimic a browser
    • userAgent - Add a valid user agent
    • cookie - If cookie is required then add this
  • throttleTime - time in millisecond. Throttles the page fetch speed .
  • listPageThrottleTime (optional) - time in millisecond. Throttles the page fetch speed page for list page. If not present then it uses throttleTime
  • detailed - Configuration for the detailed page
    • scrapValues - List of option to scrap from the detailed page. It is an array of scrapOptions.
  • output - Configuration for the output
    • location - The file location where the output csv is saved
    • bufferLength - The length of buffer to hold the output scrap values. After it is filled up , the program empties it to the output file csv file

Scrap options

  • selector - The css seelctor where the content lies
  • key - The field name
  • split - configuration to split the selectors content
    • sep - The separator by which the content is split
    • idx - The index of the splitted string array to be picked as the value
  • func - for complex processing of the selectors content you can add the function content here. The function is passed one argument and it is the selectors content. The output of the function is picked as the value

About

Iterates paginated list pages and then scraps the detailed page

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published