Skip to content

titanjer/scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawler Instructions

If you have any questions, please contact @Anfernee Chang

Product Database Schema

Validation

  • Please run your spider and make sure it passes scraper/pipelines/validation.py before sending it.
  • Please make sure the spider doesn't raise any errors with 'scrapy crawl spider' before sending it.
  • Any spiders sent without checking will result in 'penalties!'

Notes

  1. Please follow PEP8 style.
  2. Please use 'pasre_product' to be the parsing method for A product and pass no meta in if you can.
  3. Please add node's XPath in the spider class variable 'xpaths' dict. We will use these information to check your spider.
  4. Please raises ValueError('XXX!') if the page have no data for the XPath to any Required Fields.
  5. Please use 'copy.deepcopy' or 'new ProductItem()' to re-generate a item for each different product variation(colors etc.).
  6. Since we use Duplicate Filter to save the carwled url, please use 'dont_filter' carefully.
  7. To complete the job, we'd only be requiring the spiders/store.py file from you. Please send it by email.

Running Your Test Crawlers

https://github.com/titanjer/scraper/wiki/Testing

Set local environment variables

Read about testing first. You don't want to use scrapyd_settings.py during developing scraper on your local machine. And you probably want to use scrapy http cache (HTTPCACHE_ENABLED).

To implement different crawler behavour on dev and production environment you could create .env file, which will be loaded You can create .env file with your local environments.

scrapyd-deploy won’t deploy anything outside the project module so the .env file won’t be deployed.

So put .env file into the folder where scrapy.cfg located.

Example:

HOST=local
LOG_LEVEL='INFO'
HTTPCACHE_ENABLED=True

Pyling & PEP8 checkers

You can check if you code can pass pylint and pep8 checkers. Activate your virtualenvironment and run ./ci.sh script.

About

scrapy template

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published