The crawler is found in crawler/crawler_2.0.
First, install selenium with
pip install selenium
At this point, the driver should run correctly, assuming you are using Chrome. If you use another browser, you must change the selenium driver setup. This looks like
driver = webdriver.Chrome()
Amazon changes the names and paths of the items we want to crawl constantly. If the crawler throws an error, inspect the code for the element that failed, then find that element in the browser inspection pane and fix the reference. This may occur multiple times. In some cases, parts of the old reference can be searched for to find the new one.
Simply run main.py. The crawler is not designed to run in parallel, but doing so is possible. Note that data already exists, so check and make sure you aren't overwriting anything.
In some cases, the crawler may need a capcha. Fill this out when it occurs.
Assorted tools to get data about the crawler files and generate invocations. Each is a self contained python file. Make sure the inputs and outputs are where you want them.
You will need to install the python openAI library. Selenium should already be instulled, but if it is not, do that too.
pip install openai
You will need an openAI account. Set up a project and get an API key.
The model is currently set to gpt-4.0. If you want a cheaper option, choose gpt-4.0-mini. It is rate limited and cannot handle parallel instances.
You will need a developer account, with username, password, and the url to the developer portal set. You may have to start a skill to access the dev portal.
Make sure you have invocations in the same directory, and there are no collisions with output.
Run
python3 main.py
If you want to run in parallel, you can specifiy category and then starting letter as input args. This prevents multiple instances from running on the same skills, leading to race conditions or inefficeincy. All 3 patterns are supported.
python3 main.py
python3 main.py CategoryName
python3 main.py CategoryName A
If too many parallel instances are run, the chatbot may crash. Also, some skills freeze indefinitely (5 or so in our run.) Kill the Chrome windows and delete these skill invocations, or it may freeze Chrome and possibly overheat your machine.
There are 3 folders with analysis tools, one for resource analysis, one for content analysis, and one for performance analysis. Each contains python analysis files. All are self contained, although you may need to install boto3.
Also included is sanitized result data.