Set Up:
git clone https://github.com/dh-thesis/crawl.git
cd crawl/
virtualenv -p python3 --no-site-packages env
source env/bin/activate
pip install -r requirements.txt
deactivate- make sure you have Gecko Driver available on your
PATH - start crawling:
./mainThe following scripts are used to crawl informations about the current Max Planck Institutes (MPIs), their research domains (category) and research areas (tag) from the website of the Max Planck Society.
Requirements: Selenium, Firefox and Gecko Driver
The following scripts can be used to map the crawled MPIs to their corresponding identifiers in MPG.PuRe and to find the associated contexts as well as categories and thematic tags of the institutes. (--> Important that you have done the retrieval before!)
Use these scripts (except src/map_post.py) by running:
./mapThis will create a mapping (mpi_ous.json) from institutes to identifiers at path ../base/data/mpis/map/. The mapping should be refined manually! Check if institutes not found have in fact corresponding identifiers in MPG.PuRe. After having done this you can run the post-mapping procedure:
source env/bin/activate
python -m src.map_postRequirements: Pybman