Skip to content

dh-thesis/crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crawl Metadata of Max Planck Institutes

Set Up:

git clone https://github.com/dh-thesis/crawl.git
cd crawl/
virtualenv -p python3 --no-site-packages env
source env/bin/activate
pip install -r requirements.txt
deactivate
  • make sure you have Gecko Driver available on your PATH
  • start crawling:
./main

Scripts

The following scripts are used to crawl informations about the current Max Planck Institutes (MPIs), their research domains (category) and research areas (tag) from the website of the Max Planck Society.

Requirements: Selenium, Firefox and Gecko Driver

Mapping of MPIs to MPG.PuRe Entities

The following scripts can be used to map the crawled MPIs to their corresponding identifiers in MPG.PuRe and to find the associated contexts as well as categories and thematic tags of the institutes. (--> Important that you have done the retrieval before!)

Use these scripts (except src/map_post.py) by running:

./map

This will create a mapping (mpi_ous.json) from institutes to identifiers at path ../base/data/mpis/map/. The mapping should be refined manually! Check if institutes not found have in fact corresponding identifiers in MPG.PuRe. After having done this you can run the post-mapping procedure:

source env/bin/activate
python -m src.map_post

Requirements: Pybman

About

crawl metadata of MPIs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published