PySciQuery is a Python package for querying scientific databases like PubMed and PMC (PubMed Central). It provides a command-line interface for easy searching and downloading of scientific articles.
-
Ensure you have Python 3.10 installed.
-
Install the package from Github:
pip install git+https://github.com/Wyss/pysciquery.git
PySciQuery provides two main commands: search and download.
Search PubMed or PMC databases using queries from an input JSON file.
pysciquery search <database> <input_file>
<database>: Either 'pubmed' or 'pmc'<input_file>: Path to the input JSON file containing search queries and parameters
Example:
pysciquery search pubmed input_file.json
The search will be performed using the NCBI API. All terms in the query list will be queried and then all terms in the query list appended with the terms in the modifier list will be queried (ex. "thermal proteome profiling" and "thermal proteome profiling xenopus"). "Strict" queries will only return results that contain the exact phrase (ex."thermal proteome profiling) across all fields. "Full" queries will return results that contain any of the terms in the query list (ex. thermal[All Fields] AND ("proteome"[MeSH Terms] OR "proteome"[All Fields]) AND profiling[All Fields]) across all fields.
"All Fields" include:
- Title
- Abstract
- Author names
- Journal name
- MeSH (Medical Subject Headings) terms
- Substance names
- Publication types
- Personal name as subject
- Corporate author
- Secondary source ID
- Comment/correction relations
- Other terms field
The input JSON file should contain the databases you would like to query (DATABASE_LIST), the query type, either "strict" or "full", (NCBI_QUERY_TYPE), the query list (QUERY_LIST), and the modifier list (MODIFIER_LIST) and your NCBI email (NCBI_EMAIL). The NCBI email is not required. Example structure:
{
"DATABASE_LIST": [
"pubmed", "pmc"
],
"NCBI_QUERY_TYPE": "full",
"NCBI_EMAIL": "my_email@domain.com",
"QUERY_LIST": [
"thermal proteome profiling",
"ketamine",
"dexmedetomidine",
"etomidate"
],
"MODIFIER_LIST": [
"xenopus",
"xenopus laevis",
"ketamine",
"dexmedetomidine",
"etomidate",
"zebrafish",
"danio rerio",
"human",
"homo sapiens",
"mouse",
"mus musculus",
"anesthetic"
]
}The search command will return two excel files in the current working directory:
<database>_api_<query_type>_<timestamp>.xlsx: Contains detailed information about each article<database>_total_results_<query_type>_<timestamp>.xlsx: Contains a summary of the total number of results for each query
Download PDF articles from PMC using a list of PMIDss from a JSON file. The script will match PMIDs to PMCIDs and download the full text of the articles and save them in a specified directory.
pysciquery download <id_file> --email <your_email> [--output-dir <directory>]
<id_file>: Path to the JSON file containing article PMIDs--output-dir: (Optional) Directory to save downloaded files (default: ./downloads)
Example:
pysciquery download ids.json --output-dir ./pubmed_articles
- Use PMIDs (e.g., "39290210")
The JSON file with IDs should have the following structure:
{
"NCBI_EMAIL": "your.email@example.com",
"PMIDS": [
"39290210",
"39028932"
]
}To set up the development environment:
- Clone the repository (if you haven't already)
- Install development dependencies:
pipenv install - Activate the virtual environment:
pipenv shell