YayPDF is a powerful Python CLI tool designed to crawl websites and download all discovered PDF files. It supports recursive crawling, multi-threaded downloading, and custom HTTP headers for authenticated sessions.
- Recursive Crawling: Crawl through pages to find PDFs at a specified depth.
- Concurrency: fast multi-threaded downloading.
- Smart Filtering: Options to restrict downloads to the same domain and verify content types.
- Custom Headers: Support for Cookies and Authorization headers for protected content.
- Polite Crawling: Configurable delay between requests to respect server limits.
Ensure you have Python 3 installed. Then install the required dependencies:
pip install requests beautifulsoup4Basic usage:
python app.py "https://example.com/resources"| Option | Description | Default |
|---|---|---|
url |
Starting page URL | (Required) |
--out |
Output directory for PDFs | downloaded_pdfs |
--depth |
Crawl depth (0 = only the starting page) | 0 |
--same-domain |
Restrict crawling/downloading to the starting domain | False |
--concurrency |
Number of parallel downloads | 6 |
--delay |
Delay between page fetches (seconds) | 0.0 |
--header |
Add custom HTTP header (repeatable) | [] |
Download all PDFs from a single page into the pdfs folder:
python app.py "https://example.com/books" --out pdfsCrawl the starting page and links found on it (depth 1), downloading only from the same domain:
python app.py "https://university.edu/papers" --depth 1 --same-domainDownload from a site requiring login cookies, with a 1-second delay between requests to be polite:
python app.py "https://site.com/protected" \
--delay 1.0 \
--header "Cookie: session_id=12345" \
--header "User-Agent: MyCustomCrawler"This project is licensed under the terms of the included LICENSE file.