YayPDF

YayPDF is a powerful Python CLI tool designed to crawl websites and download all discovered PDF files. It supports recursive crawling, multi-threaded downloading, and custom HTTP headers for authenticated sessions.

Features

Recursive Crawling: Crawl through pages to find PDFs at a specified depth.
Concurrency: fast multi-threaded downloading.
Smart Filtering: Options to restrict downloads to the same domain and verify content types.
Custom Headers: Support for Cookies and Authorization headers for protected content.
Polite Crawling: Configurable delay between requests to respect server limits.

Installation

Ensure you have Python 3 installed. Then install the required dependencies:

pip install requests beautifulsoup4

Usage

Basic usage:

python app.py "https://example.com/resources"

Common Options

Option	Description	Default
`url`	Starting page URL	(Required)
`--out`	Output directory for PDFs	`downloaded_pdfs`
`--depth`	Crawl depth (0 = only the starting page)	`0`
`--same-domain`	Restrict crawling/downloading to the starting domain	`False`
`--concurrency`	Number of parallel downloads	`6`
`--delay`	Delay between page fetches (seconds)	`0.0`
`--header`	Add custom HTTP header (repeatable)	`[]`

Examples

1. Simple Download

Download all PDFs from a single page into the pdfs folder:

python app.py "https://example.com/books" --out pdfs

2. Recursive Crawl

Crawl the starting page and links found on it (depth 1), downloading only from the same domain:

python app.py "https://university.edu/papers" --depth 1 --same-domain

3. Authenticated & Polite

Download from a site requiring login cookies, with a 1-second delay between requests to be polite:

python app.py "https://site.com/protected" \
  --delay 1.0 \
  --header "Cookie: session_id=12345" \
  --header "User-Agent: MyCustomCrawler"

License

This project is licensed under the terms of the included LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

YayPDF

Features

Installation

Usage

Common Options

Examples

1. Simple Download

2. Recursive Crawl

3. Authenticated & Polite

License

About

Uh oh!

Releases

Packages

Languages

License

hepidad/yaypdf

Folders and files

Latest commit

History

Repository files navigation

YayPDF

Features

Installation

Usage

Common Options

Examples

1. Simple Download

2. Recursive Crawl

3. Authenticated & Polite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages