A robust, automated pipeline designed to download your daily newspaper, process it, and email it to you. This solution is built for reliability and portability, capable of running in various environments (local, cloud, CI/CD) with minimal dependencies.
Primary Goal: To provide a reliable, "set-and-forget" tool for self-hosters to automate the daily retrieval, archival, and delivery of digital newspaper editions (PDFs).
Target Users: Home lab enthusiasts, digital archivists, and subscribers who prefer offline or email-based reading workflows.
Key Features:
- Automated Delivery: Fetches the daily edition, generates a visual thumbnail, and emails it to your inbox.
- Flexible Storage: Supports stateless cloud storage (S3, Cloudflare R2) or persistent local storage.
- Security First: Includes an interactive wizard (
configure.py) that encrypts sensitive credentials (SMTP passwords, API keys) at rest. - Resilience: Features robust error handling, retries, and a fallback HTTP client for restricted network environments.
- Observability: Integrated health checks and rich CLI status reporting.
Non-Goals:
- This is not a general-purpose web scraper. It is designed for specific, predictable URL patterns.
- It does not bypass paywalls; users must provide valid URLs or authentication for their content.
The system operates as a linear pipeline orchestrated by main.py.
- Configuration: Centralized settings management (
config.py) loads parameters fromconfig.yamland environment variables (.env). - Download: The
websitemodule authenticates (if necessary) and downloads the newspaper edition for the target date. - Storage: The
storagemodule handles interactions with S3-compatible cloud storage or the local filesystem. - Processing: The
thumbnailmodule generates a preview image of the newspaper's front page (PDF only). - Notification: The
email_sendermodule constructs an HTML email with download links and the inline thumbnail, sending it via SMTP. - Cleanup: Old files are automatically purged from storage based on retention policy.
- Python 3.8+
- SMTP Credentials: Access to an SMTP server (e.g., Gmail, SendGrid, or self-hosted) for sending emails.
- (Optional) S3 Credentials: Access Key ID and Secret Key for S3-compatible storage (AWS S3, Cloudflare R2, MinIO). Local storage is also supported.
-
Clone the repository:
git clone https://github.com/yourusername/newspaper-emailer.git cd newspaper-emailer -
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate # Linux/macOS # .venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
You can configure the application using the interactive wizard (recommended) or manually.
Option A: Interactive Wizard (Recommended)
Run the setup script to generate config.yaml and a secure .env file with encrypted credentials.
python configure.pyOption B: Manual Configuration
-
config.yaml: Copyconfig.yamlto configure defaults.newspaper: url: "https://example.com" download_path_pattern: "newspaper/download/{date}" storage: endpoint_url: "https://<account>.r2.cloudflarestorage.com" bucket: "newspaper-archive" email: sender: "bot@example.com" recipients: ["user@example.com"] smtp_host: "smtp.example.com"
-
.env: Copy.env.exampleto.envand fill in your secrets.cp .env.example .env nano .env
Note: The system supports both plain text and encrypted credentials (using
_ENCsuffix) in.env.
Key Environment Variables (.env):
NEWSPAPER_URL: Base URL for the publication.STORAGE_TYPE:s3(default) orlocal.EMAIL_SMTP_HOST/EMAIL_SMTP_PASS: SMTP server details.
-
Run Health Check: Verify your environment, dependencies, and configuration validity.
python healthcheck.py
-
Manual Dry-Run: Simulate a run without downloading or sending emails.
export MAIN_PY_DRY_RUN=true python main.py
To run the pipeline for today's date:
python main.pyTo target a specific date:
export MAIN_PY_TARGET_DATE="2023-10-27"
python main.pySchedule main.py using cron (Linux) or Task Scheduler (Windows).
Example Cron Job (Daily at 6:00 AM):
0 6 * * * /path/to/repo/.venv/bin/python /path/to/repo/main.py >> /path/to/repo/logs/cron.log 2>&1- Running Tests:
python run_tests.pyperforms static analysis and structure checks. - Health Check:
python healthcheck.pyruns a full diagnostic suite. - Docstrings: All code is documented using Google Style Python Docstrings.
MIT