░█████████ ░██████████ ░██ ░██
░██ ░██ ░██ ░██ ░██
░██ ░██ ░██ ░███████ ░████████ ░███████ ░████████ ░███████ ░██░████
░█████████ ░█████████ ░██ ░██ ░██ ░██ ░██ ░██ ░██ ░██ ░██ ░███
░██ ░██ ░██ ░█████████ ░██ ░██ ░██ ░██ ░█████████ ░██
░██ ░██ ░██ ░██ ░██ ░██ ░██ ░██ ░██ ░██ ░██
░██ ░██ ░██ ░███████ ░████ ░███████ ░██ ░██ ░███████ ░██ RFetcher is a powerful Python CLI tool for scraping and categorizing Reddit content with intelligent filtering. Designed for researchers, data scientists, and content analysts, it provides structured access to Reddit discussions while filtering out noise and irrelevant content.
- Smart Content Filtering 🧠
Automatically skip Reddit-specific references (subreddit links, meta-discussions) - Custom Category Management 🗂️
Define keyword-based categories and filter content dynamically - Multi-Mode Scraping ⚙️
Supports hot/new/top/rising posts with pagination - Comment Processing 💬
Recursive comment scraping with nested replies - Data Organization 📂
Automatic JSON output with timestamps todata/folder - API-Friendly 🤝
Built-in rate limiting and error handling
- Python 3.9+
- Reddit API credentials
- Clone repository:
git clone https://github.com/NouroGhoul/rfetcher.git
cd rfetcher- Install dependencies:
pip install -r requirements.txt-
Get Reddit API credentials:
- Go to Reddit App Preferences
- Click "Create App" (select "script" type)
- Note these values:
- Client ID (under app name)
- Client Secret (next to "secret")
- Your Reddit username
- Your Reddit password
-
Create
.envfile:
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USERNAME=your_reddit_username
REDDIT_PASSWORD=your_reddit_passwordpython fetcher.py-
Configure categories (optional):
- Define keyword groups for content filtering
- Example:
Programming: python,java,rust
-
Run fetcher:
==================================================
Reddit Fetcher - Configuration
==================================================
Enter subreddit URL or name: programming
Select post type [1-4]: 1
Number of posts: 50
Fetch Mode [1-3]: 1
- Select category:
Available Categories:
1. Programming
2. Technology
3. Web Development
- All data saved to
data/folder - Filename format:
data/{subreddit}_{category}_{timestamp}.json - Example:
data/programming_web_development_20230815_143022.json
{
"category": "Web Development",
"posts": [
{
"id": "t3_abc123",
"title": "React 18 performance improvements",
"author": "js_dev",
"selftext": "Discussion about new features...",
"score": 142,
"url": "https://reddit.com/...",
"created_utc": 1689264000,
"num_comments": 38,
"comments": [
{
"id": "t1_def456",
"author": "react_fan",
"body": "This update is game-changing!",
"score": 42,
"created_utc": 1689264120,
"replies": [...]
}
]
}
]
}rfetcher/
├── data/ # Output directory (auto-created)
├── fetcher.py # Main application
├── .gitignore # Ignores sensitive data
├── requirements.txt
└── .env # For API credentials (EXAMPLE - create your own)
We welcome contributions! Please follow these steps:
- Setup environment:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt- Development workflow:
git checkout -b feature/your-feature
# Make changes
git commit -m 'Add new feature'
git push origin feature/your-feature- Testing:
- Place tests in
tests/directory - Maintain consistent coding style
- Include docstrings for new functions
- Test edge cases and error handling
This project is licensed under the MIT License - see the LICENSE file for details.
Created by: https://github.com/NouroGhoul
For educational purposes only
Important Notes:
- Respect Reddit's API Rules
- Data is saved in
data/folder - ensure directory exists - Never commit your
.envfile with credentials - The tool includes rate limiting to comply with API guidelines