A Chrome extension that automatically crawls and extracts posts from social media platforms including Twitter/X, LinkedIn, Facebook, Instagram, and Reddit.
- Multi-Platform Support: Crawl posts from Twitter/X, LinkedIn, Facebook, Instagram, Reddit, and TikTok
- Smart Post Extraction: Automatically extracts post content, author information, timestamps, engagement metrics, and media
- Intelligent Scrolling: Uses post-height-based scrolling to efficiently navigate through feeds
- Duplicate Detection: Prevents re-crawling of already saved posts
- Real-time Control Panel: Floating control panel with live status updates and post counting
- Data Export: Export collected data in JSON or CSV formats
- Filtering & Search: Filter posts by platform and company before export
- Persistent Storage: Uses IndexedDB for reliable local data storage
- Rate Limiting: Built-in protection against excessive API calls and duplicate processing
-
Clone the repository
git clone <your-repo-url> cd crawlers
-
Install in Chrome
- Open Chrome and navigate to
chrome://extensions/ - Enable "Developer mode" (toggle in top-right corner)
- Click "Load unpacked" and select the project folder
- The extension icon should appear in your toolbar
- Open Chrome and navigate to
-
Navigate to Target Page: Go to any supported social media platform (Twitter/X, LinkedIn, Facebook, Instagram, Reddit, or TikTok)
-
Activate Extension: Click the extension icon in your Chrome toolbar
-
Start Crawler: Click "Start Crawler" in the popup to activate the floating control panel
The floating control panel appears in the top-right corner of the webpage and includes:
- Platform Detection: Automatically detects the current social media platform
- Status Indicator: Shows current crawling status (Stopped/Running)
- Post Counter: Real-time count of crawled posts
- Currently Viewing: Shows the post currently being processed
- Start/Stop Controls: Toggle crawling on/off
- Export Options: Download data in JSON or CSV format
- Data Management: Clear all stored data
Automatic Mode:
- The crawler automatically scrolls through the feed
- Posts are processed one by one based on their height
- Each post is analyzed for content, author, metrics, and media
- Duplicate posts are automatically skipped (shows "Already Recorded" indicator)
- New posts are saved to local database
What Gets Extracted:
- Post Content: Full text content and titles
- Author Information: Name, handle, avatar, profile URL
- Engagement Metrics: Likes, shares, comments, views (platform-dependent)
- Timestamps: Post creation time
- Media Content: Images, videos, GIFs, embedded content
- Platform Metadata: Post type, tags, categories
- Company Information: Extracted from profile or URL
-
Apply Filters (optional):
- Select specific platform from dropdown
- Filter by company name
- Choose "All" for no filtering
-
Export Format:
- JSON: Complete data with nested objects and arrays
- CSV: Flattened data suitable for spreadsheet analysis
-
Download: Files are automatically named with date and post count
| Platform | Status | Features |
|---|---|---|
| Twitter/X | ✅ Full Support | Posts, replies, retweets, metrics, media |
| ✅ Full Support | Posts, articles, company updates, engagement | |
| ✅ Full Support | Posts, comments, reactions, media | |
| ✅ Full Support | Posts, stories, captions, hashtags | |
| ✅ Full Support | Posts, comments, subreddits, awards | |
| TikTok | ✅ Full Support | Videos, captions, transcripts, music, effects |
- Base Crawler: Abstract class with common crawling logic
- Platform Crawlers: Specialized extractors for each social media platform
- IndexedDB Storage: Client-side database for persistent data storage
- Smart Scrolling: Post-height-based navigation system
- Rate Limiting: Prevents duplicate processing and API overload
{
id: "unique_post_identifier",
platform: "twitter|linkedin|facebook|instagram|reddit|tiktok",
company: "extracted_company_name",
author: {
name: "Author Name",
handle: "@username",
avatar: "avatar_url",
profileUrl: "profile_url"
},
text: "post_content",
transcript: "video_transcript", // for TikTok videos
title: "post_title", // if applicable
timestamp: "ISO_datetime",
url: "post_url",
crawledAt: "ISO_datetime",
metrics: {
likes: 0,
shares: 0,
comments: 0,
views: 0, // platform dependent
collects: 0 // TikTok saves/bookmarks
},
media: {
images: [...],
videos: [...],
gifs: [...],
embeds: [...]
},
// Platform-specific fields...
}- Intelligent Viewport Detection: Only processes visible posts
- Memory Management: Automatic cleanup of processed post references
- Scroll Optimization: Precise scroll amounts based on actual post heights
- Background Processing: Non-blocking data extraction and storage
The crawler includes built-in rate limiting:
- Maximum 3 processing attempts per post within 15 seconds
- Automatic retry logic for failed extractions
- Memory cleanup for optimal performance
- Auto-scroll Interval: 1 second (configurable)
- Viewport Detection: 30% visibility threshold
- Scroll Amount: Calculated based on current post height
- Recovery Mode: Automatic detection and recovery from stuck states
Extension Not Working:
- Ensure you're on a supported platform
- Check that the extension is enabled in Chrome
- Refresh the page and try again
No Posts Being Detected:
- Platform may have updated their HTML structure
- Check browser console for error messages
- Try scrolling manually to load content
Control Panel Not Visible:
- Panel appears in top-right corner
- Check if it's behind other page elements
- Try clicking the extension icon again
Data Export Issues:
- Ensure you have crawled posts before exporting
- Check filter settings (try "All" options)
- Verify browser allows file downloads
Enable Chrome DevTools (F12) to see detailed logging:
[Platform] Post saved: ...- Successful post extraction[Platform] Rate limiting: ...- Duplicate processing prevention[Platform] Error: ...- Extraction or storage errors
- Local Storage Only: All data is stored locally in your browser
- No External Servers: No data is transmitted to external services
- HTTPS Only: Extension only works on secure connections
- Minimal Permissions: Only requests necessary browser permissions
This extension follows a modular architecture making it easy to add new platforms:
- Create new platform crawler extending
BaseCrawler - Implement required methods:
getSelectors(),extractPostData() - Add platform detection in
content.js - Update manifest permissions for new domains
This project is provided as-is for educational and research purposes.
For issues, questions, or feature requests, please check:
- Browser console for error messages
- Extension permissions and settings
- Platform compatibility
Note: This extension is designed for educational and research purposes. Please respect platform terms of service and rate limits when crawling data.