Social Media Posts Crawler

A Chrome extension that automatically crawls and extracts posts from social media platforms including Twitter/X, LinkedIn, Facebook, Instagram, and Reddit.

✨ Features

Multi-Platform Support: Crawl posts from Twitter/X, LinkedIn, Facebook, Instagram, Reddit, and TikTok
Smart Post Extraction: Automatically extracts post content, author information, timestamps, engagement metrics, and media
Intelligent Scrolling: Uses post-height-based scrolling to efficiently navigate through feeds
Duplicate Detection: Prevents re-crawling of already saved posts
Real-time Control Panel: Floating control panel with live status updates and post counting
Data Export: Export collected data in JSON or CSV formats
Filtering & Search: Filter posts by platform and company before export
Persistent Storage: Uses IndexedDB for reliable local data storage
Rate Limiting: Built-in protection against excessive API calls and duplicate processing

🚀 Installation

From Source

Clone the repository
```
git clone <your-repo-url>
cd crawlers
```
Install in Chrome
- Open Chrome and navigate to chrome://extensions/
- Enable "Developer mode" (toggle in top-right corner)
- Click "Load unpacked" and select the project folder
- The extension icon should appear in your toolbar

📖 How to Use

1. Basic Setup

Navigate to Target Page: Go to any supported social media platform (Twitter/X, LinkedIn, Facebook, Instagram, Reddit, or TikTok)
Activate Extension: Click the extension icon in your Chrome toolbar
Start Crawler: Click "Start Crawler" in the popup to activate the floating control panel

2. Using the Control Panel

The floating control panel appears in the top-right corner of the webpage and includes:

Platform Detection: Automatically detects the current social media platform
Status Indicator: Shows current crawling status (Stopped/Running)
Post Counter: Real-time count of crawled posts
Currently Viewing: Shows the post currently being processed
Start/Stop Controls: Toggle crawling on/off
Export Options: Download data in JSON or CSV format
Data Management: Clear all stored data

3. Crawling Process

Automatic Mode:

The crawler automatically scrolls through the feed
Posts are processed one by one based on their height
Each post is analyzed for content, author, metrics, and media
Duplicate posts are automatically skipped (shows "Already Recorded" indicator)
New posts are saved to local database

What Gets Extracted:

Post Content: Full text content and titles
Author Information: Name, handle, avatar, profile URL
Engagement Metrics: Likes, shares, comments, views (platform-dependent)
Timestamps: Post creation time
Media Content: Images, videos, GIFs, embedded content
Platform Metadata: Post type, tags, categories
Company Information: Extracted from profile or URL

4. Data Export

Apply Filters (optional):
- Select specific platform from dropdown
- Filter by company name
- Choose "All" for no filtering
Export Format:
- JSON: Complete data with nested objects and arrays
- CSV: Flattened data suitable for spreadsheet analysis
Download: Files are automatically named with date and post count

🎯 Supported Platforms

Platform	Status	Features
Twitter/X	✅ Full Support	Posts, replies, retweets, metrics, media
LinkedIn	✅ Full Support	Posts, articles, company updates, engagement
Facebook	✅ Full Support	Posts, comments, reactions, media
Instagram	✅ Full Support	Posts, stories, captions, hashtags
Reddit	✅ Full Support	Posts, comments, subreddits, awards
TikTok	✅ Full Support	Videos, captions, transcripts, music, effects

🔧 Technical Details

Architecture

Base Crawler: Abstract class with common crawling logic
Platform Crawlers: Specialized extractors for each social media platform
IndexedDB Storage: Client-side database for persistent data storage
Smart Scrolling: Post-height-based navigation system
Rate Limiting: Prevents duplicate processing and API overload

Data Schema

{
  id: "unique_post_identifier",
  platform: "twitter|linkedin|facebook|instagram|reddit|tiktok",
  company: "extracted_company_name",
  author: {
    name: "Author Name",
    handle: "@username",
    avatar: "avatar_url",
    profileUrl: "profile_url"
  },
  text: "post_content",
  transcript: "video_transcript", // for TikTok videos
  title: "post_title", // if applicable
  timestamp: "ISO_datetime",
  url: "post_url",
  crawledAt: "ISO_datetime",
  metrics: {
    likes: 0,
    shares: 0,
    comments: 0,
    views: 0, // platform dependent
    collects: 0 // TikTok saves/bookmarks
  },
  media: {
    images: [...],
    videos: [...],
    gifs: [...],
    embeds: [...]
  },
  // Platform-specific fields...
}

Performance Features

Intelligent Viewport Detection: Only processes visible posts
Memory Management: Automatic cleanup of processed post references
Scroll Optimization: Precise scroll amounts based on actual post heights
Background Processing: Non-blocking data extraction and storage

⚙️ Configuration

Rate Limiting

The crawler includes built-in rate limiting:

Maximum 3 processing attempts per post within 15 seconds
Automatic retry logic for failed extractions
Memory cleanup for optimal performance

Scroll Behavior

Auto-scroll Interval: 1 second (configurable)
Viewport Detection: 30% visibility threshold
Scroll Amount: Calculated based on current post height
Recovery Mode: Automatic detection and recovery from stuck states

🐛 Troubleshooting

Common Issues

Extension Not Working:

Ensure you're on a supported platform
Check that the extension is enabled in Chrome
Refresh the page and try again

No Posts Being Detected:

Platform may have updated their HTML structure
Check browser console for error messages
Try scrolling manually to load content

Control Panel Not Visible:

Panel appears in top-right corner
Check if it's behind other page elements
Try clicking the extension icon again

Data Export Issues:

Ensure you have crawled posts before exporting
Check filter settings (try "All" options)
Verify browser allows file downloads

Browser Console

Enable Chrome DevTools (F12) to see detailed logging:

[Platform] Post saved: ... - Successful post extraction
[Platform] Rate limiting: ... - Duplicate processing prevention
[Platform] Error: ... - Extraction or storage errors

🔐 Privacy & Security

Local Storage Only: All data is stored locally in your browser
No External Servers: No data is transmitted to external services
HTTPS Only: Extension only works on secure connections
Minimal Permissions: Only requests necessary browser permissions

🤝 Contributing

This extension follows a modular architecture making it easy to add new platforms:

Create new platform crawler extending BaseCrawler
Implement required methods: getSelectors(), extractPostData()
Add platform detection in content.js
Update manifest permissions for new domains

📄 License

This project is provided as-is for educational and research purposes.

🆘 Support

For issues, questions, or feature requests, please check:

Browser console for error messages
Extension permissions and settings
Platform compatibility

Note: This extension is designed for educational and research purposes. Please respect platform terms of service and rate limits when crawling data.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
crawlers		crawlers
README.md		README.md
background.js		background.js
content.js		content.js
control-panel.css		control-panel.css
db.js		db.js
index.js		index.js
manifest.json		manifest.json
package-lock.json		package-lock.json
popup.html		popup.html
popup.js		popup.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Social Media Posts Crawler

✨ Features

🚀 Installation

From Source

📖 How to Use

1. Basic Setup

2. Using the Control Panel

3. Crawling Process

4. Data Export

🎯 Supported Platforms

🔧 Technical Details

Architecture

Data Schema

Performance Features

⚙️ Configuration

Rate Limiting

Scroll Behavior

🐛 Troubleshooting

Common Issues

Browser Console

🔐 Privacy & Security

🤝 Contributing

📄 License

🆘 Support

About

Uh oh!

Releases

Packages

Languages

atmaxmoj/socialmediascrawler

Folders and files

Latest commit

History

Repository files navigation

Social Media Posts Crawler

✨ Features

🚀 Installation

From Source

📖 How to Use

1. Basic Setup

2. Using the Control Panel

3. Crawling Process

4. Data Export

🎯 Supported Platforms

🔧 Technical Details

Architecture

Data Schema

Performance Features

⚙️ Configuration

Rate Limiting

Scroll Behavior

🐛 Troubleshooting

Common Issues

Browser Console

🔐 Privacy & Security

🤝 Contributing

📄 License

🆘 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages