TopicStreams

Real-time news aggregation system that continuously scrapes Google (not Google News) for any topics (search keywords) and streams updates via WebSocket.

Why TopicStreams?

The Limitations with Google News & RSS

Google News (https://news.google.com) and Google News RSS (https://news.google.com/rss?search=<keyword>) provide curated news collections based on Google's algorithms. While convenient, they have limitations:

Results are not necessarily the latest - articles may be hours or days old
Google filters by quality and relevance, potentially missing breaking news
No control over what Google considers "newsworthy"

Google News Search result - hours or days old

Google News RSS - same as Google News search

TopicStreams' Approach

TopicStreams scrapes Google Search → News Tab with time filters, giving you:

Real-time results - All news indexed by Google, regardless of quality rating
Unfiltered access - No curation, you decide what's relevant
Near-instant updates - Scrape frequently enough and catch news as it breaks
Full control - Customize topics (search keywords) and scrape intervals

Google Search News Tab - Latest, Unfiltered Results

Try It Live

Experience TopicStreams in action: http://topicstreams.dongziyu.com:8080

Quick Demo

# Add topics (ensure they exist)
curl -X POST http://topicstreams.dongziyu.com:8080/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "Bitcoin"}'

# List all active topics (contain "bitcoin")
curl http://topicstreams.dongziyu.com:8080/api/v1/topics | jq

# Get latest news for "Bitcoin"
curl http://topicstreams.dongziyu.com:8080/api/v1/news/bitcoin?limit=5 | jq

WebSocket Streaming

For real-time news updates, connect via WebSocket:

# Real-time WebSocket news stream for "China" (automatically add topic if not present)
websocat ws://topicstreams.dongziyu.com:8080/api/v1/ws/news/china | jq

The WebSocket delivers live news updates as they're scraped, showing the same content you'd see by continuously refreshing Google's news search page.

WebSocket Real-time News Stream - Live updates as articles are scraped

What TopicStreams Offers

Real-time news streaming on customizable topics (any search keywords)
Self-hosted - No third-party news API costs

Limitations

Google Dependency - Black box algorithms, no source control, variable indexing speed, geographic filtering
Inconsistent Results - Same queries return different results based on IP, geolocation, browser, A/B testing
No Quality Control - All news included, credible or not
Access Risks - Google may detect scraping and rate limit or block access, mitigation: anti-bot detection

Features

Real-time News Aggregation - Continuously scrapes Google Search News tab (not Google News site) for the latest articles
Multi-Topic Tracking - Monitor multiple news topics simultaneously with configurable scrape intervals
WebSocket Streaming - Subscribe to live news updates per topic via WebSocket connections
REST API - Manage topics and retrieve historical news entries through HTTP endpoints
Anti-Bot Detection - Playwright with stealth patches, realistic browser fingerprinting, and configurable geolocation (details)

Architecture

TopicStreams consists of three main components:

┌─────────────────────────┐
│         Client          │
│ (REST API / WebSocket)  │
└────────────┬────────────┘
             │                               
             ▼                               
┌─────────────────────────┐    ┌──────────────────────────────┐
│     FastAPI Server      │    │      Scraper Service         │
│                         │    │                              │
│  - REST endpoints       │    │  - Playwright browser        │
│  - WebSocket streams    │    │  - BeautifulSoup parser      │
│  - PostgreSQL listener  │    │  - Continuous scraping loop  │
└────────────┬────────────┘    └─────────────┬────────────────┘
             │                               │
             ▼                               ▼
┌─────────────────────────────────────────────────────────────┐
│                   PostgreSQL Database                       │
│                                                             │
│          - Topics (tracked keywords)                        │
│          - News Entries (scraped articles)                  │
│          - Scraper Logs (monitoring)                        │
│          - LISTEN/NOTIFY for real-time updates              │
└─────────────────────────────────────────────────────────────┘

Data Flow

Scraper Service continuously scrapes Google Search News tab for tracked topics
New articles are inserted into PostgreSQL with automatic deduplication
Database triggers send NOTIFY events on new inserts
FastAPI Server listens for these events via PostgreSQL's LISTEN/NOTIFY
Updates are pushed to connected WebSocket clients in real-time
Clients can also fetch historical data via REST API

Key Technologies

FastAPI - Web framework for REST and WebSocket
Playwright - Browser automation with anti-bot detection (see how it works)
PostgreSQL - Reliable storage with LISTEN/NOTIFY for real-time events
Docker - Container orchestration for easy deployment

Prerequisites

Docker installed on your system
- Install Docker

That's it! All dependencies (Python, PostgreSQL, Playwright browsers) are handled inside containers.

Optional: Install websocat for WebSocket testing (used for demo in this article), or use any WebSocket client you prefer.

Web UI

TopicStreams includes a modern, responsive Web UI that provides a complete dashboard for monitoring and managing your news aggregation system.

Features

System Status Dashboard - Real-time monitoring of scraper health and activity
Topic Management - Easy add/remove topics with visual feedback
Real-time News Feed - Live updates with WebSocket connections
Scraper Logs Panel - Historical activity monitoring

Access the Web UI

After Quick Start, simply open your browser and navigate to:

http://localhost:5000

TopicStreams Web UI - Complete dashboard for real-time news aggregation

Quick Start

1. Clone the Repository

git clone https://github.com/zydo/topicstreams.git
cd topicstreams

2. Configure Environment

Copy .env.example to .env and customize if needed:

cp .env.example .env

Default settings work out-of-the-box.

3. Start Services

docker compose up -d

This will start three containers:

postgres - Database
scraper - Background scraping service
api - FastAPI server http://localhost:5000

4. Add Topics to Track

# Add a topic
curl -X POST http://localhost:5000/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "artificial intelligence"}'

Scraping of the topic will start on the next iteration.

5. Access Real-Time News

WebSocket (for real-time):

# Using websocat
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence

# Or with jq for prettier formatted output
websocat ws://localhost:5000/api/v1/ws/news/artificial+intelligence | jq

REST API (for historical data):

# Get recent news for a topic with pagination (result 11 to 15, newest first)
curl http://localhost:5000/api/v1/news/artificial+intelligence?offset=10&limit=5 | jq

# List all actively scraping topics
curl http://localhost:5000/api/v1/topics | jq

# List recent 10 scraper logs (each log represents one Google webpage load - typically up to 10 news entries)
curl http://localhost:5000/api/v1/logs?limit=10 | jq

See the API Reference section below for complete endpoint documentation.

6. Monitor Logs

# Background scraper logs
docker compose logs -f scraper

# FastAPI server logs
docker compose logs -f api

Stop Services

docker compose down

Configuration

All configuration is done via environment variables in the .env file. Copy .env.example to get started.

Database Settings

Variable	Default	Description
`POSTGRES_HOST`	`postgres`	PostgreSQL hostname (use Docker service name, `postgres`, for Docker services internal access)
`POSTGRES_PORT`	`5432`	PostgreSQL port
`POSTGRES_DB`	`newsdb`	Database name
`POSTGRES_USER`	`newsuser`	Database username
`POSTGRES_PASSWORD`	`newspass`	Database password

Note: The PostgreSQL service is only accessible within the Docker network (not exposed to the host). Simple passwords are acceptable since the database is not publicly accessible. For direct database access, see Database Access.

API Settings

Variable	Default	Description
`API_PORT`	`5000`	Port for FastAPI server

Note: The API port is mapped to the same port on the host (Docker internal 5000 → host localhost:5000). If port 5000 is already in use on your host, modify the port mapping in docker-compose.yml (e.g., change "5000:5000" to "8000:5000" to access via localhost:8000).

Scraper Settings

Variable	Default	Description
`SCRAPE_INTERVAL`	`60`	Interval in seconds between scrape cycles (measured from start to start). Set to `0` or negative for continuous scraping with no delay. See Scrape Interval Behavior.
`MAX_PAGES`	`1`	Number of result pages to scrape. Increase if you have high-volume topics or longer intervals.

Browser Fingerprinting (Anti-Detection)

Variable	Default	Description
`BROWSER_TIMEZONE`	`America/Los_Angeles`	Browser timezone identifier. Recommended to match your machine's IP location to be "realistic". List of timezones
`BROWSER_GEOLOCATION_LATITUDE`	`37.3273`	Latitude coordinate (default: San Jose, CA)
`BROWSER_GEOLOCATION_LONGITUDE`	`-121.954`	Longitude coordinate (default: San Jose, CA)

Example configurations:

# New York
BROWSER_TIMEZONE=America/New_York
BROWSER_GEOLOCATION_LATITUDE=40.7128
BROWSER_GEOLOCATION_LONGITUDE=-74.0060

# London
BROWSER_TIMEZONE=Europe/London
BROWSER_GEOLOCATION_LATITUDE=51.5074
BROWSER_GEOLOCATION_LONGITUDE=-0.1278

# Tokyo
BROWSER_TIMEZONE=Asia/Tokyo
BROWSER_GEOLOCATION_LATITUDE=35.6762
BROWSER_GEOLOCATION_LONGITUDE=139.6503

Anti-Bot Detection

TopicStreams uses sophisticated techniques to make the scraper appear as a real human user, minimizing the risk of being blocked by Google.

How It Works

The scraper uses Playwright (headless Chromium browser) combined with playwright-stealth patches to hide automation signals and mimic genuine user behavior.

1. Browser Launch Arguments

browser.launch(
    headless=True,
    args=[
        "--no-sandbox",                                   # Docker compatibility
        "--disable-setuid-sandbox",                       # Docker compatibility
        "--disable-blink-features=AutomationControlled"   # Hide automation flag
    ]
)

--disable-blink-features=AutomationControlled prevents navigator.webdriver from being exposed

2. Realistic Browser Context

The browser context is configured to match a real macOS Chrome user:

context = browser.new_context(
    user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ... Chrome/131.0.0.0",
    viewport={"width": 1920, "height": 1080},
    locale="en-US",
    timezone_id="America/Los_Angeles",                          # Configurable
    geolocation={"latitude": 37.3273, "longitude": -121.954},   # Configurable
    color_scheme="light",
    extra_http_headers={
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,..."
    }
)

Key points:

User Agent: Latest Chrome version (131) on macOS
Timezone & Geolocation: Recommended to match your server's IP location (see Configuration)
HTTP Headers: Realistic Accept-Language and content type preferences

3. Playwright-Stealth Patches

After creating each page, we apply stealth patches:

from playwright_stealth import Stealth

stealth = Stealth()
stealth.apply_stealth_sync(page)

This patches ~20 automation detection vectors:

Detection Vector	Before	After
`navigator.webdriver`	`true`	`false`
`navigator.plugins`	Empty (0)	3 fake plugins
`window.chrome`	Missing	Present
Canvas fingerprints	Generic	Realistic
WebGL fingerprints	Generic	Realistic

4. Memory Management

To prevent memory leaks in long-running scrapers:

for topic in topics:
    page = context.new_page()          # Fresh page per topic
    stealth.apply_stealth_sync(page)
    try:
        scrape_news(page, topic)       # page.goto(...) one or multiple URLs
    finally:
        page.close()                   # Always cleanup

What Google Sees

After all patches, Google's JavaScript sees:

navigator.webdriver        // false (was true)
navigator.plugins.length   // 3 (was 0)
window.chrome              // Object (was undefined)
navigator.languages        // ["en-US", "en"]
navigator.platform         // "MacIntel"

Limitations

This is NOT perfect invisibility:

Google can still detect patterns (same IP scraping many topics)
Browser fingerprints are static (not randomized per request)
High request rates will still trigger blocks

For high-volume or 24/7 scraping, consider proxy rotation to distribute load across multiple IPs.

Best practices:

Match timezone/geolocation to your server's IP location
Keep SCRAPE_INTERVAL reasonable (default 60s is safe)
Monitor scraper logs for HTTP 429 (rate limit) or 403 (blocked)

See the Configuration section to customize timezone and geolocation settings.

Scraping Behavior

Sequential Execution

Topics are scraped one after another sequentially, not concurrently:

for topic in topics:
    scrape_news(page, topic)
    # Next topic starts after previous completes

Why sequential?

Avoids unusually high QPS (queries per second) that could trigger Google's rate limiting
Reduces the chance of being blocked
Simulates natural browsing behavior (humans don't open 10 Google searches simultaneously)

Topic order:

Topics are randomized at the start of each cycle (shuffle(topics)) to avoid deterministic request pattern
Different loop iterations scrape topics in different orders
Further mimics human behavior and distributes load

Scrape Interval Behavior

The SCRAPE_INTERVAL setting (default: 60 seconds) controls how often to scrape all topics:

Normal case (scraping finishes within interval):

Cycle 1: Scrape all topics (30s) → Wait 30s → Cycle 2 starts at exactly 60s

Long-running case (scraping exceeds interval):

Cycle 1: Scrape all topics (90s) → No wait → Cycle 2 starts immediately at 90s

Result pages:

During each cycle, only the first MAX_PAGES (default: 1) pages of each topic are scraped. This strategy assumes that between scrape intervals, the number of new articles per topic doesn't exceed one page (typically up to 10 entries). If you have high-volume topics or longer intervals (e.g., >5 minutes), increase MAX_PAGES to 2-3 to avoid missing articles.

Key points:

The interval is from the start of one cycle to the start of the next
If scraping takes longer than the interval, the next cycle starts immediately after completion
No cycles are skipped - every topic gets scraped eventually

Monitoring Scrape Performance

To monitor how long each cycle takes:

# Watch scraping performance in real-time
docker compose logs -f scraper | grep 'topics took'

Example output:

topicstreams-scraper  | 2025-12-03 22:47:50,978 - INFO - 50 topics took 72.1s (exceeds 60s interval), starting next cycle immediately

topicstreams-scraper  | 2025-12-03 22:49:27,978 - INFO - 5 topics took 8.3s, waiting 51.7s until next scrape...

What to look for:

If cycles consistently exceed the interval, consider:
- Increasing SCRAPE_INTERVAL
- Reducing MAX_PAGES (scrape fewer pages per topic)
- Reducing the number of tracked topics
If you see frequent HTTP 429 or 403 errors in logs (check via scraper logs API), you're being rate-limited or blocked
- For high-volume needs, see Proxy Rotation

Proxy Rotation

Not implemented yet - This is a recommended enhancement for high-volume scraping scenarios.

The Problem: Scaling Beyond Sequential Scraping

The current implementation uses sequential scraping with reasonable intervals (default 60s) to avoid detection. However, for aggressive scraping needs:

High QPS requirements - Scraping many topics frequently (e.g., SCRAPE_INTERVAL=0 for continuous scraping)
24/7 operation - Long-running scrapers from the same IP
Concurrent scraping - Switching from sequential to parallel topic scraping for speed

These scenarios significantly increase the risk of being rate-limited or blocked by Google, even with anti-bot detection measures.

Advanced: Different Personas per Proxy

For maximum stealth, pair each proxy with a unique browser fingerprint ("persona"):

personas = {
    "proxy1": {
        "timezone": "America/New_York",
        "geolocation": {"latitude": 40.7128, "longitude": -74.0060},
        "user_agent": "Chrome/131.0.0.0 on Windows",
        "viewport": {"width": 1920, "height": 1080},
    },
    "proxy2": {
        "timezone": "Europe/London",
        "geolocation": {"latitude": 51.5074, "longitude": -0.1278},
        "user_agent": "Chrome/131.0.0.0 on macOS",
        "viewport": {"width": 1440, "height": 900},
    },
    # ...more personas
}

This makes each proxy appear as a completely different user (different location, device, browser).

Benefits

Distribute load - Requests come from different IPs, avoiding single-IP rate limits
Reduce blocking risk - Even if one proxy gets blocked, others continue working
Enable concurrency - Scrape multiple topics in parallel without triggering detection
24/7 operation - Sustained high-volume scraping becomes feasible
Geographic diversity - Appear as users from different locations

Implementation Considerations

Proxy pool management and health checks
Persona configuration and rotation strategy
Error handling for proxy failures

When You Need This

You probably don't need proxies if:

Scraping ≤10 topics with default settings (60s interval)
Sequential scraping is fast enough for your use case
You're okay with occasional rate limiting

You should consider proxies if:

Scraping >20 topics with aggressive intervals (<30s)
Need true real-time updates (interval near 0)
Switching to concurrent scraping for performance
Experiencing frequent HTTP 429/403 blocks

Future Plans

This feature is not currently implemented but is on the roadmap. Contributions welcome!

Authentication & Security

Not implemented yet - The current project is designed for localhost/LAN access only and has no authentication or security features.

Current State: Localhost/LAN Only

TopicStreams is intentionally minimal and assumes deployment in a trusted environment:

Perfect for: Local machine, home network, trusted team LAN
NOT safe for: Public internet exposure without additional security layers

No built-in security:

No user authentication or authorization
No API rate limiting (anyone can flood the API)
No protection against DDOS or malicious attacks
No HTTPS/SSL encryption
No input sanitization beyond basic validation
No CORS configuration for browser security

WebSocket Scalability

Not implemented yet - The current WebSocket implementation uses simple broadcasting that doesn't scale beyond a limited number of subscribers.

Current State: Simple Broadcasting

TopicStreams uses a straightforward approach for WebSocket message distribution:

# Current implementation (simplified)
async def broadcast_news_update(news_entry: NewsEntry):
    for websocket in connected_websockets[topic]:
        try:
            await websocket.send_json(news_entry.dict())
        except ConnectionClosedOK:
            connected_websockets[topic].remove(websocket)

How it works:

Each WebSocket connection is stored in memory
When new news arrives, the server iterates through ALL connected clients for that topic
Messages are sent one-by-one to each subscriber
Failed connections are cleaned up during broadcasting

Scalability Limitations

This approach has significant limitations:

O(n) Broadcast Cost - Sending to 1000 subscribers requires 1000 separate send operations
Memory Usage - All WebSocket connections stored in server memory
Single Point of Failure - If one server restarts, all connections are lost
No Load Distribution - All broadcasting load handled by single server instance
Slow Message Delivery - Large subscriber bases experience delivery delays

API Reference

Base URL: http://localhost:5000/api/v1

Topics

List Topics

GET /api/v1/topics

Query Parameters:

Parameter	Type	Default	Description
`all`	boolean	`false`	Include inactive (soft deleted) topics

Response:

[
  {
    "id": 1,
    "name": "artificial intelligence",
    "created_at": "2025-12-03T10:30:00",
    "is_active": true
  }
]

Example:

# Get active topics only
curl http://localhost:5000/api/v1/topics

# Get all topics including inactive
curl http://localhost:5000/api/v1/topics?all=true

Add Topic

POST /api/v1/topics

Request Body:

{
  "name": " Climate    CHANGE "
}

Response: 204 No Content

Notes:

Topic names are automatically normalized (lowercased, trimmed), in above example, name is normalized as climate change
Adding an existing inactive topic reactivates it
Adding an existing active topic is idempotent (no error)

Example:

curl -X POST http://localhost:5000/api/v1/topics \
  -H "Content-Type: application/json" \
  -d '{"name": "Quantum Computing"}'

Delete Topic

DELETE /api/v1/topics/{topic_name}

Path Parameters:

Parameter	Type	Description
`topic_name`	string	Topic name (will be normalized)

Response: 204 No Content

Notes:

Soft delete (marks is_active = false)
Deleting a non-existent topic succeeds (idempotent)
Use URL encoding for topics with spaces: Quantum%20Computing or Quantum+Computing
In this example, the topic with normalized name quantum computing will be soft deleted

Example:

curl -X DELETE http://localhost:5000/api/v1/topics/Quantum%20Computing

News

Get News for Topic

GET /api/v1/news/{topic_name}

Path Parameters:

Parameter	Type	Description
`topic_name`	string	Topic name (will be normalized)

Query Parameters:

Parameter	Type	Default	Range	Description
`limit`	integer	`20`	1-100	Number of entries per page
`offset`	integer	`0`	≥0	Pagination offset

Response:

{
  "topic": "artificial intelligence",
  "entries": [
    {
      "id": 123,
      "topic": "artificial intelligence",
      "title": "AI Breakthrough in Healthcare",
      "url": "https://example.com/article",
      "domain": "example.com",
      "source": "Tech News",
      "scraped_at": "2025-12-03T10:45:00"
    }
  ],
  "total": 150,
  "limit": 20,
  "offset": 0
}

Notes:

Results ordered by scraped_at DESC (newest first)
Use pagination for large result sets

Example:

# Get first (newest) 20 news entries
curl http://localhost:5000/api/v1/news/Artificial%20Intelligence

# Get entries 21-40 (pagination)
curl http://localhost:5000/api/v1/news/ARTIFICAL+INTELLIGENCE?limit=20&offset=20

# Get latest 5 entries
curl http://localhost:5000/api/v1/news/artificial+intelligence?limit=5

Logs

Get Scraper Logs

GET /api/v1/logs

Query Parameters:

Parameter	Type	Default	Range	Description
`limit`	integer	`20`	1-100	Number of log entries to return

Response:

[
  {
    "id": 456,
    "topic": "artificial intelligence",
    "scraped_at": "2025-12-03T10:50:00",
    "success": true,
    "http_status_code": 200,
    "error_message": null
  },
  {
    "id": 455,
    "topic": "climate change",
    "scraped_at": "2025-12-03T10:49:30",
    "success": false,
    "http_status_code": 429,
    "error_message": null
  }
]

Notes:

Results ordered by scraped_at DESC (newest first)
One log entry = one webpage load attempt
success = false indicates scraping failure (check http_status_code and error_message)

Example:

# Get last 10 scraper logs
curl http://localhost:5000/api/v1/logs?limit=10

WebSocket

Real-Time News Updates

websocat ws://localhost:5000/api/v1/ws/news/{topic_name}

Path Parameters:

Parameter	Type	Description
`topic_name`	string	Topic name (will be normalized)

Behavior:

Automatically adds the topic if it doesn't exist (starts scraping), no need to POST /api/v1/topics to add topic in advance
Pushes JSON messages when new news entries are scraped
Connection stays open until client disconnects

Message Format:

{
  "id": 789,
  "topic": "artificial intelligence",
  "title": "Breaking: New AI Model Released",
  "url": "https://example.com/breaking-news",
  "domain": "example.com",
  "source": "Tech Times",
  "scraped_at": "2025-12-03T10:55:00"
}

Example:

# Using websocat
websocat ws://localhost:5000/api/v1/ws/news/Bitcoin

# With formatted output
websocat ws://localhost:5000/api/v1/ws/news/bitcoin | jq

# Using JavaScript
const ws = new WebSocket('ws://localhost:5000/api/v1/ws/news/bitcoin');
ws.onmessage = (event) => {
  const news = JSON.parse(event.data);
  console.log('New article:', news.title);
};

Error Responses

All errors return JSON with this structure:

{
  "error": "ERROR_CODE",
  "message": "Human-readable error message",
  "status": "error"
}

Common HTTP Status Codes:

Code	Error Type	Description
`400`	`BAD_REQUEST`	Invalid request (e.g., topic name too long)
`422`	`VALIDATION_ERROR`	Request validation failed (see `details` field)
`500`	`INTERNAL_SERVER_ERROR`	Unexpected server error

Example Error:

{
  "error": "VALIDATION_ERROR",
  "message": "Invalid request parameters",
  "details": [
    {
      "loc": ["query", "limit"],
      "msg": "ensure this value is less than or equal to 100",
      "type": "value_error"
    }
  ],
  "status": "error"
}

Database Access

For debugging or manual inspection, you can access the PostgreSQL database directly using psql:

docker compose exec postgres psql -U newsuser -d newsdb

Common SQL queries:

-- View all active topics
SELECT id, name, created_at, is_active FROM topics WHERE is_active = TRUE ORDER BY created_at DESC;

-- Count news entries per topic
SELECT topic, COUNT(*) as count FROM news_entries GROUP BY topic ORDER BY count DESC;

-- View recent news entries for a topic
SELECT id, title, url, source, scraped_at
FROM news_entries
WHERE topic = 'artificial intelligence'
ORDER BY scraped_at DESC
LIMIT 10;

-- Check scraper success rate
SELECT
    success,
    COUNT(*) as count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
FROM scraper_logs
GROUP BY success;

-- View failed scrapes with error details
SELECT topic, scraped_at, http_status_code, error_message
FROM scraper_logs
WHERE success = FALSE
ORDER BY scraped_at DESC
LIMIT 20;

-- Database size
SELECT pg_size_pretty(pg_database_size('newsdb')) as database_size;

-- Table sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Backup and restore:

# Backup database
docker compose exec postgres pg_dump -U newsuser newsdb > backup.sql

# Restore database
docker compose exec -T postgres psql -U newsuser -d newsdb < backup.sql

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
api		api
common		common
docs/pic		docs/pic
postgres		postgres
requirements		requirements
scraper		scraper
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

License

zydo/topicstreams

Folders and files

Latest commit

History

Repository files navigation

TopicStreams

Why TopicStreams?

The Limitations with Google News & RSS

TopicStreams' Approach

Try It Live

Quick Demo

WebSocket Streaming

What TopicStreams Offers

Limitations

Features

Architecture

Data Flow

Key Technologies

Prerequisites

Web UI

Features

Access the Web UI

Quick Start

1. Clone the Repository

2. Configure Environment

3. Start Services

4. Add Topics to Track

5. Access Real-Time News

6. Monitor Logs

Stop Services

Configuration

Database Settings

API Settings

Scraper Settings

Browser Fingerprinting (Anti-Detection)

Anti-Bot Detection

How It Works

1. Browser Launch Arguments

2. Realistic Browser Context

3. Playwright-Stealth Patches

4. Memory Management

What Google Sees

Limitations

Scraping Behavior

Sequential Execution

Scrape Interval Behavior

Monitoring Scrape Performance

Proxy Rotation

The Problem: Scaling Beyond Sequential Scraping

Recommended Solution: Proxy Rotation

Advanced: Different Personas per Proxy

Benefits

Implementation Considerations

When You Need This

Future Plans

Authentication & Security

Current State: Localhost/LAN Only

Recommended Solutions

1. Authentication & Authorization

2. API Rate Limiting

3. Cloudflare (Recommended for Public Deployment)

4. Additional Security Measures

WebSocket Scalability

Current State: Simple Broadcasting

Scalability Limitations

Recommended Solutions

1. Redis Pub/Sub

2. Apache Kafka

API Reference

Topics

List Topics

Add Topic

Delete Topic

News

Get News for Topic

Logs

Get Scraper Logs

WebSocket

Real-Time News Updates