Skip to content

fix: Reduce token bloat - Stop scraping closed issues and unnecessary metadata #169

@yusufkaraaslan

Description

@yusufkaraaslan

Community Feedback

"I'm also concerned that it's scraping things that don't need to be scraped. 'GitHub Issues (open/closed, labels, milestones)'. Why are you scraping anything but open issues? I could see maybe scraping some closed issues that are recent (and prior to a release that hasn't arrived), but I'm not sure a lot of these things are relevant. People have token usage concerns."

The Problem

Current behavior (cli/github_scraper.py:403):

# Fetch recent issues (open + closed)
issues = self.repo.get_issues(state='all', sort='updated', direction='desc')

# Default: max_issues = 100

What gets scraped:

  • ✅ Open issues (relevant - active bugs/features)
  • Closed issues (often irrelevant historical noise)
  • ❌ Issue body (500 chars each × 100 issues = 50KB text)
  • ❌ All labels (metadata bloat)
  • ❌ All milestones (often outdated)
  • ❌ Created/updated/closed timestamps (unnecessary)

Example: React repo has 12,000+ closed issues. Why include these in a skill?

Why This Is Critical: Token Economy

The Token Problem

Users pay per token when using Claude:

  • Claude Pro: 200K context window, but costs more for larger skills
  • API usage: Direct token costs ($$$)
  • Sonnet: ~$3 per million input tokens

Bloated skills = higher costs for users.

Current Token Waste

Example: Scraping facebook/react:

100 issues scraped:
- 100 titles × 50 chars = 5,000 chars
- 100 bodies × 500 chars = 50,000 chars
- 100 × labels/milestones/dates = 10,000 chars
Total: ~65,000 chars = ~16,000 tokens

Cost per skill use: ~$0.05 just from issues
With 50% closed issues: ~$0.025 wasted per use

Multiply by thousands of users = significant waste.

Real-World Impact

User perspective:

  • "Why does this React skill cost 30K tokens?"
  • "Half of these issues are closed from 2019"
  • "I just wanted current API docs, not bug history"

What Should Be Scraped?

Option 1: Only Open Issues (Recommended)

# Only active issues - what users care about
issues = self.repo.get_issues(state='open', sort='updated', direction='desc')

Why:

  • ✅ Open = current, relevant information
  • ✅ Bugs to be aware of
  • ✅ Planned features
  • ✅ Active discussions
  • ❌ No historical noise

Token savings: 50-80% reduction in issue data

Option 2: Open + Recent Closed (Balanced)

# Open issues + closed in last 30 days
issues_open = repo.get_issues(state='open', sort='updated')
issues_closed_recent = repo.get_issues(
    state='closed', 
    sort='updated',
    since=datetime.now() - timedelta(days=30)
)

Why:

  • ✅ Current active issues
  • ✅ Recently fixed bugs (still relevant)
  • ✅ Pre-release closed issues
  • ❌ No ancient history

Token savings: 40-60% reduction

Option 3: Configurable (Most Flexible)

config = {
    "include_issues": True,
    "issue_state": "open",  # "open", "closed", "all"
    "issue_age_days": 30,   # Only closed issues from last N days
    "max_issues": 20,       # Reduced from 100
    "include_labels": True,
    "include_milestones": False,
    "include_body": False   # Just title + URL
}

Why:

  • ✅ User controls token usage
  • ✅ Different use cases (docs vs bug tracking)
  • ✅ Defaults optimized for tokens

What About Milestones and Labels?

Labels

Current: Scrapes all labels for every issue
Problem: Often irrelevant ("good first issue", "documentation", "enhancement")
Proposal: Make optional, default OFF

if config.get('include_labels', False):  # Default: False
    labels = [label.name for label in issue.labels]

Milestones

Current: Scrapes milestone title for every issue
Problem: Often outdated ("v1.0", "Q3 2022")
Proposal: Make optional, default OFF

if config.get('include_milestones', False):  # Default: False
    milestone = issue.milestone.title if issue.milestone else None

Issue Body

Current: First 500 chars of every issue body
Problem: 100 issues × 500 chars = 50KB of often-irrelevant text
Proposal: Make optional, reduce length, or skip entirely

body_length = config.get('issue_body_length', 0)  # Default: 0 (skip)
if body_length > 0:
    issue_data['body'] = issue.body[:body_length]

Proposed Changes

Immediate (Breaking but Necessary)

  1. Change default to open only:

    state = config.get('issue_state', 'open')  # Changed from 'all'
    issues = self.repo.get_issues(state=state, ...)
  2. Reduce max_issues:

    max_issues = config.get('max_issues', 20)  # Changed from 100
  3. Skip body by default:

    include_body = config.get('include_issue_body', False)  # New, default False
  4. Skip metadata by default:

    include_labels = config.get('include_labels', False)
    include_milestones = config.get('include_milestones', False)

Migration Path

Update existing configs to be explicit:

{
  "repo": "facebook/react",
  "include_issues": true,
  "issue_state": "open",
  "max_issues": 20,
  "include_labels": false,
  "include_milestones": false,
  "include_issue_body": false
}

Add warning for old behavior:

if config.get('issue_state') is None:
    logger.warning("No issue_state specified - defaulting to 'open' (changed from 'all')")
    logger.warning("To scrape closed issues, set 'issue_state': 'all' in config")

Expected Results

Token Reduction

Before (current):

  • 100 issues (50 open, 50 closed)
  • Full metadata (labels, milestones, body)
  • ~65,000 chars = ~16,000 tokens

After (proposed):

  • 20 issues (open only)
  • Minimal metadata (title, URL, state)
  • ~2,000 chars = ~500 tokens

Savings: 97% token reduction on issue data

User Impact

Before:

  • User: "Why is this skill so large?"
  • User: "Half these issues are from 2019"
  • User: "Too expensive to use"

After:

  • User: "Lean skill, just what I need"
  • User: "Only current, relevant info"
  • User: "Reasonable token usage"

Implementation Checklist

  • Change default issue_state from 'all' to 'open'
  • Reduce default max_issues from 100 to 20
  • Add include_issue_body config option (default: False)
  • Add include_labels config option (default: False)
  • Add include_milestones config option (default: False)
  • Add issue_age_days config option for recent closed issues
  • Update all example configs with new defaults
  • Update documentation about token optimization
  • Add CLI flags:
    • --include-closed-issues
    • --max-issues N
    • --include-issue-metadata
  • Add warning when using old defaults
  • Create migration guide for existing users

Related Issues

Priority

High - This directly impacts:

  • User costs (token usage)
  • Skill quality (signal vs noise)
  • Project adoption (bloated skills = bad UX)

Community Impact

Addressing this concern shows:

  • ✅ We listen to technical feedback
  • ✅ We respect token economy
  • ✅ We prioritize signal over noise
  • ✅ We care about user costs

"People have token usage concerns" is a critical insight - we should act on it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions