-
-
Notifications
You must be signed in to change notification settings - Fork 524
Description
Community Feedback
"I'm also concerned that it's scraping things that don't need to be scraped. 'GitHub Issues (open/closed, labels, milestones)'. Why are you scraping anything but open issues? I could see maybe scraping some closed issues that are recent (and prior to a release that hasn't arrived), but I'm not sure a lot of these things are relevant. People have token usage concerns."
The Problem
Current behavior (cli/github_scraper.py:403):
# Fetch recent issues (open + closed)
issues = self.repo.get_issues(state='all', sort='updated', direction='desc')
# Default: max_issues = 100What gets scraped:
- ✅ Open issues (relevant - active bugs/features)
- ❌ Closed issues (often irrelevant historical noise)
- ❌ Issue body (500 chars each × 100 issues = 50KB text)
- ❌ All labels (metadata bloat)
- ❌ All milestones (often outdated)
- ❌ Created/updated/closed timestamps (unnecessary)
Example: React repo has 12,000+ closed issues. Why include these in a skill?
Why This Is Critical: Token Economy
The Token Problem
Users pay per token when using Claude:
- Claude Pro: 200K context window, but costs more for larger skills
- API usage: Direct token costs ($$$)
- Sonnet: ~$3 per million input tokens
Bloated skills = higher costs for users.
Current Token Waste
Example: Scraping facebook/react:
100 issues scraped:
- 100 titles × 50 chars = 5,000 chars
- 100 bodies × 500 chars = 50,000 chars
- 100 × labels/milestones/dates = 10,000 chars
Total: ~65,000 chars = ~16,000 tokens
Cost per skill use: ~$0.05 just from issues
With 50% closed issues: ~$0.025 wasted per use
Multiply by thousands of users = significant waste.
Real-World Impact
User perspective:
- "Why does this React skill cost 30K tokens?"
- "Half of these issues are closed from 2019"
- "I just wanted current API docs, not bug history"
What Should Be Scraped?
Option 1: Only Open Issues (Recommended)
# Only active issues - what users care about
issues = self.repo.get_issues(state='open', sort='updated', direction='desc')Why:
- ✅ Open = current, relevant information
- ✅ Bugs to be aware of
- ✅ Planned features
- ✅ Active discussions
- ❌ No historical noise
Token savings: 50-80% reduction in issue data
Option 2: Open + Recent Closed (Balanced)
# Open issues + closed in last 30 days
issues_open = repo.get_issues(state='open', sort='updated')
issues_closed_recent = repo.get_issues(
state='closed',
sort='updated',
since=datetime.now() - timedelta(days=30)
)Why:
- ✅ Current active issues
- ✅ Recently fixed bugs (still relevant)
- ✅ Pre-release closed issues
- ❌ No ancient history
Token savings: 40-60% reduction
Option 3: Configurable (Most Flexible)
config = {
"include_issues": True,
"issue_state": "open", # "open", "closed", "all"
"issue_age_days": 30, # Only closed issues from last N days
"max_issues": 20, # Reduced from 100
"include_labels": True,
"include_milestones": False,
"include_body": False # Just title + URL
}Why:
- ✅ User controls token usage
- ✅ Different use cases (docs vs bug tracking)
- ✅ Defaults optimized for tokens
What About Milestones and Labels?
Labels
Current: Scrapes all labels for every issue
Problem: Often irrelevant ("good first issue", "documentation", "enhancement")
Proposal: Make optional, default OFF
if config.get('include_labels', False): # Default: False
labels = [label.name for label in issue.labels]Milestones
Current: Scrapes milestone title for every issue
Problem: Often outdated ("v1.0", "Q3 2022")
Proposal: Make optional, default OFF
if config.get('include_milestones', False): # Default: False
milestone = issue.milestone.title if issue.milestone else NoneIssue Body
Current: First 500 chars of every issue body
Problem: 100 issues × 500 chars = 50KB of often-irrelevant text
Proposal: Make optional, reduce length, or skip entirely
body_length = config.get('issue_body_length', 0) # Default: 0 (skip)
if body_length > 0:
issue_data['body'] = issue.body[:body_length]Proposed Changes
Immediate (Breaking but Necessary)
-
Change default to open only:
state = config.get('issue_state', 'open') # Changed from 'all' issues = self.repo.get_issues(state=state, ...)
-
Reduce max_issues:
max_issues = config.get('max_issues', 20) # Changed from 100
-
Skip body by default:
include_body = config.get('include_issue_body', False) # New, default False
-
Skip metadata by default:
include_labels = config.get('include_labels', False) include_milestones = config.get('include_milestones', False)
Migration Path
Update existing configs to be explicit:
{
"repo": "facebook/react",
"include_issues": true,
"issue_state": "open",
"max_issues": 20,
"include_labels": false,
"include_milestones": false,
"include_issue_body": false
}Add warning for old behavior:
if config.get('issue_state') is None:
logger.warning("No issue_state specified - defaulting to 'open' (changed from 'all')")
logger.warning("To scrape closed issues, set 'issue_state': 'all' in config")Expected Results
Token Reduction
Before (current):
- 100 issues (50 open, 50 closed)
- Full metadata (labels, milestones, body)
- ~65,000 chars = ~16,000 tokens
After (proposed):
- 20 issues (open only)
- Minimal metadata (title, URL, state)
- ~2,000 chars = ~500 tokens
Savings: 97% token reduction on issue data
User Impact
Before:
- User: "Why is this skill so large?"
- User: "Half these issues are from 2019"
- User: "Too expensive to use"
After:
- User: "Lean skill, just what I need"
- User: "Only current, relevant info"
- User: "Reasonable token usage"
Implementation Checklist
- Change default
issue_statefrom'all'to'open' - Reduce default
max_issuesfrom 100 to 20 - Add
include_issue_bodyconfig option (default: False) - Add
include_labelsconfig option (default: False) - Add
include_milestonesconfig option (default: False) - Add
issue_age_daysconfig option for recent closed issues - Update all example configs with new defaults
- Update documentation about token optimization
- Add CLI flags:
-
--include-closed-issues -
--max-issues N -
--include-issue-metadata
-
- Add warning when using old defaults
- Create migration guide for existing users
Related Issues
- feat: Add modern Python packaging (pyproject.toml + uv support) #168 - Modern Python packaging (related to overall quality)
- F2.1-F2.5 - Incremental updates (related to efficiency)
Priority
High - This directly impacts:
- User costs (token usage)
- Skill quality (signal vs noise)
- Project adoption (bloated skills = bad UX)
Community Impact
Addressing this concern shows:
- ✅ We listen to technical feedback
- ✅ We respect token economy
- ✅ We prioritize signal over noise
- ✅ We care about user costs
"People have token usage concerns" is a critical insight - we should act on it.