Skip to content

Conversation

@Scriptwonder
Copy link

Unity JSON file upload for Scraping the correspondent Unity Skills

Unity JSON upload
@yusufkaraaslan
Copy link
Owner

Hi @Scriptwonder!

Thank you for this comprehensive Unity configuration! This is one of the most detailed configs we've received. 🎉

I've reviewed the PR and found it to be well-structured overall. However, I have a few concerns that need to be addressed before merging:

❌ Critical Issue: max_pages too high

"max_pages": 50000  // This would take days to scrape!

Unity docs have approximately 15,000-20,000 pages total. With this setting, scraping would run for 10+ hours and create unnecessarily large skills. I recommend:

  • If using split strategy: "max_pages": 20000 (will split into ~5 skills of 4,000 pages each)
  • If NOT using split: "max_pages": 5000 (more manageable single skill)

For reference, our Godot config (similar game engine) uses "max_pages": 500 and works well.

⚠️ Unsupported Fields

These fields are not yet implemented in the codebase and will be silently ignored:

  • version_handling
  • special_handling
  • content_processing

I verified this by checking the source code - these fields aren't processed anywhere. Options:

  1. Remove them (recommended for clarity - avoids confusion about supported features)
  2. Prefix with _future_ to mark as planned features (e.g., _future_version_handling)

✅ What's Great

  • Excellent category organization (17 categories - most comprehensive we have!)
  • Proper use of split_strategy and checkpoint (these ARE fully supported ✓)
  • Correct selectors and URL patterns (I verified div.content exists on Unity docs)
  • Good rate limiting (0.3s is appropriate)
  • Thoughtful exclusions (legacy versions 2017-2020)
  • 11 well-chosen start URLs covering all major Unity documentation sections

🔧 Requested Changes

Could you please update the PR to:

  1. Reduce max_pages to a realistic value:

    • 5000 for single skill
    • 20000 if using the split strategy (will create ~5 sub-skills)
  2. Remove or rename unsupported fields:

    • Delete version_handling, special_handling, content_processing
    • OR prefix them with _future_ to indicate they're planned features
  3. (Optional but recommended) Test the config with a small scrape first:

    skill-seekers scrape --config configs/unity.json --max-pages 100 --dry-run

📊 Comparison

For context, here are max_pages values from similar configs:

  • Godot (game engine): 500
  • React (framework): 100
  • FastAPI (framework): 100
  • Django (framework): 200

Once these changes are made, this will be an excellent addition to our config collection! Unity is a highly requested framework and your comprehensive categorization will make for a great skill.

Let me know if you have any questions or need help with the changes. Thanks again for contributing! 🚀

Copy link
Owner

@yusufkaraaslan yusufkaraaslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes as outlined in the review comment above. The config is excellent overall but needs:

  1. Reduced max_pages (50000 → 5000 or 20000)
  2. Removal of unsupported fields (version_handling, special_handling, content_processing)

Please update when you have a chance. Happy to help if you have questions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants