It uses a Cloudflare's HTMLRewriter to scrape the page for metadata and returns it as JSON.
- super fast metadata scraping using rules based on Metascraper's rules to pick out the metadata for a given property. These rules can easily be modified to suit your needs, they are defined in
src/scraper-rules.ts - resolves short urls (e.g. https://t.co/wy9S5P0Cd2) and unshortens them
- option to clean url tracking params (e.g.
utm_*andfbclidusing thecleanUrlquery param, see below) - tries to infer the link type and returns that in the
urlTypeproperty, will be one of:'link' | 'video' | 'audio' | 'recipe' | 'image' | 'document' | 'article' | 'game' | 'book' | 'event' | 'product' | 'note' | 'file' - the worker randomises user-agent strings to avoid being blocked
url- the URL to scrapecleanUrl- iftrue, the URL will be cleaned up to remove any tracking params
Once deployed to Cloudflare, add a url query param for the URL you want to scrape, e.g.
# Basic example
https://your-worker-name.cloudflare.com/?url=https://zander.wtf
# Example with `cleanUrl=true`
https://your-worker-name.cloudflare.com/?cleanUrl=true&url=https://poetsroad.bandcamp.com/?from=search&search_item_id=1141951669&search_item_type=b&search_match_part=%3F&search_page_id=1748155363&search_page_no=1&search_rank=1&search_sig=a9a9cbdfc454df7c2999f097dc8a216bResponse:
From my website (https://zander.wtf):
{
"title": "Hi! I'm Zander, I make websites",
"description": "Zander Martineau's personal site. I'm a contractor with 15+ years of experience helping companies get products to market, rewriting apps, creating POCs and more. I specialize in front-end but also work full-stack.",
"author": "Zander Martineau",
"image": "https://zander.wtf/opengraph.jpg",
"feeds": [
"https://zander.wtf/blog.rss.xml",
"https://zander.wtf/links.rss.xml"
],
"date": "2023-09-07T00:00:00.000Z",
"lang": "en",
"logo": "",
"video": "",
"keywords": "",
"jsonld": "",
"cleaned_url": "https://zander.wtf",
"url": "https://zander.wtf",
"urlType": "link"
}From a YouTube video:
{
"title": "World's Best FPV Drone Shot? (extreme mountain biking) - YouTube",
"description": "Dive into the hardest mountain bike race through the eyes of an intense FPV drone shot. The @dutchdronegods followed Kade Edwards down the Red Bull Hardline ...",
"author": "",
"image": "https://i.ytimg.com/vi/ctEksNz7tqg/maxresdefault.jpg",
"feeds": [],
"date": "2023-09-19T07:00:07-07:00",
"lang": "en",
"logo": "",
"video": "https://www.youtube.com/embed/ctEksNz7tqg",
"keywords": "red bull, redbull, action sports, extreme sports, sport videos, action, sport, red bull bike, bike, downhill, pov, mtb, pov mtb, urban downhill, urban, downhill mtb, urban downhill racing, racing, DRONE, drone, fpv drone, dutch drone gods, drone shot, hardline, red bull hardline, hardest mountain bike race, hardest race, hard line, hardest drone shot, downhill mountain bike, downhill race, hardest mountain bike, hardest mtb, kade edwards, kade, edwards, wales, welsh, one shot",
"jsonld": {
"@context": "http://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"item": {
"@id": "http://www.youtube.com/@redbull",
"name": "Red Bull"
}
}
]
},
"url": "https://www.youtube.com/watch?v=ctEksNz7tqg",
"urlType": "video"
}- Run
npm start - The test file in
src/test/index.htmlcan be used to test the worker locally. Runnpm run serve:testto start a local server and then run a GET request againsthttp://127.0.0.1:8787/?url=http://localhost:1234to view the output.
Run npm run test to run the small suite of unit tests.
Scraping metadata from a page is a tricky business, so if you have any suggestions or improvements, please open an issue or submit a PR, they are always welcome!
Made by Zander • zander.wtf • GitHub • Mastodon