Skip to content

calbucci/CalbucciLib.HtmlParser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CalbucciLib.HtmlParser

CalbucciLib.HtmlParser is a SAX-like HTML 5.0 compliant Parser for fast parsing of HTML documents without building the DOM.

Typical Scenarios

  • Use it to scrape pieces of HTML
  • Detect META / LINK tags (e.g. Open Graph tags)
  • Optimize the output HTML (remove whitespace, clear empty tags)
  • Detect HTML syntax errors and notify developers
  • Extract text from the HTML

Sample

Get the RSS Feed of a website

			var parser = new HtmlParser(TestContent.BlogPost);
			string rssFeed = null;
			parser.Parse(null, (HtmlElement element, bool isEmptyTag) =>
			{
				if (element.TagName == "link")
				{
					if (element.GetAttributeValue("type") == "application/rss+xml")
					{
						rssFeed = element.GetAttributeValue("href");
						parser.Stop();
					}
				}
			});

Remove whitespaces

			var parser = new HtmlParser(html);
			parser.PreserveCRLFTab = false;
			StringBuilder sb = new StringBuilder(html.Length);
			parser.Parse(
				(text, parent) =>
				{
					sb.Append(WebUtility.HtmlEncode(text));
				},
				(parent, isEmptyTag) =>
				{
					sb.Append(parent.GetOpenTag(false, false));
				},
				(closeTag) =>
				{
					sb.AppendFormat("</{0}>", closeTag);
				});

Questions

Contributors

About

Fast HTML Parser for .NET

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages