Skip to content

Java static method to detect text that has been marked up with HTML tags or entities

License

Notifications You must be signed in to change notification settings

dbennett455/DetectHtml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DetectHtml

A Java static method to detect text that has been marked up with HTML tags or entities.

I needed to detect self-contained HTML tags or entities in user supplied data to make formatting determinations. After searching the Internet I found a few examples as regular expressions. Most of the examples failed my initial test cases and didn't handle conditions such as text without tags that contained HTML entity escape codes.

I continued to refine the regular expression until I came up with a good meta expression that handled:

  • Start and End tag combinations in single or multi-line text values.
  • Text marked up with self-closing tags such as <br/> or <hr/>
  • Text marked up with HTML entity escape sequences like &lt; or &frac12;

I also wanted to make sure that it didn't match other common text phrases that may be misinterpreted as HTML.

  • Logic expressions such as: "If A<B then B>A"
  • Ampersand usage: AT&T, D&B, etc...
  • Malformed or partial HTML: </body></html>

Sample Usage

    String htmlContent="<a href=\"http://www.example.com/\">\nclick here\n</a>";
    if (DetectHtml.isHtml(htmlContent))
      System.out.println("htmlContent is HTML");

Please Note:

This in no way will check user provided HTML for safety. You still need to sanitize your HTML. I recommend OWASP to sanitize your HTML.


No dependencies required. Just refactor the class into your project and you're done.

--Dave

About

Java static method to detect text that has been marked up with HTML tags or entities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published