A Java static method to detect text that has been marked up with HTML tags or entities.
I needed to detect self-contained HTML tags or entities in user supplied data to make formatting determinations. After searching the Internet I found a few examples as regular expressions. Most of the examples failed my initial test cases and didn't handle conditions such as text without tags that contained HTML entity escape codes.
I continued to refine the regular expression until I came up with a good meta expression that handled:
- Start and End tag combinations in single or multi-line text values.
- Text marked up with self-closing tags such as <br/> or <hr/>
- Text marked up with HTML entity escape sequences like < or ½
I also wanted to make sure that it didn't match other common text phrases that may be misinterpreted as HTML.
- Logic expressions such as: "If A<B then B>A"
- Ampersand usage: AT&T, D&B, etc...
- Malformed or partial HTML: </body></html>
String htmlContent="<a href=\"http://www.example.com/\">\nclick here\n</a>";
if (DetectHtml.isHtml(htmlContent))
System.out.println("htmlContent is HTML");Please Note:
This in no way will check user provided HTML for safety. You still need to sanitize your HTML. I recommend OWASP to sanitize your HTML.
No dependencies required. Just refactor the class into your project and you're done.
--Dave