Skip to content

Java Regular Expression

Zuned Ahmed edited this page Jul 13, 2014 · 6 revisions

What Are Regular Expressions?

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit, or manipulate text and data.

Metacharacters

Different types of metacharachters are: <([{\^-=$!|]})?*+.>

There are two ways to force a metacharacter to be treated as an ordinary character:

  • precede the metacharacter with a backslash
  • or enclose it within \Q (which starts the quote) and \E (which ends it).
Construct Description
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z, or A through Z, inclusive (range)
[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]] d, e, or f (intersection)
[a-z&&[^bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] a through z, and not m through p: [a-lq-z] (subtraction)

Regular Expression Description
[bcr]at 3 Character String that start with b or c or r and ends with at
[^bcr]at 3 Character String that does not start with b or c or r and ends with at
[0-4[6-8]] It represent as Union of two set. Describe as Single Character String that can have value 0,1,2,3,4,6,7,8
[0-9&&[345]] It represent as Intersection of two set. Describe as Single Character String that can have value 3,4,5
[2-8&&[4-6]] It represent as Intersection of two set. Describe as Single Character String that can have value 4,5,6
[0-9&&[^345]] It represent as Subtraction of two set. Describe as Single Character String that can have any value between 0 to 9 except 3,4,5

Predefined Character Classes

The Pattern API contains a number of useful predefined character classes, which offer convenient shorthands for commonly used regular expressions

Construct Description
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

How to use construct with esacape character?

Constructs beginning with a backslash are called escaped constructs. For usage:

private final String REGEX = "\\d"; // a single digit

Quantifiers

Differences Among Greedy, Reluctant, and Possessive Quantifiers:

  • Greedy quantifiers are considered "greedy" because they force the matcher to read in, or eat, the entire input string prior to attempting the first match. If the first match attempt (the entire input string) fails, the matcher backs off the input string by one character and tries again, repeating the process until a match is found or there are no more characters left to back off from. Depending on the quantifier used in the expression, the last thing it will try matching against is 1 or 0 characters.
  • The reluctant quantifiers, however, take the opposite approach: They start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string.
  • Finally, the possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.

To illustrate, consider the input string xfooxxxxxxfoo.

 
Enter your regex: .*foo  // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.

Enter your regex: .*?foo // reluctant quantifier Enter input string to search: xfooxxxxxxfoo I found the text "xfoo" starting at index 0 and ending at index 4. I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // possessive quantifier Enter input string to search: xfooxxxxxxfoo No match found.

The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f" "o" "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f" "o" "o") have already been consumed. So the matcher slowly backs off one letter at a time until the rightmost occurrence of "foo" has been regurgitated, at which point the match succeeds and the search ends.

The second example, however, is reluctant, so it starts by first consuming "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.

The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+, leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.

Greedy Reluctant Possessive Meaning
X? X?? X?+ X, once or not at all
X* X*? X*+ X, zero or more times
X+ X+? X++ X, one or more times
X{n} X{n}? X{n}+ X, exactly n times
X{n,} X{n,}? X{n,}+ X, at least n times
X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times
Capturing Groups

Capturing groups are a way to treat multiple characters as a single unit

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g".

Boundary Matchers

Boundary Construct Description
^ The beginning of a line
$ The end of a line
\b A word boundary
\B A non-word boundary
\A The beginning of the input
\G The end of the previous match
\Z The end of the input but for the final terminator, if any
\z The end of the input

How Are Regular Expressions Represented in java.util.regex Package?

The java.util.regex package primarily consists of three classes: Pattern, Matcher, and PatternSyntaxException.

  • A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, we must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument;
  • A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. We obtain a Matcher by invoking the matcher method on a Pattern object
  • A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.

	public static void main(String[] args){
        Console console = System.console();
        if (console == null) {
            System.err.println("No console.");
            System.exit(1);
        }
        while (true) {

            Pattern pattern = 
            Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher = 
            pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;
            while (matcher.find()) {
                console.format("I found the text" +
                    " \"%s\" starting at " +
                    "index %d and ending at index %d.%n",
                    matcher.group(),
                    matcher.start(),
                    matcher.end());
                found = true;
            }
            if(!found){
                console.format("No match found.%n");
            }
        }
    }

Let's see how to use Api

Pattern pattern = Pattern.compile(console.readLine("%nEnter your regex: "));

Here console.readLine() : will take input from console , where we provide regular expression String.
To initialize console use java.io.Console console = System.console();

Matcher matcher = pattern.matcher(console.readLine("Enter input string to search: "));

Note pattern object is used to retrieve matcher object, pattern object will take input as String to match against regular expression.

matcher.find()

Regular Expression Examples
Another Good Tutorial
Good Read

Clone this wiki locally