Lab 24: Regex

A. Intro

In this lab you'll be learning to work with regular expressions, often abbreviated as regex. Regular expressions are a special kind of syntax that allow easy searching of long documents for Strings that match special patterns. Most general purpose programming languages you come across will allow you to use regex, so rest assured that what you learn today will not only apply to Java. Regex is an essential tool for any programmer!

A Motivating Example

Imagine you have a document, like so:

Dear Bob,

The time for revolution is fast approaching. We must act quickly before the government gets wind of our plans. 
Please forward this message to our trusty allies, eve@shadywebsite.net and mallory@vivalarevolucion.com.
And don't let anyone else see this message!!

Sincerely,
Alice

You'd like to create a program that reads in documents like this and ouputs a list of all email addresses contained within the document. How would you do it?

If we were to give you this task before today, you might start off like so: First, split the document into its separate words by looking for spaces. Then for each word, check if it contains an @ sign. But only one @ sign. And there has to be text before and after the @ sign. And that text could include numbers, and letters before (and a couple of other things), but only letters after. And for sure it must start with a letter. And there can be multiple periods, but you couldn't have two periods in a row.

See the problem? If you were to write this program, it would involve a lot of crazy conditionals all over the place, and be quite long. But it's such a mundane task. Certainly we can do better, right?

Luckily, regex steps up to the challenge. You can solve this problem essentially using a single line of regex. This particular regex is a bit complicated, so we'll start more simply. But we'll get there.

Checking if a String Matches a Pattern

The simplest way we can use regex is to check if a String matches a pattern. The String class has a method for this. Here's an example.

String wugPattern = "wug";
String w = "wug";
System.out.println(w.matches(wugPattern)); // prints true
w = "love";
System.out.println(w.matches(wugPattern)); // prints false

If this is all we could do with regex, it would be pretty useless. Luckily, regex patterns can be more general than just specific sequences of characters.

B. Character Classes

Character Classes

In the previous example, we asked if a String matched a sequence of characters. We can add to this, and ask if a String matches a sequences of characters and character clases. The fastest way to explain a character class is probably just to show it.

String pattern = "w[aeiouy]g";
String w = "wug";
System.out.println(w.matches(pattern)); // prints true
w = "wog";
System.out.println(w.matches(pattern)); // also true!
w = "waeg";
System.out.println(w.matches(pattern)); // false

What's going on? The String "wug" doesn't look anything like the String "w[aeiouy]g"! That's because square brackets are a special symbol in regex. A string will match a regex pattern if it matches any one character inside the brackets. You can think of it as a big OR symbol. This pattern is saying: first match exactly one w, then match exactly one of a, e, i, o, u, or y, then match exactly one g.

What if you wanted to match all the letters of the alphabet, a - z? You could do:

String pattern = "[abcdefghijklmnopqrstuvwxyz]ug";
String w = "hug";
System.out.println(w.matches(pattern)); // prints true

This seems a little annoying. Luckily, regex has a shortcut for us:

String pattern = "[a-z]ug";
String w = "hug";
System.out.println(w.matches(pattern)); // prints true

Yes, [a-z] will match all the characters between a and z. This works for numbers too. [0-9] will match all the digits.

Not

If you include a ^ at the beginning of a character class, you will match any charater that's not in the class. For example, the pattern w[^aeiouy]gwill match whg, wng, wvg, etc., but not wug, wig, wog, etc.

C. Special Characters

Repetition

The true power of computer science, of course, is in its ability to repeat things. Naturally, regex supports some kind of notion of repetion. Here's an example of one in action:

String laughingPattern = "lo+l";
String w = "lol";
System.out.println(w.matches(laughingPattern)); // true
w = "loooool";
System.out.println(w.matches(laughingPattern)); // true
w = "ll";
System.out.println(w.matches(laughingPattern)); // false

There's another special regex character! The + sign matches one or more repetitions of the character, as shown above. There are some other special characters that work similarly to, but not exactly the same as, the + sign. They are:

You can repeat a group of things using parentheses. For example

String laughingPattern = "l(ol)+"; // specifies you can repeat ol
String w = "lol";
System.out.println(w.matches(laughingPattern)); // true
w = "lololololol";
System.out.println(w.matches(laughingPattern)); // true
w = "loooooool";
System.out.println(w.matches(laughingPattern)); // false

Special Characters and Escaping

Usually when you include a character in a regex pattern, that character will only match itself. For example, the regex pattern "wug" will match the String "wug" and nothing else. As we saw earlier, however, there are certain special characters that have other functions. [ and ] were used to indicate character classes, allowing us to match one of several characters, while +, ?, and * were used to match repetitons. We'd like to now introduce one more common special character:

Great. So now we know some special characters in regex. What if we actually wanted to match one of the special characters? What if you wanted to specifically match a String that included a +, for example? In this case, you have to escape the character by putting a \ in front of it.

Note that this means \ is itself a special character. So if you wanted to match a string that included \, you'd need to include \\.

Note! The \ is a bit of crazy character, because it's not just a special character in regex; it's a special character in any Java string, regardless of if it's intended to be interpreted as regex. (For example, try System.out.println("\n");, and you won't see a \ appear). Therefore, if you want to have the regex \+, you need to escape the \ in Java so that it can be treated as a normal Java character so that it can be a special character in regex. Hence, to match the string "+", you will need the pattern "\\+".

Special Characters and Character Classes

The characters that are considered special inside of [] are different than the characters that are considered special elsewhere. The only special characters inside [] are ^, -, and \. The \ is still used for escaping, allowing you to escape ], ^, and - (and \ itself) inside the character class.

Convenient Character Classes

Asside from escaping, the \ in regex has an additional use. Notice that since there's no reason you'd even want to escape normal characters, such as w, you might imagine the sequence \w is useless. It's not. Intead, regex uses sequences like these for special meanings:

C. Java's Regex Objects

Using Java's Pattern and Matcher Objects

Using String's matches method is cute and all, but this is Java, so we should expect objects to get involved somehow. This turns out to be the case. In order to do more complicated things with regex, we'll need to take advantage of two new classes: Pattern and Matcher. These need to be imported from java.util.regex.

Before we were representing patterns with just Strings, like so:

String laughingPattern = "lo+l"; 

Now we'll transition to using a Pattern object instead.

Pattern laughingPattern = Pattern.compile("lo+l");

We want to check if various Strings match this pattern. We have to create a Matcher object for each one out of the pattern.

Matcher matcher1 = laughingPattern.matcher("loool");
System.out.println(matcher1.matches()); // prints true

Matcher matcher2 = laughingPattern.matcher("loooooooool");
System.out.println(matchter2.matches()); // prints true

Seems like a lot of work just to do the same thing, right? But it's worth it, because because these objects come with some extra functionality.

Flags

For example, you can pass special flags into the constructor for Pattern to subtly change its behavior. One of the most useful ones is one that allows it to to ignore case.

Pattern laughingPattern = Pattern.compile("lo+l", Pattern.CASE_INSENSTIVE);

Matcher matcher1 = laughingPattern.matcher("loool");
System.out.println(matcher1.matches()); // prints true

Matcher matcher2 = laughingPattern.matcher("LOoOoOOL");
System.out.println(matchter2.matches()); // prints true

Finding

Notably, Matcher has a find method that will check if any substring matches the pattern.

Matcher matcher3 = laughingPattern.matcher("loooooolz");
System.out.println(matcher3.matches()); // prints false, because the pattern doesn't have a z

Matcher matcher4 = laughingPattern.matcher("looooooolz");
System.out.println(matcher4.find()); // prints true, because some substring does match the pattern

Why did I create a matcher4 instead of just reusing matcher3 and calling its find method? It turns out the Matcher objects cannot be reused normally. If you do want to reuse one, you would have to call its .reset() method.

Finding Multiple Things

What if we want to check if multiple substrings match the pattern? We can try to use the find method multiple times to see if there are additional matches. Suppose we have the following method:

public static int countLaughs(String text) {
    Pattern lolPattern = Pattern.compile("lo+l");
    Matcher lolMatcher = lolPattern.matcher(text);
    int count = 0;
    while (lolMatcher.find()) {
        count++;
    }
    return count;
}

We could try the following code:

System.out.println(countLaughs("looooolz you are the funniest person I have ever met lolol"));

/* Prints 2. We say that find _consumes_ characters left-to-right. Once a character has been included
in a match, it cannot be reused. Hence, we only match twice, not three times. */

D. Capturing

Capturing

One of the most useful things we can do with regex is capturing. Capturing allows us to extract the part of a String that matches a pattern. For example, suppose I have a document that includes dates formatted like so: 2015-03-14 (YYYY-MM-DD). Then I just want to return a list of all the years mentioned in the document. I will use the following pattern:

Pattern yearCapturer = Pattern.compile("([0-9]{4})-[0-9]{2}-[0-9]{2}");

(Here I'm making the simplifying assumption that a valid date could have any digit in any spot. This is not the most precise regex I could have written! We know that months are only 01-12 and days are only 01-31, for example.)

Notice that the pattern contains ( and ). Parentheses are special characters in regex that indicate a capturing group, which mean they are a part of the pattern we wish to extract. Here is an example of it in use:

String document = "Please meet me on 2016-03-14. That's right. Let's wait a year. I didn't mean 2015-03-14.";
Matcher yearMatcher = yearCapturer.matcher(document);
while (yearMatcher.find()) {
    System.out.println(yearMatcher.group(1)); // prints out the thing we captured from this find. First 2016, then 2015
}

Why is there a 1 in .group(1)? It turns out we can include multiple capturing groups in an expression and reference them by number.

Pattern dateCapturer = Pattern.compile("([0-9]{4})-([0-9]{2})-([0-9]{2})");
Matcher dateMatcher = dateCapturer.matcher(document);
while (dateMatcher.find()) {
    System.out.println(dateMatcher.group(1)); // first 2016, then 2015
    System.out.println(dateMather.group(2)); // first 03, then 03
    System.out.println(dateMather.group(3)); // first 14, then 14
}

Why do I start counting at 1, and not 0? It turns out 0 contains the whole matching expression, ignoring the parentheses. Above, first dateMatcher.group(0) would contain 2016-03-14, then next it would contain 2015-03-04.

Greedy vs. Reluctant

What would you expect the following code to do? Recall that group(0) is the whole substring that find matches.

Pattern weirdLaughingPattern = Pattern.compile("l.+l");
Matcher weirdMatcher = weirdLaughingPattern.matcher("looool you are the funniest person I have ever met lolol");
while (weirdMatcher.find()) {
    System.out.println(weirdMatcher.group(0));
}

Perhaps you expect it to print out looool and lol. This is not the case. It actually prints out looool you are the funniest person I have ever met lolol. Why's that? It turns out the + operate is greedy, which means it will try to match as many characters as possible. Notice the String we matched starts with l, ends with l, and has one or more other characters in the middle. And it has as many characters in the middle as possible. ? and * are also greedy. To make them reluctant, meaning they will match as few characters as possible, include a ? after them. To get the code to print looool and lol, the correct pattern is l.+?l.

Find and Replace

A natural extention of finding is find-and-replace. The Matcher class has two methods, replaceAll and replaceFirst that might be helpful. For example, the following makes a famous quote gender-neutral:

Pattern malePattern = Pattern.compile("men");
Matcher neutralizer = malePattern.matcher("We hold these truths to be self-evident, that all men are created equal...");
System.out.println(neutralizer.replaceAll("people"));

However, this regex doesn't do what we want in the following case:

Matcher neutralizer2 = malePattern.matcher("We hold these truth to be self-evident, that all men are equal mentally...");

It prints:

We hold these truth to be self-evident, that all people are equal peopletally...

That's right, it replaced the men in mentally, too! That's not what we wanted. How do we fix this problem?

E. Boundaries

Boundaries

We introduce the concept of boundaries. Boundaries help us specify that we only want to match patterns within certain boundary conditions. Here are some useful boundaries we can use in regex:

In this case, we probably only want to match the word men, so we should demarcate it with word boundaries. Here's the regex we want:

Pattern malePattern = Pattern.compile("\\bmen\\b");

Again, we want to treat the \ as a normal character, not a special character, so we must escape it.

F. Exercises

URLs

The url we are looking for has a very particular pattern and is surrounded by parentheses (). The following are examples of valid urls: (randomstuff1234https://www.eecs.berkeley.edu/blah.htmlyoullneverfindyourextracredit), (https://en.wikipedia.org/greed.htmltry23andfindmenow), (http://www.cs61bl.github.io/yolo.html).

Here's the exact pattern specification:

A valid domain name is a sequence of any length of word characters followed by a .. For example eecs.berkeley is a valid domain or simply www.berkeley, or just facebook.

Note: the text you'll be parsing is inundated with text preceeding https and urls ending in html, so if you use a greedy pattern you will match thousands of urls—you only want one. Use a reluctant regex any time you match on 0 or more things.

Finding Startup Names

You wake up one day and find yourself as some strange amalgam of Marc Andreessen, Peter Thiel, Sam Altman, Ben Horowitz as well as the CEO of Y Combinator. As a well-respected VC, your job is to discover and incubate innovative, game-changing, data-driven, cutting edge startups that will disrupt and change the world. First, these startups must have a worthy name and because you received thousands of applications per day, you decide to write a Regex parser for your first round of filtering. The naming criteria are listed as follows:

Hint: Although we didn't mention it in this lab, you might want to look up Regex alternation constructs.

Recover the Image!

You are given a text file named mystery.txt where each line contains a set of RGB pixel values for a particular (x, y) coordinate pair. The RGB values are collection of a three integer values between square brackets, separated by commas (e.g. [240, 100, 76]). The corresponding coordinate pair is represented using a length-2 tuple of form ([x-value], [y-value]). However, each line may contain other arrays and tuples which do not fit our criteria. Write a Regex pattern to match and extract each set of (x, y)-RGB values and uncover the mystery image (link to text file here).

G. Conclusion

Submission

Please submit a completed RegexSolution.java file. Our tests will check the correctness of your solution using randomized tests, so keep that in mind as you're writing your solutions. And with that, congratulations on completing your last lab of 61B!

Additional Resources

Don't worry if all of these symbols seem like complete gibberish. With repeated practice, you'll eventually find that the necessary characters come to you more quickly and intuitively. Until then, feel free to consult Regex cheatsheets like the ones listed below:

* http://www.omicentral.com/cheatsheets/JavaRegularExpressionsCheatSheet.pdf
* http://regexlib.com/

If you enjoyed the exercises in this lab and would like some additional practice, also consider playing some Regex Golf (http://regex.alf.nu/).