Linux Intro Wiki | Main / RegularExpressionBasics

Getting Started With Regular Expressions

What And Why

A regular expression is a method for expressing a string in such a way that it can be matched with other strings.

For example, if you wished to search a configuration file for all references to your hostname drongo and change it to dragon, you would not need a regular expression. You could replace drongo with dragon, and all would be good.

But suppose you were looking through a group of configuration files that referred to different servers. Let's further suppose you wanted to replace all the IP addresses with host names. Then you'd need a regular expression that matched only IP addresses and nothing else.

Let's say you weren't doing any replacing, but wanted to find all instances in a file where the word No was the last character on that line. Some feature had gotten turned off, and you wanted to fix the configuration keyword that disabled that feature. You could search for No, but you'd read about noses, nogins and phenotypes. You could search for No as a stand-alone word if you had an editor which could do that, but regular expressions are the solution when you want the no keyword appearing at the end of a line.

Another great use for a regular expression is to verify that data is in a particular format. A regular expression can check that a phone number, postal code or monitory field is properly entered, at least that it doesn't contain extra spaces or unwanted punctuation.

Let's suppose you have to fix files that refer to the IP address 10.1.1.135. But the idiot who had the job before you sometimes got it mixed up and specified 192.168.1.135. Then he must have been drinking beer when he changed the address in some of the files to 10.1.1.136! Trying to fix this one IP address at a time by hand could become a real error-prone pain. But with regular expressions it can be a snap.

Or let's say that you want to go to
http://braille.wunderground.com/

which is the text-only site for weather underground. Let's further suppose that you want your system to extract a daily weather report, and every morning at 8 AM, play relevant portions of it using synthesized speech over speakers in your ceiling. (Yes I have a nifty cron job to do this.)

You can use regular expressions to tease out the weather report from all the unwanted parts of the page.

Regular expressions can clean up results from imperfect OCR or simply bad typing. They can seek out and remove unwanted extra tabs, spaces, empty lines, decorative borders, or characters that cause a speech synthesizer fits.

Regular expressions can also make it easy to track down a name if you know how it sounds, but not how it is spelled, or operate on certain lines in a file but only if a specific condition exists. They can convert case, turning all those pesky upper-case and mixed-case Windows filenames to lower-case. They can replace spaces with underscores in filenames, and batch process them in a script.

Delimiting an Expression

Regular expressions usually begin and end with a slash, and though other delimiters can be used, for this discussion we will assume that slash is our delimiter.

Simple Expressions

Regular expressions can be simple strings, for example
/dragon/

matches occurrences of dragon.

In a slightly more complex expression, the caret symbol matches the beginning of a line and the dollar sign matches the end. So
/^dragon/

matches dragon only if it is at the start of a line. And
/dragon$/

matches dragon only if it is at the end of a line. Therefore /^dragon$/

matches dragon only if it is the only word on a line and /^$/

matches an empty line.

I have yet to find a Windows editor or word processor that will jump directly to the next blank line in a document, but with regular expressions, simply searching for /^$/ will do the trick.

The caret (^) and the dollar sign ($) are called anchors. They match no characters, instead they match a position, which is why the word anchor is used. They anchor the search string to their position.

Special Characters

Besides the anchors we learned about already, there are many more special characters. The period is used in a regular expression to replace any single character. So the expression
/gr.y/

matches gray spelled with an A and grey, spelled with an E. Two periods together match two characters so that
/c..t/

matches coot but not cut, or coat, but not cat.

The problem with gr.y is that not only will it match gray with an a and grey with an e but it will match griy, gruy, grky or gryy. You can fix this problem with character classes discussed below.

Do not confuse regular expressions with either DOS wildcards or shell globbing. These are similar concepts, but regular expressions are more complex and exacting.

For example, the asterisk character matches zero or more occurrences of a character. But not just any character. In a shell, you can specify foo* to match any file beginning with foo. In DOS, specifying Foo*.* does the same thing. But in a regular expression, the * indicates zero or more occurrences of the character immediately preceding it.

So
/foo.*/

would match anything beginning with foo, because the period stands for any character and the asterisk following it indicates zero or more occurrences.

For another asterisk example
/192*/

would match 190, 191, 1922, 192, 19222, 192 19222224 and anything else having the two repeated zero or more times.

Matching Start or End Of A Word

You might by now be wondering how this can be all that useful. How often are you looking say for either goat or coat, cake or cape?

But regular expressions are, as I said, complex, so you can actually do more than I've implied. The \< anchors to the beginning of a word and \> to the end. so /\<dog/

matches words that begin with dog, like dogbert, but not words that end with dog like hotdog. For a less frivolous example,
/put\>/

searches for output or input or shotput or put, but not words beginning with put, such as putty or putanesca. It also won't get stuck matching words with put included, like computer.

Remember that \< and \> being anchors, match positions, rather than characters in a search string.

Character Classes

You can also create character classes, to treat a group of characters as one. Classes are enclosed in square brackets, so the class
[aeiou]

contains all the vowels and the class
[12345]

contains the first five digits. So
/d[aeiou]*g/

would locate dog, dig, dug, but not did or door. Add an asterisk after a class and you are now looking for zero or more occurrences of that class as in
/d[aeiou]*r/

would locate dear, deer, dire, dare, door, and Dr. Basically with this last expression, we said find a d followed by any optional vowel, only you can repeat that vowel zero or more times, and then find an R. This expression doesn't match "dictator" even though dictator starts with a D has a vowel and ends with an r. Dictator follows the vowel with a consonant and at that letter C the search stops.

If you want to find a vowel that begins a line, you can anchor it with a caret as in /^[aeiou]/

But here's the confusing part. The caret has another use as well. When it is inside a bracket, it negates the class, so /[^aeiou]

actually tells the expression to match only consonants and *NOT* vowels! Therefore to match a consonant at the start of the line, you'd have two carets; the one outside the brackets is the start of line anchor, and the one inside the brackets negates the character class. That expression is
/^[^aeiou]/

Returning to our grey/gray example, we now see that typing
/gr[ae]y/

matches gray or grey but not other things that begin with gr, end with y and use some other letter after the r.

Matching a Special Character

Suppose you need to match a special character like a bracket, asterisk or period. You simply preceed it with a backslash. So if your script referred to files with extensions of .txt and you wanted to find that period and not just any character, your search expression would be
/\.txt/

If you wanted to find a unix filepath like /usr/bin/, your expression would need to "escape" the slashes in the path with backslashes, so it would look like this:
/\/usr\/bin\//

If you need to match a backslash, the same rule applies, simply use two backslashes. So in matching a Windows path with backslashes, you'd end up with an expression like:
/
program files
/

Repetition And Ranges

Besides character classes, we have the concept of repetition . If you want to match exactly two occurrences of the letter O you'd specify
/o{2}/

This seems just as easy as writing
/oo/

and it is. But suppose you want to search for forty underscores. Then it's easier to type
/_{40}/ instead of typing all forty of them.

Inside the curly braces, you can also specify a range. So to search for twenty to 40 underscores, you'd type /_{20,40}/

and that would do it.

You can also specify a range inside the square brackets for a character class. If you wanted 0 through 9, you'd specify /[0-9]/

to search for any digit. Note that usually when specifying ranges within the square brackets for a character class, you use a hyphen, whereas when you specify a range of repeats within the curly braces, you use a comma. The syntax does vary a bit from one program to the next. You can find programs that want all ranges separated by commas, and others that separate the start and end of a range with a dash. But once you understand the concept of a range, it's fairly easy to adapt to a different syntax.

Other Syntax Gotchas

You can search for 1 or more occurrence of something with a plus sign and zero or one occurrences with a question mark, but unfortunately that syntax is not standard. Your man pages for vi, grep, egrep, less and ed will tell how to do it in each case.

It is best to practice with egrep, which nondestructively searches for patterns, or with ed, which can let you edit and locate lines based on expressions. The less pager, which by default is used by the man command also supports regular expressions for searching.

More than one character

Many times you are not looking for just one character but for a particular group of characters that may occur zero or more times, that may start a line followed by something else. Whenever you want a special character, like an asterisk, or a character class, or repitition to apply to more than one character, you enclose the expression you wish it to apply to in parenthesis. For example to match a line that has samba at its start repeated 1 or more times, you'd type
/(samba)+)/

To match a line with Debee repeated zero or more times followed by a space and the word setup, you'd type
/(Debee)* setup/

Either Or

To match either this or that, the concept is called alternation and it uses the vertical bar, which is the same symbol used for pipe. The shell, or whatever program you are using knows it's not a pipe because it occurs inside regular expression delimiters. For example to match this or that litterally, you'd type
/(this)|(that)/

If you were using alternation with just single characters, you could skip the parenthesis to match a t or an h, type
/t|h/

You can combine alternation with other special characters, so to match a yes or a no but only at the end of a line, type
/(yes$)|(no$)/

One last real-world example

To match an IP address, try this:
/[0-9{3}\.[0-9]{3\.[0-9]{3}\.[0-9{3}/

Remember in some programs you need a comma and not a dash symbol for a range.

Special Characters Summary

Here's a quick summary of the special characters:

Represent any single character: Period
Repeat zero or more: asterisks
Repeat zero or once: Question mark
Repeat once or more: Plus sign
Anchor to start of word: Backslash Left angle bracket
Anchor to end of word: Backslash Right angle bracket
Anchor to start of line: Caret
Anchor to end of line: Dollar sign
Specify a range: Lower end of range followed by a dash or comma, followed by the upper end of the range
Treat as one expression: enclose in parenthesis
Treat as character class: Enclose in square brackets, may include a range but doesn't have to
Repeat last expression or character: Enclose a digit or range of digits specifying number of repeats in curly braces
Use alternation: Place a virtical bar between two expressions
Delimit the entire expression: Forward Slash
Escape a special character: Precede with backslash