Welcome to the Cisco Academy for Vision Impaired Linux Wiki. |
Main /
RegularExpressionBasicsGetting Started With Regular ExpressionsWhat And WhyA regular expression is a method for expressing a string in such a way that it can be matched with other strings. For example, if you wished to search a configuration file for all references to your hostname drongo and change it to dragon, you would not need a regular expression. You could replace drongo with dragon, and all would be good. But suppose you were looking through a group of configuration files that referred to different servers. Let's further suppose you wanted to replace all the IP addresses with host names. Then you'd need a regular expression that matched only IP addresses and nothing else. Let's say you weren't doing any replacing, but wanted to find all instances in a file where the word No was the last character on that line. Some feature had gotten turned off, and you wanted to fix the configuration keyword that disabled that feature. You could search for No, but you'd read about noses, nogins and phenotypes. You could search for No as a stand-alone word if you had an editor which could do that, but regular expressions are the solution when you want the no keyword appearing at the end of a line. Another great use for a regular expression is to verify that data is in a particular format. A regular expression can check that a phone number, postal code or monitory field is properly entered, at least that it doesn't contain extra spaces or unwanted punctuation. Let's suppose you have to fix files that refer to the IP address 10.1.1.135. But the idiot who had the job before you sometimes got it mixed up and specified 192.168.1.135. Then he must have been drinking beer when he changed the address in some of the files to 10.1.1.136! Trying to fix this one IP address at a time by hand could become a real error-prone pain. But with regular expressions it can be a snap. Or let's say that you want to go to which is the text-only site for weather underground. Let's further suppose that you want your system to extract a daily weather report, and every morning at 8 AM, play relevant portions of it using synthesized speech over speakers in your ceiling. (Yes I have a nifty cron job to do this.) You can use regular expressions to tease out the weather report from all the unwanted parts of the page. Regular expressions can clean up results from imperfect OCR or simply bad typing. They can seek out and remove unwanted extra tabs, spaces, empty lines, decorative borders, or characters that cause a speech synthesizer fits. Regular expressions can also make it easy to track down a name if you know how it sounds, but not how it is spelled, or operate on certain lines in a file but only if a specific condition exists. They can convert case, turning all those pesky upper-case and mixed-case Windows filenames to lower-case. They can replace spaces with underscores in filenames, and batch process them in a script. Delimiting an ExpressionRegular expressions usually begin and end with a slash, and though other delimiters can be used, for this discussion we will assume that slash is our delimiter. Simple ExpressionsRegular expressions can be simple strings, for example matches occurrences of dragon. In a slightly more complex expression, the caret symbol matches the beginning of a line and the dollar sign matches the end. So matches dragon only if it is at the start of a line. And matches dragon only if it is at the end of a line. Therefore /^dragon$/ matches dragon only if it is the only word on a line and /^$/ matches an empty line. I have yet to find a Windows editor or word processor that will jump directly to the next blank line in a document, but with regular expressions, simply searching for /^$/ will do the trick. The caret (^) and the dollar sign ($) are called anchors. They match no characters, instead they match a position, which is why the word anchor is used. They anchor the search string to their position. Special CharactersBesides the anchors we learned about already, there are many more special characters. The period is used in a regular expression to replace any single character. So the expression matches gray spelled with an A and grey, spelled with an E. Two periods together match two characters so that matches coot but not cut, or coat, but not cat. The problem with gr.y is that not only will it match gray with an a and grey with an e but it will match griy, gruy, grky or gryy. You can fix this problem with character classes discussed below. Do not confuse regular expressions with either DOS wildcards or shell globbing. These are similar concepts, but regular expressions are more complex and exacting. For example, the asterisk character matches zero or more occurrences of a character. But not just any character. In a shell, you can specify foo* to match any file beginning with foo. In DOS, specifying Foo*.* does the same thing. But in a regular expression, the * indicates zero or more occurrences of the character immediately preceding it. So would match anything beginning with foo, because the period stands for any character and the asterisk following it indicates zero or more occurrences. For another asterisk example would match 190, 191, 1922, 192, 19222, 192 19222224 and anything else having the two repeated zero or more times. Matching Start or End Of A WordYou might by now be wondering how this can be all that useful. How often are you looking say for either goat or coat, cake or cape? But regular expressions are, as I said, complex, so you can actually do more than I've implied. The \< anchors to the beginning of a word and \> to the end. so /\<dog/ matches words that begin with dog, like dogbert, but not words that end with dog like hotdog. For a less frivolous example, searches for output or input or shotput or put, but not words beginning with put, such as putty or putanesca. It also won't get stuck matching words with put included, like computer. Remember that \< and \> being anchors, match positions, rather than characters in a search string. Character ClassesYou can also create character classes, to treat a group of characters as one. Classes are enclosed in square brackets, so the class contains all the vowels and the class contains the first five digits. So would locate dog, dig, dug, but not did or door. Add an asterisk after a class and you are now looking for zero or more occurrences of that class as in would locate dear, deer, dire, dare, door, and Dr. Basically with this last expression, we said find a d followed by any optional vowel, only you can repeat that vowel zero or more times, and then find an R. This expression doesn't match "dictator" even though dictator starts with a D has a vowel and ends with an r. Dictator follows the vowel with a consonant and at that letter C the search stops. If you want to find a vowel that begins a line, you can anchor it with a caret as in /^[aeiou]/ But here's the confusing part. The caret has another use as well. When it is inside a bracket, it negates the class, so /[^aeiou] actually tells the expression to match only consonants and *NOT* vowels! Therefore to match a consonant at the start of the line, you'd have two carets; the one outside the brackets is the start of line anchor, and the one inside the brackets negates the character class. That expression is Returning to our grey/gray example, we now see that typing matches gray or grey but not other things that begin with gr, end with y and use some other letter after the r. Matching a Special CharacterSuppose you need to match a special character like a bracket, asterisk or period. You simply preceed it with a backslash. So if your script referred to files with extensions of .txt and you wanted to find that period and not just any character, your search expression would be If you wanted to find a unix filepath like /usr/bin/, your expression would need to "escape" the slashes in the path with backslashes, so it would look like this: If you need to match a backslash, the same rule applies, simply use two backslashes. So in matching a Windows path with backslashes, you'd end up with an expression like: Repetition And RangesBesides character classes, we have the concept of repetition . If you want to match exactly two occurrences of the letter O you'd specify This seems just as easy as writing and it is. But suppose you want to search for forty underscores. Then it's easier to type Inside the curly braces, you can also specify a range. So to search for twenty to 40 underscores, you'd type /_{20,40}/ and that would do it. You can also specify a range inside the square brackets for a character class. If you wanted 0 through 9, you'd specify /[0-9]/ to search for any digit. Note that usually when specifying ranges within the square brackets for a character class, you use a hyphen, whereas when you specify a range of repeats within the curly braces, you use a comma. The syntax does vary a bit from one program to the next. You can find programs that want all ranges separated by commas, and others that separate the start and end of a range with a dash. But once you understand the concept of a range, it's fairly easy to adapt to a different syntax. Other Syntax GotchasYou can search for 1 or more occurrence of something with a plus sign and zero or one occurrences with a question mark, but unfortunately that syntax is not standard. Your man pages for vi, grep, egrep, less and ed will tell how to do it in each case. It is best to practice with egrep, which nondestructively searches for patterns, or with ed, which can let you edit and locate lines based on expressions. The less pager, which by default is used by the man command also supports regular expressions for searching. More than one characterMany times you are not looking for just one character but for a particular group of characters that may occur zero or more times, that may start a line followed by something else. Whenever you want a special character, like an asterisk, or a character class, or repitition to apply to more than one character, you enclose the expression you wish it to apply to in parenthesis. For example to match a line that has samba at its start repeated 1 or more times, you'd type To match a line with Debee repeated zero or more times followed by a space and the word setup, you'd type Either OrTo match either this or that, the concept is called alternation and it uses the vertical bar, which is the same symbol used for pipe. The shell, or whatever program you are using knows it's not a pipe because it occurs inside regular expression delimiters. For example to match this or that litterally, you'd type If you were using alternation with just single characters, you could skip the parenthesis to match a t or an h, type You can combine alternation with other special characters, so to match a yes or a no but only at the end of a line, type One last real-world exampleTo match an IP address, try this: Remember in some programs you need a comma and not a dash symbol for a range. Special Characters SummaryHere's a quick summary of the special characters:
|