Unix Regular Expression
Unix Regular Expression
Unix Regular Expression
CS-2008-019
Regular Expression:
provides a concise and flexible means for matching strings of text, such as particular characters, words, or
patterns of characters.
A regular expression is written in a formal language that can be interpreted by a regular expression processor, a
program that either serves as a parser generator or examines text and identifies parts that match the provided
specification.
It is simple to search for a specific word or string of characters. Almost every editor on every computer system
can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You
can search for a word with four or more vowels that end with an "s". Numbers, punctuation characters, you
name it, a regular expression can find it. What happens once the program you are using find it is another matter.
Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string
with a new pattern. It all depends on the utility.
There are three important parts to a regular expression. Anchors are used to specify the position of the pattern
in relation to a line of text. Character Sets match one or more characters in a single position. Modifiers specify
how many times the previous character set is repeated.
There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular
expression.
Some characters have a special meaning in regular expressions. If you want to search for such a character, escape
it with a backslash.
^[0123456789]$
This is verbose. You can use the hyphen between two characters to specify a range:
^[0-9]$
You can intermix explicit characters with character ranges. This pattern will match a single character that is a
letter, number, or underscore:
[A-Za-z0-9_]
This explains why the pattern "^#*" is useless, as it matches any number of "#'s" at the beginning of the
line,including zero. Therefore this will match every line, because every line starts with zero or more "#'s".
Having convinced you that "\{" isn't a plot to confuse you, an example is in order. The regular expression to
match 4, 5, 6, 7 or 8 lower case letters is
[a-z]\{4,8\}
Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper
limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of
times specified by the first number.
Matching words with \< and \>
Searching for a word isn't quite as simple as it at first appears. The string "the" will match the word "other". You
can put spaces before and after the letters and use this regular expression: " the ". However, this does not match
words at the beginning or end of the line. And it does not match the case where there is a punctuation mark
after the word. There is an easy solution. The characters "\<" and "\>" are similar to the "^" and "$" anchors, as
they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a
word boundary. The pattern to search for the word "the" would be "\<[tT]he\>". The character before the "t"
must be either a new line character, or anything except a letter, number, or underscore. The character after the
"e" must also be a character other than a number, letter, or underscore or it could be the end of line character.
Potential Problems
That completes a discussion of the Basic regular expression. Before I discuss the extensions the extended
expressions offer, I wanted to mention two potential problem areas. The "\<" and "\>" characters were
introduced in the vi editor. The other programs didn't have this ability t that time. Also the "\{min,max\}"
modifier is new and earlier utilities didn't have this ability. This made it difficult for the novice user of regular
expressions, because it seemed each utility has a different convention. Sun has retrofited the newest regular
expression library to all of their programs, so they all have the same ability. If you try to use these newer
features on other vendor's machines, you might find they don't work the same way.The other potential point of
confusion is the extent of the pattern matches. Regular expressions match the longest possible pattern. That is,
the regular expression
A.*B
matches "AAB" as well as "AAAABBBBABCCCCBBBAAAB". This doesn't cause many problems using grep, because
an oversight in a regular expression will just match more lines than desired. If you use sed, and your patterns get
carried away.