Nothing Special   »   [go: up one dir, main page]

Unix Regular Expression

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Unix Regular Expressions

Done By: Mohammed Salah Khlil

CS-2008-019
Regular Expression:
provides a concise and flexible means for matching strings of text, such as particular characters, words, or
patterns of characters.

A regular expression is written in a formal language that can be interpreted by a regular expression processor, a
program that either serves as a parser generator or examines text and identifies parts that match the provided
specification.

Most of the UNIX utilities operate on ASCII files a line at a time.

It is simple to search for a specific word or string of characters. Almost every editor on every computer system
can do this. Regular expressions are more powerful and flexible. You can search for words of a certain size. You
can search for a word with four or more vowels that end with an "s". Numbers, punctuation characters, you
name it, a regular expression can find it. What happens once the program you are using find it is another matter.
Some just search for the pattern. Others print out the line containing the pattern. Editors can replace the string
with a new pattern. It all depends on the utility.

There are three important parts to a regular expression. Anchors are used to specify the position of the pattern
in relation to a line of text. Character Sets match one or more characters in a single position. Modifiers specify
how many times the previous character set is repeated.

There are also two types of regular expressions: the "Basic" regular expression, and the "extended" regular
expression.

UNIX Base Regular Expression:


was designed mostly for backward compatibility with the traditional (Simple Regular Expression) syntax but
provided a common standard which has since been adopted as the default syntax of many Unix regular
expression tools, though there is often some variation or additional features. Many such tools also provide
support for Extended Regular Expressions syntax with command line arguments.

The Anchor Characters: ^ and $


Most UNIX text facilities are line oriented. Searching for patterns that span several lines is not easy to do. You
see, the end of line character is not included in the block of text that is searched. It is a separator. Regular
expressions examine the text between the separators. If you want to search for a pattern that is at one end or
the other, you use anchors. The character "^" is the starting anchor, and the character "$" is the end anchor. The
regular expression "^A" will match all lines that start with a capital A. The expression "A$" will match all lines
that end with the capital A. If the anchor characters are not used at the proper end of the pattern, then they no
longer act as anchors. That is, the "^" is only an anchor if it is the first character in a regular expression. The "$" is
only an anchor if it is the last character. The expression "$1" does not have an anchor. Neither is "1^". If you
need to match a "^" at the beginning of the line, or a "$" at the end of a line, The use of "^" and "$" as indicators
of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as
commands to go to the beginning or end of a line. The C shell uses "!^" to specify the first argument of the
previous line, and "!$" is the last argument on the previous line.
It is one of those choices that other utilities go along with to maintain consistancy. For instance, "$" can refer to
the last line of a file when using ed and sed. Cat -e marks end of lines with a "$".
Pattern Matches

^A "A" at the beginning of a line


A$ "A" at the end of a line

A^ "A^" anywhere on a line

$A "$A" anywhere on a line

^^ "^" at the beginning of a line

$$ "$" at the end of a line

Match any character with .


The character "." is one of those special meta-characters. By itself it will match any character, except the end-of-
line character. The pattern that will match a line with a single characters is ^.$

Matching a character with a character set


The simplest character set is a character. The regular expression "the" contains three character sets: "t," "h" and
"e". It will match any line with the string "the" inside it. This would also match the word "other". To prevent this,
put spaces before and after the pattern: " the ". You can combine the string with an anchor. The pattern "^From:
" will match the lines of a mail message that identify the sender. Use this pattern with grep to print every
address in your incoming mail box:

grep '^From: ' /usr/spool/mail/$USER

Some characters have a special meaning in regular expressions. If you want to search for such a character, escape
it with a backslash.

Specifying a Range of Characters with [...]


If you want to match specific characters, you can use the square brackets to identify the exact characters you are
searching for. The pattern that will match any line of text that contains exactly one number is

^[0123456789]$

This is verbose. You can use the hyphen between two characters to specify a range:

^[0-9]$

You can intermix explicit characters with character ranges. This pattern will match a single character that is a
letter, number, or underscore:

[A-Za-z0-9_]

Exceptions in a character set


Regular Expression Matches
[] The characters "[]"
[0] The character "0"
[0-9] Any number
[^0-9] Any character other than a number
[-0-9] Any number or a "-"
[0-9-] Any number or a "-"
[^-0-9] Any character except a number or a "-"
[]0-9] Any number or a "]"
[0-9]] Any number followed by a "]"
[0-9-z] Any number, or any character between "9" and "z".
[0-9\-a\]] Any number, or
a "-", a "a", or a "]"

Repeating character sets with *


The third part of a regular expression is the modifier. It is used to specify how may times you expect to see the
previous character set. The special character "*" matches zero or more copies. That is, the regular expression
"0*" matches zero or more zeros, while the expression "[0-9]*" matches zero or more numbers.

This explains why the pattern "^#*" is useless, as it matches any number of "#'s" at the beginning of the
line,including zero. Therefore this will match every line, because every line starts with zero or more "#'s".

Matching a specific number of sets with \{ and \}


You cannot specify a maximum number of sets with the "*" modifier. There is a special pattern you can use to
specify the minimum and maximum number of repeats. This is done by putting those two numbers between "\{"
and "\}". The backslashes deserve a special discussion. Normally a backslash turns off the special meaning for a
character. A period is matched by a "\." and an asterisk is matched by a "\*".
If a backslash is placed before a "<," ">," "{," "}," "(," ")," or before a digit, the backslash turns on a special
meaning. This was done because these special functions were added late in the life of regular expressions.
Changing the meaning of "{" would have broken old expressions. This is a horrible crime punishable by a year of
hard labor writing COBOL programs. Instead, adding a backslash added functionality without breaking old
programs. Rather than complain about the unsymmetry, view it as evolution.

Regular Expression Matches


_
* Any line with an asterisk
\* Any line with an asterisk
\\ Any line with a backslash
^* Any line starting with an asterisk
^A* Any line
^A\* Any line starting with an "A*"
^AA* Any line if it starts with one "A"
^AA*B Any line with one or more "A"'s followed by a "B"
^A\{4,8\}B Any line starting with 4, 5, 6, 7 or 8 "A"'s
followed by a "B"
^A\{4,\}B Any line starting with 4 or more "A"'s
followed by a "B"
^A\{4\}B Any line starting with "AAAAB"
\{4,8\} Any line with "{4,8}"
A{4,8} Any line with "A{4,8}"
* Any line with an asterisk

Having convinced you that "\{" isn't a plot to confuse you, an example is in order. The regular expression to
match 4, 5, 6, 7 or 8 lower case letters is

[a-z]\{4,8\}

Any numbers between 0 and 255 can be used. The second number may be omitted, which removes the upper
limit. If the comma and the second number are omitted, the pattern must be duplicated the exact number of
times specified by the first number.
Matching words with \< and \>
Searching for a word isn't quite as simple as it at first appears. The string "the" will match the word "other". You
can put spaces before and after the letters and use this regular expression: " the ". However, this does not match
words at the beginning or end of the line. And it does not match the case where there is a punctuation mark
after the word. There is an easy solution. The characters "\<" and "\>" are similar to the "^" and "$" anchors, as
they don't occupy a position of a character. They do "anchor" the expression between to only match if it is on a
word boundary. The pattern to search for the word "the" would be "\<[tT]he\>". The character before the "t"
must be either a new line character, or anything except a letter, number, or underscore. The character after the
"e" must also be a character other than a number, letter, or underscore or it could be the end of line character.

Backreferences - Remembering patterns with \(, \) and \1


Another pattern that requires a special mechanism is searching for repeated words. The expression "[a-z][a-z]"
will match any two lower case letters. If you wanted to search for lines that had two adjoining identical letters,
the above pattern wouldn't help. You need a way of remembering what you found, and seeing if the same
pattern occurred again. You can mark part of a pattern using "\(" and "\)". You can recall the remembered
pattern with "\" followed by a single digit. Therefore, to search for two identical letters, use "\([a-z]\)\1". You
can have 9 different remembered patterns. Each occurrence of "\(" starts a new pattern. The regular expression
that would match a 5 letter palindrome, (e.g. "radar"), would be \([a-z]\)\([a-z]\)[a-z]\2\1

Potential Problems
That completes a discussion of the Basic regular expression. Before I discuss the extensions the extended
expressions offer, I wanted to mention two potential problem areas. The "\<" and "\>" characters were
introduced in the vi editor. The other programs didn't have this ability t that time. Also the "\{min,max\}"
modifier is new and earlier utilities didn't have this ability. This made it difficult for the novice user of regular
expressions, because it seemed each utility has a different convention. Sun has retrofited the newest regular
expression library to all of their programs, so they all have the same ability. If you try to use these newer
features on other vendor's machines, you might find they don't work the same way.The other potential point of
confusion is the extent of the pattern matches. Regular expressions match the longest possible pattern. That is,
the regular expression

A.*B

matches "AAB" as well as "AAAABBBBABCCCCBBBAAAB". This doesn't cause many problems using grep, because
an oversight in a regular expression will just match more lines than desired. If you use sed, and your patterns get
carried away.

UNIX Extended Regular Expression:


With these extensions, special characters preceded by a backslash no longer have the special meaning: "\{" ,
"\}", "\<", "\>", "\(", "\)" as well as the "\digit". There is a very good reason for this, which I will delay explaining
to build up suspense.

Regular Expression Class Type Meaning


_
. all Character Set A single character (except
newline)
^ all Anchor Beginning of line
$ all Anchor End of line
[...] all Character Set Range of characters
* all Modifier zero or more duplicates
\< Basic Anchor Beginning of word
\> Basic Anchor End of word
\(..\) Basic Backreference Remembers pattern
\1..\9 Basic Reference Recalls pattern
_+ Extended Modifier One or more duplicates
? Extended Modifier Zero or one duplicate
\{M,N\} Extended Modifier M to N Duplicates
(...|...) Extended Anchor Shows alteration
. all Character Set A single character (except
newline)
_
\(...\|...\) EMACS Anchor Shows alteration
\w EMACS Character set Matches a letter in a word
\W EMACS Character set Opposite of \w

You might also like