Revision: $Revision: 1.4 $ ($Date: 2007-01-10 15:36:27 $)
This topic has a total weight of 3 points and contains the following objective:
Candidates should be able to write simple scripts to automate tasks using different common scripting languages. Tasks for automation include checking processes, process execution, parsing logs, synchronising files across machines, monitoring files for changes, generating and sending e-mail alerts and notifying administrators when specified users log in or out.
Key knowledge area(s):
| Standard text manipulation software sucs as awk and sed |
| BASH |
| cron configuration files |
| at daemon usage |
| Remote copying software such as rsync and scp |
| Perl: Basic commands |
The following is a partial list of the used files, terms and utilities:
| perl |
| bash |
| awk |
| sed |
| crontab |
| at |
This chapter discusses three main programs that work with Regular Expressions: Perl, sed and awk. Because Regular Expressions require a rather extended explanation and are common in all three programs, the Regular Expressions are discussed first.
The chapter concludes with topics about rsync, crontab and monitoring your system, as listed in the lpic objectives above.
Revision: $Revision: 1.10 $ ($Date: 2007-01-11 08:31:04 $)
Resources: (none).
Further reading: Friedl01, manpages for the various commands.
Regular Expressions are, as you will read further on in this chapter, the salt of all Unix systems. Regular Expression patterns are used in grep and derivatives, Perl, awk and sed. In general, there are lots of similarities between the Regular Expression dialects used by each of these programs. When able to work with one dialect, it is quite easy to work with the other dialects too. There are, however, annoying and dangerous differences. In this section the commonalities will be described and the differences clarified. Unfortunately, portability is an issue: there are even different Regular Expression dialects for the same program on different operating systems. Linux Regular Expression support is less complicated, but even here differences exist. This point will gradually become clear in this section.
In keeping with common practice in Perl documentation, the term Perl will be used for the Perl language and perl will be used for the program.
A Regular Expression is a pattern. The pattern is matched against some text. The result: the pattern will either fit or not fit. There is a lot to say about this pattern language. The text is frequently called the input. On the command line or in scripts, it can originate from an input stream (standard input) or from a set of files. In these cases the input will almost always be of a line of text. When using a programming language that supports Regular Expressions, the pattern will probably be matched against the contents of a buffer.
When using Regular Expressions on the command line, such as with
grep, you must protect the Regular Expression from
expansion by the shell. Putting a single (forward)
quote on each side of the Regular Expression will do the job:
grep 'loo*ny' tune.texts
This (rather long) section will discuss Regular Expressions in detail. First, the Regular Expression language (including variants) is introduced. Next, Regular Expressions are applied using specific programs. So, take a deep breath and start reading.
The parts of a Regular Expression most commonly used are the primitive and the multiplier. They frequently occur in pairs. An example of such a pair is:
E{4}
In this example, E is the primitive and
{4} is the multiplier, meaning four times the thing
before it.
This Regular Expression will match four consecutive capital
E characters somewhere in the
input.
Given the Regular Expression above, as an example, and an input line such as the following:
I like lots of EEEE's
the Regular Expression will match the four E's in the input. Moreover, the complete line will match, since the Regular Expression fits somewhere in the line. A member of the grep family, such as egrep will show the input if the Regular Expression matches part or all of it. So, if egrep were to handle something like this
echo "Lots of EEEE's" | egrep 'E{4}'
then egrep would show the whole line.
A multiplier may be omitted, in which case a multiplier of one time is added automatically to each primitive. So the Regular Expression:
head
effectively means:
one
h, followed by oneefollowed by oneafollowed by oned.
Both primitives and multipliers are discussed below (the section called “Primitives” and the section called “Multipliers”). More facilities are available for Regular Expressions: anchoring, grouping and alternation. These are discussed later on.
There are three types of primitives: a regular character, the dot placeholder and the character class. The character class is the most complicated of these.
Any regular character can be a primitive. This includes
letters (like E) and digits.
For instance, in the following example the capital
E is the primitive:
E{4}
What is the primitive here:
4{2}
The correct answer is 4.
Other characters can also be used as primitives in a Regular Expression. For example:
ls -la | egrep ' {11}'
(the Regular Expression is a space followed by multiplier
{11}). This will show lines containing at least
11 consecutive spaces.
Some characters have special meanings, such as the curly brace
({). These characters must be
escaped (they are preceded by a backslash
(\)) to remove the special meaning.
This will be discussed further later on.
The dot has a special meaning. As a primitive, it can be a placeholder to allow any character in that position. For instance:
a.c
allows for one (remember the one-time multiplier being added?)
a, followed by any single character,
followed by one c. So
echo reading the abc | egrep 'a.c'
fits.
And what will this fit on?
a.{3}
The answer: one a followed by three arbitrary
characters.
A character class is a placeholder for a number of characters. It is more restrictive than the dot placeholder, since it allows you to specify the characters the placeholder should match.
Keep in mind that the character class is a primitive, it is a placeholder for one of the specified characters.
To set up a character class, start with a [
(opening square bracket) and end it with ]
(closing square bracket). Any combination of
sets and ranges can be
specified between these square brackets.
To specify a set (a series) of characters in a character class, simply put them between the square brackets. So,
[aouiyA]
specifies a set consisting of an a (both
lower and upper case), an o,
a u or a y.
Remember that the character class is a primitive. That is, it can be multiplied:
[aouiyA]{2}
This specifies two characters, each of which may be one of the characters inside the square brackets.
The order of the characters in the set is irrelevant. That is,
[abcABC] is the same as the
[aAbBcC] character class. So,
echo reading the abc | egrep 'a[bB]c'
fits, as would
echo reading the aBc | egrep 'a[bB]c'
A character range consists of two characters separated by a minus sign. This specifies one digit:
[0-9]
This is one lower case letter:
[a-z]
Ranges can be combined:
[a-z0-9]
Some sets can be written as ranges. For example, the Regular Expression:
[ghiGHI]{2}
can also be written as:
[g-iG-I]{2}
It matches two of the indicated characters.
For now, the order is defined by the ASCII character set (or derivatives like iso8859-x). In the ASCII character set, letters and digits are not adjacent. That is why they need to be specified separately as ranges:
[a-z0-9]
More and more systems support so-called locale handling (such as the character set for a given country) of Regular Expressions. Arising character definitions like Unicode will probably change this.
Ranges and sets can be combined at will. This specifies one of the named vowels or a digit:
[aouiy0-9]
And this one matches three characters:
[A-Z][aouiy0-9][A-Z]
Which one? Answer: three characters, the first and the third an uppercase letter with a vowel or digit in the middle. In the following example the Regular Expression matches:
echo This is JaN | egrep '[A-Z][aouiy0-9][A-Z]'
It is often desirable to exclude certain characters from a character class. That's where the inverted character class comes in.
The inverted character class starts with a
caret (^) as the first character after the
opening square bracket ([) that starts the
character class.
The caret sign at the beginning inverts (negates) the meaning of
the character class. Where
[0-9]
fits one digit,
[^0-9]
will fit any character that is not a digit.
This, of course, can be extended with every character class:
[^aeouy0-9]
will fit anything that is neither a vowel nor a digit.
The POSIX character classes form an extension to the character classes that were discussed above. They come in the following format:
[[:keyword:]]
The keyword is a word describing the class, for
example, alnum or blank.
Current implementations of GNU grep and
egrep, as well as GNU awk
(gawk), come with manpages that describe the POSIX
character classes. Let's look at some examples.
The outer square brackets are the delimiters of any character class. The inner square brackets and colons plus the keyword define the POSIX character set.
Table 13.1. Overview of character classes
| Notation | Description |
|---|---|
[[:alnum:]] | It fits exactly one alphanumeric character. It is the same as
[a-zA-Z0-9] |
[[:alpha:]] | This class matches any letter |
[[:upper:]] | This class matches any uppercase letter (same as: [A-Z]) |
[[:lower:]] | This class matches any lowercase letter (same as: [a-z]) |
[[:digit:]] | This class matches any digit (same as: [0-9]) |
[[:xdigit:]] | This class matches any hexadecimal digit (same as: [0-9A-Fa-f]) |
[[:blank:]] | It matches a blank or a tab sign (a subset of the wider set called white space) |
[[:space:]] | It matches each white-space character, including space, tab, formfeed and carriage return |
[[:punct:]] | This class matches all punctiation characters |
Multiple POSIX-character-set definitions can be combined in one character class:
[[:punct:][:space:]]
This one fits a character which is either a punctuation character (such as a semicolon or a plus sign) or a white-space character.
The POSIX character set specifiers can be combined with any other character-class element, as shown below:
[0-9]{4}[[:space:],\.:][0-9]{2}
In words: four digits, then one character that can be either a white-space character, a comma, a dot or a colon, followed by two digits.
Since the POSIX character class extensions are extensions to the regular character class, it can be inverted the same way:
[^[:digit:]]
This matches one non-digit. It is identical to:
[^0-9]
A multiplier (also known as quantifier) works in conjunction with a primitive. The multiplier is placed after the primitive. Together they form a specification of how many of which characters are wanted.
Unfortunately, before multipliers can be explained in detail, the two main variants in Regular Expressions, most affecting the way multipliers are spelled, need to be introduced first. The classic Regular Expressions are the ones implemented in grep and sed. The modern, or extended Regular Expressions, on the other hand, are the ones used by awk, egrep, Perl and flex, to name a few. The multipliers shown so far were in the extended style. For this reason, egrep, was used in examples.
We will discuss 4 multipliers. They are
the *, ? and
+ multipliers (and their classic counterparts
*, \? and
\+, yes, the * is the same in
both cases).
You can also use the curly-brace multipliers we've already met.
These are part of the POSIX standard.
All relevant GNU programs have support for curly-brace multipliers
on board; even versions for classic Regular Expression exist. In
GNU awk (gawk) a special --posix
option must be given to turn on support for curly-brace multipliers.
Table 13.2, “Multipliers” at the end of this section
will provide an overview of Regular Expression multipliers.
The * is the oldest multiplier around.
As far as I know, it is implemented in every Regular Expression
language. It is the same in classic and extended variants.
The * means: zero or more
times. So,
a*
means zero or more a's. This may not be that
meaningful, since it is always true. In larger combinations,
however, its use can be more meaningful:
ba*c
That is, one b, and one c,
possibly with a's between them.
The ? multiplier means zero or one
time, in other words, present or
not present. The extended version is
?, the version to use in classic Regular
Expressions is \?.
An example: in sgml/xml sections start with
<section followed by a lot of things, and end
with </section, also followed by other things.
To fetch both lines, use
egrep '</?section' regexp.xml
The classic variant is
grep '</\?section' regexp.xml
The + multiplier means one or more
times.
The extended version is +, the version to use in
classic Regular Expressions is \+.
This (extended) Regular Expression
ba+c
means one b, followed by at least one
a, followed by a c.
To find long, regardless of how may
o's it contains, use the (extended) Regular
Expression
lo+ng
In the extended variant, a normal set of curly braces is used, as
in {4}.
In the classic variant, a matching pair of \{ and
\} (each curly brace is prefixed by
a backslash) is used, as in \{4\}.
The curly-brace multipliers come in four variants (only the extended Regular Expressions will be shown). The general form is
{m,n}
In this, both m and n are numbers indicating the lowest (m) and the highest (n) number. It matches at least m, but at most n times. For instance, to match at least two, but at most four digits:
[0-9]{2,4}
To specify a minimum number of times, omit the n (but not the comma!):
{m,}
For instance, to match at least five digits:
[0-9]{5,}
And to show lines that are at least 78 characters long (show also line numbers), select
egrep -n '.{78,}' file1 file2 ...
To specify a maximum number of times, omit the m (but not the comma!):
{,n}
For instance, to match up to three digits:
[0-9]{,3}
In this case, the minimum will be zero.
The example alone will not limit to three digits. It will just match up to three characters. If more numbers adjacent to these three are not wanted, anchors must be used - see the section called “Anchors”.
To match something no more and no less than n times, specify only n (do not use a comma):
{n}
To match exactly four capital alphabetic characters, for instance:
[A-Z]{4}
Support and portability.
Curly-brace multipliers are supported by Perl.
In GNU awk, support must be enabled explicitly with the
--posix option; other awk implementations
might not (yet) support these multipliers (called
interval expressions in awk terminology).
GNU egrep will support them. GNU grep supports the classic
version. In other versions of (e)grep, support may be lacking.
GNU sed supports the classic implementation. There are
implementations of sed that do not support curly-brace multipliers
at all. In yet other implementations of sed, extended Regular
Expressions can be selected.
POSIX standards are said to require curly-brace multipliers, so
support is expected to be implemented elsewhere over time.
Table 13.3, “Portable multipliers” provides an
overview.
Table 13.2, “Multipliers” shows all multipliers and variants together.
Table 13.2. Multipliers
| extended RE | classic RE | meaning |
|---|---|---|
* | * | zero or more times |
? | \? | zero or one time |
+ | \+ | one or more time |
{m,n} | \{m,n\} | at least m, but maximally n times |
{,n} | \{,n\} | up to n times |
{m,} | \{m,\} | at least m times |
{m} | \{m\} | exactly m times |
| extended RE: |
Perl, GNU awk with --posix, GNU egrep
|
| classic RE: | GNU grep, GNU sed |
GNU awk without --posix and other awk
implementations currently support only the *,
+ and ? multipliers.
Portability.
If your Regular Expressions should also be used on non-GNU
platforms portability becomes an issue.
Old grep's will probably not support the
\+ and \? multipliers.
Curly-brace multipliers (called interval
expression in awk terminology) will probably not
be supported in other awks (though they are reported to be part of
the POSIX standard). Even GNU awk (gawk) does
not enable interval expressions by default: they can only be
enabled by invoking either the --posix or
the --re-interval option.
Table 13.3, “Portable multipliers” lists multipliers that you can
use safely across platforms. But even here, exceptions do exist.
Table 13.3. Portable multipliers
| Perl | awk, egrep | sed, grep | meaning |
|---|---|---|---|
* | * | * | zero or more times |
? | ? | zero or one time | |
+ | + | one or more times | |
{m,n} | at least m, but maximally n times | ||
{,n} | up to n times | ||
{m,} | at least m times | ||
{n} | exactly n times |
Sometimes a Regular Expression should only match if a pattern matches at a certain position, for instance, at the beginning of the input or at the beginning of a word. This can be achieved by adding an anchor to your Regular Expression.
Anchors are also called zero width assertions because they fit between or adjacent to something.
There are two types of anchors: anchors specifying a beginning or ending of a line and anchors specifying a word boundary. Both types will be discussed.
When an anchor is used in a Regular Expression, the Regular Expression will only match if the string is in the position specified by the anchor.
Generally, newlines are stripped before the match is attempted.
This is true for at least awk, sed and all grep variants.
But, in Perl, you must either remove the newline using the
chomp operator or include it in the Regular
Expression. Furthermore, if the so-called multi-line
mode is turned on, the meaning of begin and end anchors
changes altogether in Perl. More in the section called “Perl Regular Expressions”.
Portability. Begin and end anchors are supported in all Regular Expression variants and use the same notation everywhere.
Let's look at the begin and end anchors in more detail.
The ^ (caret) is the anchor that specifies the
beginning of the input. If you attach ^ at the
beginning of your Regular Expression, matches will
then only be found at the beginning of the input.
In the example below there is a match:
echo Once upon ... | grep '^Once'
There is, however, no match in this one:
echo He said: Once ... | grep '^Once'
The $ (dollar sign) is the anchor that
specifies placement at the end of the input.
The $ is typically
attached to the end of the Regular Expression. The Regular
Expression will then only match strings
at the end of the input.
In the next example all lines that end in a slash (that is, containing directory names) will match:
ls -lF | grep '/$'
In this example, there will not be a match, since pwd does not put a slash at the end of its output:
pwd | grep '/$'
By using the begin and end anchors together, you can create a Regular Expression that matches the entire input, not just some part of it:
^[0-9]+$'
specifies that the complete input should consist of digits and nothing else. So, if the input contains one non-digit, the Regular Expression will not match.
In the following example, egrep will not show empty lines or lines that only contain spaces:
egrep -v '^ *$' listofnames
So, the lines that will be shown contain at least one character that is not a space.
Word-boundary anchors fit between a word and something else.
In most Regular Expressions word is defined as a series of
alphanumeric characters. So, a word boundary fits between an
alphanumeric character and a non-alphanumeric character.
Furthermore, a word-boundary anchor will also fit between
an alphanumeric character and the beginning or end (where
^ and $ would match).
There are two variants of word-boundary anchors. The first is
\b, which matches on either side of a word.
The second variant consists of two anchors: \<
and \>.
The \< anchor fits either between a
non-alphanumeric character and an alphanumeric character or before
the beginning of a line if the line starts with an alphanumeric
character (like ^).
The \> anchor fits either between an
alphanumeric character and a non-alphanumeric character or
after an alphanumeric character at the end of a line (like
$).
The spelling of word-boundary anchors is the same everywhere.
Portability.
Word-boundary anchors are not supported widely, but can be found in
all GNU implementations of the programs. The only implementation
you can really be sure of is \b in Perl.
Let's look at word-boundary anchors in more detail.
The \b word-boundary anchor fits on either side
of a word. This one matches:
echo The one and only | grep '\bone'
This one does not:
echo gone with the wind | grep '\bone'
The \b anchor can also be used at the end of a
word. This one does not match:
echo printed onesided | grep 'one\b'
A more complicated example is:
\b[0-9]{6-10}\b
This will only match if the input contains a series of six, seven, eight, nine or ten digits.
Table 13.4, “Anchors” provides an overview of anchors and their support.
Table 13.4. Anchors
| anchor | awk | egrep | Perl | grep, sed | meaning |
|---|---|---|---|---|---|
^ | yes | yes | yes | yes | beginning |
$ | yes | yes | yes | yes | ending |
\b | no | yes | yes | yes | word boundary |
\< | no | yes | no | yes | left word boundary |
\> | no | yes | no | yes | right word boundary |
| Notes |
|---|
^ and $ in Perl can
have a slightly different meaning in conjunction with
newlines.
|
GNU awk (gawk) does
support \< and
\>
|
| GNU grep and egrep do support word boundaries. Other grep's might not do this. |
Grouping is making part of a Regular Expression a separate entity. This is done by putting that part inside parentheses, like:
[0-9]:([0-9]:)
In the above, the second [0-9]: is inside
parentheses, hence it is grouped.
The group (the part between the parentheses) itself must be a correct Regular Expression.
Why would you use grouping? There are three important reasons:
a multiplier can be applied to a group. Discussed next.
the input the group matches can be re-used (so-called backreferences). Discussed next.
to use alternation in part of the Regular Expression. Will be discussed later in the section called “Grouping and Alternation”.
Unfortunately, the difference between classic and extended Regular Expressions pops up again here. Table 13.5, “Grouping operators” shows when to use which spelling.
Support and portability. Grouping support is present in Perl together with various backreference mechanisms. Grouping is also supported by all awk variants, but backreferences are not. GNU grep, egrep and sed support both grouping and backreferences. Every program that allows grouping also allows multiplication applied to a group.
A multiplier can be applied to a group. Simply put the multiplier
right after the closing parenthesis of a group.
To find a representation of time in the typical format
two digits - colon - two digits - colon - two digits
(such as 14:10:17), for example, use the
following (extended) Regular Expression:
([0-9]{2}:){2}[0-9]{2}
To find all words with “ing” twice
(such as singing), use this extended Regular
Expression:
(ing){2}
If a group is not followed by a multiplier, then multiplication by one is chosen automatically.
The part of the input on which a group matches comes available for
re-use.
A special mechanism for backward referencing, called a
backreference
is available for this. The backreference corresponding to
the first group in a Regular Expression is indicated with
“\1” (a backslash and the digit
one), the one corresponding to the second group
\2, and so on.
If, for example, you are looking for the occurrence of four
identical digits, you might use:
([0-9])\1\1\1
Suppose this is tried on:
The 1-2-3 of 4444
First, the RE finds 1 as the first digit.
There is a group around it, so the value is saved.
In the Regular Expression there is a \1 after the
group. This tells
the Regular Expression engine to use the same value (digit 1) again.
This fails, since there is a dash after the 1.
This sequence is repeated for the 2 and the
3, but both fail.
However, when it finds the
4 it than finds the other three
4s, so the Regular Expression matches at
last.
In an earlier paragraph I needed to find words that contain
twice the same three characters, like
singing. Here is how I found words like that:
egrep '(...)\1' /usr/share/dict/american-english
(I know, I should have used [A-Za-z]{3} instead of
the dots. Since there are only letters in the file, dots work as
well).
Extracting the matching part (or even subparts) using grouping is very important in Perl. More in the section called “Perl Regular Expressions”.
Using alternation, one can specify either/or constructions in a Regular Expression.
Again, the difference between classic and extended Regular Expressions pops up. Table 13.6, “Alternation operator” shows when to use which spelling.
As an example, suppose a file contains phone numbers of mobile phones.
Wanted: only the numbers that apply to The Netherlands.
In the Netherlands, mobile phone numbers start with
06. When called from outside the Netherlands,
this should be 00316. This is not true in all
countries, so some numbers start with +316.
Here is the Regular Expression to handle all three possibilities:
egrep '06|00316|\+31' phonenumbers
Alternation can be used inside a group. Then, the alternation applies only to the part in the group.
Suppose the mobile phone numbers should be at the beginning of the
input line.
Then we'll we need to use an anchor. But if we use an anchor followed by the
existing Regular Expression (^06|...), then the
anchor applies only to the first number.
The following is the correct solution:
egrep '^(06|00316|\+31)' phonenumbers
Another example: Find numbers in the range 1 .. 12. Leading zeroes are not allowed. Solution (extended Regular Expression):
\b([1-9]|1[0-2])\b
In words: one digit in the range 1-9 or two digits in the range 10-12.
Two topics are discussed here:
setting or removing the special meaning of a character
white space
Rule 1. If a character has a special meaning by itself, it will lose this special meaning if a backslash is put before it.
The dot by itself has a special meaning, but \.
will match a literal dot, as in
[0-9]+\.[0-9]{2}
Also, to find a literal star it needs to be prefixed with a backslash. The following will match zero or more digits or stars.
[0-9\*]*
To find a phone number starting with a literal plus followed by one or more digits, use this egrep command line:
egrep '\+[0-9]+' holidayaddress.mail
Rule 2. If a backslash and a character together have a special meaning, then omitting the backslash will remove the special meaning.
In some Regular Expression variants, the \b has a
special meaning as word boundary. When the b is
used alone, it just means the letter b.
\bback
In classic Regular Expressions, the \+ has a
special meaning (a multiplier). A + without a
backslash matches a literal +:
grep '+[0-9]\+ holidayaddress.mail
which is, of course, completely the opposite of the earlier extended Regular Expression example.
A white-space character is either a space character or a tab character. Sometimes the definition is extended with a formfeed character and a carriage-return character.
In classic Regular Expressions, handling white space, especially tab characters, can be a burden. In extended Regular expressions, this is done comfortably. Perl is the most luxurious. All will be discussed next.
In some classic Regular Expression variants (notably
sed), there is no easy primitive for a tab
character, let alone for white space.
It is probably best to refrain from using classic Regular
Expressions for tab handling. Two workarounds are available.
One way to handle white space is to create a character class
consisting of a space and a literal tab character. That is the
[, followed by a space, followed by a real tab
character, followed by ].
This will not work on the command line. Most shells intercept the tab character for other purposes.
In an editor you can usually type the tab character as ^I (control-I). In some editors, tab has a special function, so something must be done first. In emacs, for example, you must type ^Q^I to get a literal tab character.
Make sure your editor does not change tabs automatically when
saving the file (check with od, for example).
The second workaround is to replace tab characters before they
reach your classic-Regular-Expression program. Using
tr (transliterate) is one option.
... | tr '\011' ' ' | sed ...
Here, any tab character (denoted by \011) is
changed to a space character before the information is sent to
sed.
tr can only read from standard
input (e.g., a pipe).
If your software supports POSIX character-classes
(Linux aims towards POSIX compliance), then you probably
have the [[:blank:]] and
[[:space:]] character classes available.
In the following example the [[:blank:]]
character class is used in a Regular Expression that matches a
#, possibly preceded white-space characters, at
the beginning of the line.
Since -v is used, grep will
show all lines on which the Regular Expression does
not match:
grep -v '^[[:blank:]]*#' whatever.conf
The resulting output consists of all lines that do not start with
a # and do not start with one or more
white-space characters and a #.
This is a way to show a file without the comments in it.
The extended Regular Expressions have the \t
notation for a tab character. With that, it is easy to build a
white-space character class:
[ \t]
This means: either a space or a tab character. Applied:
[0-9]{1,2}[ \t]+[A-Z]
That is, one or two digits, followed by one or more white-space characters and an uppercase letter.
POSIX character classes are used exactly the same way as in classic Regular Expressions (above).
If you can use Perl for white-space handling, you are lucky: Perl has
the \s primitive that matches a white-space
character. Applied:
[0-9]{1,2}\s+[A-Z]
Or, to see lines starting with a # possibly
preceded by white space:
^\s*#
For more on Perl Regular Expressions, see the section called “Perl Regular Expressions”.
This section will only describe what Regular Expressions look like in sed. How and where Regular Expressions are applied is described in the section called “Using sed”.
The sed program uses the classic Regular Expression version. Probably because of a fear of breaking old software, few new features have been added, not even in GNU sed.
Remember that sed is line oriented. Operations work on a line that has been read from a file or standard input. The global modifier for the substitute command works on a line basis. When global is turned on, the replacement is done as often as possible in each line. When global is kept off, then the replacement will work once in each input line.
As described earlier, parsing white
space is a burden in sed.
Luckily, GNU sed 3.02 supports the POSIX character-set primitives,
so [[:space:]]\+ can be used to match one or more
white-space characters.
This section will only describe what Regular Expressions look like in awk. How and where Regular Expressions are applied is described in the section called “Using awk”.
There are at least three variants of awk for Linux.
Of those, GNU awk (gawk) is the most sophisticated.
The gawk distribution comes with great documentation.
All awk variants accept the extended Regular Expressions described
earlier, but, currently, without the curly-brace multipliers (called
interval expressions in awk terminology).
Keep in mind that gawk has some extensions
that the other awk variants do not accept, such as POSIX character-class
extensions and the \< and \>
word anchors. The interval expressions are
supported in gawk, but are not enabled by default for
portability with other awk implementations. To enable these, use either
the --posix or the --re-interval
option.
You might have wondered why most awk variants do not have support for word anchors. Since each awk split its input into fields (which are normal words by default) these fields can be checked with normal begin and end anchors. The following checks, for example, if the second field (word) begins with an uppercase letter:
echo The Ape | awk '$2 ~ /^[A-Z]/ {print $2 ": starts upcase"}'
Creating a white-space
character class in awk is easy: [ \t] (as was shown
earlier).
In gawk, the POSIX white-space character classes can
be used instead. This is, of course, not portable to other awks.
Further Reading: lots of Perl documentation, e.g. Wall01. See perldoc perlre on your computer for more information about Perl Regular Expressions.
Perl has the most powerful Regular Expression variant. If at all possible, use Perl for your Regular Expression tasks.
This section will only describe the Perl Regular Expressions and how to use them in Perl. The Perl language itself is described in the section called “Using Perl”.
The Perl Regular Expression language is basically the extended Regular Expression language introduced earlier. There are, however, some subtle differences and extensions, which will be discussed below. The section on Perl Regular Expressions will end with a presentation on how to use grouping to extract parts of a Regular Expression match.
Perl has extra primitives, many of which turn out to be very handy in everyday use. Table 13.7, “Extra primitives in Perl” shows a selection.
Table 13.7. Extra primitives in Perl
| in Perl | description | same in awk |
|---|---|---|
\d | a digit | [0-9] |
\D | a non-digit | [^0-9] |
\s | a white-space character | [ \t\f\r] |
\S | a non-white-space character | [^ \t\f\r] |
\w | a “word” character | [a-zA-Z0-9_] |
\W | a non-“word” character | [^a-zA-Z0-9_] |
These primitives can be used both as a standalone primitive
and inside a character class.
The dot means: any character, but not the newline character. In the single line mode, which is not discussed further in this book, the dot will also match the newline.
In Perl, it is easy to distinction between characters that need to
be prefixed with a backslash to remove a (possible) special meaning
and those that do not, namely \w characters
versus \W characters.
\w characters.
Any primitive that matches \w is a normal
character without a special meaning. The letter
a, for example, falls into this category.
\W characters.
Any character that matches \W may have a special
meaning.
Prefix it with a backslash to make sure it loses that special
meaning. If it does not have a special meaning, putting a backslash
before it will do no harm.
The + character, for example, falls in
this category.
Line-boundary handling is very sophisticated in Perl.
Consider for example, that the meaning of ^
and $ changes when multi-line
mode is used.
For more information you might start with
perldoc perlre, then read other documentation.
Newlines are not removed automatically.
When reading lines (from a file or information stream) Perl does not
chop off newlines automatically. You can either remove the newline
before the match is applied, or you can include the newline in your
Regular Expression.
Here is an example that removes the newline using the
chomp operator:
while (<>) # read line, put it in $_
{
chomp; # remove newline from $_
print $_,"\n" if (/9{2}$/); # show $_ if it ends in 99
}
If you keep the newline character and want to match the end of line you must include the newline in your Regular Expression:
/9{2}\n$/
This matches if there are two nines and a newline-character at the end of the input.
As far as we know, the anchor \b originated in Perl,
and is now available in some other Regular Expression variants, as
shown earlier. Perl has the \b, but also the
opposite \B.
Table 13.8, “Perl (non-)word boundary anchors” shows both.
Table 13.8. Perl (non-)word boundary anchors
| primitive | description |
|---|---|
\b |
word boundary (between \w and
\W)
|
\B | not a word boundary |
Perl has two ways to re-use the matching parts (or even subparts). The first is done by reading some special pseudo variables, the second involves grouping.
After a successful match, the three related pseudo variables
$`, $& and
$' will contain the following:
$&
The $& pseudo variable
(correct spelling: a dollar sign and an ampersand character)
contains that part of the input that the Regular Expression
precisely matched.
$`
The $` pseudo variable
(correct spelling: a dollar sign and a backward quote
a.k.a. backtick)
contains the part of the input before
the matching part.
$'
The $' pseudo variable
(correct spelling: a dollar sign and a forward single quote)
contains the part of the input after
the matching part.
Be sure to confirm that the Regular Expression really matched.
Otherwise, the values of $`,
$& and $' may be
random.
An example:
$_ = 'This is a nice book';
if (/\w+ic\w+/)
{
print
"Input: \"$_\"\n",
"Part before match: \"$`\"\n",
"Matching part: \"$&\"\n",
"Part after match: \"$'\"\n";
}
The output of this program part is:
Input: "This is a nice book"
Part before match: "This is a "
Matching part: "nice"
Part after match: " book"
Note that the contents of $` has an extra
trailing space and that of $' an extra leading
space.
In the Perl Regular Expression language, one can use grouping to extract matching parts or even subparts from the input.
Suppose the input is:
Today is 7 January 2002, a monday.
The following Perl Regular Expression matches the complete date in the input:
\b\d{1,2}\s+[A-Za-z][a-z]+\s+\d{4}\b
In Perl, you can also group the day, the month and the year parts:
\b(\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4})\b
When this Regular Expression is matched against the input, three
pseudo variables, namely $1,
$2 and $3 will be filled-in
with the corresponding parts of the date. The information
corresponding
to the first group (seen from the left) is inserted in
$1, the second in $2, and so
on:
$_ = "Today is 7 January 2002, a monday.";
if (/\b(\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4})\b/)
{
print "Day: $1\n",
"Month: $2\n",
"Year: $3\n";
}
Nested groups. Groups can be nested:
\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b
Information corresponding to the utmost group (here: the whole
date) gets into $1. The parts inside that group
fall into $2, $3 and so on.
To see this working, use something like this:
$_ = "Today is 7 January 2002, a monday.";
if (/\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b/)
{
print "Date: $1\n",
"Day: $2\n",
"Month: $3\n",
"Year: $4\n";
}
There is an alternative to the use of the pseudo variables
($1 etc.).
The parts of the input that correspond to the groups in the Regular
Expression can be written in an array.
In the program part below, these subparts of the input are stored in
the array @results:
my $input = "Today is 7 January 2002, a monday.";
my @result = $input =~ /\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b/;
if (@result) # if RE fits
# etc
It is important to realize that
$input =~ /RE/
actually is a statement that returns an array. To capture this, put an array assignment before it:
@result = $input =~ /RE/
This really is correct Perl! Remember that the =~
operator is called the binding operator. It is
not an assignment.
This program part puts everything in perspective:
my $input = "Today is 7 January 2002, a monday.";
my @result = $input =~ /\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b/;
if (@result) # if RE fits
{
for (my $i = 0; $i < scalar @result; $i++)
{
print "Field $i: $result[$i]\n";
}
}
Elements of the array will be filled in the same order the
pseudo variables were assigned. The first element,
$results[0], will get the complete date,
$results[1] will get the day,
$results[2] the month and
$results[3] the year.