Chapter 13. System Customization and Automation (2.213)

Revision: $Revision: 1.4 $ ($Date: 2007-01-10 15:36:27 $)

This topic has a total weight of 3 points and contains the following objective:

Objective 2.213.1; Automating tasks using scripts (3)

Candidates should be able to write simple scripts to automate tasks using different common scripting languages. Tasks for automation include checking processes, process execution, parsing logs, synchronising files across machines, monitoring files for changes, generating and sending e-mail alerts and notifying administrators when specified users log in or out.

Key knowledge area(s):

Standard text manipulation software sucs as awk and sed
BASH
cron configuration files
at daemon usage
Remote copying software such as rsync and scp
Perl: Basic commands

The following is a partial list of the used files, terms and utilities:

perl
bash
awk
sed
crontab
at

This chapter discusses three main programs that work with Regular Expressions: Perl, sed and awk. Because Regular Expressions require a rather extended explanation and are common in all three programs, the Regular Expressions are discussed first.

The chapter concludes with topics about rsync, crontab and monitoring your system, as listed in the lpic objectives above.

Regular Expressions

Revision: $Revision: 1.10 $ ($Date: 2007-01-11 08:31:04 $)

Resources: (none).

Further reading: Friedl01, manpages for the various commands.

Regular Expressions are, as you will read further on in this chapter, the salt of all Unix systems. Regular Expression patterns are used in grep and derivatives, Perl, awk and sed. In general, there are lots of similarities between the Regular Expression dialects used by each of these programs. When able to work with one dialect, it is quite easy to work with the other dialects too. There are, however, annoying and dangerous differences. In this section the commonalities will be described and the differences clarified. Unfortunately, portability is an issue: there are even different Regular Expression dialects for the same program on different operating systems. Linux Regular Expression support is less complicated, but even here differences exist. This point will gradually become clear in this section.

In keeping with common practice in Perl documentation, the term Perl will be used for the Perl language and perl will be used for the program.

Introducing Regular Expressions

A Regular Expression is a pattern. The pattern is matched against some text. The result: the pattern will either fit or not fit. There is a lot to say about this pattern language. The text is frequently called the input. On the command line or in scripts, it can originate from an input stream (standard input) or from a set of files. In these cases the input will almost always be of a line of text. When using a programming language that supports Regular Expressions, the pattern will probably be matched against the contents of a buffer.

Note

When using Regular Expressions on the command line, such as with grep, you must protect the Regular Expression from expansion by the shell. Putting a single (forward) quote on each side of the Regular Expression will do the job:

    grep 'loo*ny' tune.texts

This (rather long) section will discuss Regular Expressions in detail. First, the Regular Expression language (including variants) is introduced. Next, Regular Expressions are applied using specific programs. So, take a deep breath and start reading.

Primitives and Multipliers

The parts of a Regular Expression most commonly used are the primitive and the multiplier. They frequently occur in pairs. An example of such a pair is:

    E{4}

In this example, E is the primitive and {4} is the multiplier, meaning four times the thing before it. This Regular Expression will match four consecutive capital E characters somewhere in the input.

Given the Regular Expression above, as an example, and an input line such as the following:

    I like lots of EEEE's

the Regular Expression will match the four E's in the input. Moreover, the complete line will match, since the Regular Expression fits somewhere in the line. A member of the grep family, such as egrep will show the input if the Regular Expression matches part or all of it. So, if egrep were to handle something like this

    echo "Lots of EEEE's" | egrep 'E{4}'

then egrep would show the whole line.

A multiplier may be omitted, in which case a multiplier of one time is added automatically to each primitive. So the Regular Expression:

    head

effectively means:

one h, followed by one e followed by one a followed by one d.

Both primitives and multipliers are discussed below (the section called “Primitives” and the section called “Multipliers”). More facilities are available for Regular Expressions: anchoring, grouping and alternation. These are discussed later on.

Primitives

There are three types of primitives: a regular character, the dot placeholder and the character class. The character class is the most complicated of these.

A digit, letter or other character

Any regular character can be a primitive. This includes letters (like E) and digits. For instance, in the following example the capital E is the primitive:

    E{4}

What is the primitive here:

    4{2}

The correct answer is 4.

Other characters can also be used as primitives in a Regular Expression. For example:

    ls -la | egrep ' {11}'

(the Regular Expression is a space followed by multiplier {11}). This will show lines containing at least 11 consecutive spaces.

Some characters have special meanings, such as the curly brace ({). These characters must be escaped (they are preceded by a backslash (\)) to remove the special meaning. This will be discussed further later on.

The dot: placeholder for any character

The dot has a special meaning. As a primitive, it can be a placeholder to allow any character in that position. For instance:

    a.c

allows for one (remember the one-time multiplier being added?) a, followed by any single character, followed by one c. So

    echo reading the abc | egrep 'a.c'

fits.

And what will this fit on?

    a.{3}

The answer: one a followed by three arbitrary characters.

Note

To find a literal dot, use the \. primitive.

Character classes

A character class is a placeholder for a number of characters. It is more restrictive than the dot placeholder, since it allows you to specify the characters the placeholder should match.

Keep in mind that the character class is a primitive, it is a placeholder for one of the specified characters.

To set up a character class, start with a [ (opening square bracket) and end it with ] (closing square bracket). Any combination of sets and ranges can be specified between these square brackets.

A character set in a character class

To specify a set (a series) of characters in a character class, simply put them between the square brackets. So,

    [aouiyA]

specifies a set consisting of an a (both lower and upper case), an o, a u or a y.

Remember that the character class is a primitive. That is, it can be multiplied:

    [aouiyA]{2}

This specifies two characters, each of which may be one of the characters inside the square brackets.

The order of the characters in the set is irrelevant. That is, [abcABC] is the same as the [aAbBcC] character class. So,

    echo reading the abc | egrep 'a[bB]c'

fits, as would

    echo reading the aBc | egrep 'a[bB]c'

A character range in a character class

A character range consists of two characters separated by a minus sign. This specifies one digit:

    [0-9]

This is one lower case letter:

    [a-z]

Ranges can be combined:

    [a-z0-9]

Some sets can be written as ranges. For example, the Regular Expression:

    [ghiGHI]{2}

can also be written as:

    [g-iG-I]{2}

It matches two of the indicated characters.

For now, the order is defined by the ASCII character set (or derivatives like iso8859-x). In the ASCII character set, letters and digits are not adjacent. That is why they need to be specified separately as ranges:

    [a-z0-9]

More and more systems support so-called locale handling (such as the character set for a given country) of Regular Expressions. Arising character definitions like Unicode will probably change this.

Combinations of sets and ranges

Ranges and sets can be combined at will. This specifies one of the named vowels or a digit:

    [aouiy0-9]

And this one matches three characters:

    [A-Z][aouiy0-9][A-Z]

Which one? Answer: three characters, the first and the third an uppercase letter with a vowel or digit in the middle. In the following example the Regular Expression matches:

    echo This is JaN | egrep '[A-Z][aouiy0-9][A-Z]'

Note

To include a minus sign in a character set, place it at the beginning or the end of the character class.

The inverted character class

It is often desirable to exclude certain characters from a character class. That's where the inverted character class comes in.

The inverted character class starts with a caret (^) as the first character after the opening square bracket ([) that starts the character class. The caret sign at the beginning inverts (negates) the meaning of the character class. Where

    [0-9]

fits one digit,

    [^0-9]

will fit any character that is not a digit.

This, of course, can be extended with every character class:

    [^aeouy0-9]

will fit anything that is neither a vowel nor a digit.

Note

To include a caret sign in a character class, do not put it at the beginning.

POSIX character classes

The POSIX character classes form an extension to the character classes that were discussed above. They come in the following format:

[[:keyword:]]

The keyword is a word describing the class, for example, alnum or blank. Current implementations of GNU grep and egrep, as well as GNU awk (gawk), come with manpages that describe the POSIX character classes. Let's look at some examples.

Note

The outer square brackets are the delimiters of any character class. The inner square brackets and colons plus the keyword define the POSIX character set.

Table 13.1. Overview of character classes

NotationDescription
[[:alnum:]]
It fits exactly one alphanumeric character. It is the same as
[a-zA-Z0-9]
[[:alpha:]]
This class matches any letter
[[:upper:]]
This class matches any uppercase letter (same as:
[A-Z]
)
[[:lower:]]
This class matches any lowercase letter (same as:
[a-z]
)
[[:digit:]]
This class matches any digit (same as:
[0-9]
)
[[:xdigit:]]
This class matches any hexadecimal digit (same as:
[0-9A-Fa-f]
)
[[:blank:]]
It matches a blank or a tab sign (a subset of the wider set called white space)
[[:space:]]
It matches each white-space character, including space, tab, formfeed and carriage return
[[:punct:]]
This class matches all punctiation characters


Multiple POSIX-character-set definitions can be combined in one character class:

    [[:punct:][:space:]]

This one fits a character which is either a punctuation character (such as a semicolon or a plus sign) or a white-space character.

The POSIX character set specifiers can be combined with any other character-class element, as shown below:

    [0-9]{4}[[:space:],\.:][0-9]{2}

In words: four digits, then one character that can be either a white-space character, a comma, a dot or a colon, followed by two digits.

Since the POSIX character class extensions are extensions to the regular character class, it can be inverted the same way:


    [^[:digit:]]

This matches one non-digit. It is identical to:

    [^0-9]

Multipliers

A multiplier (also known as quantifier) works in conjunction with a primitive. The multiplier is placed after the primitive. Together they form a specification of how many of which characters are wanted.

Unfortunately, before multipliers can be explained in detail, the two main variants in Regular Expressions, most affecting the way multipliers are spelled, need to be introduced first. The classic Regular Expressions are the ones implemented in grep and sed. The modern, or extended Regular Expressions, on the other hand, are the ones used by awk, egrep, Perl and flex, to name a few. The multipliers shown so far were in the extended style. For this reason, egrep, was used in examples.

We will discuss 4 multipliers. They are the *, ? and + multipliers (and their classic counterparts *, \? and \+, yes, the * is the same in both cases).

You can also use the curly-brace multipliers we've already met. These are part of the POSIX standard. All relevant GNU programs have support for curly-brace multipliers on board; even versions for classic Regular Expression exist. In GNU awk (gawk) a special --posix option must be given to turn on support for curly-brace multipliers. Table 13.2, “Multipliers” at the end of this section will provide an overview of Regular Expression multipliers.

The * multiplier

The * is the oldest multiplier around. As far as I know, it is implemented in every Regular Expression language. It is the same in classic and extended variants.

The * means: zero or more times. So,

    a*

means zero or more a's. This may not be that meaningful, since it is always true. In larger combinations, however, its use can be more meaningful:

    ba*c

That is, one b, and one c, possibly with a's between them.

The ? multiplier

The ? multiplier means zero or one time, in other words, present or not present. The extended version is ?, the version to use in classic Regular Expressions is \?.

An example: in sgml/xml sections start with <section followed by a lot of things, and end with </section, also followed by other things. To fetch both lines, use

    egrep '</?section' regexp.xml

The classic variant is

    grep '</\?section' regexp.xml

The + multiplier

The + multiplier means one or more times. The extended version is +, the version to use in classic Regular Expressions is \+.

This (extended) Regular Expression

    ba+c

means one b, followed by at least one a, followed by a c.

To find long, regardless of how may o's it contains, use the (extended) Regular Expression

    lo+ng

The curly brace multiplier

In the extended variant, a normal set of curly braces is used, as in {4}. In the classic variant, a matching pair of \{ and \} (each curly brace is prefixed by a backslash) is used, as in \{4\}.

The curly-brace multipliers come in four variants (only the extended Regular Expressions will be shown). The general form is

{m,n}

In this, both m and n are numbers indicating the lowest (m) and the highest (n) number. It matches at least m, but at most n times. For instance, to match at least two, but at most four digits:

    [0-9]{2,4}

To specify a minimum number of times, omit the n (but not the comma!):

{m,}

For instance, to match at least five digits:

    [0-9]{5,}

And to show lines that are at least 78 characters long (show also line numbers), select

    egrep -n '.{78,}' file1 file2 ...

To specify a maximum number of times, omit the m (but not the comma!):

{,n}

For instance, to match up to three digits:

    [0-9]{,3}

In this case, the minimum will be zero.

Note

The example alone will not limit to three digits. It will just match up to three characters. If more numbers adjacent to these three are not wanted, anchors must be used - see the section called “Anchors”.

To match something no more and no less than n times, specify only n (do not use a comma):

{n}

To match exactly four capital alphabetic characters, for instance:

    [A-Z]{4}

Support and portability.  Curly-brace multipliers are supported by Perl. In GNU awk, support must be enabled explicitly with the --posix option; other awk implementations might not (yet) support these multipliers (called interval expressions in awk terminology). GNU egrep will support them. GNU grep supports the classic version. In other versions of (e)grep, support may be lacking. GNU sed supports the classic implementation. There are implementations of sed that do not support curly-brace multipliers at all. In yet other implementations of sed, extended Regular Expressions can be selected. POSIX standards are said to require curly-brace multipliers, so support is expected to be implemented elsewhere over time. Table 13.3, “Portable multipliers” provides an overview.

Multiplier overview

Table 13.2, “Multipliers” shows all multipliers and variants together.

Table 13.2. Multipliers

extended REclassic REmeaning
**zero or more times
?\?zero or one time
+\+one or more time
{m,n}\{m,n\}at least m, but maximally n times
{,n}\{,n\}up to n times
{m,}\{m,\}at least m times
{m}\{m\}exactly m times
extended RE: Perl, GNU awk with --posix, GNU egrep
classic RE:GNU grep, GNU sed


Note

GNU awk without --posix and other awk implementations currently support only the *, + and ? multipliers.

Portability.  If your Regular Expressions should also be used on non-GNU platforms portability becomes an issue. Old grep's will probably not support the \+ and \? multipliers. Curly-brace multipliers (called interval expression in awk terminology) will probably not be supported in other awks (though they are reported to be part of the POSIX standard). Even GNU awk (gawk) does not enable interval expressions by default: they can only be enabled by invoking either the --posix or the --re-interval option. Table 13.3, “Portable multipliers” lists multipliers that you can use safely across platforms. But even here, exceptions do exist.

Table 13.3. Portable multipliers

Perlawk, egrepsed, grepmeaning
***zero or more times
?? zero or one time
++ one or more times
{m,n}  at least m, but maximally n times
{,n}  up to n times
{m,}  at least m times
{n}  exactly n times


Anchors, Grouping and Alternation

Anchors

Sometimes a Regular Expression should only match if a pattern matches at a certain position, for instance, at the beginning of the input or at the beginning of a word. This can be achieved by adding an anchor to your Regular Expression.

Anchors are also called zero width assertions because they fit between or adjacent to something.

There are two types of anchors: anchors specifying a beginning or ending of a line and anchors specifying a word boundary. Both types will be discussed.

Begin and end anchors

When an anchor is used in a Regular Expression, the Regular Expression will only match if the string is in the position specified by the anchor.

Generally, newlines are stripped before the match is attempted. This is true for at least awk, sed and all grep variants. But, in Perl, you must either remove the newline using the chomp operator or include it in the Regular Expression. Furthermore, if the so-called multi-line mode is turned on, the meaning of begin and end anchors changes altogether in Perl. More in the section called “Perl Regular Expressions”.

Portability.  Begin and end anchors are supported in all Regular Expression variants and use the same notation everywhere.

Let's look at the begin and end anchors in more detail.

The ^ begin anchor

The ^ (caret) is the anchor that specifies the beginning of the input. If you attach ^ at the beginning of your Regular Expression, matches will then only be found at the beginning of the input.

In the example below there is a match:

    echo Once upon ... | grep '^Once'

There is, however, no match in this one:

    echo He said: Once ... | grep '^Once'

The $ end anchor

The $ (dollar sign) is the anchor that specifies placement at the end of the input. The $ is typically attached to the end of the Regular Expression. The Regular Expression will then only match strings at the end of the input.

In the next example all lines that end in a slash (that is, containing directory names) will match:

    ls -lF | grep '/$'

In this example, there will not be a match, since pwd does not put a slash at the end of its output:

    pwd | grep '/$'

Matching complete text

By using the begin and end anchors together, you can create a Regular Expression that matches the entire input, not just some part of it:

    ^[0-9]+$'

specifies that the complete input should consist of digits and nothing else. So, if the input contains one non-digit, the Regular Expression will not match.

In the following example, egrep will not show empty lines or lines that only contain spaces:

    egrep -v '^ *$' listofnames

So, the lines that will be shown contain at least one character that is not a space.

Word-boundary anchors

Word-boundary anchors fit between a word and something else. In most Regular Expressions word is defined as a series of alphanumeric characters. So, a word boundary fits between an alphanumeric character and a non-alphanumeric character. Furthermore, a word-boundary anchor will also fit between an alphanumeric character and the beginning or end (where ^ and $ would match).

There are two variants of word-boundary anchors. The first is \b, which matches on either side of a word. The second variant consists of two anchors: \< and \>. The \< anchor fits either between a non-alphanumeric character and an alphanumeric character or before the beginning of a line if the line starts with an alphanumeric character (like ^). The \> anchor fits either between an alphanumeric character and a non-alphanumeric character or after an alphanumeric character at the end of a line (like $). The spelling of word-boundary anchors is the same everywhere.

Portability.  Word-boundary anchors are not supported widely, but can be found in all GNU implementations of the programs. The only implementation you can really be sure of is \b in Perl.

Let's look at word-boundary anchors in more detail.

The \b word-boundary anchor

The \b word-boundary anchor fits on either side of a word. This one matches:

    echo The one and only | grep '\bone'

This one does not:

    echo gone with the wind | grep '\bone'

The \b anchor can also be used at the end of a word. This one does not match:

    echo printed onesided | grep 'one\b'

A more complicated example is:

    \b[0-9]{6-10}\b

This will only match if the input contains a series of six, seven, eight, nine or ten digits.

The \< and \> word-boundary anchors

The \< and \> word-boundary anchors fit, respectively, on the left and right word boundaries. They can be used together, but this is not mandatory.

In:

    echo The onesided page is gone | grep '\<one\>'

the Regular Expression will not match.

Anchor overview

Table 13.4, “Anchors” provides an overview of anchors and their support.

Table 13.4. Anchors

anchorawkegrepPerlgrep, sedmeaning
^yesyesyesyesbeginning
$yesyesyesyesending
\bnoyesyesyesword boundary
\<noyesnoyesleft word boundary
\>noyesnoyesright word boundary
Notes
^ and $ in Perl can have a slightly different meaning in conjunction with newlines.
GNU awk (gawk) does support \< and \>
GNU grep and egrep do support word boundaries. Other grep's might not do this.


Grouping

Grouping is making part of a Regular Expression a separate entity. This is done by putting that part inside parentheses, like:

    [0-9]:([0-9]:)

In the above, the second [0-9]: is inside parentheses, hence it is grouped.

Note

The group (the part between the parentheses) itself must be a correct Regular Expression.

Why would you use grouping? There are three important reasons:

  1. a multiplier can be applied to a group. Discussed next.

  2. the input the group matches can be re-used (so-called backreferences). Discussed next.

  3. to use alternation in part of the Regular Expression. Will be discussed later in the section called “Grouping and Alternation”.

Unfortunately, the difference between classic and extended Regular Expressions pops up again here. Table 13.5, “Grouping operators” shows when to use which spelling.

Table 13.5. Grouping operators

spellingprogram
(...)awk, Perl, egrep
\(...\)grep, sed


Support and portability.  Grouping support is present in Perl together with various backreference mechanisms. Grouping is also supported by all awk variants, but backreferences are not. GNU grep, egrep and sed support both grouping and backreferences. Every program that allows grouping also allows multiplication applied to a group.

Applying a multiplier to a group

A multiplier can be applied to a group. Simply put the multiplier right after the closing parenthesis of a group. To find a representation of time in the typical format two digits - colon - two digits - colon - two digits (such as 14:10:17), for example, use the following (extended) Regular Expression:

    ([0-9]{2}:){2}[0-9]{2}

To find all words with ing twice (such as singing), use this extended Regular Expression:

    (ing){2}

If a group is not followed by a multiplier, then multiplication by one is chosen automatically.

Grouping and backreferences

The part of the input on which a group matches comes available for re-use. A special mechanism for backward referencing, called a backreference is available for this. The backreference corresponding to the first group in a Regular Expression is indicated with \1 (a backslash and the digit one), the one corresponding to the second group \2, and so on. If, for example, you are looking for the occurrence of four identical digits, you might use:

    ([0-9])\1\1\1

Suppose this is tried on:

    The 1-2-3 of 4444

First, the RE finds 1 as the first digit. There is a group around it, so the value is saved. In the Regular Expression there is a \1 after the group. This tells the Regular Expression engine to use the same value (digit 1) again. This fails, since there is a dash after the 1. This sequence is repeated for the 2 and the 3, but both fail. However, when it finds the 4 it than finds the other three 4s, so the Regular Expression matches at last.

In an earlier paragraph I needed to find words that contain twice the same three characters, like singing. Here is how I found words like that:

    egrep '(...)\1' /usr/share/dict/american-english

(I know, I should have used [A-Za-z]{3} instead of the dots. Since there are only letters in the file, dots work as well).

Extracting the matching part (or even subparts) using grouping is very important in Perl. More in the section called “Perl Regular Expressions”.

Alternation

Using alternation, one can specify either/or constructions in a Regular Expression.

Again, the difference between classic and extended Regular Expressions pops up. Table 13.6, “Alternation operator” shows when to use which spelling.

Table 13.6. Alternation operator

spellingprogram
|awk, Perl, egrep
\|grep, sed


As an example, suppose a file contains phone numbers of mobile phones. Wanted: only the numbers that apply to The Netherlands. In the Netherlands, mobile phone numbers start with 06. When called from outside the Netherlands, this should be 00316. This is not true in all countries, so some numbers start with +316. Here is the Regular Expression to handle all three possibilities:

    egrep '06|00316|\+31' phonenumbers

Grouping and Alternation

Alternation can be used inside a group. Then, the alternation applies only to the part in the group.

Suppose the mobile phone numbers should be at the beginning of the input line. Then we'll we need to use an anchor. But if we use an anchor followed by the existing Regular Expression (^06|...), then the anchor applies only to the first number. The following is the correct solution:

    egrep '^(06|00316|\+31)' phonenumbers

Another example: Find numbers in the range 1 .. 12. Leading zeroes are not allowed. Solution (extended Regular Expression):

    \b([1-9]|1[0-2])\b

In words: one digit in the range 1-9 or two digits in the range 10-12.

Special characters

Two topics are discussed here:

  1. setting or removing the special meaning of a character

  2. white space

Characters and combinations with a special meaning

Rule 1.  If a character has a special meaning by itself, it will lose this special meaning if a backslash is put before it.

The dot by itself has a special meaning, but \. will match a literal dot, as in

    [0-9]+\.[0-9]{2}

Also, to find a literal star it needs to be prefixed with a backslash. The following will match zero or more digits or stars.

    [0-9\*]*

To find a phone number starting with a literal plus followed by one or more digits, use this egrep command line:

    egrep '\+[0-9]+' holidayaddress.mail

Rule 2.  If a backslash and a character together have a special meaning, then omitting the backslash will remove the special meaning.

In some Regular Expression variants, the \b has a special meaning as word boundary. When the b is used alone, it just means the letter b.

    \bback

In classic Regular Expressions, the \+ has a special meaning (a multiplier). A + without a backslash matches a literal +:

    grep '+[0-9]\+ holidayaddress.mail

which is, of course, completely the opposite of the earlier extended Regular Expression example.

White space

A white-space character is either a space character or a tab character. Sometimes the definition is extended with a formfeed character and a carriage-return character.

In classic Regular Expressions, handling white space, especially tab characters, can be a burden. In extended Regular expressions, this is done comfortably. Perl is the most luxurious. All will be discussed next.

Classic Regular Expressions

In some classic Regular Expression variants (notably sed), there is no easy primitive for a tab character, let alone for white space. It is probably best to refrain from using classic Regular Expressions for tab handling. Two workarounds are available.

Tab characters

One way to handle white space is to create a character class consisting of a space and a literal tab character. That is the [, followed by a space, followed by a real tab character, followed by ].

Note

This will not work on the command line. Most shells intercept the tab character for other purposes.

In an editor you can usually type the tab character as ^I (control-I). In some editors, tab has a special function, so something must be done first. In emacs, for example, you must type ^Q^I to get a literal tab character.

Note

Make sure your editor does not change tabs automatically when saving the file (check with od, for example).

Replacing tabs beforehand

The second workaround is to replace tab characters before they reach your classic-Regular-Expression program. Using tr (transliterate) is one option.

    ... | tr '\011' ' ' | sed ...

Here, any tab character (denoted by \011) is changed to a space character before the information is sent to sed.

Note

tr can only read from standard input (e.g., a pipe).

POSIX white space

If your software supports POSIX character-classes (Linux aims towards POSIX compliance), then you probably have the [[:blank:]] and [[:space:]] character classes available. In the following example the [[:blank:]] character class is used in a Regular Expression that matches a #, possibly preceded white-space characters, at the beginning of the line. Since -v is used, grep will show all lines on which the Regular Expression does not match:

    grep -v '^[[:blank:]]*#' whatever.conf

The resulting output consists of all lines that do not start with a # and do not start with one or more white-space characters and a #. This is a way to show a file without the comments in it.

Extended Regular Expressions

The extended Regular Expressions have the \t notation for a tab character. With that, it is easy to build a white-space character class:

    [ \t]

This means: either a space or a tab character. Applied:

    [0-9]{1,2}[ \t]+[A-Z]

That is, one or two digits, followed by one or more white-space characters and an uppercase letter.

POSIX character classes are used exactly the same way as in classic Regular Expressions (above).

If you can use Perl for white-space handling, you are lucky: Perl has the \s primitive that matches a white-space character. Applied:

    [0-9]{1,2}\s+[A-Z]

Or, to see lines starting with a # possibly preceded by white space:

    ^\s*#

For more on Perl Regular Expressions, see the section called “Perl Regular Expressions”.

Regular Expressions in sed

This section will only describe what Regular Expressions look like in sed. How and where Regular Expressions are applied is described in the section called “Using sed.

The sed program uses the classic Regular Expression version. Probably because of a fear of breaking old software, few new features have been added, not even in GNU sed.

Remember that sed is line oriented. Operations work on a line that has been read from a file or standard input. The global modifier for the substitute command works on a line basis. When global is turned on, the replacement is done as often as possible in each line. When global is kept off, then the replacement will work once in each input line.

As described earlier, parsing white space is a burden in sed. Luckily, GNU sed 3.02 supports the POSIX character-set primitives, so [[:space:]]\+ can be used to match one or more white-space characters.

Regular Expressions in awk

This section will only describe what Regular Expressions look like in awk. How and where Regular Expressions are applied is described in the section called “Using awk.

There are at least three variants of awk for Linux. Of those, GNU awk (gawk) is the most sophisticated. The gawk distribution comes with great documentation.

All awk variants accept the extended Regular Expressions described earlier, but, currently, without the curly-brace multipliers (called interval expressions in awk terminology). Keep in mind that gawk has some extensions that the other awk variants do not accept, such as POSIX character-class extensions and the \< and \> word anchors. The interval expressions are supported in gawk, but are not enabled by default for portability with other awk implementations. To enable these, use either the --posix or the --re-interval option.

You might have wondered why most awk variants do not have support for word anchors. Since each awk split its input into fields (which are normal words by default) these fields can be checked with normal begin and end anchors. The following checks, for example, if the second field (word) begins with an uppercase letter:

    echo The Ape | awk '$2 ~ /^[A-Z]/ {print $2 ": starts upcase"}'

Creating a white-space character class in awk is easy: [ \t] (as was shown earlier). In gawk, the POSIX white-space character classes can be used instead. This is, of course, not portable to other awks.

Perl Regular Expressions

Further Reading: lots of Perl documentation, e.g. Wall01. See perldoc perlre on your computer for more information about Perl Regular Expressions.

Perl has the most powerful Regular Expression variant. If at all possible, use Perl for your Regular Expression tasks.

Note

This section will only describe the Perl Regular Expressions and how to use them in Perl. The Perl language itself is described in the section called “Using Perl”.

The Perl Regular Expression language is basically the extended Regular Expression language introduced earlier. There are, however, some subtle differences and extensions, which will be discussed below. The section on Perl Regular Expressions will end with a presentation on how to use grouping to extract parts of a Regular Expression match.

Perl character class extensions

Perl has extra primitives, many of which turn out to be very handy in everyday use. Table 13.7, “Extra primitives in Perl” shows a selection.

Table 13.7. Extra primitives in Perl

in Perldescriptionsame in awk
\da digit[0-9]
\Da non-digit[^0-9]
\sa white-space character[ \t\f\r]
\Sa non-white-space character[^ \t\f\r]
\wa word character[a-zA-Z0-9_]
\Wa non-word character[^a-zA-Z0-9_]


These primitives can be used both as a standalone primitive and inside a character class.

Note

The Perl definition of \w is not the same as an alphanumeric character. Instead, it is defined as an alphanumeric character or an underline character ([a-zA-Z0-9_]).

Difference: the dot

The dot means: any character, but not the newline character. In the single line mode, which is not discussed further in this book, the dot will also match the newline.

Special characters

In Perl, it is easy to distinction between characters that need to be prefixed with a backslash to remove a (possible) special meaning and those that do not, namely \w characters versus \W characters.

\w characters.  Any primitive that matches \w is a normal character without a special meaning. The letter a, for example, falls into this category.

\W characters.  Any character that matches \W may have a special meaning. Prefix it with a backslash to make sure it loses that special meaning. If it does not have a special meaning, putting a backslash before it will do no harm. The + character, for example, falls in this category.

Anchors: line boundaries

Line-boundary handling is very sophisticated in Perl. Consider for example, that the meaning of ^ and $ changes when multi-line mode is used. For more information you might start with perldoc perlre, then read other documentation.

Newlines are not removed automatically.  When reading lines (from a file or information stream) Perl does not chop off newlines automatically. You can either remove the newline before the match is applied, or you can include the newline in your Regular Expression. Here is an example that removes the newline using the chomp operator:

    while (<>)  # read line, put it in $_
    {
        chomp;  # remove newline from $_

        print $_,"\n" if (/9{2}$/); # show $_ if it ends in 99
    }

If you keep the newline character and want to match the end of line you must include the newline in your Regular Expression:

    /9{2}\n$/

This matches if there are two nines and a newline-character at the end of the input.

Anchors: word boundaries

As far as we know, the anchor \b originated in Perl, and is now available in some other Regular Expression variants, as shown earlier. Perl has the \b, but also the opposite \B. Table 13.8, “Perl (non-)word boundary anchors” shows both.

Table 13.8. Perl (non-)word boundary anchors

primitivedescription
\b word boundary (between \w and \W)
\Bnot a word boundary


Extracting matching parts

Perl has two ways to re-use the matching parts (or even subparts). The first is done by reading some special pseudo variables, the second involves grouping.

Method 1: using $`, $& and $'

After a successful match, the three related pseudo variables $`, $& and $' will contain the following:

$&

The $& pseudo variable (correct spelling: a dollar sign and an ampersand character) contains that part of the input that the Regular Expression precisely matched.

$`

The $` pseudo variable (correct spelling: a dollar sign and a backward quote a.k.a. backtick) contains the part of the input before the matching part.

$'

The $' pseudo variable (correct spelling: a dollar sign and a forward single quote) contains the part of the input after the matching part.

Note

Be sure to confirm that the Regular Expression really matched. Otherwise, the values of $`, $& and $' may be random.

An example:

    $_ = 'This is a nice book';

    if (/\w+ic\w+/)
    {
        print 
            "Input:             \"$_\"\n",
            "Part before match: \"$`\"\n",
            "Matching part:     \"$&\"\n",
            "Part after match:  \"$'\"\n";
    }

The output of this program part is:

    Input:             "This is a nice book"
    Part before match: "This is a "
    Matching part:     "nice"
    Part after match:  " book"

Note that the contents of $` has an extra trailing space and that of $' an extra leading space.

Method 2: grouping

In the Perl Regular Expression language, one can use grouping to extract matching parts or even subparts from the input.

Suppose the input is:

    Today is  7 January 2002, a monday.

The following Perl Regular Expression matches the complete date in the input:

    \b\d{1,2}\s+[A-Za-z][a-z]+\s+\d{4}\b

In Perl, you can also group the day, the month and the year parts:

    \b(\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4})\b

When this Regular Expression is matched against the input, three pseudo variables, namely $1, $2 and $3 will be filled-in with the corresponding parts of the date. The information corresponding to the first group (seen from the left) is inserted in $1, the second in $2, and so on:

     $_ = "Today is 7 January 2002, a monday.";

     if (/\b(\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4})\b/)
     {
         print "Day:   $1\n",
               "Month: $2\n",
               "Year:  $3\n";
     }

Nested groups.  Groups can be nested:

     \b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b

Information corresponding to the utmost group (here: the whole date) gets into $1. The parts inside that group fall into $2, $3 and so on. To see this working, use something like this:

     $_ = "Today is 7 January 2002, a monday.";

     if (/\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b/)
     {
         print "Date:  $1\n",
               "Day:   $2\n",
               "Month: $3\n",
               "Year:  $4\n";
     }

Alternate method 2: using the binding operator

There is an alternative to the use of the pseudo variables ($1 etc.). The parts of the input that correspond to the groups in the Regular Expression can be written in an array. In the program part below, these subparts of the input are stored in the array @results:

    my $input = "Today is 7 January 2002, a monday.";

    my @result = $input =~ /\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b/;

    if (@result)    # if RE fits
    # etc

It is important to realize that

    $input =~ /RE/

actually is a statement that returns an array. To capture this, put an array assignment before it:

    @result = $input =~ /RE/

This really is correct Perl! Remember that the =~ operator is called the binding operator. It is not an assignment.

This program part puts everything in perspective:

    my $input = "Today is 7 January 2002, a monday.";

    my @result = $input =~ /\b((\d{1,2})\s+([A-Za-z][a-z]+)\s+(\d{4}))\b/;

    if (@result)    # if RE fits
    {
        for (my $i = 0; $i < scalar @result; $i++)
        {
            print "Field $i: $result[$i]\n";
        }
    }

Elements of the array will be filled in the same order the pseudo variables were assigned. The first element, $results[0], will get the complete date, $results[1] will get the day, $results[2] the month and $results[3] the year.

Copyright Snow B.V. The Netherlands