Using sed

Revision: $Revision: 1.3 $ ($Date: 2004-01-30 10:22:06 $)

sed is an abbreviation of stream editor. It can edit information on the fly while reading information from either standard input or one or more files.

As shown in the section called “Regular Expressions”, sed operates with the classic Regular Expressions. The Regular Expressions described here are valid for GNU sed (the examples are tested with GNU sed version 3.02). Note that very old, non-POSIX compliant versions of sed may not have the extra multipliers and grouping support.

Behaviour

Note that sed will pass all information it reads on to standard output by default.

Since we can control information when it is processed by sed, we can delete part of the information before it reaches standard output or we can change part of the information so that what comes out on standard output is different from what was read.

But, if we do not intervene, information is shown as-is by sed.

There is also the possibility of reverting this behavior. When the -n option is specified, sed will only show information when explicitly asked to do so.

Calling sed

There are two ways to call sed. In the first form, sed reads from standard input:

... | sed [flags] 'sed expression'

In the second form, sed will read information from one or more files:

sed [flags] 'sed expression' file(s)

The crucial part in both forms is the sed expression. The sed expression specifies what will be done to the information. What it looks like will be discussed further on. How a sed expression is specified is shown here.

Note

The sed expression may contain spaces and characters that are subject to shell expansion. To prevent expansion by the shell, the complete sed expression must be embedded in single forward quotes, as shown above. sed will not see these quotes.

When multiple sed expressions are desired, each must have a -e flag before it. The first form:

... | sed -e 'sed expr1' -e 'sed expr2'

The second form:

sed -e 'sed expr1' -e 'sed expr2' file(s)

The sed expression

The sed expression allows you to control the information passing through sed. It can, however, have many different forms.

The general form of a sed expression is

addresssedcommand

That is: an address immediately followed by a sedcommand. If address is valid for a line of information, then sedcommand will be applied.

An address can be a Regular Expression (between slashes), or it can be a line-number specification. Some sedcommands may need extra additions.

Regular Expression specification.  A form with a Regular Expression specification as address looks like:

/regex/sedcommand

An example of this is:

    /=/d

The Regular Expression specification is /=/, the sedcommand is d. Together, these tell sed not to display lines that contain an = character.

Line number specification.  The address can be a line-number specification (abbreviated as linespec):

linespecsedcommand

(that is: linespec immediately followed by sedcommand). The linespec is a specification of line number(s). They can be specified as a single number or as a range: two numbers with a comma between them. An example of this is:

    1,3d

The result of this is that the first three lines read by sed will not be shown. The other lines will, of course, be shown.

The most frequently used sedcommands

The delete and substitute commands are probably the most frequently used commands.

The delete command

The general form is

addressd

That is: address immediately followed by a d. Using this, sed will omit the information that matches address from the information that is being processed by sed. Information not matching address will, of course, be passed on.

An example of this is:

    sed '1d' /etc/passwd

This command line[21] will result in all lines of /etc/passwd, except the first one, to standard output (the screen in this case).

This is another example, using a range:

    sed '1,10d' myfile

This will cause the contents of myfile to be shown, minus the first ten lines.

Using a Regular Expression as address works the same way:

    sed '/^$/d' myscript | less

This will pass all non-empty lines from myscript to the less command.

A little extension:

    sed '/^ *$/d' myscript > newscript

Here the result will be that all lines in myscript, except those that contain nothing or only space(s), will be written to newscript.

The substitute command

With the substitute command, you can change some of the information before it is sent to the standard output. The other information will go there as well, but unchanged.

The most general form of the substitute sedcommand is:

addresss/regex/replacementstring/modifiers

The address works as a line selector, as shown earlier. The address part can be omitted, in which case the substitute is attempted on all lines. The modifiers can also be omitted, in which case the defaults apply (more on this later). Since both the address and the modifiers can be omitted, a simplified command would look like this:

s/regex/replacementstring/

This format, of course, is very famous.

The substitute works as follows: the part of the line on which regex fits is replaced by replacementstring.

sed reads its information line by line, as do many other Unix commands. A substitute works only once per line: the first location (seen from the left) where the Regular Expression matches is replaced by the replacementstring.

This behavior can be changed by the g modifier. When the g modifier is specified, each part of the input string where the Regular Expression fits is replaced by the replacement string. The g (global) means: as many times as possible within the input line.

Now, let's look at this in practice. Suppose the following information is generated by echo:

    echo http://www.x.z/HTML Files/Pag 1.html | sed 's/ /%20/'

In this example, the input string contains two spaces. The sed command-line replaces one space (the first) by %20. So the result will be:

    http://www.x.z/HTML%20Files/Pag 1.html

That is, the first space is replaced by %20, the second is not.

Note

What is shown here corresponds to the so-called urlencoding. But this is not completely covered here. At least, percent signs in the input should be replaced first.

To replace all spaces, the g modifier needs to be specified:

    echo http://www.x.z/HTML Files/Pag 1.html | sed 's/ /%20/g'

Now the result is:

    http://www.x.z/HTML%20Files/Pag%201.html

That is, both spaces are replaced by %20.

Suppose you've got a file containing lines with URLs as well as lines containing other information. You want to replace spaces in the URLs, but only in the URLs. In this case, an address can be used to select the URLs:

    /^http:/s/ /%20/g

Suppose file mystuff contains the following:

    http://a.b.c/Item 1/index Main.html
    some other information on a separate line
    http://a.b.c/Item 2/index Subpage.html

This file can be processed as follows:

    sed '/^http:/s/ /%20/g' mystuff

The output will be:

    http://a.b.c/Item%201/index%20Main.html
    some other information on a separate line
    http://a.b.c/Item%202/index%20Subpage.html

Outputting changed lines only

Recall that the -n flag reverses the behavior of sed: information read is not passed on by default. To show information some action must be taken. The substitute command has a p modifier that does exactly that: outputting those lines that are changed. Using both the -n flag and the p modifier together will output only those lines that are changed.

sed -n 's/regex/replacement/p' infile(s)

Another example is this:

    sed -n 's/\b[Ll]inux\b/LINUX/p' txtfile

This will read all lines from txtfile, but show only lines that contain either linux or Linux with both linux and Linux replaced by LINUX.

The &: the matching part

The precise part of an input line on which the Regular Expression matches is represented by &, which can then be used in the replacement part. An example is (split across lines for readability):

    echo Now is 15 Feb 2002 | \
          sed 's/[0-9]\{1,2\} \+[A-Za-z][a-z]\+/=&=/'

This will replace the date part with the date surrounded by = characters. The part of the input that does not match the Regular Expression is not replaced. Instead it is shown literally. So the output is:

    Now is =15 Feb= 2002

Grouping in sed

Grouping can be used in sed, at least in the GNU version that is used by default in Linux. A group is opened with \( (a backslash and an opening parenthesis) and closed with \) (a backslash and a closing parenthesis). Grouping can be used in combination with back-referencing.

Back-references

Remember that a back-reference is the re-use of a part of a Regular Expression selected by grouping.

Back-references in sed can be used in both a Regular Expression and in the replacement part of the substitute command.

For example, the following will not show lines in files f1 and f2 that contain four identical uppercase-letters:

    sed '/\([A-Z]\)\1\1\1/d' f1 f2

The following substitute will replace each series of four identical uppercase-letters by four X characters:

    ls -la | sed 's/\([A-Z]\)\1\1\1/XXXX/g'

The following sedexpression places any uppercase-letter between square brackets:

    s/\([A-Z]\)/[\1]/g

Remember that the replacement part is not a Regular Expression, so the square brackets are literal characters.

For example:

    echo Hello There | sed 's/\([A-Z]\)/[\1]/g'

The output of the above command line will be:

    [H]ello [T]here

Replacing the whole input

Remember that a sed substitute command will replace only the part on which the Regular Expression matches. Parts of the line that do not match are sent unchanged to the standard output.

When you do want to match the whole line, the following must be true:

  1. the Regular Expression must match the complete line

  2. the desired part must be selected using grouping

Let's look again at the date example discussed earlier (split across lines for readability):

    echo Now is 15 Feb 2002 | \
        sed 's/^.*\b\([0-9]\{1,2\} \+[A-Za-z][a-z]\+\).*$/\1/'

The output is exactly:

    15 Feb

Do not forget the \b word-boundary anchor: without it, the .* will match too many characters.

White space

Remember that sed uses the classical Regular Expressions, in which there is no easy way to specify a tabulation character.

In modern, GNU versions of sed, POSIX character-classes are accepted, so [[:blank:]] and [[:space:]] can be used to match white space. However, since POSIX support (required by some governments) is not yet present on all non-Linux systems, the use of POSIX character-classes may be non-portable.

Advanced sed

Lots of things can be done with sed. Among sed's facilities are command-grouping and pattern-space. Things might grow too complex to use on the command line, so this section will show how to use a separate file with sed commands.

Putting sedcommands in a file

It is is possible to store individual sed expressions in a so-called sed script-file. Each line should contain exactly one sed expression. There should be no quotes around the sed expressions in the sedfile:

    s/red/green/
    /blue/d

The sed script-file can be called by sed using the -f flag. The first form:

... | sed -f sedfile

The second form:

sed -f sedfile file(s)

Other flags can be added (e.g. -n) before the -f flag or after sedfile. The -f and sedfile should be used together. Do not put other parameters between -f and sedfile.

Consider for example, the following sed command:

    sed -e 's/<title>/title(/' -e 's/<\/title>/)/' f1

This can be put in a special sed command-file. We call it chgtitle.sed:

    s/<title>/title(/
    s/<\/title>/)/

Now all we have to do is:

    sed -f chgtitle.sed f1

This is the contents of file f1:

    <title>Everything about sed</title>

The output of the sed command line will be:

    title(Everything about sed)

Using a separate sedfile allows you to do more complex things. Among these are command grouping and the pattern buffer.

Command grouping

sed commands, such as d or s, can be grouped. This can be explained using an example task: an ASCII art table that should be converted into html.

Suppose, the file orgtable contains the following table in ascii art:

    Table 1
     20-3  3-4 10-4 17-4  1-5  8-5 15-5 22-5  5-6 19-6  pct
    -------------------------------------------------------
    |    |+5  |    |    |-9  |-8  |-6  |+2b |+1  |+8  | 57%|
    |-2b |+8b |-6b |-9  |+1b |+7  |+8b |+1c |+6  |-7  | 60%|
    |+7b |-7  |-6  |+2c |+10c|+1  |    |-9  |    |+3b | 63%|

We want to convert the lines inside the table into html. The first line should look like this after conversion:

    <TR><TD>&nbsp;&nbsp;</TD><TD>+5</TD>
    <TD>&nbsp;&nbsp;</TD><TD>&nbsp;&nbsp;</TD>
    <TD>-9</TD> <TD>-8</TD> <TD>-6</TD> <TD>+2b</TD> <TD>+1</TD> <TD>+8</TD>
    <TD>57%</TD> </TR>

To accomplish this, we have to make several substitutions, among them the substitution of white space into &nbsp; (the non-breakable-space character in HTML. We only want to do this in the table entries (that is: all lines that start with a |), not in the table header. The following commands in the sedfile called htmlise.sed will replace only those lines that start with a |:

    /^|/s/^|/<TR><TD>/
    /^|/s/|$/<\/TD><\/TR>/
    /^|/s/|/<\/TD><TD>/g
    /^|/s/ /&nbsp;/g
    /^---/d

As a bonus it gets rid of the dashed line.

Note

The literal slashes in the replacement part need to be prefixed by a backslash to prevent sed from mixing it up with the slash separator.

Here is the complete call to convert the contents of orgtable to html using htmlise.sed:

    sed -f htmlise.sed orgtable

There is another way to write the same thing. As can be seen, the address /^|/ is used four times. Commands can be grouped by using the same address followed by the same commands inside curly braces. This is a so-called command group:

    /^|/  {
        s/^|/<TR><TD>/
        s/|$/<\/TD><\/TR>/
        s/|/<TD><TD>/g
        s/ /&nbsp;/g
    }
    /^---/d

We begin by matching every input line with the /^|/ pattern. If this matches, all the commands between the curly braces are applied to this line. The first and the last vertical bars are exceptions because they need only <TR><TD> (row and cell open in HTML) and </TD></TR> (cell and row close in HTML), so we substitute them first. Next we substitute each remaining vertical bar with </TD><TD> pattern. Last, we substitute each space by &nbsp;.

Note

White space at the start of a command is allowed, so you can indent your script to reflect the grouping of the commands. You cannot put whitespace after a command: the end of a command must be either a newline or a semicolon. The semicolon is not part of the POSIX standard, but is supported in the GNU version of sed.

The pattern buffer

Suppose you want to match a fixed group of words, such as a name, but they are not on the same line. For example, in a manual you want to change the words local unix guru into Local Expert. But, the word local could be on one line and the words unix guru on the next. This would be hard to match with the commands discussed so far. Enter the pattern buffer.

The pattern buffer is the place where the current line is stored (that is: the buffer to which the Regular Expression is matched). This is the buffer on which the commands are executed and which is printed after the last command is finished. Lines can be added to this buffer by using the N command. When the new line is appended to the former, a newline character is inserted between them.

It is now possible to match input like this:

    Yesterday, our local
    unix guru explained that ...

In this case, we have to match every combination in which the words can be divided over two lines (three possibilities). Even better: we can first remove the newline character. Our sedfile would look like this:

    /local/ {
        N
        s/ *\n/ /
        s/local unix guru/Local Expert/
    }

The output will be:

    Yesterday, our Local Expert explained that ...

There is still a small problem though. When you run this script on the following input, the result might be not quite what you expect. Here is the input:

    When all else fails, ask your local unix guru.

        This is an extra indented line.

    This story was brought to you by:
    your local unix guru

The result is:

    When all else fails, ask your Local Expert. 
        This is an extra indented line.

    This story was brought to you by:
    your local unix guru

First, the empty line before This is an extra ... has been removed. This is probably not what we intended. Second, the last line is unchanged (in some non-GNU versions of sed, the last line may disappear because sed crashes).

The problem is that the pattern we looked for (i.e., local) was on one line. So we did not have to read the second line before making the substitution. Because we did so anyway the newline was removed and the empty line was added to the line that contained the match. The second time this happened, the Regular Expression of the substitute command did not find a second line, so the last line of input is not changed at all.

To fix this we could add a rule to match the pattern when it appears completely on one line. This gives us the following script

    s/local unix guru/Local Expert/
    /local/ {
        N
        s/ *\n/ /
        s/local unix guru/Local Expert/
    }

The output now becomes:

    When all else fails, ask your Local Expert.

        This is an extra indented line.

    This story was brought to you by:
    your Local Expert

Note that this is the only way to remove newlines from a file with sed. Because it operates on one line at a time, it normally ignores the newline. So you can't match it, unless you add the next line to the pattern buffer. When you do that, the newline is reinserted, and you can match it using \n.

A stand-alone sed script

Let's go back to the htmlise.sed file used earlier. If you are getting bored of typing sed -f htmlise.sed, consider making this a stand-alone sed script. That is a sed script-file that becomes a command of its own.

First, start by adding a line at the beginning:

    #!/bin/sed -f

Of course, the location of the sed program may be different.

The total file will now look like this:

    #!/bin/sed -f
    /^|/  {
        s/^|/<TR><TD>/
        s/|$/<\/TD><\/TR>/
        s/|/<TD><TD>/g
        s/ /&nbsp;/g
    }
    /^---/d

Now rename it to htmlise, install it in a directory that is named in the PATH environment variable, give it execute permission, and you are ready to run the command everywhere:

    htmlise orgtable



[21] The quotes are not really needed here, but added for completeness.

Copyright Snow B.V. The Netherlands