Revision: $Revision: 1.3 $ ($Date: 2004-01-30 10:22:06 $)
sed is an abbreviation of stream
editor. It can edit information on the fly while reading
information from either standard input or one or more
files.
As shown in the section called “Regular Expressions”, sed operates with the classic Regular Expressions. The Regular Expressions described here are valid for GNU sed (the examples are tested with GNU sed version 3.02). Note that very old, non-POSIX compliant versions of sed may not have the extra multipliers and grouping support.
Note that sed will pass all information it reads on to standard output by default.
Since we can control information when it is processed by sed, we can delete part of the information before it reaches standard output or we can change part of the information so that what comes out on standard output is different from what was read.
But, if we do not intervene, information is shown as-is by sed.
There is also the possibility of reverting this
behavior. When the -n option is specified,
sed will only show information when explicitly asked
to do so.
There are two ways to call sed. In the first form, sed reads from standard input:
... | sed[flags] 'sed expression'
In the second form, sed will read information from one or more files:
sed[flags] 'sed expression' file(s)
The crucial part in both forms is the sed expression. The sed expression specifies what will be done to the information. What it looks like will be discussed further on. How a sed expression is specified is shown here.
The sed expression may contain spaces and characters that are subject to shell expansion. To prevent expansion by the shell, the complete sed expression must be embedded in single forward quotes, as shown above. sed will not see these quotes.
When multiple sed expressions are desired, each
must have a -e flag before it.
The first form:
... | sed -e'sed expr1'-e'sed expr2'
The second form:
sed -e'sed expr1'-e'sed expr2' file(s)
The sed expression allows you to control the information passing through sed. It can, however, have many different forms.
The general form of a sed expression is
addresssedcommand
That is: an address immediately followed by a sedcommand. If address is valid for a line of information, then sedcommand will be applied.
An address can be a Regular Expression (between slashes), or it can be a line-number specification. Some sedcommands may need extra additions.
Regular Expression specification. A form with a Regular Expression specification as address looks like:
/regex/sedcommand
An example of this is:
/=/d
The Regular Expression specification is /=/, the
sedcommand is d.
Together, these tell sed not to display lines that
contain an = character.
Line number specification. The address can be a line-number specification (abbreviated as linespec):
linespecsedcommand
(that is: linespec immediately followed by sedcommand). The linespec is a specification of line number(s). They can be specified as a single number or as a range: two numbers with a comma between them. An example of this is:
1,3d
The result of this is that the first three lines read by sed will not be shown. The other lines will, of course, be shown.
The delete and substitute commands are probably the most frequently used commands.
The general form is
address
d
That is: address immediately followed by a
d. Using this, sed will omit the
information that matches address from the
information that is being processed by sed.
Information not matching
address will,
of course, be passed on.
An example of this is:
sed '1d' /etc/passwd
This command line[21]
will result in all lines of
/etc/passwd, except the first one, to
standard output (the screen in this case).
This is another example, using a range:
sed '1,10d' myfile
This will cause the contents of myfile to be
shown, minus the first ten lines.
Using a Regular Expression as address works the same way:
sed '/^$/d' myscript | less
This will pass all non-empty lines from myscript
to the less command.
A little extension:
sed '/^ *$/d' myscript > newscript
Here the result will be that all lines in myscript,
except those that contain nothing or only space(s), will be written to
newscript.
With the substitute command, you can change some of the information before it is sent to the standard output. The other information will go there as well, but unchanged.
The most general form of the substitute sedcommand is:
address
s/regex/replacementstring/modifiers
The address works as a line selector, as shown earlier. The address part can be omitted, in which case the substitute is attempted on all lines. The modifiers can also be omitted, in which case the defaults apply (more on this later). Since both the address and the modifiers can be omitted, a simplified command would look like this:
s/regex/replacementstring/
This format, of course, is very famous.
The substitute works as follows: the part of the line on which regex fits is replaced by replacementstring.
sed reads its information line by line, as do many other Unix commands. A substitute works only once per line: the first location (seen from the left) where the Regular Expression matches is replaced by the replacementstring.
This behavior can be changed by the g modifier.
When the g modifier is specified, each part of the
input string where the Regular Expression fits is replaced by the
replacement string. The g (global) means: as many
times as possible within the input line.
Now, let's look at this in practice. Suppose the following information is generated by echo:
echo http://www.x.z/HTML Files/Pag 1.html | sed 's/ /%20/'
In this example, the input string contains two spaces.
The sed command-line replaces one space (the first)
by %20. So the result will be:
http://www.x.z/HTML%20Files/Pag 1.html
That is, the first space is replaced by %20, the
second is not.
What is shown here corresponds to the so-called urlencoding. But this is not completely covered here. At least, percent signs in the input should be replaced first.
To replace all spaces, the g modifier needs to be
specified:
echo http://www.x.z/HTML Files/Pag 1.html | sed 's/ /%20/g'
Now the result is:
http://www.x.z/HTML%20Files/Pag%201.html
That is, both spaces are replaced by
%20.
Suppose you've got a file containing lines with URLs as well as lines containing other information. You want to replace spaces in the URLs, but only in the URLs. In this case, an address can be used to select the URLs:
/^http:/s/ /%20/g
Suppose file mystuff contains the following:
http://a.b.c/Item 1/index Main.html
some other information on a separate line
http://a.b.c/Item 2/index Subpage.html
This file can be processed as follows:
sed '/^http:/s/ /%20/g' mystuff
The output will be:
http://a.b.c/Item%201/index%20Main.html
some other information on a separate line
http://a.b.c/Item%202/index%20Subpage.html
Recall that the -n flag reverses the behavior of
sed: information read is not passed on by default.
To show information some action must be taken.
The substitute command has a p modifier that does
exactly that: outputting those lines that are changed.
Using both the -n flag and the p
modifier together will output only those lines that are changed.
sed -n 's/regex/replacement/p'infile(s)
Another example is this:
sed -n 's/\b[Ll]inux\b/LINUX/p' txtfile
This will read all lines from txtfile, but
show only lines that contain either linux or
Linux with both linux and
Linux replaced by LINUX.
The precise part of an input line on which the Regular Expression
matches is represented by &, which can then
be used in the replacement part.
An example is (split across lines for readability):
echo Now is 15 Feb 2002 | \
sed 's/[0-9]\{1,2\} \+[A-Za-z][a-z]\+/=&=/'
This will replace the date part with the date surrounded by
= characters. The part of the input that does not
match the Regular Expression is not replaced. Instead it is shown
literally. So the output is:
Now is =15 Feb= 2002
Grouping can be used in sed, at least in the GNU
version that is used by default in Linux.
A group is opened with “\(”
(a backslash and an opening parenthesis) and
closed with “\)” (a backslash and a
closing parenthesis).
Grouping can be used in combination with
back-referencing.
Remember that a back-reference is the re-use of a part of a Regular Expression selected by grouping.
Back-references in sed can be used in both a Regular Expression and in the replacement part of the substitute command.
For example, the following will not
show lines in files f1 and
f2 that contain four identical uppercase-letters:
sed '/\([A-Z]\)\1\1\1/d' f1 f2
The following substitute will replace each series
of four identical uppercase-letters by four X
characters:
ls -la | sed 's/\([A-Z]\)\1\1\1/XXXX/g'
The following sedexpression places any uppercase-letter between square brackets:
s/\([A-Z]\)/[\1]/g
Remember that the replacement part is not a Regular Expression, so the square brackets are literal characters.
For example:
echo Hello There | sed 's/\([A-Z]\)/[\1]/g'
The output of the above command line will be:
[H]ello [T]here
Remember that a sed substitute command will replace only the part on which the Regular Expression matches. Parts of the line that do not match are sent unchanged to the standard output.
When you do want to match the whole line, the following must be true:
the Regular Expression must match the complete line
the desired part must be selected using grouping
Let's look again at the date example discussed earlier (split across lines for readability):
echo Now is 15 Feb 2002 | \
sed 's/^.*\b\([0-9]\{1,2\} \+[A-Za-z][a-z]\+\).*$/\1/'
The output is exactly:
15 Feb
Do not forget the \b word-boundary anchor: without
it, the .* will match too many characters.
Remember that sed uses the classical Regular Expressions, in which there is no easy way to specify a tabulation character.
In modern, GNU versions of sed, POSIX
character-classes are accepted, so [[:blank:]] and
[[:space:]] can be used to match white space.
However, since POSIX support (required by some governments) is not yet
present on all non-Linux systems, the use of POSIX character-classes may
be non-portable.
Lots of things can be done with sed. Among sed's facilities are command-grouping and pattern-space. Things might grow too complex to use on the command line, so this section will show how to use a separate file with sed commands.
It is is possible to store individual sed expressions in a so-called “sed script-file”. Each line should contain exactly one sed expression. There should be no quotes around the sed expressions in the sedfile:
s/red/green/
/blue/d
The sed script-file can be called by sed using the
-f flag.
The first form:
... | sed -fsedfile
The second form:
sed -fsedfile file(s)
Other flags can be added (e.g. -n)
before the
-f flag or
after sedfile.
The -f and sedfile should be
used together. Do not put other parameters between
-f and sedfile.
Consider for example, the following sed command:
sed -e 's/<title>/title(/' -e 's/<\/title>/)/' f1
This can be put in a special sed command-file. We call it
chgtitle.sed:
s/<title>/title(/
s/<\/title>/)/
Now all we have to do is:
sed -f chgtitle.sed f1
This is the contents of file f1:
<title>Everything about sed</title>
The output of the sed command line will be:
title(Everything about sed)
Using a separate sedfile allows you to do more complex things. Among these are command grouping and the pattern buffer.
sed commands, such as
d or s, can be grouped.
This can be explained using an example task: an ASCII art table
that should be converted into html.
Suppose, the file orgtable contains the following
table in ascii art:
Table 1
20-3 3-4 10-4 17-4 1-5 8-5 15-5 22-5 5-6 19-6 pct
-------------------------------------------------------
| |+5 | | |-9 |-8 |-6 |+2b |+1 |+8 | 57%|
|-2b |+8b |-6b |-9 |+1b |+7 |+8b |+1c |+6 |-7 | 60%|
|+7b |-7 |-6 |+2c |+10c|+1 | |-9 | |+3b | 63%|
We want to convert the lines inside the table into html. The first line should look like this after conversion:
<TR><TD> </TD><TD>+5</TD>
<TD> </TD><TD> </TD>
<TD>-9</TD> <TD>-8</TD> <TD>-6</TD> <TD>+2b</TD> <TD>+1</TD> <TD>+8</TD>
<TD>57%</TD> </TR>
To accomplish this, we have to make several substitutions, among them
the substitution of white space into “ ” (the
non-breakable-space character in HTML. We only want to do this in the table
entries (that is: all lines that start with a |),
not in the table header. The following commands in the
sedfile called htmlise.sed
will replace only those lines that start with a |:
/^|/s/^|/<TR><TD>/
/^|/s/|$/<\/TD><\/TR>/
/^|/s/|/<\/TD><TD>/g
/^|/s/ / /g
/^---/d
As a bonus it gets rid of the dashed line.
The literal slashes in the replacement part need to be prefixed by a backslash to prevent sed from mixing it up with the slash separator.
Here is the complete call to convert the contents of
orgtable to html using
htmlise.sed:
sed -f htmlise.sed orgtable
There is another way to write the same thing. As can be seen, the
address /^|/ is used four
times. Commands can be grouped by using the same
address followed by the same commands inside
curly braces. This is a so-called command group:
/^|/ {
s/^|/<TR><TD>/
s/|$/<\/TD><\/TR>/
s/|/<TD><TD>/g
s/ / /g
}
/^---/d
We begin by matching every input line with the
/^|/ pattern. If this matches,
all the commands between the curly braces are applied
to this line. The first and the last vertical bars are
exceptions because they need only
“<TR><TD>”
(row and cell open in HTML)
and “</TD></TR>”
(cell and row close in HTML),
so we substitute them first.
Next we substitute each remaining vertical bar with
“</TD><TD>” pattern. Last,
we substitute each space by
“ ”.
White space at the start of a command is allowed, so you can indent your script to reflect the grouping of the commands. You cannot put whitespace after a command: the end of a command must be either a newline or a semicolon. The semicolon is not part of the POSIX standard, but is supported in the GNU version of sed.
Suppose you want to match a fixed group of words,
such as a name, but they are not on the same line. For example, in
a manual you want to change the words
“local unix guru” into
“Local Expert”. But, the word
“local”
could be on one line and the words “unix
guru” on the next. This would be hard to match with
the commands discussed so far. Enter the pattern
buffer.
The pattern buffer is the place where the current line
is stored (that is: the buffer to which the Regular Expression is
matched). This is the buffer on which the commands
are executed and which is printed after the last command
is finished.
Lines can be added to this buffer by using the N
command. When the new line is appended to the former, a
newline character is inserted between them.
It is now possible to match input like this:
Yesterday, our local
unix guru explained that ...
In this case, we have to match every combination in which the words can be divided over two lines (three possibilities). Even better: we can first remove the newline character. Our sedfile would look like this:
/local/ {
N
s/ *\n/ /
s/local unix guru/Local Expert/
}
The output will be:
Yesterday, our Local Expert explained that ...
There is still a small problem though. When you run this script on the following input, the result might be not quite what you expect. Here is the input:
When all else fails, ask your local unix guru.
This is an extra indented line.
This story was brought to you by:
your local unix guru
The result is:
When all else fails, ask your Local Expert.
This is an extra indented line.
This story was brought to you by:
your local unix guru
First, the empty line before This is an extra ...
has been removed. This is probably not what we intended.
Second, the last line is unchanged (in some non-GNU versions of
sed, the last line may disappear because
sed crashes).
The problem is that the pattern we looked for (i.e.,
local) was on one line.
So we did not have to read the second line
before making the substitution. Because we did so anyway
the newline was removed and the empty line was added to
the line that contained the match. The second time
this happened, the Regular Expression of the substitute command
did not find a second line, so the last line of input is not changed
at all.
To fix this we could add a rule to match the pattern when it appears completely on one line. This gives us the following script
s/local unix guru/Local Expert/
/local/ {
N
s/ *\n/ /
s/local unix guru/Local Expert/
}
The output now becomes:
When all else fails, ask your Local Expert.
This is an extra indented line.
This story was brought to you by:
your Local Expert
Note that this is the only way to remove newlines from a file
with sed. Because it operates
on one line at a time, it normally ignores the newline.
So you can't match it, unless you add the next line to
the pattern buffer. When you do that, the newline is
reinserted, and you can match it using
“\n”.
Let's go back to the htmlise.sed file used
earlier. If you are getting bored of typing sed -f
htmlise.sed, consider making this a stand-alone sed script.
That is a sed script-file that becomes a command of its own.
First, start by adding a line at the beginning:
#!/bin/sed -f
Of course, the location of the sed program may be different.
The total file will now look like this:
#!/bin/sed -f
/^|/ {
s/^|/<TR><TD>/
s/|$/<\/TD><\/TR>/
s/|/<TD><TD>/g
s/ / /g
}
/^---/d
Now rename it to htmlise, install it in a
directory that is named in the PATH environment
variable, give it execute permission, and you are ready to run
the command everywhere:
htmlise orgtable