Revision: $Revision: 1.10 $
A candidate should be able to recognize and identify boot loader and kernel specific stages and utilize kernel boot messages to diagnose kernel errors. This objective includes being able to identify and correct common hardware issues, and be able to determine if the problem is hardware or software.
Key files, terms and utilities include:
| screen output during bootup |
/bin/dmesg |
| kernel syslog entries in system logs (if entry is able to be gained) |
various system and daemon log files in /var/log/ |
/sbin/lspci |
/bin/dmesg |
/usr/bin/lsdev |
/sbin/lsmod |
/sbin/modprobe |
/sbin/insmod |
/bin/uname |
| location of system kernel and attending modules /, /boot, and /lib/modules |
/proc filesystem |
strace |
strings |
ltrace |
lsof |
Resources: the man pages for the various commands, Vasudevan02.
Debugging boot problems (or any other problems for that matter) can be complex at times. Your best teacher is experience. However, by carefully studying the boot messages and have the proper understanding of the mechanisms at hand you should be able to solve most, if not all, common Linux problems.
You need a good working knowledge of the boot process to be able to solve boot problems. In previous sections the boot process was described in detail as was system initialization. By now you should be able to determine in what stage the boot process is from the messages displayed on the screen. You should be able to utilize kernel boot messages to diagnose kernel errors. If not, please re-read the previous sections carefully. We will provide a number of suggestions on how to solve common problems in the next sections. However, it is beyond the scope of this book to cover all possible permutations. If your problem is not covered here, search the web for peers with the same problem, consult your colleagues, read more documentation, investigate and experiment.
In an ideal world you always have plenty of time to find out the exact nature of the problem and consequently solve it. In the real world you will have to be aware of cost effectiveness. Look for the most (cost-) effective way to solve the problem. For example: let's say that initial investigation indicates that the boot disk has hardware problems. You do not know yet the exact nature of the problems, but have eliminated the most common causes. The disk still does not work reliably. Hence, you need more time to investigate further. If your customer has a recent backup and deliberation learns that he agrees to go back to that situation, you might consider suggesting installation of a brand new disk and restoring the most recent backup instead. The total costs of your time so far and the new hardware are probably less than the costs you have to make to investigate any further - even more so if you consider the probability that you need to replace the disk after all.
This book is not an omniscient encyclopedia that describes solutions for all problems you may encounter. If you get stuck, there are many ways to get help. First, you could call or mail the distributor of your Linux version. Often you are granted installation support or 30 day generic support. It may be worth your money to subscribe to a support network - most distributions offer such a service at nominal fees.
Some distributions grant on-line (Internet) access to support databases. In these databases you can find a wealth of information on hardware and software problems. You can often search for keywords or error messages and most of them are cross-referenced to enable you to find related topics. Some URLs you may check follow:
You can also grep or zgrep for (parts of)
an error message or keyword in the documentation on your system, typically in and
under /usr/doc/,
/usr/share/doc/ or
/usr/src/linux/Documentation.
Additionally, you could try to enter the error message in an Internet search
engine, for example
http://www.google.org. This often returns URLS to
FAQ's, HOW-TO's and other documents that may contain clues on how to solve your problem.
Usenet news archives, like
http://www.deja.com are another resource to use.
Often, Linux refuses to boot due to hardware problems. If a system boots and seems to be working fine, it still may have a hardware problem. If it is not working with Linux, but is working fine with other operating systems, for example DOS or Windows, this too often signifies hardware problems. Linux assumes the hardware to be up to specifications and will try to use it to its (specified) limits.
In general: regularly check the system log files for write and read errors. They indicate that the hardware is slowly becoming less reliable. Other indications of lurking hardware problems are: problems when accessing the CDROM (halt, long delays, bus errors, segmentation faults), kernel generation or compilation of other programs aborts with signal 11 or signal 7, scrambled or incorrect file contents, memory access errors, graphics that are not displayed correctly, CRC errors when accessing the floppy disk drive, crashes or halts during boot and errors when creating a filesystem.
To discover lurking hardware errors, you can use a simple, yet effective test: create a small script that compiles the kernel in an endless loop, for example:
#
# adapted from http://www.bitwizard.nl/sig11
#
cd /usr/src/linux
#
c=0
while true
do
make clean &> /dev/null
make -k bzImage > log.${c} 2> /dev/null
c=`expr ${c} + 1`
done
Every iteration of this loop should create a log file - all log files should have the exact same content. You could use sum or md5sum to check this. If you detect differences between the log files this often is an indication some hardware problem exists.
If your system does not boot (anymore) you want to retrieve basic system functionality: your system should boot and the kernel should load. After that, you often are able to resolve the other problems using the rich set of debugging tools Linux offers.
There are a number of common boot problems. One group relates to the MBR and bootstrap files. Data corruption or accidental deletion of files or the boot partition will prevent your drive from booting. Another group clearly relates to hardware failures on the boot-drive.
hardware problems. If a hardware component failure causes the system to refuse to boot you typically get one of the following clues: the BIOS reports errors like “No Fixed Disk Found” or “Disk Controller Error”. Sometimes, numerical messages are displayed, often in the 1700 range, e.g. “1701”, “1791” etc. If the controller is in error, a message that indicates this is often shown. In the most obvious case, the disk does not even start spinning.
Start by ensuring that your BIOS was set up correctly. The disk geometry needs to be set correctly to enable your BIOS to see the drive. Often, a Plug And Play option can be set in the BIOS. Set the option so that PNP is deactivated. If an additional option “Reset Configuration Data” or “Update ESCD” is offered, please set it to “yes” or alternatively to “enabled”.
If you have problems with a drive you just added to the system, the problems are often caused by incorrect cabling or jumper selections. Make sure all connectors are properly seated. Ensure that IDE master and slave drives are jumpered and cabled properly. UDMA drives make use of special twisted pair cabling. Your BIOS needs to be able to access the disk during boot, so make sure your drive geometry and/or type has been correctly specified in your BIOS.
You should verify the seating of the connectors, check the cables between disk and motherboard. Check for corrosion. Also check the power cables. You can remove the cables and measure them, using a simple ohm meter or buzzer, and/or replace them with new ones. A “hard disk controller” error message can often be caused by bad cabling too. Always check your cabling first, even when the symptoms seem to indicate a controller error. If the error persists try to replace the controller card.
software problems. If a software failure causes the system to refuse to boot, you typically get one of the following clues: LILO does not display all of its four letters, LILO hangs, you get a screen with scrolling error codes, e.g. “010101..”, the system report something like “drive not bootable, insert system disk”, the system boots another operating system than intended.
If you are able to boot a floppy you can use a boot floppy or rescue disk to boot
a kernel (the section called “Why we need bootdisks”). You could use a rescue floppy that
boots a kernel that has the root filesystem set to your hard disk (using the
rdev command). Or you could use a special tiny Linux distribution,
like tomsrtbt
(http://www.toms.net/rb/
) and mount the root filesystem by hand. Make sure the rescue disk contains
he necessary functionality/modules to fit your hardware, e.g. if you have SCSI disks,
be sure the drivers are either compiled into your boot kernel or available as modules.
If the boot floppy has booted you should start by checking the validity of your
MBR. When the LILO
bootup messages indicate a geometry error or a related error, you should verify that
/sbin/lilo ran correctly. If you are in doubt, you can rerun
it, provided you have access to (a backup copy of) the lilo.conf
file. Also check if the error is caused by the 1024 cylinder boundary problem: your
kernel(s) and related files should be accessible for your BIOS, and some BIOSes
(as a rule of thumb: made before 1998) are not able to access disk cylinders above
1024.
Next, verify that the partition table in the MBR is correct by running fdisk. Verify that at least one partition is marked as “boot” or “active” - some BIOSes require that the Linux boot partition to be marked “bootable”, some distributions may require this too. If the partition table is incorrect, you will need to repair it. If you have access to a backup of your boot systems MBR (including the partition table, hence: the first 512 bytes of your hard disk) and have put it on a floppy, now is the time to recover it. Alternately, you may have printed the partition list; if so you can use fdisk to restore the partition table.
If the partition table looks correct when you print it in fdisk the next step is to take a look at the root filesystem of your drive. You can use the fsck command to repair filesystem problems on your root file system. If all else fails, you have to resort to reformatting and/or repartitioning your disk and restore the latest back up or even may need to replace the disk.
If the initial boot sequence could be completed, the kernel is loaded and tries to mount its root filesystem. This can either be a RAMDISK (initrd) or a partition on a hard disk. Of course, the kernel needs to be able to access (all of) its memory and the filesystem on the disk, which may require certain device drivers in the kernel. When the root filesystem could be mounted additional programs can be executed and additional kernel modules may be loaded to add functionality. If the root filesystem is a RAM disk, it may contain programs to load the modules needed to address other hardware devices such as disks. In these phases you could experience module loading problems, which can result in (partial) hardware inaccessibility.
In many cases it is not clear whether or not the cause of a problem lies with the hardware or the software. However, most of these errors result from invalid configuration.
hardware problems. If a hardware problem causes the system to refuse to boot, you typically get one of the following clues: “PANIC VFS unable to mount root fs on ##:##”, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported, parts of your hardware do not work or intermittent errors occur. As a rule of thumb, if your kernel boots, but your system consequently hangs or issues error messages, you should check for software problems and configuration problems first. If this does not resolve the problem, check your hardware (the section called “Generic issues with hardware problems”).
software problems. If a software problem causes the system to refuse to boot, you typically also get one of the clues we listed under “hardware errors”: “PANIC VFS unable to mount root fs on ##:##”, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported.
The “PANIC VFS unable to mount...” message occurs when the kernel could be
loaded, but either the ramdisk or the physical partition could not be mounted.
This can be the result of forgetting to run /sbin/lilo
after a kernel change or update, or forgetting to run rdev
if case you put the kernel-image directly into the boot-sectors of a partition.
The PANIC message is sometimes caused by inaccessibility of (parts of) your
system's memory, for example when it tries to mount the RAM-disk as its root
filesystem. Older kernels may require you to set the amount of RAM by rebooting
and setting the boot parameter:
mem=<size-of-memory-in-Kbytes>
In order to recognize memory above 64 Meg, it may be necessary to append the
“mem=” option to the kernel command line permanently. If you are using LILO
for your boot loader, you would do this in the lilo.conf
file. For example, if you had a machine with 128 Meg you would type:
append="mem=131072K"
To help you to determine what device the kernel tries to mount,
the “PANIC” message contains the major and minor number of the device, e.g.
9:0 for /dev/md0 (the first
virtual RAID disk, the section called “Software RAID”) or
8:3, the first SCSI disk
/dev/sda. This can pinpoint the area where you should
start your search for configuration errors. For example: if this message points
to the /dev/md0 device, you should check if the kernel was
compiled with support for software RAID, ditto for SCSI disks.
In most cases, your should have your system booted by now. It may be that you still see numerous errors, e.g. daemons won't start, your network card or sound card does not work, In the next sections we will describe tools that aid you with troubleshooting and give hints and tips on how to resolve more problems.
IRQ conflicts are a common source of problems. You can use dmesg
to see which interrupts were required by the drivers and compare this with the
contents of /proc/interrupts or the output of lsdev
to determine conflicts. The IRQ a PCI device uses is also reported in
/proc/pci or by using the lspci program.
Sometimes a card could not obtain an IRQ since the BIOS assigned all of them to non-PCI
(ISA) cards. Check your BIOS settings if you suspect this to be the case.
Under some conditions IRQ's can be shared between two devices. Devices on the PCI bus may share the same IRQ interrupt with other devices on the PCI bus provided the driver software supports this. In other cases where there is potential for conflict, there should be no problem if no two devices with the same IRQ are ever in use at the same time. Even if devices with conflicting IRQs are used simultaneously one of them will likely have its interrupts caught by its device driver and may work. The other device(s) will likely behave like they were configured with the wrong interrupts.
Linux has a rich set of tools that aid you in troubleshooting. An overview of the most commonly used commands and their functionality follows:
lspci.
this command displays information about all PCI buses in the
system and all devices connected to them. The command is
described in more detail in the section called “Querying your PCI bus”.
dmesg.
the kernel logs messages into a ring buffer. dmesg
dumps the contents of the ring buffer to standard output. Often,
the dmesg command is issued at the end of the boot sequence
for example in one of the start up scripts, to dump the bootup messages
into a file (e.g. boot.messages).
lsdev.
is a front end to the /proc filesystem and
prints out information about interrupts, I/O ports and dma settings.
This gives an overview of which hardware uses what IO addresses,
IRQ's and DMA channels and can aid in determining conflicts.
lsmod.
a program that displays the modules currently in use by the
kernel. Name, size, use count, an a list of referring
modules are displayed. The information displayed is identical
tot hat in /proc/modules. lsmod
frequently is used to check if the proper modules could be loaded.
modprobe.
a high level interface to insmod. Often, modules
depend on each other and/or need to be loaded in a certain order.
modprobe is used to make this more easy for
system administrators. It uses a dependency file, which is
created by depmod, to load modules in the
right order from certain specified locations. The normal use of
depmod is to include it somewhere in the
rc-files in /etc/rc.d, so that the
correct module dependencies will be available immediately
after booting the system. The configuration file
/etc/modules.conf can be used to steer
depmod and modprobe's
behavior. modprobe will unload all modules
in a dependent chains if one of them fails to load. See also
the section called “Kernel Components (201.1)”.
insmod.
installs a loadable module in the running kernel. It tries to
do this by resolving all symbols from the kernel's exported
symbol table. You can specify the (object) file name.
If the file name is given without extension, insmod
will search for the module in common default directories. These
default locations can be overridden by the contents of an environment
variable (MODPATH) or in the configuration file
/etc/modules.conf.
uname.
displays machine type, network hostname, OS release,
OS name, OS version and processor type of the host.
/proc filesystem.
is a direct reflection of the system in memory presented to you
as files and directories. It provides an easy way to view kernel information
and information about currently running processes. In Linux some commands read
/proc directly to get information about the state of the
system. It allows you to view statistical information, hardware information,
network and host parameters, memory and performance information and lets you
modify some parameters runtime by writing values in it.
strace.
a very handy diagnostic tool. It runs a program or connects to a
running process (e.g. a daemon) and intercepts the system calls
which are received by it. By default it reports the name
of the system call, its arguments and the return value on
standard error. It is very useful in cases where you do not
have access to the source code and also serves as a tool to be
used to gain better understanding of the inner workings of
certain programs. The program to be traced need not be recompiled
for this.
ltrace.
similar to strace, but instead of recording system
calls it runs the specified command and intercepts and records the
dynamic library calls which are called by the executed process and the
signals which are received by that process. It can also intercept and
print the system calls executed by the program. The program to be traced
need not be recompiled for this, so you can use it on binaries for which
you don't have the source handy.
strings.
can print out strings - sequences of printable characters -
that are hidden within non-text files, such as executables. Often
used to check for names of environment variables and configurations
files used by an executable.
fuser.
accepts a filename and displays the PID's of processes using the
specified files or filesystems. Comes in handy if you want to know
which process uses a certain file, for example: if you are not able
to unmount a filesystem this often is caused by a process that still
uses a file on that filesystem. fuser can be used
to find the PID(s) of the process(es). A consecutive
ps -p $PID will name the process.
lsof.
by default lists all open files belonging to all active
processes. Since Unix uses the file metaphor too for devices
a open file can be a regular file, a directory, a block special
file, a character special file, an executing text reference, a
library, a stream or a network file (Internet socket, NFS file or
UNIX domain socket.) The utility can be used to see which processes
use which resources.