General troubleshooting (2.214.5)

Revision: $Revision: 1.10 $

A candidate should be able to recognize and identify boot loader and kernel specific stages and utilize kernel boot messages to diagnose kernel errors. This objective includes being able to identify and correct common hardware issues, and be able to determine if the problem is hardware or software.

Key files, terms and utilities include:

screen output during bootup
/bin/dmesg
kernel syslog entries in system logs (if entry is able to be gained)
various system and daemon log files in /var/log/
/sbin/lspci
/bin/dmesg
/usr/bin/lsdev
/sbin/lsmod
/sbin/modprobe
/sbin/insmod
/bin/uname
location of system kernel and attending modules /, /boot, and /lib/modules
/proc filesystem
strace
strings
ltrace
lsof

Resources: the man pages for the various commands, Vasudevan02.

A word of caution

Debugging boot problems (or any other problems for that matter) can be complex at times. Your best teacher is experience. However, by carefully studying the boot messages and have the proper understanding of the mechanisms at hand you should be able to solve most, if not all, common Linux problems.

You need a good working knowledge of the boot process to be able to solve boot problems. In previous sections the boot process was described in detail as was system initialization. By now you should be able to determine in what stage the boot process is from the messages displayed on the screen. You should be able to utilize kernel boot messages to diagnose kernel errors. If not, please re-read the previous sections carefully. We will provide a number of suggestions on how to solve common problems in the next sections. However, it is beyond the scope of this book to cover all possible permutations. If your problem is not covered here, search the web for peers with the same problem, consult your colleagues, read more documentation, investigate and experiment.

Cost effectiveness

In an ideal world you always have plenty of time to find out the exact nature of the problem and consequently solve it. In the real world you will have to be aware of cost effectiveness. Look for the most (cost-) effective way to solve the problem. For example: let's say that initial investigation indicates that the boot disk has hardware problems. You do not know yet the exact nature of the problems, but have eliminated the most common causes. The disk still does not work reliably. Hence, you need more time to investigate further. If your customer has a recent backup and deliberation learns that he agrees to go back to that situation, you might consider suggesting installation of a brand new disk and restoring the most recent backup instead. The total costs of your time so far and the new hardware are probably less than the costs you have to make to investigate any further - even more so if you consider the probability that you need to replace the disk after all.

Getting help

This book is not an omniscient encyclopedia that describes solutions for all problems you may encounter. If you get stuck, there are many ways to get help. First, you could call or mail the distributor of your Linux version. Often you are granted installation support or 30 day generic support. It may be worth your money to subscribe to a support network - most distributions offer such a service at nominal fees.

Some distributions grant on-line (Internet) access to support databases. In these databases you can find a wealth of information on hardware and software problems. You can often search for keywords or error messages and most of them are cross-referenced to enable you to find related topics. Some URLs you may check follow:

You can also grep or zgrep for (parts of) an error message or keyword in the documentation on your system, typically in and under /usr/doc/, /usr/share/doc/ or /usr/src/linux/Documentation. Additionally, you could try to enter the error message in an Internet search engine, for example http://www.google.org. This often returns URLS to FAQ's, HOW-TO's and other documents that may contain clues on how to solve your problem. Usenet news archives, like http://www.deja.com are another resource to use.

Generic issues with hardware problems

Often, Linux refuses to boot due to hardware problems. If a system boots and seems to be working fine, it still may have a hardware problem. If it is not working with Linux, but is working fine with other operating systems, for example DOS or Windows, this too often signifies hardware problems. Linux assumes the hardware to be up to specifications and will try to use it to its (specified) limits.

In general: regularly check the system log files for write and read errors. They indicate that the hardware is slowly becoming less reliable. Other indications of lurking hardware problems are: problems when accessing the CDROM (halt, long delays, bus errors, segmentation faults), kernel generation or compilation of other programs aborts with signal 11 or signal 7, scrambled or incorrect file contents, memory access errors, graphics that are not displayed correctly, CRC errors when accessing the floppy disk drive, crashes or halts during boot and errors when creating a filesystem.

To discover lurking hardware errors, you can use a simple, yet effective test: create a small script that compiles the kernel in an endless loop, for example:

#
# adapted from http://www.bitwizard.nl/sig11
#
cd /usr/src/linux
#
c=0
while true
do
   make clean &> /dev/null
   make -k bzImage > log.${c} 2> /dev/null
   c=`expr ${c} + 1`
done  

Every iteration of this loop should create a log file - all log files should have the exact same content. You could use sum or md5sum to check this. If you detect differences between the log files this often is an indication some hardware problem exists.

Resolving initial boot problems

If your system does not boot (anymore) you want to retrieve basic system functionality: your system should boot and the kernel should load. After that, you often are able to resolve the other problems using the rich set of debugging tools Linux offers.

There are a number of common boot problems. One group relates to the MBR and bootstrap files. Data corruption or accidental deletion of files or the boot partition will prevent your drive from booting. Another group clearly relates to hardware failures on the boot-drive.

hardware problems.  If a hardware component failure causes the system to refuse to boot you typically get one of the following clues: the BIOS reports errors like No Fixed Disk Found or Disk Controller Error. Sometimes, numerical messages are displayed, often in the 1700 range, e.g. 1701, 1791 etc. If the controller is in error, a message that indicates this is often shown. In the most obvious case, the disk does not even start spinning.

Start by ensuring that your BIOS was set up correctly. The disk geometry needs to be set correctly to enable your BIOS to see the drive. Often, a Plug And Play option can be set in the BIOS. Set the option so that PNP is deactivated. If an additional option Reset Configuration Data or Update ESCD is offered, please set it to yes or alternatively to enabled.

If you have problems with a drive you just added to the system, the problems are often caused by incorrect cabling or jumper selections. Make sure all connectors are properly seated. Ensure that IDE master and slave drives are jumpered and cabled properly. UDMA drives make use of special twisted pair cabling. Your BIOS needs to be able to access the disk during boot, so make sure your drive geometry and/or type has been correctly specified in your BIOS.

You should verify the seating of the connectors, check the cables between disk and motherboard. Check for corrosion. Also check the power cables. You can remove the cables and measure them, using a simple ohm meter or buzzer, and/or replace them with new ones. A hard disk controller error message can often be caused by bad cabling too. Always check your cabling first, even when the symptoms seem to indicate a controller error. If the error persists try to replace the controller card.

software problems.  If a software failure causes the system to refuse to boot, you typically get one of the following clues: LILO does not display all of its four letters, LILO hangs, you get a screen with scrolling error codes, e.g. 010101.., the system report something like drive not bootable, insert system disk, the system boots another operating system than intended.

If you are able to boot a floppy you can use a boot floppy or rescue disk to boot a kernel (the section called “Why we need bootdisks”). You could use a rescue floppy that boots a kernel that has the root filesystem set to your hard disk (using the rdev command). Or you could use a special tiny Linux distribution, like tomsrtbt (http://www.toms.net/rb/ ) and mount the root filesystem by hand. Make sure the rescue disk contains he necessary functionality/modules to fit your hardware, e.g. if you have SCSI disks, be sure the drivers are either compiled into your boot kernel or available as modules. If the boot floppy has booted you should start by checking the validity of your MBR. When the LILO bootup messages indicate a geometry error or a related error, you should verify that /sbin/lilo ran correctly. If you are in doubt, you can rerun it, provided you have access to (a backup copy of) the lilo.conf file. Also check if the error is caused by the 1024 cylinder boundary problem: your kernel(s) and related files should be accessible for your BIOS, and some BIOSes (as a rule of thumb: made before 1998) are not able to access disk cylinders above 1024.

Next, verify that the partition table in the MBR is correct by running fdisk. Verify that at least one partition is marked as boot or active - some BIOSes require that the Linux boot partition to be marked bootable, some distributions may require this too. If the partition table is incorrect, you will need to repair it. If you have access to a backup of your boot systems MBR (including the partition table, hence: the first 512 bytes of your hard disk) and have put it on a floppy, now is the time to recover it. Alternately, you may have printed the partition list; if so you can use fdisk to restore the partition table.

If the partition table looks correct when you print it in fdisk the next step is to take a look at the root filesystem of your drive. You can use the fsck command to repair filesystem problems on your root file system. If all else fails, you have to resort to reformatting and/or repartitioning your disk and restore the latest back up or even may need to replace the disk.

Resolving kernel boot problems

If the initial boot sequence could be completed, the kernel is loaded and tries to mount its root filesystem. This can either be a RAMDISK (initrd) or a partition on a hard disk. Of course, the kernel needs to be able to access (all of) its memory and the filesystem on the disk, which may require certain device drivers in the kernel. When the root filesystem could be mounted additional programs can be executed and additional kernel modules may be loaded to add functionality. If the root filesystem is a RAM disk, it may contain programs to load the modules needed to address other hardware devices such as disks. In these phases you could experience module loading problems, which can result in (partial) hardware inaccessibility.

In many cases it is not clear whether or not the cause of a problem lies with the hardware or the software. However, most of these errors result from invalid configuration.

hardware problems.  If a hardware problem causes the system to refuse to boot, you typically get one of the following clues: PANIC VFS unable to mount root fs on ##:##, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported, parts of your hardware do not work or intermittent errors occur. As a rule of thumb, if your kernel boots, but your system consequently hangs or issues error messages, you should check for software problems and configuration problems first. If this does not resolve the problem, check your hardware (the section called “Generic issues with hardware problems”).

software problems.  If a software problem causes the system to refuse to boot, you typically also get one of the clues we listed under hardware errors: PANIC VFS unable to mount root fs on ##:##, modprobe reports errors loading a module, IRQ/DMA conflicts can be reported.

The PANIC VFS unable to mount... message occurs when the kernel could be loaded, but either the ramdisk or the physical partition could not be mounted. This can be the result of forgetting to run /sbin/lilo after a kernel change or update, or forgetting to run rdev if case you put the kernel-image directly into the boot-sectors of a partition. The PANIC message is sometimes caused by inaccessibility of (parts of) your system's memory, for example when it tries to mount the RAM-disk as its root filesystem. Older kernels may require you to set the amount of RAM by rebooting and setting the boot parameter:

mem=<size-of-memory-in-Kbytes>

In order to recognize memory above 64 Meg, it may be necessary to append the mem= option to the kernel command line permanently. If you are using LILO for your boot loader, you would do this in the lilo.conf file. For example, if you had a machine with 128 Meg you would type:

append="mem=131072K"

To help you to determine what device the kernel tries to mount, the PANIC message contains the major and minor number of the device, e.g. 9:0 for /dev/md0 (the first virtual RAID disk, the section called “Software RAID”) or 8:3, the first SCSI disk /dev/sda. This can pinpoint the area where you should start your search for configuration errors. For example: if this message points to the /dev/md0 device, you should check if the kernel was compiled with support for software RAID, ditto for SCSI disks.

In most cases, your should have your system booted by now. It may be that you still see numerous errors, e.g. daemons won't start, your network card or sound card does not work, In the next sections we will describe tools that aid you with troubleshooting and give hints and tips on how to resolve more problems.

Resolving IRQ/DMA conflicts

IRQ conflicts are a common source of problems. You can use dmesg to see which interrupts were required by the drivers and compare this with the contents of /proc/interrupts or the output of lsdev to determine conflicts. The IRQ a PCI device uses is also reported in /proc/pci or by using the lspci program. Sometimes a card could not obtain an IRQ since the BIOS assigned all of them to non-PCI (ISA) cards. Check your BIOS settings if you suspect this to be the case.

Under some conditions IRQ's can be shared between two devices. Devices on the PCI bus may share the same IRQ interrupt with other devices on the PCI bus provided the driver software supports this. In other cases where there is potential for conflict, there should be no problem if no two devices with the same IRQ are ever in use at the same time. Even if devices with conflicting IRQs are used simultaneously one of them will likely have its interrupts caught by its device driver and may work. The other device(s) will likely behave like they were configured with the wrong interrupts.

Troubleshooting tools

Linux has a rich set of tools that aid you in troubleshooting. An overview of the most commonly used commands and their functionality follows:

lspci this command displays information about all PCI buses in the system and all devices connected to them. The command is described in more detail in the section called “Querying your PCI bus.

dmesg the kernel logs messages into a ring buffer. dmesg dumps the contents of the ring buffer to standard output. Often, the dmesg command is issued at the end of the boot sequence for example in one of the start up scripts, to dump the bootup messages into a file (e.g. boot.messages).

lsdev is a front end to the /proc filesystem and prints out information about interrupts, I/O ports and dma settings. This gives an overview of which hardware uses what IO addresses, IRQ's and DMA channels and can aid in determining conflicts.

lsmod a program that displays the modules currently in use by the kernel. Name, size, use count, an a list of referring modules are displayed. The information displayed is identical tot hat in /proc/modules. lsmod frequently is used to check if the proper modules could be loaded.

modprobe a high level interface to insmod. Often, modules depend on each other and/or need to be loaded in a certain order. modprobe is used to make this more easy for system administrators. It uses a dependency file, which is created by depmod, to load modules in the right order from certain specified locations. The normal use of depmod is to include it somewhere in the rc-files in /etc/rc.d, so that the correct module dependencies will be available immediately after booting the system. The configuration file /etc/modules.conf can be used to steer depmod and modprobe's behavior. modprobe will unload all modules in a dependent chains if one of them fails to load. See also the section called “Kernel Components (201.1)”.

insmod installs a loadable module in the running kernel. It tries to do this by resolving all symbols from the kernel's exported symbol table. You can specify the (object) file name. If the file name is given without extension, insmod will search for the module in common default directories. These default locations can be overridden by the contents of an environment variable (MODPATH) or in the configuration file /etc/modules.conf.

uname displays machine type, network hostname, OS release, OS name, OS version and processor type of the host.

/proc filesystem.  is a direct reflection of the system in memory presented to you as files and directories. It provides an easy way to view kernel information and information about currently running processes. In Linux some commands read /proc directly to get information about the state of the system. It allows you to view statistical information, hardware information, network and host parameters, memory and performance information and lets you modify some parameters runtime by writing values in it.

strace a very handy diagnostic tool. It runs a program or connects to a running process (e.g. a daemon) and intercepts the system calls which are received by it. By default it reports the name of the system call, its arguments and the return value on standard error. It is very useful in cases where you do not have access to the source code and also serves as a tool to be used to gain better understanding of the inner workings of certain programs. The program to be traced need not be recompiled for this.

ltrace similar to strace, but instead of recording system calls it runs the specified command and intercepts and records the dynamic library calls which are called by the executed process and the signals which are received by that process. It can also intercept and print the system calls executed by the program. The program to be traced need not be recompiled for this, so you can use it on binaries for which you don't have the source handy.

strings can print out strings - sequences of printable characters - that are hidden within non-text files, such as executables. Often used to check for names of environment variables and configurations files used by an executable.

fuser accepts a filename and displays the PID's of processes using the specified files or filesystems. Comes in handy if you want to know which process uses a certain file, for example: if you are not able to unmount a filesystem this often is caused by a process that still uses a file on that filesystem. fuser can be used to find the PID(s) of the process(es). A consecutive ps -p $PID will name the process.

lsof by default lists all open files belonging to all active processes. Since Unix uses the file metaphor too for devices a open file can be a regular file, a directory, a block special file, a character special file, an executing text reference, a library, a stream or a network file (Internet socket, NFS file or UNIX domain socket.) The utility can be used to see which processes use which resources.

Copyright Snow B.V. The Netherlands