Enable kdump Over Ethernet¶
Overview¶
kdump is an feature of the Linux kernel that allows the system to be booted from the context of another kernel. This second kernel reserves a small amount of memory and its only purpose is to capture the core dump in the event of a kernel crash. The ability to analyze the core dump helps to determine causes of system failures.
xCAT Interface¶
The following attributes of an osimage should be modified to enable kdump:
- pkglist
- exlist
- postinstall
- dump
- crashkernelsize
- postscripts
Configure the pkglist file¶
The pkglist for the osimage needs to include the appropriate RPMs. The following list of RPMs are provided as a sample, always refer to the Operating System specific documentataion to ensure the required packages are there for kdump support.
[RHELS]
kexec-tools crash
[SLES]
kdump kexec-tools makedumpfile
[Ubuntu]
<TODO>
Modify the exlist file¶
The default diskless image created by copycds excludes the /boot directory in the exclude list file, but this is required for kdump.
Update the exlist for the target osimage and remove the line /boot:
./boot* # <-- remove this line
Run packimage to update the diskless image with the changes.
The postinstall file¶
The kdump will create a new initrd which used in the dumping stage. The /tmp or /var/tmp directory will be used as the temporary directory. These 2 directory only are allocated 10M space by default. You need to enlarge it to 200M. Modify the postinstall file to increase /tmp space.
[RHELS]
tmpfs /var/tmp tmpfs defaults,size=200m 0 2
[SLES11]
tmpfs /tmp tmpfs defaults,size=200m 0 2
[Ubuntu]
<TODO>
The dump attribute¶
To support kernel dumps, the dump attribute must be set on the osimage definition. If not set, kdump service will not be enabled. The dump attribute defines the NFS remote path where the crash information is to be stored.
Use the chdef command to set a value of the dump attribute:
chdef -t osimage <image name> dump=nfs://<nfs_server_ip>/<kdump_path>
If the NFS server is the Service Node or Management Node, the server can be left out:
chdef -t osimage <image name> dump=nfs:///<kdump_path>
Note: Only NFS is currently supported as a storage location. Make sure the NFS remote path(nfs://<nfs_server_ip>/<kdump_path>) is exported and it is read-writeable to the node where kdump service is enabled.
The crashkernelsize attribute¶
To allow the Operating System to automatically reserve the appropriate amount of memory for the kdump kernel, set crashkernelsize=auto.
For setting specific sizes, use the following example:
For System X machines, set the
crashkernelsizeusing this format:chdef -t osimage <image name> crashkernelsize=<size>M
For Power System AC922, set the
crashkernelsizeusing this format:chdef -t osimage <image name> crashkernelsize=<size>M
For System P machines, set the
crashkernelsizeusing this format:chdef -t osimage <image name> crashkernelsize=<size>@32M
Notes: the value of the crashkernelsize depends on the total physical memory size on the machine. For more about size, refer to Appedix
If kdump start error like this:
Your running kernel is using more than 70% of the amount of space you reserved for kdump, you should consider increasing your crashkernel
The crashkernelsize is not large enough, you should change the crashkernelsize larger until the error message disappear.
The enablekdump postscript¶
xCAT provides a postscript enablekdump that can be added to the Nodes to automatically start the kdump service when the node boots. Add to the nodes using the following command:
chdef -t node <node range> -p postscripts=enablekdump
Manually trigger a kernel panic on Linux¶
Normally, kernel panic() will trigger booting into capture kernel. Once the kernel panic is triggered, the node will reboot into the capture kernel, and a kernel dump (vmcore) will be automatically saved to the directory on the specified NFS server (<nfs_server_ip>).
Check your Operating System specific documentation for the path where the kernel dump is saved. For example:
[RHELS6]
<kdump_path>/var/crash/<node_ip>-<time>/
[SLES11]
<kdump_path>/<node hostname>/<date>
To trigger a dump, use the following commands:
echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger
This will force the Linux kernel to crash, and the address-YYYY-MM-DD-HH:MM:SS/vmcore file should be copied to the location you set on the NFS server.
Dump Analysis¶
Once the system has returned from recovering the crash, you can analyze the kernel dump using the crash tool.
Locate the recent vmcore dump file.
Locate the kernel file for the crash server. The kernel is under
/tftpboot/xcat/netboot/<OS name="">/<ARCH>/<profile>/kernelon the managenent node.Once you have located a vmcore dump file and kernel file, call
crash:crash <vmcore_dump_file> <kernel_file>
Note: If crash cannot find any files, make sure you have the kernel-debuginfo package installed.
Appedix¶
OS Documentations on kdump configuration:
- http://www.novell.com/support/kb/doc.php?id=3374462.
- https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Deployment_Guide/s2-kdump-configuration-cli.html.
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_crash_dump_guide/sect-kdump-config-cli.
OS Documentation on dump analysis: