Tuesday, 23 December 2025

Analyze kernel core dump (vmcore) files on Rocky Linux 8.10 after a kernel panic.

This is production-grade, exactly how SRE teams do RCA.


🧠 End-to-End Flow (Rocky Linux 8.10)

Image

Image

Image

Kernel Panic
   ↓
kdump captures vmcore
   ↓
System reboots
   ↓
vmcore saved in /var/crash
   ↓
Analyze using crash + kernel-debuginfo
   ↓
Root Cause Analysis

✅ STEP 0: Confirm OS & Kernel (Baseline)

cat /etc/os-release
uname -r

Expected:

Rocky Linux 8.10
4.18.0-553.el8_10.x86_64   (example)

⚠️ Kernel version MUST match debuginfo


1️⃣ STEP 1: Confirm It Was a Kernel Panic

After the node rebooted:

journalctl -b -1 -k | tail -50

Look for:

Kernel panic - not syncing
BUG: unable to handle kernel NULL pointer dereference
watchdog: soft lockup

Check reboot reason:

last -x | grep reboot

If kernel panic is confirmed → continue.


2️⃣ STEP 2: Verify kdump Is Enabled (MANDATORY)

Check kdump service

systemctl status kdump

Expected:

Active: active (exited)

Check crashkernel parameter

cat /proc/cmdline

Must include:

crashkernel=512M

❌ If missing → vmcore will NOT be generated


3️⃣ STEP 3: Locate vmcore Files

Rocky Linux stores core dumps here:

ls -lh /var/crash/

Example:

/var/crash/127.0.0.1-2025-12-22-14:32/
 ├── vmcore
 ├── vmcore-dmesg.txt

📌 Files meaning:

  • vmcore → full memory dump (used by crash tool)

  • vmcore-dmesg.txt → kernel logs at crash time (fast RCA)


4️⃣ STEP 4: Install Required Packages (Safe in Production)

Install crash utility

yum install -y crash

Install matching kernel debuginfo

dnf debuginfo-install kernel-$(uname -r)

Verify debuginfo installed correctly

ls -lh /usr/lib/debug/lib/modules/$(uname -r)/vmlinux

Expected:

-rwxr-xr-x 1 root root 300M+ vmlinux

❌ If vmlinux is missing → analysis will fail


5️⃣ STEP 5: Start vmcore Analysis (MOST IMPORTANT)

Run:

crash \
/usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
/var/crash/*/vmcore

You will enter:

crash>

6️⃣ STEP 6: Mandatory crash Commands (DO NOT SKIP)

🔴 1. Check panic message

crash> log

Shows:

  • Panic reason

  • RIP (crashed function)

  • Kernel BUG info


🔴 2. Stack trace of crashed CPU

crash> bt

This usually directly shows the faulty module or function.


🔴 3. Stack trace of all CPUs

crash> bt -a

Use this to detect:

  • Soft lockups

  • Hung CPUs

  • Deadlocks


🔴 4. Loaded kernel modules

crash> mod

Look for:

  • NIC drivers (mlx5_core, ixgbe)

  • Storage drivers (nvme, dm_multipath)

  • Third-party modules


🔴 5. Memory status

crash> kmem -i

Checks:

  • Memory exhaustion

  • Fragmentation

  • Corruption indicators


🔴 6. Slab corruption (VERY COMMON)

crash> kmem -s

Slab corruption = bad driver / kernel bug


7️⃣ STEP 7: Identify Root Cause (How to Read Output)

Example crash output

RIP: mlx5e_napi_poll
Call Trace:
 mlx5e_poll_rx_cq
 net_rx_action
 __do_softirq

Interpretation

mlx5e_*  → Mellanox NIC driver
RX path  → Network traffic triggered

Root Cause: NIC driver kernel panic


8️⃣ STEP 8: Quick RCA Using vmcore-dmesg.txt (Fastest)

When crash tool is not available:

cat /var/crash/*/vmcore-dmesg.txt | tail -50

Look for:

Kernel panic - not syncing
RIP: function_name

🔥 Often enough for initial RCA


Common Panic Patterns (Rocky Linux 8.10)

Image

Image

Image

Pattern in OutputMeaning
mlx5_coreNIC driver issue
nvmeDisk / firmware
BUG:Kernel bug
watchdogCPU soft lockup
slab corruptionMemory overwrite
net_rx_actionNetwork flood / driver

9️⃣ STEP 9: (If Kubernetes Node) Correlate with K8s

kubectl describe node <node-name>
kubectl get events -A --sort-by=.lastTimestamp

Look for:

  • Node reboot time

  • Pod evictions

  • CNI / CSI restarts

  • High CPU / DPDK pods


🔟 STEP 10: Final RCA Template (Use This)

Incident: Kernel Panic on Worker Node
OS: Rocky Linux 8.10
Kernel: 4.18.0-553.el8_10
Time: 22-Dec-2025 14:32 IST

Root Cause:
Kernel panic caused by mlx5_core NIC driver
NULL pointer dereference during RX polling

Evidence:
- vmcore backtrace shows mlx5e_napi_poll
- vmcore-dmesg confirms RIP in NIC driver

Impact:
- Node rebooted
- Pods evicted
- 6 minutes downtime

Fix:
- Upgraded NIC firmware
- Kernel errata applied

Prevention:
- Enable reboot alerts
- Maintain kernel debuginfo cache

✅ Production Best Practices (MUST FOLLOW)

✔ Keep kdump always enabled
✔ Cache kernel-debuginfo
✔ Monitor node reboots
✔ Avoid privileged containers
✔ Keep kernel & firmware aligned
✔ Archive vmcore after RCA


No comments:

Post a Comment