Tuesday, 23 December 2025

Analyze kernel core dump (vmcore) files on Rocky Linux 8.10 after a kernel panic.

This is production-grade, exactly how SRE teams do RCA.


๐Ÿง  End-to-End Flow (Rocky Linux 8.10)

Image

Image

Image

Kernel Panic
   ↓
kdump captures vmcore
   ↓
System reboots
   ↓
vmcore saved in /var/crash
   ↓
Analyze using crash + kernel-debuginfo
   ↓
Root Cause Analysis

✅ STEP 0: Confirm OS & Kernel (Baseline)

cat /etc/os-release
uname -r

Expected:

Rocky Linux 8.10
4.18.0-553.el8_10.x86_64   (example)

⚠️ Kernel version MUST match debuginfo


1️⃣ STEP 1: Confirm It Was a Kernel Panic

After the node rebooted:

journalctl -b -1 -k | tail -50

Look for:

Kernel panic - not syncing
BUG: unable to handle kernel NULL pointer dereference
watchdog: soft lockup

Check reboot reason:

last -x | grep reboot

If kernel panic is confirmed → continue.


2️⃣ STEP 2: Verify kdump Is Enabled (MANDATORY)

Check kdump service

systemctl status kdump

Expected:

Active: active (exited)

Check crashkernel parameter

cat /proc/cmdline

Must include:

crashkernel=512M

❌ If missing → vmcore will NOT be generated


3️⃣ STEP 3: Locate vmcore Files

Rocky Linux stores core dumps here:

ls -lh /var/crash/

Example:

/var/crash/127.0.0.1-2025-12-22-14:32/
 ├── vmcore
 ├── vmcore-dmesg.txt

๐Ÿ“Œ Files meaning:

  • vmcore → full memory dump (used by crash tool)

  • vmcore-dmesg.txt → kernel logs at crash time (fast RCA)


4️⃣ STEP 4: Install Required Packages (Safe in Production)

Install crash utility

yum install -y crash

Install matching kernel debuginfo

dnf debuginfo-install kernel-$(uname -r)

Verify debuginfo installed correctly

ls -lh /usr/lib/debug/lib/modules/$(uname -r)/vmlinux

Expected:

-rwxr-xr-x 1 root root 300M+ vmlinux

❌ If vmlinux is missing → analysis will fail


5️⃣ STEP 5: Start vmcore Analysis (MOST IMPORTANT)

Run:

crash \
/usr/lib/debug/lib/modules/$(uname -r)/vmlinux \
/var/crash/*/vmcore

You will enter:

crash>

6️⃣ STEP 6: Mandatory crash Commands (DO NOT SKIP)

๐Ÿ”ด 1. Check panic message

crash> log

Shows:

  • Panic reason

  • RIP (crashed function)

  • Kernel BUG info


๐Ÿ”ด 2. Stack trace of crashed CPU

crash> bt

This usually directly shows the faulty module or function.


๐Ÿ”ด 3. Stack trace of all CPUs

crash> bt -a

Use this to detect:

  • Soft lockups

  • Hung CPUs

  • Deadlocks


๐Ÿ”ด 4. Loaded kernel modules

crash> mod

Look for:

  • NIC drivers (mlx5_core, ixgbe)

  • Storage drivers (nvme, dm_multipath)

  • Third-party modules


๐Ÿ”ด 5. Memory status

crash> kmem -i

Checks:

  • Memory exhaustion

  • Fragmentation

  • Corruption indicators


๐Ÿ”ด 6. Slab corruption (VERY COMMON)

crash> kmem -s

Slab corruption = bad driver / kernel bug


7️⃣ STEP 7: Identify Root Cause (How to Read Output)

Example crash output

RIP: mlx5e_napi_poll
Call Trace:
 mlx5e_poll_rx_cq
 net_rx_action
 __do_softirq

Interpretation

mlx5e_*  → Mellanox NIC driver
RX path  → Network traffic triggered

Root Cause: NIC driver kernel panic


8️⃣ STEP 8: Quick RCA Using vmcore-dmesg.txt (Fastest)

When crash tool is not available:

cat /var/crash/*/vmcore-dmesg.txt | tail -50

Look for:

Kernel panic - not syncing
RIP: function_name

๐Ÿ”ฅ Often enough for initial RCA


Common Panic Patterns (Rocky Linux 8.10)

Image

Image

Image

Pattern in OutputMeaning
mlx5_coreNIC driver issue
nvmeDisk / firmware
BUG:Kernel bug
watchdogCPU soft lockup
slab corruptionMemory overwrite
net_rx_actionNetwork flood / driver

9️⃣ STEP 9: (If Kubernetes Node) Correlate with K8s

kubectl describe node <node-name>
kubectl get events -A --sort-by=.lastTimestamp

Look for:

  • Node reboot time

  • Pod evictions

  • CNI / CSI restarts

  • High CPU / DPDK pods


๐Ÿ”Ÿ STEP 10: Final RCA Template (Use This)

Incident: Kernel Panic on Worker Node
OS: Rocky Linux 8.10
Kernel: 4.18.0-553.el8_10
Time: 22-Dec-2025 14:32 IST

Root Cause:
Kernel panic caused by mlx5_core NIC driver
NULL pointer dereference during RX polling

Evidence:
- vmcore backtrace shows mlx5e_napi_poll
- vmcore-dmesg confirms RIP in NIC driver

Impact:
- Node rebooted
- Pods evicted
- 6 minutes downtime

Fix:
- Upgraded NIC firmware
- Kernel errata applied

Prevention:
- Enable reboot alerts
- Maintain kernel debuginfo cache

✅ Production Best Practices (MUST FOLLOW)

✔ Keep kdump always enabled
✔ Cache kernel-debuginfo
✔ Monitor node reboots
✔ Avoid privileged containers
✔ Keep kernel & firmware aligned
✔ Archive vmcore after RCA


Monday, 15 December 2025

k8sgpt

 Run Kubernetes AI Debugging Locally Using k8sgpt + Ollama (No OpenAI, 100% Free)

As Kubernetes clusters grow, debugging issues like ImagePullBackOff, CrashLoopBackOff, or scheduling failures becomes time-consuming.
k8sgpt solves this by analyzing your cluster and explaining issues in plain English using AI.

In this blog, I’ll show how to run k8sgpt locally with Ollama (no OpenAI key required) using Minikube on Windows.

This setup is ideal for:

  • Kubernetes SREs

  • DevOps Engineers

  • Platform teams

  • Anyone who wants AI-assisted debugging without cloud dependency


๐Ÿงฑ Architecture

Minikube (Kubernetes)
   |
k8sgpt (CLI)
   |
Ollama (Local LLM - llama3.1)

✔ Fully local
✔ No API key
✔ No cost
✔ Works offline


✅ Prerequisites

  • Windows 10/11 (64-bit)

  • Minikube installed and running

  • kubectl configured

  • Ollama installed

Verify:

kubectl get nodes
ollama list

Expected:

minikube   Ready
llama3.1

๐Ÿ”น Step 1: Download k8sgpt (Windows)

Go to:
๐Ÿ‘‰ https://github.com/k8sgpt-ai/k8sgpt/releases

Download:

k8sgpt_Windows_x86_64.zip

Extract it and move:

k8sgpt.exe → C:\Program Files\k8sgpt\

Add this directory to your PATH.

Verify:

k8sgpt version

๐Ÿ”น Step 2: Verify Kubernetes Context

kubectl config current-context

Output:

minikube

๐Ÿ”น Step 3: Remove OpenAI Backend (Important)

If OpenAI was previously configured:

k8sgpt auth remove --backends openai

This avoids quota and authentication errors.


๐Ÿ”น Step 4: Configure Ollama as AI Provider

Add Ollama with explicit model name:

k8sgpt auth add --backend ollama --model llama3.1

Set Ollama as default provider:

k8sgpt auth default --provider ollama

Verify:

k8sgpt auth list

Expected:

Default:
> ollama
Active:
> ollama

๐Ÿ”น Step 5: Verify Ollama Endpoint

curl http://localhost:11434/api/tags

You should see:

llama3.1

๐Ÿ”น Step 6: Run k8sgpt Analysis

Basic analysis:

k8sgpt analyze

With AI explanation:

k8sgpt analyze --explain

For cleaner output:

k8sgpt analyze --explain --filter=Pod,Node,Deployment

๐Ÿงช Step 7: Test with a Real Failure

Create a broken pod:

apiVersion: v1
kind: Pod
metadata:
  name: broken-pod
spec:
  containers:
  - name: test
    image: nginx:doesnotexist

Apply:

kubectl apply -f broken.yaml

Now run:

k8sgpt analyze --explain --filter=Pod

✅ Output (Example)

  • Detects ImagePullBackOff

  • Explains root cause

  • Suggests fix

  • Generated locally using llama3.1


๐Ÿง  Why This Setup Is Powerful

FeatureBenefit
Local LLMNo internet required
No OpenAIZero cost
MinikubeSafe learning environment
k8sgptFast RCA
OllamaProduction-grade local AI

๐Ÿ” Production Notes

  • This setup works the same on large clusters (50+ nodes)

  • In production, you can:

    • Run k8sgpt as CronJob

    • Integrate with Slack / MCP / ChatOps

    • Use with EFK / OpenSearch logs

    • Extend to Robin.io environments


๐Ÿ Conclusion

By combining k8sgpt + Ollama, you get an AI-powered Kubernetes debugging assistant that:

  • Runs locally

  • Costs nothing

  • Protects data privacy

  • Scales from Minikube → Production

This is an excellent way for SREs to adopt AI safely.



MCP server

Minikube + MCP server + VS code + continue + ollama


VS Code

  ↓

Continue Extension

  ↓

MCP Protocol (SSE)

  ↓

kubernetes-mcp-server

  ↓

Kubernetes API (via ServiceAccount)



VS Code
 └─ Continue Extension
     ├─ LLM (Ollama / OpenAI / etc.)
     └─ MCP Server → http://localhost:3000/mcp
         └─ kubernetes-mcp-server
             └─ Minikube cluster

๐Ÿฅ‡ OPTION 1 (RECOMMENDED): Build image locally & load into Minikube

This avoids GHCR entirely.

Step 1: Clone repo locally

git clone https://github.com/containers/kubernetes-mcp-server.git cd kubernetes-mcp-server

Step 2: Build Docker image

If using Docker (recommended):

docker build -t kubernetes-mcp-server:local .

If using Podman:

podman build -t kubernetes-mcp-server:local .

Step 3: Load image into Minikube

minikube image load kubernetes-mcp-server:local



apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: mcp-reader rules: - apiGroups: [""] resources: - pods - services - nodes - events - namespaces verbs: ["get", "list", "watch"] - apiGroups: ["apps"] resources: - deployments - replicasets - statefulsets verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: mcp-reader-binding subjects: - kind: ServiceAccount name: mcp-sa namespace: mcp roleRef: kind: ClusterRole name: mcp-reader apiGroup: rbac.authorization.k8s.io
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubernetes-mcp-server
  namespace: mcp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kubernetes-mcp-server
  template:
    metadata:
      labels:
        app: kubernetes-mcp-server
    spec:
      serviceAccountName: mcp-sa
      containers:
        - name: mcp
          image: kubernetes-mcp-server:local
          imagePullPolicy: IfNotPresent
          args:
            - --port
            - "3000"
          env:
            - name: KUBERNETES_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          ports:
            - containerPort: 3000


PS C:\Users\Raj Kumar Gupta\Desktop\Raj\minikube> kubectl port-forward -n mcp svc/kubernetes-mcp-server 3000:3000

Forwarding from 127.0.0.1:3000 -> 3000

Forwarding from [::1]:3000 -> 3000

Handling connection for 3000

============================================

STEP 1️⃣ Install Ollama (Windows)

Download Ollama

๐Ÿ‘‰ https://ollama.com/download

  1. Download Windows installer

  2. Install (default options)

  3. Reboot recommended


Verify Ollama installation

Open PowerShell:

ollama --version


PS C:\WINDOWS\system32> cd C:\Users\"Raj Kumar Gupta"\Desktop\Raj\minikube

PS C:\Users\Raj Kumar Gupta\Desktop\Raj\minikube> ollama --version

ollama version is 0.13.0

PS C:\Users\Raj Kumar Gupta\Desktop\Raj\minikube> ollama --version

ollama version is 0.13.0

PS C:\Users\Raj Kumar Gupta\Desktop\Raj\minikube> ollama pull llama3.1

pulling manifest

pulling 667b0c1932bc: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.9 GB

pulling 948af2743fc7: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.5 KB

pulling 0ba8f0e314b4: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  12 KB

pulling 56bb8bd477a5: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   96 B

pulling 455f34728c9b: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  487 B

verifying sha256 digest

writing manifest

success

PS C:\Users\Raj Kumar Gupta\Desktop\Raj\minikube> ollama run llama3.1

>>> hello

Hello! How are you today? Is there something I can help you with or would you like to chat?


============================================

 C:\Users\Raj Kumar Gupta\.continue

{
  "models": [
    {
      "title": "Ollama (Local)",
      "provider": "ollama",
      "model": "llama3.1"
    }
  ],
  "mcpServers": [
    {
      "name": "kubernetes",
      "transport": "http",
      "url": "http://localhost:3000/mcp"
    }
  ]
}

Local AI for Kubernetes: Ollama + Continue + MCP Step-by-Step

๐Ÿ“Œ Prerequisites (Blog Section)
✔ Windows / Linux / macOS
✔ VS Code installed
✔ Kubernetes cluster (Minikube used here)
✔ kubectl configured
✔ Docker installed
✔ Basic Kubernetes knowledge

๐Ÿง  Architecture Overview (Explain in Blog)

VS Code (Continue Extension)
        |
        |  (SSE / MCP)
        v
Kubernetes MCP Server
        |
        |  (Kubernetes API)
        v
Minikube Cluster
        ^
        |
Local LLM (Ollama – Llama 3.1)

Key idea:

Continue talks to Ollama for AI reasoning and to Kubernetes MCP for real cluster data.


๐Ÿš€ Step 1: Install Ollama (Local LLM – Free)

Download Ollama

๐Ÿ‘‰ https://ollama.ai/download

Verify installation

ollama --version

Pull model (IMPORTANT)

ollama pull llama3.1

Verify model

ollama list

๐Ÿงฉ Step 2: Install Continue Extension in VS Code

  1. Open VS Code

  2. Go to Extensions

  3. Search Continue

  4. Install Continue.dev

  5. Reload VS Code


๐Ÿค– Step 3: Add Ollama Model in Continue

  1. Open Continue panel (left sidebar)

  2. Click Select model → Add Chat Model

  3. Fill details:

    • Provider: Ollama

    • Model: Llama3.1 Chat

  4. Click Connect

  5. Select Llama3.1 Chat

✅ At this point, Continue works with local AI.


☸️ Step 4: Deploy Kubernetes MCP Server

Create namespace

kubectl create namespace mcp

Deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kubernetes-mcp-server
  namespace: mcp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kubernetes-mcp-server
  template:
    metadata:
      labels:
        app: kubernetes-mcp-server
    spec:
      containers:
        - name: mcp
          image: ghcr.io/containers/kubernetes-mcp-server:latest
          args:
            - "--port"
            - "3000"
          ports:
            - containerPort: 3000

Service YAML

apiVersion: v1
kind: Service
metadata:
  name: kubernetes-mcp-server
  namespace: mcp
spec:
  selector:
    app: kubernetes-mcp-server
  ports:
    - port: 3000
      targetPort: 3000

Apply

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

๐Ÿ”Œ Step 5: Port-forward MCP Server

kubectl port-forward -n mcp svc/kubernetes-mcp-server 3000:3000

Keep this terminal open.


⚙️ Step 6: Configure Continue MCP (config.yaml)

Path:

C:\Users\<username>\.continue\config.yaml

Final Working Config (VERY IMPORTANT)

name: Local Config
version: 1.0.0
schema: v1

models:
  - name: Llama3.1 Chat
    provider: ollama
    model: llama3.1

mcpServers:
  - name: kubernetes
    type: sse
    url: http://localhost:3000/mcp

Reload VS Code

Ctrl + Shift + P → Reload Window

✅ Step 7: Verify Everything Works

Test AI

hello

Discover MCP tools

What tools are available?

Kubernetes real data

List pods in the mcp namespace

๐ŸŽ‰ If you see real cluster output → success!



Sunday, 9 November 2025

Managing Robin Clusters Remotely Using the Robin Client

Robin’s CLI is powerful, but what if you need to manage multiple clusters from a single machine? That’s where the Robin Remote Client comes in. It mirrors the native CLI functionality and introduces the concept of contexts, allowing you to switch between clusters seamlessly.

This blog post walks you through downloading the remote client, adding and managing contexts, and using them to control your Robin clusters efficiently.


๐Ÿ“ฅ 4.11.1 Downloading the Robin Client

To begin, download the Robin client from your cluster’s master node:

curl -k 'https://<masterip>:<port>/api/v3/robinserver/download?file=robincli&os=<os>' -o robin

๐Ÿ”น Example

curl -k 'https://vnode42:29442/api/v3/robin_server/download?file=robincli&os=linux' -o robin

Output:

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100 10.1M  100 10.1M    0     0  1421k      0  0:00:07  0:00:07 --:--:-- 1483k

Verify the download:

ls -lart
-rw-r--r--    1 demo  staff    10655536 Mar 26 14:12 robin


๐Ÿงฉ 4.11.2 Adding a Context

A context defines a Robin cluster for the remote client. You can add one using:

robin client add-context <server>
                        --port <port>
                        --file-port <fileport>
                        --event-port <eventport>
                        --watchdog-port <watchport>
                        --metrics-port <metricsport>
                        --log-level <loglevel>
                        --product <producttype>
                        --set-current

๐Ÿ”น Example

robin client add-context centos-60-214 --port 29442

Output:

Context robin-cluster-centos-60-214 created successfully

๐Ÿ’ก If the cluster is highly available, use VIP and set ports to 29465.


๐Ÿ“‹ 4.11.3 Listing All Contexts

To view all registered contexts with details:

robin client list-contexts --full

✅ Output

   | Server                            | Port  | Version    | Tenant         | Last Login           | Tenants        | FPort | WPort | MPort | LogLevel
---+-----------------------------------+-------+------------+----------------+----------------------+----------------+-------+-------+-------+----------
   | master.robin-server.service.robin | 29442 | -          | -              | -                    |                | 29445 | 29444 | 29446 | ERR
   | centos-60-214                     | 29443 | -          | Administrators | -                    |                | 29445 | 29444 | 29446 | ERR
 * | 172.19.174.194                    | 29442 | 5.2.3-9842 | Administrators | 26 Mar 2020 16:10:58 | Administrators | 29445 | 29444 | 29446 | ERR

The asterisk * indicates the current context.


๐Ÿ”„ 4.11.4 Setting the Current Context

To switch to a specific cluster:

robin client set-current <context>

๐Ÿ”น Example

robin client set-current centos-60-214

Output:

Current context set to robin-cluster-centos-60-214

๐Ÿ”ง 4.11.5 Updating the Current Context

If cluster attributes change (e.g., after reinstallation), update the context:

robin client update-context --port <port>
                            --file-port <fileport>
                            --event-port <eventport>
                            --watchdog-port <watchport>
                            --metrics-port <metricsport>
                            --log-level <log_level>

๐Ÿ”น Example

robin client update-context --port 29942 --file-port 29445 --watchdog-port 29444 --metrics-port 29446

Output:

Updating attributes for context robin-cluster-centos-60-214
Server: centos-60-214
Context config updated for robin-cluster-centos-60-214

๐Ÿ—‘️ 4.11.6 Deleting a Context

To remove a context:

robin client delete-context <context>

๐Ÿ”น Example

robin client delete-context centos-60-214

Output:

Context centos-60-214 deleted

๐Ÿ“ Summary

TaskCommandDescription
Download clientcurlGet Robin CLI for remote use
Add contextrobin client add-contextRegister a cluster
List contextsrobin client list-contexts --fullView all registered clusters
Set currentrobin client set-current <context>Switch active cluster
Update contextrobin client update-contextModify cluster attributes
Delete contextrobin client delete-context <context>Remove a cluster