Kubernetes Incident Response Playbook: Master Security & Protect Your Cluster

In the ephemeral, distributed world of cloud-native infrastructure, traditional forensic methods often fail. Kubernetes Incident Response requires a paradigm shift from treating servers as pets to handling volatile, containerized workloads that can vanish in seconds. For expert practitioners, the challenge isn't just detecting an intrusion—it's performing containment and forensics without alerting the attacker or destroying the evidence in a self-healing environment.

This guide serves as a technical playbook for SREs and Platform Engineers. We will bypass basic definitions and dive straight into the architectural strategies, `kubectl` patterns, and runtime security configurations necessary to execute a professional response to a cluster compromise.

The Kubernetes Incident Response Lifecycle

Effective response follows the NIST 800-61 r2 framework but adapted for the Kubernetes control plane and data plane. The lifecycle consists of four critical phases:

Preparation: Audit logging, runtime security (Falco/Tetragon), and immutable infrastructure.
Detection & Analysis: Anomalous behavior identification via metrics and API logs.
Containment & Eradication: Network isolation, node cordoning, and malware removal.
Post-Mortem: Forensic analysis and policy updates.

Phase 1: Preparation & Hardening

You cannot respond to what you cannot see. In a Kubernetes environment, the API server is the brain, and its logs are the flight recorder.

Enabling Deep Visibility with Audit Logs

Standard logging often misses the context required for forensics. You must configure the Kubernetes API server with a robust audit-policy.yaml. For production clusters, relying on "Metadata" level logging is insufficient for deep investigation; "RequestResponse" is preferred for critical resources, despite the volume.

apiVersion: audit.k8s.io/v1
kind: Policy
rules:
  # Log body of secrets changes (Use with caution/encryption)
  - level: Metadata
    resources:
    - group: ""
      resources: ["secrets", "configmaps"]

  # Log full request and response for critical modifications
  - level: RequestResponse
    verbs: ["update", "patch", "create", "delete", "deletecollection"]
    resources:
    - group: ""
      resources: ["pods", "serviceaccounts", "services"]
    - group: "rbac.authorization.k8s.io"
      resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]

  # Catch-all
  - level: Metadata
    omitStages:
      - "RequestReceived"

Pro-Tip: Ensure your audit logs are shipped off-cluster immediately (e.g., to Splunk, ELK, or CloudWatch). Attackers with root access to the control plane nodes will attempt to wipe local /var/log/kube-apiserver directories.

Phase 2: Detection & Analysis

Detection in Kubernetes Incident Response relies on identifying deviations from the declarative state. If a container spawns a shell or initiates an outbound connection to an unknown IP, it is a high-fidelity indicator of compromise (IoC).

Runtime Detection with Falco

Static analysis (scanning images) is proactive, but runtime security is reactive. Tools like Falco monitor kernel system calls (via eBPF). Below is a custom rule to detect a reverse shell, a common post-exploitation technique.

- rule: Terminal shell in container
  desc: A shell was used as the entrypoint for a container.
  condition: >
    spawned_process and container
    and shell_procs and proc.tty != 0
    and container_entrypoint
  output: >
    Shell spawned in a container (user=%user.name container_id=%container.id 
    image=%container.image.repository shell=%proc.name parent=%proc.pname cmdline=%proc.cmdline)
  priority: WARNING

Investigating "Exec" Anomalies

Attackers often use kubectl exec to move laterally. You can hunt for this in your audit logs by filtering for the pods/exec subresource.

jq 'select(.objectRef.subresource == "exec") | {user: .user.username, pod: .objectRef.name, ns: .objectRef.namespace, cmd: .requestURI}' audit.log

Phase 3: Containment, Eradication & Forensics

This is the most critical phase. Do not delete the compromised pod immediately. Deleting the pod destroys the memory state and ephemeral storage, effectively deleting the crime scene.

Step 1: Isolate the Pod (Network Segmentation)

Instead of killing the pod, cut off its network access to the rest of the cluster and the internet, while allowing access from a forensic workstation.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: quarantine-compromised-pod
  namespace: target-namespace
spec:
  podSelector:
    matchLabels:
      app: compromised-app # Ensure you label the pod first
  policyTypes:
  - Ingress
  - Egress
  # Deny all traffic by default (empty rules), allow only forensic tool ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: forensic-investigator

Step 2: Cordon and Taint the Node

Prevent the Kubernetes scheduler from placing new workloads on the compromised node to limit the blast radius.

kubectl cordon worker-node-01
kubectl taint nodes worker-node-01 security=compromised:NoSchedule

Step 3: Live Forensics with Ephemeral Containers

Avoid installing tools (like tcpdump or gdb) into the running container, as this alters the filesystem hash. Instead, use Ephemeral Containers to attach a "sidecar" with your forensic toolkit.

kubectl debug -it --image=nicolaka/netshoot:latest target-pod-name --target=main-container-name

This attaches the netshoot container to the process namespace of the target, allowing you to run ps, netstat, and capture traffic without polluting the original container image.

Advanced Concept: If you need to dump the memory of a process, you might need to access the node directly. Use nsenter to enter the container's namespace from the host node:
nsenter -t <PID> -n -p -m

Phase 4: Post-Mortem & Recovery

Once the threat is neutralized (e.g., by patching the vulnerability, rotating secrets, and redeploying fresh images), the focus shifts to recovery and documentation.

Snapshot Volumes: Before deleting the pod, take a snapshot of any PersistentVolumeClaims (PVCs) for deeper offline analysis.
Rotate Credentials: Assume all ServiceAccount tokens and Secrets mounted in the pod are compromised. Rotate them immediately.
Update Policies: Update NetworkPolicies and OPA Gatekeeper/Kyverno constraints to prevent recurrence.

Frequently Asked Questions (FAQ)

How does Kubernetes Incident Response differ from traditional IR?

In traditional IR, servers are static. In Kubernetes, pods are ephemeral. IP addresses change frequently, and logs can be lost if a pod restarts. This requires a shift to label-based identification and centralized, immutable logging streams.

Should I delete a compromised pod immediately?

No. Deleting a pod destroys volatile evidence (RAM, temporary files, network connections). Always isolate (quarantine) and pause the pod first, capture forensic data, and then delete/redeploy.

What tools are essential for K8s forensics?

Essential tools include Falco for runtime detection, kubectl debug for ephemeral attachment, Sysdig or Inspektor Gadget for tracing, and Velero for taking volume snapshots.

Kubernetes Incident Response Playbook: Master Security & Protect Your Cluster

Conclusion

Mastering Kubernetes Incident Response is about speed and precision. By preparing your cluster with robust audit logging and runtime security enforcement, you gain the visibility needed to detect breaches early. When an incident occurs, resisting the urge to "kill the pod" and instead following a disciplined process of isolation and forensic analysis will ensure you not only neutralize the threat but also learn from it to harden your infrastructure against future attacks.Thank you for reading the huuphan.com page!

Search This Blog