Back to Lab Notes

Operational Playbooks: What to Do When Things Go Wrong

errors

Everything fails eventually. Hardware degrades, software has bugs, configurations drift, and humans make mistakes. The difference between a minor incident and a major outage isn't whether things fail - it's how quickly and effectively you respond when they do.

In air-gapped and secure environments, the response challenge is amplified. You can't Google the error message. You can't quickly pull in a patch from upstream. Your monitoring can't phone home to a cloud dashboard. Whatever you need to diagnose and fix problems has to already be inside the perimeter.

This post covers how we approach incident response with Lattice - the playbooks we build, the patterns we follow, and the thinking behind preparing for failure before it happens.

Why Playbooks Matter

Under pressure, people make worse decisions. Stress narrows thinking, time pressure encourages shortcuts, and the urgency to restore service can lead to actions that make things worse.

Playbooks counter this. A well-written playbook provides a structured path through a problem when the person following it is tired, stressed, and under pressure. It's not about removing judgement — it's about ensuring the basics are covered before judgement is needed.

Good playbooks share common characteristics:

Specific - "Check Longhorn volume health" is useful. "Investigate storage issues" is not.

Sequential - Steps are ordered. Do this first, then this, then this. Decision points are explicit.

Observable - Each step tells you what you should see if things are working. Expected output matters as much as the command itself.

Bounded - Playbooks define when to escalate. If step five doesn't resolve the issue, stop and escalate rather than improvising.

Categories of Failure

Not all failures are the same, and different categories need different responses:

Infrastructure Failures

Hardware and OS-level problems - node crashes, disk failures, network partitions, power loss. These are often sudden and total for the affected component.

Response pattern: identify the scope of impact, verify redundancy is handling the failure, plan replacement or recovery of the failed component.

Platform Failures

Kubernetes and its components - API server unresponsive, etcd degradation, CNI issues, storage controller problems. These affect everything running on the platform.

Response pattern: assess platform health systematically, identify the failing component, determine whether the failure is isolated or cascading.

Application Failures

Workloads running on the platform - pods crashing, services unreachable, performance degradation. These affect specific applications but the platform itself is healthy.

Response pattern: confirm the platform is healthy first (it almost always looks like a platform problem initially), then focus on the application layer.

Configuration Drift

Gradual divergence from intended state - certificates expiring, resource limits drifting, security policies weakened. These are slow-burn problems that become incidents when a threshold is crossed.

Response pattern: compare current state against intended state, identify what changed, restore the intended configuration.

The Playbook Structure

Every Lattice playbook follows a consistent structure:

1. Detection

How do you know there's a problem? This section describes:

  • What alert fired, or what symptom was observed
  • Where to look first (which dashboard, which logs)
  • How to confirm the problem is real (not a false alarm)

False alarms erode trust in monitoring. Every playbook starts by confirming the problem before taking action.

2. Assessment

What's the scope and impact? This section covers:

  • What's affected (which services, which users, which data)
  • Is the problem getting worse, stable, or recovering on its own
  • What's the urgency (data loss risk, service outage, degraded performance)

Assessment drives prioritisation. A degraded dashboard is less urgent than a degrading database.

3. Diagnosis

What's actually wrong? This section provides:

  • Specific commands to run and what output to expect
  • Logs to check and what to look for
  • Common causes for this type of failure

Diagnosis is where most time is spent. Good playbooks accelerate this by pointing people at the right places.

4. Resolution

How do you fix it? This section includes:

  • Step-by-step remediation actions
  • Verification steps after each action (did it work?)
  • Rollback procedures if the fix makes things worse

5. Post-Incident

What happens after the immediate problem is resolved:

  • Verify full recovery (run the test suite)
  • Document what happened, what was done, and what was learned
  • Identify follow-up actions to prevent recurrence

Common Playbooks

These are the incident types that come up most frequently in Lattice deployments:

Node Not Ready

Detection: Alert fires on node status change. Dashboard shows a node in NotReady state.

Assessment: How many nodes are affected? Are pods being rescheduled? Is storage replicating?

Diagnosis:

  • Check node status and conditions - is it a network issue, resource pressure, or kubelet failure?
  • Check system logs on the affected node (if reachable via SSH)
  • Check for resource exhaustion - disk pressure, memory pressure, PID pressure

Resolution path:

  • If the node is reachable, check and restart kubelet
  • If resource pressure, identify and address the resource constraint
  • If the node is unreachable, verify hardware status and consider replacement
  • Verify pod rescheduling completed, storage replicas rebuilt

Storage Volume Degraded

Detection: Longhorn dashboard shows degraded volumes. Alert fires on replica count drop.

Assessment: How many volumes are degraded? Is data still accessible? Are applications affected?

Diagnosis:

  • Check which node lost replicas - is it a node failure or disk failure?
  • Check Longhorn manager logs for error details
  • Verify remaining replicas are healthy

Resolution path:

  • If the node is recovering, wait for automatic rebuild
  • If the node is lost, Longhorn will rebuild on available nodes (monitor progress)
  • If disk capacity is insufficient for rebuild, expand storage or evict data
  • Verify all volumes return to healthy state with full replica count

Certificate Expiring

Detection: x509 exporter alert fires on certificate approaching expiry. Dashboard shows certificate with less than the threshold remaining.

Diagnosis:

  • Which certificate is expiring? (the alert should identify it)
  • Is it a Lattice-managed certificate or an externally provided one?
  • What services depend on this certificate?

Resolution path:

  • For Lattice-managed certificates, run the certificate renewal playbook in Ansible
  • For externally provided certificates, coordinate with the certificate authority
  • After renewal, verify services are using the new certificate
  • Check that the x509 exporter now shows healthy expiry dates

DNS Resolution Failing

Detection: Applications reporting connection failures. DNS-related errors in pod logs.

Assessment: Is DNS failing cluster-wide or for specific services? Is CoreDNS running?

Diagnosis:

  • Check CoreDNS pod status and logs
  • Test resolution from a debug pod - can you resolve internal services? External domains?
  • Check CoreDNS resource usage - is it overloaded?
  • Check for NetworkPolicy changes that might be blocking DNS traffic

Resolution path:

  • If CoreDNS pods are unhealthy, check node resources and restart if necessary
  • If resolution is intermittent, check for DNS cache issues or upstream DNS problems
  • If a NetworkPolicy is blocking DNS, correct the policy
  • Verify resolution is working from multiple pods across namespaces

Monitoring Stack Unhealthy

Detection: Grafana dashboards show gaps or errors. AlertManager not delivering alerts. Prometheus targets showing as down.

Assessment: Which monitoring components are affected? Is data being lost?

Diagnosis:

  • Check Prometheus pod status - is it running? Is storage full?
  • Check Prometheus targets - which scrape targets are failing?
  • Check Loki pod status and ingestion rate
  • Check AlertManager configuration and connectivity

Resolution path:

  • If Prometheus storage is full, expand PVC or adjust retention settings
  • If scrape targets are failing, check the target services and network connectivity
  • If Loki is behind, check resource limits and storage backend
  • After recovery, verify dashboards are populating and alerts are firing

Istio Control Plane Issues

Detection: Services intermittently failing. Sidecar injection not working. Kiali showing errors.

Assessment: Is istiod running? Are existing proxies functional or is new configuration not distributing?

Diagnosis:

  • Check istiod pod status, logs, and resource usage
  • Check proxy sync status - are sidecars receiving configuration updates?
  • If using ambient mode, check ztunnel status on affected nodes
  • Check for certificate distribution issues (Istio manages its own CA)

Resolution path:

  • If istiod is unhealthy, check resource constraints and restart if necessary
  • If configuration distribution is stalled, check for invalid configuration resources
  • If certificates are the issue, check Istio's CA and certificate rotation
  • Verify mesh connectivity by testing service-to-service communication

Cluster Upgrade Failure

Detection: Upgrade process errors out or completes with failing tests.

Assessment: What state is the cluster in? Is it partially upgraded? Are workloads still running?

Diagnosis:

  • Which step of the upgrade failed?
  • Are nodes running mixed versions?
  • Check Ansible output for specific error messages
  • Run the test suite to identify what's broken

Resolution path:

  • If the failure is early, roll back to the previous version
  • If partially upgraded, assess whether to continue forward or roll back
  • If rollback is needed, use the documented rollback procedure
  • Never leave a cluster in a partially upgraded state - resolve in one direction

Cluster Expansion

Detection: This isn't a failure - it's planned growth. But it's operationally complex enough to warrant the same structured approach. Triggers include capacity thresholds being reached, new workloads requiring additional resources, or resilience requirements demanding more nodes for replication.

Assessment: What's driving the expansion? If it's resource pressure, understand which resource - CPU, memory, storage - before adding nodes. Adding compute nodes doesn't help if the constraint is disk I/O.

Planning:

  • How many nodes, and in which roles (control plane, worker, or both)?
  • Do the new nodes meet the same hardware and OS baseline as existing nodes?
  • Are there enough IP addresses, DNS entries, and certificates provisioned?
  • In air-gapped environments, are the new nodes pre-loaded with the required images and packages?

Execution:

  • Update Ansible inventory with the new nodes
  • Run the deployment playbook - Ansible's idempotent design means existing nodes are validated while new nodes are provisioned
  • Verify the new nodes join the cluster and report Ready
  • Confirm Longhorn extends storage to the new nodes (if configured)
  • Confirm Istio components (ztunnel in ambient mode, or sidecar injection) are functioning on new nodes

Verification:

  • Run the full test suite to confirm cluster health with the expanded topology
  • Check that workloads can schedule onto new nodes
  • Verify storage replication is balanced across the expanded node set
  • Confirm monitoring is scraping the new nodes and dashboards reflect the updated topology

The key risk with expansion in air-gapped environments is preparation. In connected environments, a new node pulls what it needs during provisioning. In air-gapped environments, everything must be staged beforehand - images, packages, certificates, configuration. A node that joins the cluster but can't pull workload images is worse than no node at all, because it consumes scheduler attention without providing capacity.

Building Your Own Playbooks

The playbooks above are starting points. Real environments generate their own failure patterns, and playbooks should evolve with experience.

Document Every Incident

After every incident, however minor, document:

  • What happened (symptoms, root cause)
  • How it was detected (alert, user report, accident)
  • How it was resolved (what steps, how long)
  • What could prevent recurrence

This documentation feeds directly into new or improved playbooks.

Practice Regularly

A playbook that's never been followed is theoretical. Run through playbooks periodically - ideally in a staging environment - to verify they're accurate and that the team is familiar with them.

Game days, where failures are deliberately injected (as discussed in Part 6), are invaluable for testing both playbooks and team readiness.

Keep Playbooks Current

Infrastructure changes. Components get upgraded. Configuration evolves. Playbooks that reference old commands, deprecated tools, or removed dashboards are worse than useless - they waste time when time matters most.

Review playbooks whenever the platform changes. Include playbook updates as part of the change process.

Don't Over-Specify

Playbooks should guide, not constrain. Over-specified playbooks that script every keystroke become brittle and discourage thinking. The goal is to get the responder to the right place quickly, not to replace their expertise.

Leave room for judgement. "Check Prometheus resource usage and assess whether the issue is CPU, memory, or storage" is better than a rigid decision tree that can't handle the unexpected.

Communication During Incidents

Technical resolution is only half of incident management. Communication matters:

Declare early - It's better to declare an incident that turns out to be minor than to discover too late that a minor issue was actually major.

Update regularly - Stakeholders need to know what's happening, even if the update is "still investigating." Silence breeds anxiety.

Separate roles - In larger incidents, the person debugging should not be the person communicating. Context switching between technical work and stakeholder communication slows both.

Be honest - "We don't know yet" is always better than a premature root cause that turns out to be wrong.

Tooling for Incident Response

Lattice's built-in tooling supports incident response:

Dashboards - Platform overview, infrastructure health, Longhorn status, certificate expiry - these provide the first-look diagnostic information that guides initial assessment.

Alerts - Configured to fire on conditions that require attention, not just conditions that exist. Alert fatigue kills incident response effectiveness.

Test suites - Can be run during and after incidents to verify platform health and confirm resolution.

Ansible - The same playbooks that deploy the platform can remediate it. Re-running deployment playbooks is a valid recovery strategy for many configuration issues.

Logs - Centralised in Loki, queryable through Grafana. For air-gapped environments, having logs in one place rather than scattered across nodes saves critical time.

The On-Call Reality

In smaller teams - which describes many organisations operating in secure environments - on-call is a reality. Not everyone can specialise, and the person responding to an incident at 2 AM might not be the person who built the component that's failing.

This is exactly why playbooks exist. They encode the knowledge of the builder in a form the responder can use. They're knowledge transfer that works under pressure.

Invest in playbooks proportionally to the consequences of failure. For a development cluster, minimal playbooks are fine. For a production platform in a secure environment, comprehensive playbooks are essential.

Lessons Learned

Prepare before you need to respond - Playbooks written during an incident are too late. Write them when things are calm.

Test your response, not just your infrastructure - Game days reveal gaps in playbooks, tooling, and team knowledge that no amount of infrastructure testing catches.

Iterate on every incident - Every incident is a learning opportunity. Capture it, improve playbooks, and prevent recurrence where possible.

Monitoring is your first responder - If your monitoring doesn't detect the problem, your playbooks never get triggered. Invest in detection as much as resolution.

Keep it simple - Complex incident response procedures fail under pressure. Simple, clear steps that anyone on the team can follow are more valuable than comprehensive procedures that only one person understands.

What's Next

The final post in this series will cover building reusable infrastructure - how to treat your platform as a product, making it adaptable across environments and organisations.


Lattice is a project developed by Digital Native Group.