When something breaks in a connected environment, you've got options. Cloud dashboards, managed logging services, external alerting platforms all a click away. In an air-gapped environment, you have what you brought with you. Nothing more.
This post covers how we approach observability in isolated Kubernetes deployments including the stack we use, what we monitor, and the lessons we've learned about operating without external dependencies.
The Self-Hosted Stack
Lattice deploys a complete observability stack that runs entirely within the cluster:
Prometheus handles metrics collection. It scrapes endpoints across the cluster - nodes, pods, services, and custom exporters storing time-series data locally. No external metrics service required.
Grafana provides dashboards and visualisation. Pre-configured dashboards give immediate visibility into cluster health, with the flexibility to build custom views as needed.
Loki collects and indexes logs. Unlike traditional logging solutions that require significant storage and indexing infrastructure, Loki's label-based approach keeps resource requirements manageable while still enabling effective log search and correlation.
AlertManager routes alerts based on configurable rules. When something needs attention, it can notify through whatever channels are available in the environment; even if that's just a webhook to an internal system.
This combination isn't novel, it's the standard open-source observability stack. What matters is having it deployed, configured, and tested before you need it.
What We Monitor
Knowing the tools is one thing. Knowing what to watch is another.
Platform Health
The platform dashboards give an at-a-glance view of cluster state:
- Node status - Are all nodes healthy and schedulable?
- Resource utilisation - CPU, memory, and disk across the cluster
- Pod health - Restart counts, failed pods, pending workloads
- Control plane - API server latency, etcd health, scheduler queue depth
These metrics answer the basic question: is the platform itself healthy?
Infrastructure Components
Each infrastructure component needs its own visibility:
Storage - Longhorn dashboards show volume health, replication status, and disk utilisation. In environments where you can't easily add storage, knowing when you're approaching capacity matters. PVC dashboards track persistent volume claims across namespaces, highlighting volumes that are filling up or experiencing I/O issues.
Networking - Service mesh metrics (when Istio is deployed) show traffic flow, latency percentiles, and error rates between services. Even without a mesh, basic network metrics help identify connectivity issues.
DNS - CoreDNS metrics reveal query rates and failure patterns. DNS issues cause subtle, hard-to-diagnose problems; having visibility prevents hours of debugging.
Certificate Monitoring
This one's important enough to call out specifically. In air-gapped environments, certificate expiry is a silent killer.
The x509 exporter monitors certificates across the cluster - TLS secrets, CA bundles, anything with an expiry date. Alerts fire well before expiration, giving time to rotate certificates through whatever process the environment requires.
Without this, you discover certificate problems when services start failing. With it, you discover them weeks in advance.
Application Stacks
Beyond the platform itself, common application patterns need monitoring:
- Database health - Connection pools, query latency, replication lag
- Message queues - Queue depth, consumer lag, dead letter accumulation
- Web services - Request rates, error rates, response times
Lattice includes overview dashboards for common application stacks, providing sensible defaults that can be extended as applications mature.
Dashboards as Documentation
In an air-gapped environment, you can't search Stack Overflow during an incident. The dashboards themselves become documentation, they encode knowledge about what matters and what normal looks like.
We approach dashboard design with this in mind:
Overview first - Top-level dashboards answer "is everything okay?" without requiring deep knowledge. Traffic lights, summary metrics, obvious problem indicators.
Drill-down available - When the overview shows a problem, detailed dashboards help investigate. But the detail is one click away, not the starting point.
Context included - Dashboard titles, panel descriptions, and threshold annotations explain what the metrics mean and what values indicate problems. The person debugging at 2 AM might not be the person who built the system.
Tested in anger - Dashboards that look good in demos sometimes fail in real incidents. We validate dashboards during failure injection testing, ensuring they actually help when things break.
Alerting Philosophy
Alerts in isolated environments need particular care. There's no PagerDuty, no Slack integration to an external service, no SMS gateway (usually). Alerts need to reach people through whatever channels exist internally.
More importantly, alert fatigue in an air-gapped environment is dangerous. When every alert requires manual investigation without easy access to external knowledge bases, false positives consume disproportionate energy.
Our approach:
Alert on symptoms, not causes - "Service error rate elevated" rather than "CPU usage high." The former definitely needs attention; the latter might be fine.
Require action - Every alert should have a clear response. If there's nothing to do, it shouldn't be an alert — maybe a dashboard panel, maybe a log entry, but not an alert.
Include context - Alert messages include what's wrong, what the current value is, what the threshold is, and links to relevant dashboards. Reduce the time from "I got paged" to "I understand the problem."
Tune relentlessly - An alert that fires and gets ignored is worse than no alert. It trains people to ignore alerts. We track alert frequency and response, removing or fixing alerts that don't drive action.
The Retention Question
Storage in air-gapped environments is often constrained, which forces decisions about metric and log retention.
Short retention (days to weeks) keeps storage manageable but limits historical analysis. You can see what's happening now, but "was it always like this?" becomes unanswerable.
Longer retention (months) enables capacity planning and trend analysis but requires more storage and can slow queries.
We typically configure:
- Metrics - 15-30 days at full resolution, with downsampling for longer-term trends where supported
- Logs - 7-14 days, with the option to archive to longer-term storage if available
The right answer depends on the environment's storage constraints and operational needs. The important thing is making a conscious decision rather than discovering you've run out of disk.
When You Can't Phone Home
Some scenarios that are trivial in connected environments require forethought when isolated:
"What does this error mean?" - Build a local knowledge base. Document errors you encounter, their causes, and resolutions. Next time, the answer is internal search, not Google.
"Is this metric value normal?" - Baseline everything. Record what normal looks like during stable operation. Without baselines, you can't tell if current values are concerning.
"Has anyone else seen this?" - Probably, but you can't easily find out. Invest in thorough incident documentation so your own team's experience becomes searchable knowledge.
"Is there a newer version that fixes this?" - Maybe, but you can't easily check. Maintain a schedule for reviewing upstream releases during maintenance windows when you have connectivity.
Testing Observability
Observability infrastructure needs testing just like application code. We validate that:
- Metrics are being collected (not just that Prometheus is running)
- Logs are flowing to Loki (not just that the agents are deployed)
- Alerts fire when conditions are met (not just that AlertManager is configured)
- Dashboards load and show meaningful data (not just that Grafana is accessible)
This happens during deployment testing, not after production issues reveal gaps.
Failure injection - deliberately breaking things, validates that the observability stack actually helps during incidents. Kill a node. Fill a disk. Saturate CPU. Does the dashboard show it? Does the alert fire? Can you find the relevant logs? If not, fix it before a real incident.
Lessons Learned
Deploy observability first - It should be among the first things running, not an afterthought. You want visibility into the rest of the deployment.
Pre-build dashboards - Creating dashboards during an incident is slow and error-prone. Have them ready.
Monitor the monitors - Prometheus disk filling up, Loki falling behind on ingestion, Grafana running out of memory — these need alerts too.
Practise without external resources - Can your team debug issues using only internal tools? If not, practise. The air-gapped environment isn't the place to learn.
Export matters - Sometimes you need to get data out for analysis with external tools or for reporting. Have a process for exporting metrics and logs when needed.
What's Next
The next post in this series will cover security hardening - CIS benchmarks, RBAC patterns, network policies, and the trade-offs between security and operability.
Lattice is a project developed by Digital Native Group.