Application testing is well understood. Unit tests, integration tests, end-to-end tests; there are established patterns, frameworks, and practices. Infrastructure testing is less mature. Too often, the validation of a deployment is "it didn't error" or "kubectl get nodes shows Ready."
That's not good enough. A cluster can be running without being correct. Services can be deployed without being functional. Configurations can be applied without being effective.
This post covers how we approach infrastructure testing in Lattice - what we test, why we test it, and how automated validation builds confidence that deployments actually work.
Why Infrastructure Testing Matters
Infrastructure failures are different from application failures. When an application bug slips through, you might get incorrect behaviour or degraded functionality. When an infrastructure problem slips through, you might get an outage that affects everything running on that infrastructure.
The blast radius is larger, the debugging is harder and the pressure is higher.
And yet, infrastructure often gets less testing rigour than applications. "It's just configuration" or "we'll see if it works in staging" aren't testing strategies.
In air-gapped and secure environments, the stakes are higher still. You can't easily Google error messages. You can't quickly pull in a fix. What you deploy needs to work, because the feedback loop for fixing it is slow.
What We Test
Lattice includes test suites that validate deployments across several dimensions:
Connectivity Tests
Can things that should communicate actually communicate?
- Node-to-node - Can nodes reach each other on required ports?
- Pod-to-pod - Can pods in the same namespace communicate? Across namespaces where permitted?
- Pod-to-service - Does service discovery work? Do ClusterIP services route correctly?
- Ingress paths - Does traffic flow from ingress to services to pods?
- Egress where permitted - Can pods reach external services they're supposed to reach?
These sound basic, but network policies, CNI misconfigurations, and firewall rules can break any of them. Finding out during a test is better than finding out when an application fails.
DNS Resolution
DNS issues cause maddening problems including intermittent failures, timeouts, applications that work sometimes. We test:
- Internal resolution - Can pods resolve service names? Other pods?
- External resolution - Where permitted, can pods resolve external domains?
- Resolution latency - Is DNS fast enough? Slow DNS causes application timeouts.
- CoreDNS health - Is the DNS service itself healthy and not overloaded?
Storage Functionality
Storage that's provisioned isn't necessarily storage that works:
- PVC provisioning - Can we dynamically create persistent volume claims?
- Read/write operations - Can pods actually write to and read from volumes?
- Replication - For distributed storage like Longhorn, is data replicated to the expected number of nodes?
- Failure recovery - If a storage node goes away, does the volume remain accessible?
- Performance baselines - Is I/O latency within acceptable bounds?
Certificate Validity
In secure environments, certificate problems are common and painful:
- Expiration checks - Are all certificates valid with sufficient remaining lifetime?
- Chain validation - Are certificate chains complete and valid?
- Trust verification - Do services trust each other's certificates?
The x509 exporter we mentioned in the observability post feeds into this — tests can verify that certificate monitoring is working and that no certificates are approaching expiry.
Control Plane Health
The cluster's brain needs to be healthy:
- API server responsiveness - Is the API server responding quickly to requests?
- etcd health - Is the datastore healthy? Are all members synchronised?
- Scheduler function - Are pods being scheduled promptly?
- Controller manager - Are controllers reconciling state correctly?
Component Health
Each component in the stack needs validation:
- Prometheus - Is it scraping targets? Is data being stored?
- Grafana - Are dashboards loading? Can it query Prometheus?
- Loki - Are logs being ingested? Can we query recent logs?
- AlertManager - Is it receiving alerts? Are routes configured correctly?
- Istio - Is the control plane healthy? Are sidecars injecting?
- Longhorn - Are volumes healthy? Is replication working?
A component can be "running" (pod is up, container isn't crashing) without being "working" (actually doing its job). Tests validate function, not just existence.
Security Posture
Hardening configurations need verification:
- Network policies active - Are deny-by-default policies in place?
- Pod security standards enforced - Does the cluster reject non-compliant pods?
- RBAC restrictions - Do service accounts have only their intended permissions?
- Audit logging enabled - Are API operations being logged?
These tests confirm that security controls are actually applied, not just intended.
When Tests Run
Tests aren't just for initial deployment. Lattice's test suites run at multiple points:
Post-Deployment Validation
Immediately after deployment, tests verify the cluster is correctly configured. This catches deployment errors before anyone tries to use the cluster.
A deployment that completes successfully but fails tests hasn't really succeeded. The automation treats test failure as deployment failure.
Upgrade Verification
After upgrades, whether K3s version bumps, component updates, or configuration changes - tests verify nothing broke. Upgrades that pass in isolation sometimes cause unexpected interactions. Tests catch these.
Ongoing Health Checks
Tests can run periodically to detect drift or degradation. A cluster that was healthy at deployment can become unhealthy over time; storage filling up, certificates expiring, components crashing and restarting.
Scheduled test runs catch these before they become incidents.
Pre-Change Validation
Before making changes, tests establish a baseline. After the change, the same tests verify the cluster still functions. Comparing before and after shows exactly what changed and whether anything broke.
Tests as Documentation
Test suites document expectations. Reading the tests tells you:
- What connectivity paths should exist
- What components are expected to be healthy
- What security controls should be in place
- What performance baselines are acceptable
This is living documentation and it's verified every time tests run, so it can't drift from reality the way written documentation can.
When someone asks "how do I know if the cluster is healthy?" the answer is "run the tests." When someone asks "what does a healthy cluster look like?" the answer is "read the tests."
Test Design Principles
Not all tests are equally valuable. We've learned some principles for infrastructure testing:
Test Behaviour, Not Implementation
A test that checks "is the Prometheus pod running?" is less valuable than one that checks "can we query metrics from Prometheus?" The first can pass while the second fails. The second tells you what actually matters.
Implementation tests are fragile, they break when you change how something works, even if what it does stays the same. Behaviour tests are stable, they verify outcomes that matter.
Test the Boundaries
Integration points are where things break. Tests that cross boundaries - pod to service, service to ingress, component to component catch more real problems than tests that stay within a single component.
Make Tests Fast
Tests that take too long don't get run. Infrastructure tests can be slow (spinning up pods, waiting for provisioning), so we optimise where possible and parallelise where we can't.
A test suite that takes an hour won't be run before every change. A test suite that takes five minutes will.
Make Tests Reliable
Flaky tests - ones that sometimes pass and sometimes fail for the same state, erode trust. When tests cry wolf, people stop listening. We invest in making tests deterministic, with appropriate timeouts and retries for operations that are legitimately eventually-consistent.
Make Tests Understandable
When a test fails, the failure message should explain what went wrong and where to look. "Test failed" is useless. "DNS resolution for service X in namespace Y timed out after 30s - check CoreDNS logs" is actionable.
Failure Injection
Testing that things work when they should is necessary but not sufficient. We also need to know that failures are detected and handled correctly.
Failure injection i.e. deliberately breaking things validates the following:
- Monitoring detects failures - Does the dashboard show the problem? Does the alert fire?
- Failures are contained - Does a node failure take down just that node, or cascade?
- Recovery works - When the failure is resolved, does the system recover?
This is uncomfortable. Deliberately breaking a system you've carefully built feels wrong, though it shouldn't matter when using full automation. But discovering your recovery procedures don't work during a real incident is worse.
Common failure injection scenarios:
- Kill a node - Does storage failover? Do pods reschedule? Do alerts fire?
- Fill a disk - Is the alert triggered before things break? Can the system recover?
- Block network traffic - Do network policies behave as expected? Is the failure visible?
- Expire a certificate - Does monitoring catch it? What breaks, and is it contained?
Lessons Learned
Test early and continuously - Discovering a problem right after deployment is better than discovering it in production. Discovering it automatically is better than discovering it manually.
Treat test failures as deployment failures - A deployment that passes but produces a broken cluster hasn't really passed. Build this into automation.
Invest in test quality - Flaky tests, slow tests, and unclear failures all reduce the value of testing. Treat test code with the same care as production code.
Test what matters - Not everything needs a test. Focus on things that break, things that are hard to debug, and things that have high impact when they fail.
Document through tests - Tests are executable documentation. They describe expected behaviour and verify it's still true.
What's Next
The next post in this series will cover storage in Kubernetes - distributed storage with Longhorn, backup strategies, disaster recovery, and why "just use cloud storage" isn't always an option.
Lattice is a project developed by Digital Native Group.