Back to Lab Notes

Building Kubernetes Infrastructure for Secure Environments

kube-sec

When you're deploying Kubernetes in environments where reliability isn't optional and network connectivity can't be guaranteed, the standard approach of "just use a managed service" doesn't apply. We built Lattice to solve this problem - an extensible, air-gap capable Kubernetes infrastructure stack that provisions production-ready clusters with enterprise-grade components.

This post covers the thinking behind Lattice and why we made the architectural choices we did.

The Problem

Government, defence, and regulated commercial environments often have requirements that don't fit neatly into the cloud-native playbook:

  • Air-gapped networks - No internet access means no pulling images from Docker Hub or relying on external package repositories during deployment.
  • Repeatable deployments - The same configuration needs to produce identical results across development, staging, and production environments, often months apart.
  • Security hardening - Default configurations are rarely sufficient when compliance frameworks mandate specific controls.
  • Operational simplicity - Teams maintaining these systems may not be Kubernetes experts, so the platform needs to be understandable and debuggable.

We needed a platform that could handle all of this while remaining flexible enough to adapt to different project requirements.

Why K3s?

We evaluated several Kubernetes distributions before settling on K3s as our foundation:

Lightweight footprint - K3s runs as a single binary with significantly lower resource overhead than traditional distributions. This matters when you're deploying to constrained environments or need to spin up clusters quickly for testing.

Certified Kubernetes - Despite its size, K3s is fully conformant. Applications that run on any Kubernetes distribution will run on K3s.

Air-gap friendly - K3s can be packaged as a single tarball with embedded images, making offline deployment straightforward.

Active community - Backed by SUSE/Rancher with strong community adoption, meaning issues get addressed and the project continues to evolve.

The trade-off is that K3s makes some opinionated choices (like using SQLite by default for single-node deployments), but these can be overridden when needed; and we do override them for production deployments where etcd provides the distributed consensus guarantees we need.

The Component Stack

Lattice assembles a curated set of components that work well together:

Distributed Storage - Longhorn provides replicated block storage that survives node failures. It's Kubernetes-native, doesn't require specialised hardware, and handles the complexity of distributed storage without requiring a storage engineering background to operate.

Service Mesh - Istio handles traffic management, security (mTLS between services), and observability. There's a learning curve, but the benefits of having consistent security policies and traffic control across services justify the investment.

Observability - Prometheus for metrics, Grafana for dashboards, AlertManager for alerting, and Loki for logs. This combination is battle-tested and provides the visibility needed to operate clusters confidently.

Each component is optional and can be enabled or disabled based on project requirements. Not every deployment needs a service mesh; not every environment requires distributed storage.

Ansible as the Deployment Engine

We chose Ansible over alternatives like Terraform or Pulumi for several reasons:

Agentless - Ansible connects over SSH and doesn't require installing agents on target machines. In secure environments where software installation is controlled, this is a significant advantage.

Idempotent - Running the same playbook multiple times produces the same result. This is crucial for both initial deployment and ongoing maintenance.

Readable - Ansible playbooks are YAML files that can be understood without deep tooling knowledge. When something goes wrong at 2 AM, clarity matters.

Extensible - Custom modules and roles can be written in Python, making it straightforward to handle specialised requirements.

The structure follows standard Ansible conventions: inventory files define the target environment, group variables configure component options, and roles handle the actual work of installing and configuring each component.

Testing as a First-Class Concern

One lesson learned from operating infrastructure in sensitive environments: if you can't test it, you can't trust it.

Lattice includes extensive built-in tests that validate deployments:

* Connectivity tests verify that components can communicate as expected

* Functional tests confirm that services (storage, networking, monitoring) work correctly

* Security tests check that hardening measures are in place

These tests run automatically after deployment and can be re-run at any time to verify cluster health. When you need to demonstrate to an auditor that your platform meets its security requirements, having automated tests that prove it saves considerable time and uncertainty.

Security Hardening

Default configurations are optimised for ease of use, not security. Lattice applies hardening measures including:

  • Appropriate RBAC policies that follow least-privilege principles
  • Network policies that restrict traffic to what's actually needed
  • Security contexts that limit container capabilities
  • Audit logging for compliance and forensics

The specific controls depend on the target environment's requirements, but the framework is designed to make applying these controls straightforward rather than an afterthought.

Lessons Learned

Building Lattice taught us several things:

Modularity pays off - Making components optional from the start meant we could adapt the platform to different requirements without maintaining multiple forks.

Documentation is infrastructure - In secure environments, you often can't search Stack Overflow during an incident. Good documentation is as important as the code itself.

Test everything - Automated tests caught issues that would have been embarrassing (or worse) in production. The investment in testing infrastructure was worth it many times over.

Understand the ecosystem - Kubernetes moves fast. Keeping up with deprecations, new features, and security patches is ongoing work, not a one-time effort.

What's Next

In future posts, we'll dive deeper into specific aspects of building and operating Kubernetes platforms for secure environments:

  • Air-gap deployment strategies and image management
  • Implementing effective observability without external dependencies
  • Security hardening patterns and compliance considerations
  • Operational playbooks for common scenarios

If you're building similar infrastructure, we'd be interested to hear about your approach.

---

Lattice is a project developed by Digital Native Group.