Kubernetes out of the box is optimised for ease of use, not security. The defaults let you get started quickly, but they're not what you want running in production, especially in environments where security isn't just a checkbox but a genuine requirement.
This post covers how we approach hardening Kubernetes deployments, the trade-offs involved, and the lessons we've learned about making security practical rather than theatrical.
The Gap Between Default and Secure
A fresh Kubernetes installation has some characteristics that should concern anyone deploying sensitive workloads:
- Pods can communicate with any other pod by default
- Containers often run as root
- Service accounts get mounted into pods whether needed or not
- API access is permissive until you lock it down
- Audit logging may be minimal or disabled
None of this is a flaw, it's a design choice that prioritises getting started over defence in depth. But it means hardening is your job, not something that happens automatically.
CIS Benchmarks: A Starting Point
The Center for Internet Security publishes benchmarks for Kubernetes that provide a structured approach to hardening. They're useful as a baseline and often required for compliance, but they need context.
What the benchmarks cover:
- Control plane configuration (API server, controller manager, scheduler, etcd)
- Node configuration (kubelet, container runtime)
- Policies (RBAC, network policies, pod security)
- Secrets management
- General practices
What they don't tell you:
- Which controls matter most for your threat model
- How to implement controls without breaking your workloads
- The operational cost of each recommendation
We use CIS benchmarks as a checklist, not a specification. Every control gets evaluated against the actual environment; what's the risk it addresses? What's the cost of implementing it? Does it make sense here?
Some controls are non-negotiable. Others are sensible defaults. A few are genuinely difficult to implement without significant operational overhead. Knowing which is which requires understanding both the security landscape and the operational reality.
RBAC: Getting It Right
Role-Based Access Control is Kubernetes' primary authorisation mechanism, and it's where many hardening efforts start. It's also where many go wrong.
The Principle of Least Privilege
Everyone agrees with least privilege in theory. In practice, it's tempting to grant broad permissions to "make things work" and never revisit them.
Effective RBAC requires:
Defined roles based on actual need - What does this service account actually need to do? Not "what might it need" or "what's convenient," but what's genuinely required for it to function.
Separation between namespaces - A workload in one namespace shouldn't be able to read secrets from another. This seems obvious, but default service accounts and overly broad ClusterRoles can undermine it.
Regular review - Permissions granted for a specific task tend to persist after the task is complete. Periodic audit of who can do what catches drift.
Common Mistakes
Overusing ClusterRoles - ClusterRoles grant permissions cluster-wide. Most workloads don't need cluster-wide anything. Prefer namespace-scoped Roles unless there's a genuine requirement.
Wildcard permissions - verbs: ["*"] or resources: ["*"] grants everything. It's convenient and dangerous. Be explicit about what's permitted.
Ignoring service accounts - Every pod gets a service account, and that account might have permissions the pod doesn't need. Either use dedicated service accounts with minimal permissions or disable token mounting entirely for pods that don't need API access.
Forgetting about users - RBAC for service accounts gets attention; RBAC for human access sometimes doesn't. Who can kubectl into the cluster? What can they do? This needs the same rigour as service account permissions.
Network Policies: Default Deny
By default, any pod can talk to any other pod. Network policies change this, but only if you implement them.
The Default Deny Foundation
Start with a default deny policy in each namespace:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This breaks everything and that's the point. Now you explicitly allow only the traffic that should flow.
Building Up From Deny
With default deny in place, add policies that permit specific communication:
- Frontend pods can receive ingress traffic from the ingress controller
- Frontend pods can connect to backend pods on specific ports
- Backend pods can connect to database pods
- Database pods can connect to other database pods for replication
- All pods can reach DNS (or nothing works)
Each policy documents an intended communication path. Unexpected traffic is blocked by default, not permitted by default.
The Operational Cost
Network policies add friction. New services need policies before they can communicate. Debugging connectivity issues requires understanding the policy set. Teams need to think about network architecture, not just deploy and hope.
This friction is the point and it forces explicit decisions about communication paths. But it needs tooling and process support to be sustainable:
- Policy templates for common patterns
- Visibility into what's being blocked (network policy logging)
- Clear process for requesting new policies
- Testing that validates policies work as intended
Pod Security Standards
Kubernetes deprecated Pod Security Policies and replaced them with Pod Security Standards, enforced through Pod Security Admission.
The Three Profiles
Privileged - No restrictions. Useful for system workloads that genuinely need elevated access, but not for general applications.
Baseline - Prevents known privilege escalations while remaining broadly compatible with common workloads. Blocks things like hostPath mounts, host networking, and privileged containers.
Restricted - Heavily locked down. Requires pods to run as non-root, drop all capabilities, use read-only root filesystems. Many applications need modification to run here.
Practical Application
We typically apply:
- Restricted to application namespaces where workloads can be designed to comply
- Baseline to namespaces with third-party software that may not meet restricted requirements
- Privileged only to system namespaces (kube-system, monitoring) where components genuinely need elevated access
The enforcement mode matters too:
- Enforce blocks non-compliant pods from running
- Warn allows pods but generates warnings
- Audit logs violations without blocking or warning
Starting with audit mode reveals what would break before you break it. Graduating through warn to enforce gives teams time to fix issues.
Beyond Admission
Pod Security Admission controls what can be created, but runtime security goes further. Tools like Falco monitor runtime behaviour, detecting unexpected process execution, file access, or network connections that might indicate compromise.
This is defence in depth: admission control prevents obviously dangerous configurations; runtime monitoring catches behaviour that's permitted but suspicious.
Audit Logging
If something goes wrong, audit logs tell you what happened. Without them, you're guessing.
What to Capture
Kubernetes API audit logging can capture:
- Who made the request (user, service account)
- What they did (verb, resource, namespace)
- When it happened
- Whether it succeeded
- Request and response bodies (optionally, with size implications)
The audit policy controls what gets logged. Logging everything generates enormous volume; logging nothing leaves you blind. The balance depends on compliance requirements and storage constraints.
At minimum, capture:
- Authentication decisions (who's trying to access the cluster)
- Secrets access (who's reading sensitive data)
- RBAC changes (who's modifying permissions)
- Workload changes in sensitive namespaces (who's deploying what)
Where Logs Go
Audit logs need to be stored somewhere they can't be tampered with by someone who compromises the cluster. If an attacker can delete the logs that would reveal their presence, the logs don't help.
Options include:
- Shipping to external log aggregation immediately
- Writing to append-only storage
- Forwarding to a separate security monitoring system
In air-gapped environments, "external" still means something inside the boundary, but separate from the cluster being monitored.
Secrets Management
Kubernetes Secrets are base64 encoded, not encrypted. Anyone with read access to secrets in a namespace can decode them trivially.
Encryption at Rest
Enabling encryption at rest for secrets stored in etcd is a baseline requirement. Without it, anyone with access to etcd (or etcd backups) can read all secrets in plain text.
This is configuration, not code but it needs to be verified, not assumed. Check that encryption is actually enabled and that the encryption keys are themselves protected.
Limiting Access
RBAC controls who can read secrets, but the defaults may be more permissive than you'd like. Review:
- Which service accounts can read secrets in which namespaces
- Whether pods that don't need secrets have them mounted anyway
- Who has kubectl access that includes secret reading
External Secrets Management
For higher security requirements, external secrets management (HashiCorp Vault, for example) keeps secrets outside the cluster entirely. Pods retrieve secrets at runtime from the external store, with fine-grained access control and audit logging.
This adds complexity and infrastructure, but it provides capabilities that native Kubernetes secrets don't — dynamic secrets, automatic rotation, detailed access logging, and secrets that never persist on disk.
The Trade-Off: Security vs. Operability
Every hardening measure has an operational cost. Locked-down systems are harder to debug. Restricted permissions slow down development. Network policies break things until you get them right.
The goal isn't maximum security - it's appropriate security. What threats are realistic? What's the impact of compromise? What's the cost of controls?
Some guidance:
Start with high-impact, low-friction controls - Encryption at rest, basic RBAC, default deny network policies. These provide significant security improvement without crippling operations.
Add friction deliberately -- Restricted pod security standards, comprehensive network policies, minimal RBAC. These require more effort to implement and operate, but they're justified for sensitive environments.
Know what you're accepting - Some risks may be accepted rather than mitigated. That's fine if it's a conscious decision. It's not fine if it's an oversight.
Automate compliance checking - Manual verification doesn't scale. Tools that continuously check for drift from security baselines catch issues before they become incidents.
Lessons Learned
Security is a process, not a state - A hardened cluster today can drift toward insecurity tomorrow. Continuous monitoring and regular review matter as much as initial configuration.
Break things in testing - Apply hardening in test environments first. Discover what breaks before production.
Document exceptions - When something can't meet a security requirement, document why. Undocumented exceptions become invisible technical debt.
Train the team - Security controls that people don't understand get worked around. Invest in making sure everyone knows why controls exist and how to work within them.
Assume compromise - Defence in depth means controls that help even after something goes wrong. Segmentation limits blast radius. Audit logs enable investigation. Runtime monitoring catches behaviour that bypassed admission control.
What's Next
The next post in this series steps back to look at the bigger picture and how Lattice compares to other Kubernetes distributions and platforms, and where it fits in the landscape.
Lattice is a project developed by Digital Native Group.