Service mesh is one of those technologies that generates strong opinions. Advocates describe it as essential infrastructure for any serious Kubernetes deployment. Sceptics call it unnecessary complexity. Both are right, depending on context.
Lattice includes Istio as an optional component. The word "optional" is doing real work in that sentence - service mesh adds genuine capability, but it also adds operational overhead. Understanding when it's worth it helps you make the right call.
This post covers what a service mesh actually does, when it justifies its cost, and how to introduce it without drowning in complexity.
What a Service Mesh Actually Does
Strip away the marketing and a service mesh does three things:
Traffic Management
Control how requests flow between services. Route traffic based on headers, percentages, or other criteria. Implement retries, timeouts, and circuit breakers consistently across all services without changing application code.
This matters when you have:
- Multiple versions of a service running simultaneously (canary deployments, A/B testing)
- Complex retry logic that shouldn't be duplicated in every service
- Traffic shifting requirements during deployments or incidents
Security
Encrypt all traffic between services with mutual TLS (mTLS). Enforce authentication and authorisation policies at the network level. Verify service identity automatically.
This matters when you need:
- Encryption in transit between services (compliance requirement in many environments)
- Service-to-service authentication that doesn't depend on application implementation
- Fine-grained access control between services
Observability
Capture detailed telemetry about every request - latency, success rates, traffic volume without modifying and instrumenting application code. Trace requests across multiple services to understand dependencies and bottlenecks.
This matters when you need:
- Visibility into service-to-service communication patterns
- Latency breakdowns across complex request paths
- Understanding of dependencies between services
The Honest Cost
These capabilities come at a price:
Resource Overhead
Traditionally, Istio injects a sidecar proxy (Envoy) into every pod. Each sidecar consumes CPU and memory. In a cluster with hundreds of pods, this overhead is significant:
- Memory: roughly 50-100MB per sidecar, more under load
- CPU: depends on traffic volume, but not free
- Network: additional hop for every request (in and out of the sidecar)
Istio's newer Ambient Mesh mode changes this model significantly. Instead of per-pod sidecars, Ambient uses a shared node-level proxy (ztunnel) for Layer 4 traffic - handling mTLS and basic connectivity without injecting anything into individual pods. Layer 7 processing (routing rules, retries, observability) is handled by optional waypoint proxies deployed per namespace or per service, only where needed.
The practical impact:
- Lower baseline overhead - No sidecar per pod means dramatically reduced memory and CPU consumption across the cluster
- Simpler adoption - No pod restarts required to join the mesh, no sidecar injection configuration to manage
- Pay for what you use - L7 capabilities are only deployed where needed, not everywhere
- Reduced operational complexity - Fewer sidecars means fewer things to version-match during upgrades
For resource-constrained environments - which describes many air-gapped deployments, Ambient Mesh makes the overhead argument against service mesh considerably weaker. The mTLS and identity benefits come at a fraction of the previous cost.
That said, sidecar mode remains more mature and better documented. Ambient is production-ready but newer, and some advanced traffic management scenarios still favour sidecars. Lattice supports both modes, allowing teams to choose based on their requirements and comfort level.
Other Recent Developments
Istio has matured significantly in recent releases, with several developments relevant to how Lattice uses it:
Gateway API as the standard - Istio now fully supports the Kubernetes Gateway API (promoted to Stable in 1.22, with full v1.4 support in 1.28), replacing the older Istio-specific Ingress model. Gateway API provides a more standardised, portable way to configure traffic routing. For Lattice, this means less Istio-specific configuration and better alignment with the broader Kubernetes ecosystem.
v1 APIs across the board - Istio's core networking, security, and telemetry APIs have all graduated to v1 as of 1.22. For risk-averse environments - which describes most of our deployment targets - this means stable APIs with formal deprecation policies. A stable validation policy can even enforce that only v1 APIs and fields are used, preventing accidental use of experimental features.
Telemetry API - The stable Telemetry API replaces the older MeshConfig approach for configuring metrics, access logs, and tracing. It supports granular configuration at mesh, namespace, or workload level. This means observability can be tuned precisely; higher trace sampling for critical services, different log formats per namespace, custom metric dimensions - all through Kubernetes-native resources rather than global configuration.
CNCF graduation - Istio graduated from the CNCF in July 2023, confirming its maturity and long-term viability as a project. For organisations making technology decisions that need to last, graduation provides confidence that the project has broad community support and governance.
Multi-cluster ambient - The 2025-2026 roadmap focuses on bringing multi-cluster traffic management to ambient mode, enabling cross-cluster failover and load balancing. While not immediately relevant for single-cluster air-gapped deployments, this capability opens future possibilities for organisations operating across multiple isolated environments.
AI workload support - Istio 1.28 introduced InferencePool v1 for managing AI inference workloads, reflecting the growing need for intelligent traffic management of GPU-backed services. As AI capabilities increasingly appear in defence and government contexts, having mesh-level traffic management for these workloads becomes relevant.
Operational Complexity
Istio is a complex system. The control plane (istiod) manages configuration distribution, certificate issuance, and telemetry aggregation. Learning to operate it takes time:
- Debugging connectivity issues now involves checking mesh proxy configuration, not just Kubernetes networking - whether that's sidecar logs or ztunnel/waypoint behaviour
- Certificate management adds another layer of infrastructure to maintain
- Configuration errors can cause subtle failures - services that mostly work but occasionally drop requests
Upgrade Complexity
Istio moves quickly. Major versions introduce breaking changes. Upgrading requires careful planning:
- In sidecar mode, sidecar versions across the cluster need to be compatible with the control plane. Ambient simplifies this but introduces its own upgrade sequencing for ztunnel and waypoint proxies.
- Custom configuration may need migration between versions
- Testing needs to cover both the mesh infrastructure and application behaviour through it
Learning Curve
Teams need to understand Istio's configuration model - VirtualServices, DestinationRules, PeerAuthentication, AuthorizationPolicies. It's a new abstraction layer with its own terminology and behaviours.
This isn't a weekend project. Effective Istio operation requires investment in training and practice.
When Service Mesh Is Worth It
You Need mTLS Everywhere
If compliance or security requirements mandate encryption between all services, a service mesh is the most practical way to achieve this. The alternative - implementing TLS in every service, managing certificates per service, handling rotation is feasible but significantly more work and more fragile.
Istio makes mTLS transparent. In sidecar mode, services communicate in plain text to their local sidecar, which handles encryption to the destination. In Ambient mode, the node-level ztunnel handles it instead - same result, no per-pod injection required. Either way, applications don't need to know about TLS at all.
For environments operating under security classification requirements, this is often the deciding factor.
You Have Complex Traffic Requirements
If you regularly need canary deployments, traffic splitting, or sophisticated routing, the mesh's traffic management capabilities pay for themselves quickly. Doing this with standard Kubernetes Ingress and Services is limited; doing it with Istio is configuration.
You Need Service-Level Observability
If understanding traffic patterns between services matters - for capacity planning, debugging, or compliance; the mesh's automatic telemetry is valuable. Every request is measured without any application changes.
You Have Enough Services
A service mesh for three services is overkill. The overhead exceeds the value. A service mesh for thirty services starts making sense - the consistency and automation it provides scales, while manual approaches don't.
When to Skip It
Resource-Constrained Environments
If the cluster is small and every resource matters, traditional sidecar overhead is hard to justify. The CPU and memory consumed by per-pod proxies could run actual workloads instead. Ambient Mesh reduces this concern significantly, but there's still overhead from the node-level ztunnel and any waypoint proxies - evaluate whether the benefits justify even this lower cost.
Simple Architectures
A handful of services with straightforward communication patterns don't need a mesh. Standard Kubernetes Services and NetworkPolicies handle routing and security adequately.
Teams Without Capacity to Learn It
Deploying Istio without understanding it creates fragile infrastructure. If the team can't invest time in learning the mesh, the complexity it adds will cause more problems than it solves.
When Network Policies Suffice
If your security requirements are met by Kubernetes NetworkPolicies and application-level TLS, a mesh adds capability you don't need at a cost you do pay.
Lattice's Approach: Optional and Incremental
Lattice includes Istio as an optional component precisely because not every deployment needs it. The configuration is straightforward:
Need a mesh? Enable it. Istio deploys with sensible defaults - mTLS in permissive mode (accepts both encrypted and plain text), basic telemetry enabled, Kiali available for visualisation.
Don't need a mesh? Disable it. The rest of the stack i.e. storage, monitoring, security hardening works independently.
This optionality is a deliberate design choice. Forcing a service mesh on every deployment would add unnecessary overhead for simpler use cases.
Incremental Adoption
For deployments that do use Istio, we recommend incremental adoption rather than big-bang deployment:
Phase 1: Observability only - Deploy the mesh with mTLS in permissive mode. Don't enforce any policies yet. Use this phase to understand traffic patterns, identify service dependencies, and build familiarity with Istio's tooling. Kiali's service graph is particularly useful here - it shows how services actually communicate, which often differs from how people think they communicate.
Phase 2: mTLS enforcement - Switch to strict mTLS. This is where you'll find services that weren't going through the mesh, external connections that need exceptions, and edge cases in certificate handling. Handle these before moving on.
Phase 3: Traffic management - Introduce routing rules, retries, and circuit breakers. Start with one service, validate the behaviour, then expand.
Phase 4: Access policies - Implement AuthorizationPolicies to restrict which services can communicate. This is the most powerful and most dangerous phase, incorrect policies break things.
Each phase adds capability and complexity. Stopping at any phase gives you value without committing to everything.
Kiali: Seeing the Mesh
Lattice includes Kiali as an optional companion to Istio. It's a web-based console that visualises the service mesh:
- Service graph - Real-time visualisation of traffic between services, including request rates, error rates, and latency
- Configuration validation - Identifies misconfigurations in Istio resources before they cause problems
- Health assessment - Shows the health of services, workloads, and applications based on mesh telemetry
In air-gapped environments, this visualisation is particularly valuable. You can't easily reach out to external tracing services or cloud-based observability platforms. Kiali gives you mesh visibility entirely within the cluster.
Common Pitfalls
Enabling strict mTLS too early - Services that communicate outside the mesh will break. Start permissive, identify all communication paths, then enforce.
Ignoring proxy resource limits - In sidecar mode, sidecars without resource limits can consume more than expected under load. In Ambient mode, ztunnel and waypoint proxies need the same attention. Set requests and limits based on observed behaviour.
Configuration sprawl - VirtualServices, DestinationRules, and AuthorizationPolicies accumulate. Without discipline, the configuration becomes hard to understand and harder to debug. Treat mesh configuration like code - version controlled, reviewed, tested.
Forgetting about non-mesh traffic - Not everything goes through the mesh. External services, databases outside the cluster, and system-level traffic may bypass the mesh entirely. Understand what's covered and what isn't, regardless of which mode you're running.
Over-engineering routing - Complex traffic routing rules are powerful but fragile. Start simple. Add complexity only when specific requirements demand it.
Service Mesh in Secure Environments
For the environments Lattice targets, a few service mesh capabilities have particular relevance:
mTLS provides defence in depth - Even if network-level controls are compromised, service-to-service traffic is encrypted and authenticated. Attackers who gain network access can't simply sniff or inject traffic.
AuthorizationPolicies complement NetworkPolicies - NetworkPolicies control which pods can communicate at the network level. AuthorizationPolicies control which services can communicate at the application level, including HTTP method and path restrictions. Together, they provide layered access control.
Audit trail - Mesh telemetry provides a detailed record of service-to-service communication. For environments that require comprehensive audit logs, this data is valuable.
Zero-trust networking - The mesh enables a zero-trust model where every service verifies the identity of every other service on every request. Nothing is trusted by default, even within the cluster network.
Lessons Learned
Don't deploy Istio because it's cool - Deploy it because you have specific requirements it addresses. If you can't articulate what problems it solves for you, you don't need it yet.
Invest in understanding - A mesh you don't understand is a liability. Dedicate time to learning how Istio works, not just how to install it.
Start small - Enable the mesh for one namespace or a few services first. Learn in a controlled scope before expanding.
Monitor the mesh itself - Istio's control plane needs monitoring like any other component. istiod resource usage, certificate rotation, configuration distribution - these all need visibility.
Plan for upgrades - Istio's release cadence means regular upgrades. Have a tested upgrade process before you need it.
What's Next
The next post in this series will cover operational playbooks - what to do when things go wrong, and how to prepare for incidents before they happen.
Lattice is aproject developed by Digital Native Group.