Zero Trust Security for Microservices

Service-mesh zero trust, in practice: SPIFFE identities, mTLS by default, Istio and Linkerd policy at the sidecar, network policies underneath, and the places where you still need app-level authorization.

SecurityArchitectureDevOps

A payments API compromise I was called in on went like this: a vulnerable image-processing sidecar in an unrelated namespace got RCE, the attacker scanned the pod network, found the internal billing service on port 8080, and issued refunds to an attacker-controlled account. No auth. Not because the billing team was lazy — because their threat model said “internal-only, behind the cluster ingress, protected by network policy.” The network policy existed. It just hadn’t been applied to the namespace the attacker landed in.

That’s the failure mode zero trust exists to prevent: you assumed network position meant identity. It doesn’t. Once an attacker has pod-level execution anywhere in your cluster, any service that trusts 10.0.0.0/8 trusts the attacker. The fix isn’t more firewalls. It’s making every service-to-service call prove, cryptographically, who’s on the other end of the socket — and making the default answer to “can A talk to B?” be no.

This post is about zero trust as a service-mesh architecture: workload identity, mTLS by default, policy enforced at the sidecar, and the handful of places you still need application-level checks. The user-auth side of zero trust (short token TTLs, impossible-travel detection, continuous validation) is a separate concern, covered in Authentication Patterns for Distributed Systems. Here I’m focused on service-to-service.

Threat Model First

Before any YAML, name the threats. Service-mesh zero trust defends a specific class, and you should be able to point at each control and say what it stops.

ThreatDefense
Lateral movement after pod compromisemTLS with peer identity, default-deny authorization policy
Namespace escape via shared networkNetworkPolicy + mesh AuthorizationPolicy at namespace boundary
Spoofed service identity (bearer-token impersonation)SPIFFE X.509 SVIDs, rotated short-lived certs, no shared secrets
Stolen long-lived service credentialCert TTL ≤ 1h, automatic rotation, identity tied to workload not pod
Unauthorized egress (data exfiltration)Egress gateways, Sidecar resource with allowed hosts, DNS policy
Sidecar bypass (attacker binds to pod IP directly)NetworkPolicy at L3/L4 independent of mesh
Policy drift (a service silently gets too much access)Policy-as-code in Git, CI diff review, AUDIT logging
Trust-on-first-use bootstrappingNode attestation (SPIRE), workload attestation by k8s SA + image
”We enabled mTLS” but it’s PERMISSIVESTRICT mode enforced cluster-wide, alerts on permissive exceptions
App-level bug still allows tenant data crossoverApp-level authorization on top of mesh identity, not instead of it

If a control isn’t defending one of those, it’s probably theater. I’ll reference this table throughout.

What Zero Trust Actually Means at the Mesh Layer

Strip the vendor marketing and zero trust in a mesh comes down to three invariants:

  1. Every workload has a cryptographic identity that the platform issues and rotates. Not a secret checked into Git. Not a token in a ConfigMap. A certificate bound to the workload, rotated on a schedule short enough that leaking it is a nuisance, not a catastrophe.
  2. Every connection is mutually authenticated and authorized against that identity. Both sides prove who they are at the transport layer. Every request is evaluated against a policy that says “is this caller allowed to do this thing on this target?” The default answer is no.
  3. Policy lives with the platform, not in application code. The sidecar enforces. The app trusts what the sidecar tells it. This keeps security invariants out of the bug surface of every individual team’s service.

“Perimeter security is dead” is a cliche. The more useful framing is: network location is a bad proxy for identity, and zero trust stops using it as one.

Identity: SPIFFE, SPIRE, and Why You Don’t Mint This Yourself

The foundation of service-mesh zero trust is identity. Every workload needs a name the platform can attest to. SPIFFE (Secure Production Identity Framework For Everyone) is the open standard for this. SPIRE is the reference runtime.

A SPIFFE ID looks like a URI:

spiffe://cluster.example.com/ns/prod/sa/orders-api

It’s embedded as a URI SAN in an X.509 certificate (an SVID — SPIFFE Verifiable Identity Document). The cert itself is short-lived, typically 1 hour, rotated automatically by an agent running on the node. The workload doesn’t hold the signing key. It fetches the current SVID over a local Unix socket (the Workload API) whenever it needs one.

Why does this matter? Because you never ship a long-lived service credential anywhere. There is no “service account token” sitting in a secrets manager waiting to leak. The workload gets its identity by attestation — SPIRE checks the pod’s Kubernetes service account, the node it’s running on, and optionally the container image hash, and issues a cert only if the attestation matches a registered entry.

Registering a workload with SPIRE looks roughly like this:

spire-server entry create \
  -spiffeID spiffe://cluster.example.com/ns/prod/sa/orders-api \
  -parentID spiffe://cluster.example.com/spire/agent/k8s_psat/cluster/NODE-ID \
  -selector k8s:ns:prod \
  -selector k8s:sa:orders-api \
  -selector k8s:container-image:registry.example.com/orders-api@sha256:3f4c2e7d91a2b5e8f1d6c3a7b4e9f2c5d8a1b4e7f0c3d6a9b2e5f8c1d4a7b0e3 \
  -ttl 3600

Three selectors, three checks. The workload must be in ns:prod, running under service account orders-api, from a specific image digest. Change any of them — wrong namespace, wrong SA, tampered image — and SPIRE refuses to issue a SVID.

In practice, you don’t talk to SPIRE directly. Istio and Linkerd both consume SPIFFE identities internally (Istio’s Citadel uses SPIFFE URIs under the hood; Linkerd can be configured to use SPIRE). You register workloads, the mesh distributes SVIDs to sidecars, and your application never touches a certificate file. That’s the point.

What I’d actually choose: if you’re on a single cluster and just need mesh identity, Istio’s built-in CA is fine. If you’re running multi-cluster, multi-cloud, or mixing Kubernetes with VMs, stand up SPIRE and let every environment plug into the same trust domain. The migration from built-in CA to SPIRE is painful; start with SPIRE if you know you’ll need it.

One cultural shift worth naming: once SPIRE is in place, there are no more secrets-in-Git for service-to-service auth. If a team asks you to provision a static token for service X to call service Y, the answer is no. They register a workload entry, they get an SVID, the mesh does the rest. Static credentials are a regression, not a convenience.

mTLS by Default: STRICT Mode or Bust

Once workloads have identities, mTLS becomes the transport. Every connection between sidecars negotiates TLS 1.3, both sides present certs, both sides verify.

Istio has three mTLS modes: DISABLE, PERMISSIVE, and STRICT. PERMISSIVE is the trap. It accepts both mTLS and plaintext, which is great during migration and terrible as a steady state. I’ve seen clusters sit in PERMISSIVE for two years because “we’ll flip to strict next quarter.” Meanwhile, a misconfigured client bypasses mTLS entirely and nobody notices, because the connection works.

Enforce STRICT at the mesh level and treat any exception as an incident:

# Cluster-wide default: every service must be called over mTLS.
# Any plaintext connection between sidecar-injected pods is rejected.
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT

Apply this at istio-system and it covers the whole mesh. If a team needs PERMISSIVE for a specific service during a migration (say, an external legacy client), they create a namespaced override with a clear expiry date, an owner, and an alerting rule. Not a permanent escape hatch.

The failure mode when mTLS is half-enabled: one side thinks it’s strict, the other side is permissive, a bug or a config drift lets a plaintext client in, and you have an encrypted-looking data plane with an unencrypted hole. Alert on istio_request_total{connection_security_policy="none"} — if that ever goes above zero in a namespace that should be strict, page someone.

Linkerd’s take: Linkerd is mTLS-by-default from day one — there’s no equivalent to PERMISSIVE at the mesh level. That simplicity is why I reach for Linkerd on greenfield Kubernetes-only clusters. The tradeoff is Linkerd’s policy language is less expressive than Istio’s; for heavy multi-tenant policy needs, Istio wins.

Authorization: Default Deny, Explicit Allow

mTLS tells you who is calling. Authorization policy says whether they’re allowed. Without both, you have encrypted free-for-all.

The pattern I enforce in every mesh is default deny at the namespace level, plus explicit allow policies per workload:

# Step 1: deny everything in the namespace.
# Explicit DENY with a match-all rule — do NOT rely on `spec: {}` shorthand.
# An empty spec with a `selector:` added later silently becomes ALLOW-nothing
# instead of DENY-all. Be explicit.
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: default-deny
  namespace: prod
spec:
  action: DENY
  rules:
    - {}  # matches everything
---
# Step 2: allow specific callers to specific paths on orders-api.
# IMPORTANT: the principal trust-domain MUST match your mesh's configured
# trust domain. Istio defaults to `cluster.local`; if you're running SPIRE
# with a custom trust domain like `cluster.example.com`, replace below.
# Never use `*` or wildcard principals — exact match only, anchored on your
# trust domain. A wildcard principal in a default-deny mesh is a permit-all.
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: orders-api-allow
  namespace: prod
spec:
  selector:
    matchLabels:
      app: orders-api
  action: ALLOW
  rules:
    - from:
        - source:
            principals:
              - cluster.local/ns/prod/sa/checkout-api
      to:
        - operation:
            methods: ["POST"]
            paths: ["/v1/orders", "/v1/orders/*/cancel"]
    - from:
        - source:
            principals:
              - cluster.local/ns/prod/sa/fulfillment-worker
      to:
        - operation:
            methods: ["GET"]
            paths: ["/v1/orders/*"]

Note the principals are SPIFFE-style identities backed by Kubernetes service accounts, not IP ranges or pod labels. An attacker who lands in a pod with a different service account cannot impersonate checkout-api even if they sit on the same node.

Reconciling trust domains. The examples above use cluster.local because that’s Istio’s default. But if you’re using SPIRE with a custom trust domain (like the spiffe://cluster.example.com/... identities shown earlier), you must use THAT trust domain as the principal prefix. Trust-domain mismatches fail closed — every call gets denied — and the usual “fix” people reach for is a wildcard principal, which breaks the whole policy. Pick your trust domain once, write it down, use it in SPIRE entries AND Istio policies.

The explicit DENY policy is the important half. A lot of teams ship only ALLOW policies and assume “if no policy matches, deny.” That’s not how Istio works by default. Without an explicit deny, absence of a matching ALLOW means absence of a restriction — the request goes through. Ship the deny policy first, then the allows, and your posture defaults to safe.

Review these in PRs like you review code. An AuthorizationPolicy change is a security boundary change. If your CODEOWNERS doesn’t route mesh policy changes to a security reviewer, fix that before the next release.

OPA as Policy Decision Point: When the Mesh Isn’t Enough

Istio’s AuthorizationPolicy covers method, path, principal, headers, and JWT claims. That handles most service-to-service cases. When it doesn’t — multi-tenant SaaS with dynamic tenant-based rules, per-object ownership checks, policies that depend on data outside the request — you reach for a dedicated policy engine. Open Policy Agent (OPA) is what I use.

The integration pattern is OPA as an external authorizer for the mesh. Istio forwards authorization decisions to OPA via an ext-authz filter:

apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: ext-authz-orders
  namespace: prod
spec:
  selector:
    matchLabels:
      app: orders-api
  action: CUSTOM
  provider:
    name: opa-ext-authz  # defined in istio mesh config
  rules:
    - to:
        - operation:
            paths: ["/v1/orders/*"]

OPA evaluates a Rego policy that can reach into external data (tenant membership, feature flags, customer tier):

package istio.authz

import rego.v1

default allow := false

# Extract caller SPIFFE ID from the mTLS peer.
caller := input.attributes.source.principal

# Extract tenant from the request path: /v1/tenants/{tenant}/orders/...
# The path is attacker-authored — validate shape BEFORE using the segment
# as a lookup key. Empty tenant or wrong-length path means deny.
path_parts := split(trim_prefix(input.attributes.request.http.path, "/"), "/")
valid_path if {
	count(path_parts) >= 4
	path_parts[0] == "v1"
	path_parts[1] == "tenants"
	path_parts[2] != ""           # no empty tenant
	not contains(path_parts[2], ".")  # no path traversal sneaking into the key
}
tenant := path_parts[2] if valid_path

# The caller must be registered in the tenant's allow-list.
# This list comes from a bundle OPA pulls every 30s from a control plane.
allow if {
    tenant_allowed_callers[tenant][caller]
}

OPA is separated from both the app and the mesh. Policy authors write Rego, CI runs opa test, bundles ship to OPA sidecars via the bundle API. The mesh enforces, OPA decides. This separation is the whole point — it keeps policy out of application code and out of YAML and into a language designed for it.

When do you actually need OPA? When your policy depends on data the mesh doesn’t see (external tenant registries, RBAC graphs, dynamic flags). If your policy is “service A can call path X on service B,” stay in AuthorizationPolicy. If your policy is “a caller can act on an order only if it belongs to a tenant the caller is registered against, and the tenant’s plan allows it,” you want OPA.

Network Policies: Defense in Depth Under the Mesh

AuthorizationPolicy operates at L7, after the sidecar has accepted the connection. If an attacker finds a way to bypass the sidecar — a CNI bug, a privileged pod that disables iptables rules, a sidecar injection failure that shipped to prod — the mesh stops protecting you. NetworkPolicy at L3/L4 is the second layer:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: orders-api-ingress
  namespace: prod
spec:
  podSelector:
    matchLabels:
      app: orders-api
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: checkout-api
        - podSelector:
            matchLabels:
              app: fulfillment-worker
      ports:
        - protocol: TCP
          port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              app: orders-db
      ports:
        - protocol: TCP
          port: 5432
    - to:  # allow DNS
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system
      ports:
        - protocol: UDP
          port: 53

This is not a replacement for AuthorizationPolicy. It’s complementary. AuthorizationPolicy speaks identity and paths; NetworkPolicy speaks pods and ports. An attacker who compromises a pod labelled app: checkout-api will satisfy the NetworkPolicy — but the AuthorizationPolicy still requires they present a valid cluster.local/ns/prod/sa/checkout-api SPIFFE ID, which requires the checkout-api service account’s key material, which SPIRE rotated an hour ago.

Two locks. Different keys. That’s defense in depth.

The biggest NetworkPolicy mistake: forgetting DNS egress. If you write a strict egress policy and don’t allow UDP/53 to kube-system, pods can’t resolve anything and the failure mode is confusing. Always include DNS explicitly.

Namespace Isolation and Pod Security Standards

Namespaces are your blast-radius boundary. A zero-trust cluster treats each namespace as a trust domain:

  • Per-namespace AuthorizationPolicy defaults. default-deny in every namespace. Cross-namespace calls require explicit ALLOW on both sides.
  • Pod Security Standards enforced. restricted profile on application namespaces. No privileged pods, no host networking, no host path mounts.
  • Dedicated service accounts per workload. Never share. The SA is the identity SPIRE attests to.
  • Egress gateways for external traffic. All egress to the internet flows through a controlled gateway with its own policy, logged and rate-limited.
apiVersion: v1
kind: Namespace
metadata:
  name: prod
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    istio.io/rev: stable

The restricted profile blocks the pod-escape vectors that turn a contained RCE into a cluster compromise. The Istio revision label ensures sidecar injection. Both are free and both close real classes of attack.

Observability: Who Called What, Why Was It Allowed

Zero trust without observability is just YAML. You need to be able to answer, for any incident: who connected to what, was it allowed, on what basis, and when did the policy change that made it possible.

The three signals I require on every mesh:

  1. Access logs with principal and policy decision. Every request, every sidecar. Ship to a durable store with >90 day retention.
  2. Policy audit events. Istio’s AUDIT action logs but doesn’t block. Apply it to sensitive paths even when you also ALLOW them — the log is the forensic record.
  3. Metrics on denies and permissive connections. istio_requests_total{response_code="403"} broken down by source/destination. Any sustained 403 rate means either a policy bug or an attacker probing.
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: audit-sensitive
  namespace: prod
spec:
  selector:
    matchLabels:
      app: orders-api
  action: AUDIT
  rules:
    - to:
        - operation:
            methods: ["DELETE", "PATCH"]
            paths: ["/v1/orders/*"]

AUDIT runs alongside ALLOW. The request proceeds, and a structured log line records it: principal, path, method, decision, timestamp. For compliance, that’s your audit trail. For incident response, that’s how you answer “did the attacker actually call this endpoint?”

When You Still Need Application-Level Authorization

The mesh gives you “service A can call path P on service B.” That’s not enough for:

  • Multi-tenancy. The mesh doesn’t know that order 42 belongs to tenant X. Your app does.
  • Object-level ownership. “Can user U modify order O?” is data-dependent. The mesh can’t evaluate it without the order.
  • Fine-grained RBAC inside a service. The mesh allows the call; the app decides which fields the caller can see in the response.
  • User identity propagation. The mesh authenticates the service. Whose request the service is acting on is user-auth concern — JWTs, OIDC, the stuff in the auth patterns post.

The clean division I use:

Mesh is responsible for: transport security, service identity, can-A-call-B authorization, audit of connections.

App is responsible for: can-this-user-do-this-thing, tenant isolation, object ownership, field-level filtering.

If you put tenant checks in the mesh, your policy engine becomes coupled to your data model and you’ll end up reimplementing half your domain in Rego. If you put transport security in the app, you’ll reimplement it thirty times in thirty codebases with thirty bug surfaces.

A minimal app-level guard, given the mesh has already authenticated the caller service and the user JWT is propagated in a header:

// app/middleware/tenant.go
package middleware

import (
	"context"
	"net/http"
)

type tenantKey struct{}

// RequireTenantMatch checks that the authenticated user's tenant matches
// the tenant in the URL. The mesh has already authenticated the calling
// service and verified the user JWT's signature; we only check the business
// rule here.
func RequireTenantMatch(next http.Handler) http.Handler {
	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		userTenant, ok := userTenantFrom(r.Context())
		if !ok || userTenant == "" {
			http.Error(w, "no user context", http.StatusUnauthorized)
			return
		}
		urlTenant := r.PathValue("tenant")
		if urlTenant == "" {
			// Routing misconfigured; do not fall through to a comparison
			// of two empty strings.
			http.Error(w, "bad request", http.StatusBadRequest)
			return
		}
		if userTenant != urlTenant {
			// Generic message on the wire; detailed log server-side.
			http.Error(w, "forbidden", http.StatusForbidden)
			return
		}
		ctx := context.WithValue(r.Context(), tenantKey{}, urlTenant)
		next.ServeHTTP(w, r.WithContext(ctx))
	})
}

This is the only auth this handler does. It trusts the mesh for transport, trusts the JWT library for user identity (validated per the auth patterns post — RS256, JWKS, iss and aud pinned), and does the one check the mesh can’t: tenant match. That’s the pattern: thin app-level checks that layer on top of mesh guarantees, not duplicating them.

Certificate Rotation: Fail Loud, Not Silent

A pattern I see in DIY implementations that I want to call out specifically, because it burned someone I know in production: silent cert rotation failure. A background goroutine reloads certs from disk; if the reload fails, it logs nothing and continues with the old cert. A week later the old cert expires, every request fails, the on-call gets paged at 3am, and the rotation has been silently broken since the last cert-manager issue.

If you are, for some reason, doing cert loading in application code (you’re running outside a mesh, or you’re writing the mesh itself), fail loud:

// security/rotation.go
package security

import (
	"context"
	"crypto/tls"
	"log/slog"
	"sync/atomic"
	"time"
)

type RotatingCert struct {
	certFile, keyFile string
	cert              atomic.Pointer[tls.Certificate]
	logger            *slog.Logger
	onFailure         func(error) // alerting hook: PagerDuty, etc.
}

func NewRotatingCert(ctx context.Context, certFile, keyFile string, logger *slog.Logger, onFailure func(error)) (*RotatingCert, error) {
	rc := &RotatingCert{
		certFile:  certFile,
		keyFile:   keyFile,
		logger:    logger,
		onFailure: onFailure,
	}
	// Load synchronously at startup. If this fails, the service does not start.
	if err := rc.reload(); err != nil {
		return nil, err
	}
	go rc.rotateLoop(ctx)
	return rc, nil
}

func (rc *RotatingCert) Current() *tls.Certificate {
	return rc.cert.Load()
}

// safeAlert invokes the user-supplied onFailure hook but recovers from any
// panic it throws. A crashing alert client (nil ptr in a PagerDuty SDK, a
// dropped connection wrapped in panic) must not kill the rotation goroutine —
// that would silently stop rotations and we'd find out when the cert expired.
func (rc *RotatingCert) safeAlert(err error) {
	defer func() {
		if r := recover(); r != nil {
			rc.logger.Error("onFailure hook panicked",
				slog.Any("recovered", r))
		}
	}()
	rc.onFailure(err)
}

func (rc *RotatingCert) reload() error {
	cert, err := tls.LoadX509KeyPair(rc.certFile, rc.keyFile)
	if err != nil {
		return err
	}
	rc.cert.Store(&cert)
	return nil
}

func (rc *RotatingCert) rotateLoop(ctx context.Context) {
	ticker := time.NewTicker(15 * time.Minute)
	defer ticker.Stop()
	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			if err := rc.reload(); err != nil {
				rc.logger.Error("cert rotation failed",
					slog.String("cert_file", rc.certFile),
					slog.String("error", err.Error()))
				rc.safeAlert(err) // fire the alert, recover if the hook panics
				// Keep serving with the old cert so we don't take down
				// the service, but the alert must be actionable.
				continue
			}
			rc.logger.Info("cert rotated", slog.String("cert_file", rc.certFile))
		}
	}
}

Three things this gets right that the naive version doesn’t:

  1. Structured logging with context. You can grep for cert rotation failed across the fleet and filter by cert_file. No silent continue.
  2. An alerting hook. onFailure is called on every failed reload. Wire it to your pager. A single failure can be a blip; three in a row means cert-manager is broken and your certs are going to expire.
  3. Startup is synchronous and fatal. If the initial load fails, NewRotatingCert returns an error and the service doesn’t start. Better to fail the deploy than to start with stale or missing certs.

The rotation interval (15 minutes) is aggressive for a reason: if SPIRE/cert-manager rotates every hour, you want to pick up the new cert well before the old one expires. Twelve hours, as in the naive version, leaves you with effectively no margin.

Critical wiring detail: you must hand this to tls.Config via a callback, not as a snapshot. The common bug here mirrors the trust-bundle mistake — readers write tls.Config{Certificates: []tls.Certificate{*rc.Current()}} at startup, which captures the cert once and never sees a rotation. Use GetCertificate (server side) and GetClientCertificate (client side) so Go re-reads the atomic on every handshake:

// trustBundle holds the current CA pool. A background goroutine (not shown)
// refreshes it from the SPIRE Workload API on its own interval.
var trustBundle atomic.Pointer[x509.CertPool]

// Base TLS config. ClientCAs is deliberately nil here — it's injected per
// handshake via GetConfigForClient so rotation actually takes effect.
baseTLSConfig := &tls.Config{
	MinVersion: tls.VersionTLS13,
	ClientAuth: tls.RequireAndVerifyClientCert, // mTLS: server requires+validates client cert
	GetCertificate: func(_ *tls.ClientHelloInfo) (*tls.Certificate, error) {
		return rc.Current(), nil
	},
	GetClientCertificate: func(_ *tls.CertificateRequestInfo) (*tls.Certificate, error) {
		return rc.Current(), nil
	},
}

// This is the config you pass to http.Server. GetConfigForClient runs on
// every handshake, clones the base, and attaches the CURRENT trust bundle.
// Snapshotting ClientCAs at startup freezes your trust store.
serverTLSConfig := &tls.Config{
	GetConfigForClient: func(*tls.ClientHelloInfo) (*tls.Config, error) {
		cfg := baseTLSConfig.Clone()
		cfg.ClientCAs = trustBundle.Load()
		return cfg, nil
	},
}

Three things to internalize:

  1. ClientAuth: RequireAndVerifyClientCert is mandatory. The zero value is NoClientCert — a “mTLS” server without it is really just a one-way-TLS server that happens to have a client cert loaded for outbound calls. This is the most common way “we have mTLS” turns out to mean “we have TLS.”
  2. ClientCAs must come from GetConfigForClient. A ClientCAs: trustBundle.Load() set once on the base config captures a snapshot — every subsequent handshake reads that frozen pool. When SPIRE rotates, the service keeps trusting the retired CAs and doesn’t know about the new ones until restart.
  3. CRL/OCSP stapling is not the answer for SPIFFE SVIDs. Short TTLs (one hour is typical) plus live trust-bundle reloads cover revocation. An OCSP responder for workloads that rotate hourly is theater.

But the honest answer: don’t write this code. Run in a mesh. Let the mesh handle cert distribution and rotation. The code above is for the case where you’re on bare metal or VMs and the mesh isn’t an option.

Istio vs Linkerd: What I’d Actually Pick

The mesh choice matters less than people think for zero trust — both give you mTLS and policy. The differences:

Istio

  • More expressive policy (AuthorizationPolicy is richer than Linkerd’s Server/ServerAuthorization)
  • Heavier footprint (sidecar memory, control plane complexity)
  • Better for multi-cluster, VM workloads, external traffic control via gateways
  • Steeper learning curve, more ways to misconfigure
  • First-class ext-authz for OPA integration

Linkerd

  • mTLS and identity are on by default, no permissive trap
  • Much lighter sidecar (Rust proxy, tiny memory footprint)
  • Simpler policy model, easier to reason about
  • Excellent operational ergonomics — linkerd viz is a joy
  • Less flexible for heavy policy needs or multi-tenant SaaS

What I’d actually choose: Linkerd on single-cluster Kubernetes-only greenfield projects. Istio when I need multi-cluster federation, VM workloads in the mesh, or heavy policy integration with OPA. I will not DIY a mesh in 2026 unless there’s a compliance constraint that rules out both.

The Mistake I See Most

The most common zero-trust failure I see on audits isn’t a missing control — it’s partial adoption. Specifically:

  • mTLS enabled, AuthorizationPolicy missing. Every service can still call every other service, just over TLS. This is encryption theater.
  • default-deny policy missing. ALLOW policies exist, but the namespace default is “no policy = permit.” Result: new services ship with implicit full access.
  • PERMISSIVE mTLS “temporarily” for two years. One plaintext client slipped in during a migration and nobody noticed.
  • NetworkPolicy missing. Attacker bypasses sidecar via privileged pod, mesh is irrelevant.
  • Policy changes landing without review. Security-sensitive YAML in a dev PR, merged by anyone, no audit trail.

Zero trust is not a product you install. It is an operating posture: default deny, identity-based, cryptographically verified, policy-as-code, audited. Every one of the items above is a failure to maintain that posture.

What I’d Actually Ship

If I’m standing up a new production Kubernetes cluster today:

  1. Mesh: Linkerd or Istio, mTLS STRICT cluster-wide from day one. No PERMISSIVE.
  2. Identity: SPIRE if multi-environment; mesh built-in CA if single-cluster. Either way, every workload gets a dedicated service account and a SPIFFE ID.
  3. AuthorizationPolicy: default-deny in every app namespace. ALLOW policies per workload, reviewed in PR by security-tagged CODEOWNERS. AUDIT on mutating endpoints.
  4. NetworkPolicy: per-workload, ingress + egress, DNS allowed explicitly. Cilium if I want eBPF-level L7 policy underneath the mesh.
  5. Pod Security: restricted profile enforced. No exceptions without a dated waiver.
  6. OPA: only when policy needs data the mesh can’t see. Start without it.
  7. Observability: access logs with principal + decision, 90+ day retention, alerts on 403 rate spikes and on any connection_security_policy="none".
  8. App-level auth: thin, on top of the mesh. Tenant checks, object ownership, JWT user identity — see the auth patterns post.

Zero trust at the mesh layer is table stakes for any cluster running more than a handful of services. It is no longer an advanced topic. The tools are mature, the patterns are well-understood, and the attack that kills you is the one that walked in on an unauthenticated internal port because someone trusted the network.

Stop trusting the network.

← Back to blog