From Manual Deployments to GitOps with Flux

How I moved from manual kubectl applies to fully automated GitOps deployments using Flux CD — and why it changed everything.

DevOpsKubernetesInfrastructure

The kubectl-from-laptop Era

I used to deploy Kubernetes manifests by running kubectl apply from my laptop. It worked until it didn’t — configuration drift, no audit trail, and the inevitable “who changed that in production?” conversation. One Friday afternoon, someone ran a kubectl apply with a stale local copy and overwrote a config change that had been made directly on the cluster. We spent the weekend figuring out what the correct state should be.

Switching to GitOps with Flux solved all of it. Git became the single source of truth. Every change went through a PR. The cluster reconciled itself automatically. And when something broke, git log told us exactly who changed what and when.

This post covers how I set it up and the patterns that work at scale.

What GitOps Actually Means

GitOps is a specific pattern, not a buzzword:

  1. Declarative configuration — the entire desired state of your cluster lives in Git
  2. Automated sync — an operator in the cluster continuously reconciles actual state to match Git
  3. Drift detection — if someone runs kubectl edit directly, the operator reverts it
  4. Pull-based deployment — the cluster pulls its own config, rather than CI pushing to it

The practical result: infrastructure changes follow the exact same workflow as application code. PRs, reviews, approvals, merge, automatic deployment. No more SSH-ing into boxes or running kubectl from laptops.

Why Flux

I chose Flux over ArgoCD for a few reasons: it’s a CNCF graduated project, it uses a pull-based model (the cluster watches Git, not the other way around), and it composes well with Kustomize and Helm without requiring a UI or a separate server. ArgoCD is fine too — but Flux fits better when you want GitOps as infrastructure, not as an application you manage.

Setting Up Flux

You need a Kubernetes cluster (1.26+), kubectl configured, and the Flux CLI installed. I strongly recommend a dedicated Git repository for infrastructure config, separate from application code — different access controls, different review cadence.

Bootstrap

# Prompt for the token so it doesn't land in shell history
read -rs GITHUB_TOKEN && export GITHUB_TOKEN
export GITHUB_OWNER=<your-org-or-username>
export GITHUB_REPO=flux-infrastructure

# Bootstrap Flux (--personal=true for a user account, false for an org)
flux bootstrap github \
  --owner=$GITHUB_OWNER \
  --repository=$GITHUB_REPO \
  --branch=main \
  --path=./clusters/production \
  --personal=false \
  --private=true

A word on that GITHUB_TOKEN: never commit it, never let it land in shell history you sync, and never reuse a long-lived org-wide token. Use read -rs (as above) or gh auth token rather than putting the value on the command line. It should be a short-lived fine-grained PAT scoped to exactly the one repository, with the minimum permissions bootstrap needs (contents: read/write, administration: write for deploy keys). Treat it as CI-only — generate it, run bootstrap, revoke it. The moment flux bootstrap returns, scrub it from the shell:

unset GITHUB_TOKEN
# then revoke the PAT in GitHub settings

This command:

  1. Creates a repository if it doesn’t exist
  2. Generates a deploy key with read/write access
  3. Installs Flux components in your cluster
  4. Configures Flux to sync the specified path in your repository

For production, bootstrap Flux itself with a read-only deploy key and add image automation separately. The two needs are different: Flux core only reads from Git, while image-automation writes tag updates back. Bundling them onto a single read/write key widens the blast radius of every Flux component to “can rewrite the GitOps repo”:

# Step 1: bootstrap Flux core with a READ-ONLY deploy key.
# No image-automation components here, so --read-write-key=false is honest.
flux bootstrap github \
  --owner=$GITHUB_OWNER \
  --repository=$GITHUB_REPO \
  --branch=main \
  --path=./clusters/production \
  --personal=false \
  --private=true \
  --read-write-key=false \
  --network-policy=true

Read-only is the right default for Flux core because the blast radius of a leaked write key is much larger. A leaked read key exposes your manifests — embarrassing, but the attacker can’t change what runs in your cluster through that path. A leaked write key lets an attacker push malicious manifests back to the GitOps repo, where Flux will then obediently reconcile them onto your cluster. You’ve just given them remote code execution on production via a deploy key that nobody rotates because “it’s just a Git key”.

Add image automation as a second step. The hardened path is a separate deploy key attached to a second GitRepository that only ImageUpdateAutomation references, so Flux core keeps its read-only key and the write permission is isolated to one controller:

# Step 2: generate a dedicated write-capable deploy key for image automation,
# attach it to the repo manually (GitHub -> Settings -> Deploy keys, allow write),
# then create a Secret + GitRepository that ONLY the ImageUpdateAutomation uses.
umask 077
WORK="$(mktemp -d)"
ssh-keygen -t ed25519 -N '' -f "$WORK/flux-image-automation" -C flux-image-automation
# Paste $WORK/flux-image-automation.pub into GitHub as a deploy key with WRITE access.

# Pin GitHub's SSH host keys against the published fingerprints before trusting
# the output of ssh-keyscan. TOFU ("we'll verify later") is not verification --
# if the first scan is hijacked, the write-capable deploy key talks to the
# attacker's host and pushes tag updates into an attacker-controlled mirror.
# Current fingerprints are published at:
#   https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/githubs-ssh-key-fingerprints
EXPECTED_RSA="SHA256:uNiVztksCsDhcc0u9e8BujQXVUpKZIDTMczCvj3tD2s"
EXPECTED_ED25519="SHA256:+DiY3wvvV6TuJJhbpZisF/zLDA0zPMSvHdkr4UvCOqU"
ssh-keyscan -t rsa,ed25519 github.com | tee "$WORK/kh" \
  | ssh-keygen -lf - | grep -E "$EXPECTED_RSA|$EXPECTED_ED25519" \
  || { echo "host key mismatch -- refusing to build known_hosts Secret"; exit 1; }
# Sanity-check the file isn't empty (network failure, DNS hijack to a silent host).
# Note: don't redirect ssh-keyscan stderr to /dev/null -- that hides the failure
# signal, and the only reason we can detect a silent-host scenario is by checking
# output non-emptiness here.
[ -s "$WORK/kh" ] || { echo "ssh-keyscan produced no output"; exit 1; }

kubectl -n flux-system create secret generic flux-image-automation-key \
  --from-file=identity="$WORK/flux-image-automation" \
  --from-file=identity.pub="$WORK/flux-image-automation.pub" \
  --from-file=known_hosts="$WORK/kh"

# Tear down the ephemeral working directory. Plain rm is the right tool here;
# `shred` is ineffective on journaled/COW filesystems -- see the GPG section
# below for the full rationale.
rm -rf "$WORK"

On rm vs shred: shred -u was the old advice, but it is effectively a no-op on ext4 with journaling, btrfs, zfs, APFS, and any tmpfs-backed path — the filesystem may redirect writes to new blocks, leaving the original key material in freed extents. The meaningful controls are (1) generate the key inside an $(mktemp -d) with umask 077 so intermediate files never sit in /tmp with world-readable defaults, (2) delete the directory after use, and (3) rely on full-disk encryption at rest on the workstation running this script. For production automation, generate the key inside an HSM-backed keyring or an ephemeral CI container that’s torn down after the Secret is created, not on a long-lived laptop.

Then commit a second GitRepository that uses this write key, install the image automation controllers (flux install --components-extra=image-reflector-controller,image-automation-controller --export > ... && git commit), and reference the new GitRepository from ImageUpdateAutomation.sourceRef. The existing flux-system GitRepository stays read-only.

The pragmatic shortcut — if you’d rather not manage two keys — is to re-run flux bootstrap with --read-write-key=true and the image-automation components; but be honest about what you’re doing: that rotates the single Flux deploy key to read/write, so every Flux controller (source-controller, kustomize-controller, image-automation-controller) gets write access to the repo, not just image-automation. That’s a wider blast radius than the two-keys path above. For a homelab or dev cluster, fine. For production, do the two keys.

Branch Protection Is a Prerequisite, Not a Nice-to-Have

Everything GitOps claims — audit trail, single source of truth, reviewed changes — collapses if main is not protected. Before you merge the bootstrap PR, configure on main:

  • Required pull request reviews (at least 1, 2 for production repos).
  • Required signed commits.
  • Required status checks (lint, policy, schema validation).
  • Restrict who can push directly to the branch (ideally: nobody).
  • No force-push, no branch deletion, linear history on.

Without these, any developer PAT with write access becomes a path straight to production: commit to main, Flux reconciles it in minutes. The drift-attack window is however often you sync (10 minutes in the examples below). Assume the attacker will time-travel through your sync interval.

Verify Commit Signatures at Reconcile Time

Branch protection enforces signed commits in GitHub. Flux can verify those signatures independently before applying anything, which closes the gap where a compromised GitHub account or a bypass of branch protection (admin override, stale webhooks) slips an unsigned commit onto main:

# clusters/production/flux-system/gotk-sync.yaml (excerpt)
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 1m
  ref:
    branch: main
  url: ssh://git@github.com/organization/flux-infrastructure
  secretRef:
    name: flux-system  # deploy-key Secret provisioned by `flux bootstrap`
  verify:
    mode: head
    secretRef:
      name: git-signing-keys

Note the ssh:// URL: flux bootstrap github provisions an SSH deploy key by default, so the GitRepository authenticates over SSH, not HTTPS. If you’re using HTTPS (token-based) auth, replace the URL with https://... and reference a token Secret — the two must match.

git-signing-keys is a Secret holding the public keys (GPG, or cosign/gitsign public key) of committers allowed to change production. If a commit on main isn’t signed by a key in that Secret, Flux refuses to reconcile and raises an alert. This is the trust anchor that makes “git is the source of truth” actually true — without it, Flux trusts whoever wrote the last commit, which is too much trust to hand an external system.

Secrets Do Not Go in Git

The single most common GitOps failure mode: someone commits a Helm values.yaml with a database password, notices a week later, force-pushes to remove it (which doesn’t actually remove it), and now you have a plaintext credential in git history that your compliance auditor will find. Don’t be that team.

Pick one of these and standardize early:

  • SOPS (Mozilla): encrypts YAML fields with KMS/age keys; Flux decrypts at reconcile time via the kustomize-controller’s decryption config. My default for small-to-midsize teams — clear ergonomics, files are readable diffs, keys rotate cleanly.
  • External Secrets Operator: git holds references; actual secrets live in Vault / AWS Secrets Manager / GCP Secret Manager. My default when there’s already a secrets backend in place, or when non-Flux workloads need the same secret.
  • Sealed Secrets (Bitnami): encrypts Secrets with a cluster-bound public key. Simple to start, but the controller’s key is the single root of trust and rotation is awkward. I’d pick SOPS over this today.

What I won’t do: commit plaintext secrets “just for dev.” Dev environments leak. Treat the repo as untrusted for secret material from day one.

Repository Structure

This is the structure I use for multi-environment deployments. It follows the Kustomize bases/overlays pattern:

├── clusters/
│   ├── development/
│   │   ├── flux-system/        # Flux components for dev
│   │   ├── infrastructure.yaml # Infrastructure for dev
│   │   └── apps.yaml           # Applications for dev
│   ├── staging/
│   │   ├── flux-system/        # Flux components for staging
│   │   ├── infrastructure.yaml # Infrastructure for staging
│   │   └── apps.yaml           # Applications for staging
│   └── production/
│       ├── flux-system/        # Flux components for prod
│       ├── infrastructure.yaml # Infrastructure for prod
│       └── apps.yaml           # Applications for prod
├── infrastructure/
│   ├── base/                   # Base infrastructure definitions
│   │   ├── ingress-nginx/      # Ingress controller
│   │   ├── cert-manager/       # Certificate management
│   │   └── monitoring/         # Prometheus and Grafana
│   └── overlays/               # Environment-specific overrides
│       ├── development/
│       ├── staging/
│       └── production/
└── apps/
    ├── base/                   # Base application definitions
    │   ├── app1/
    │   └── app2/
    └── overlays/               # Environment-specific overrides
        ├── development/
        ├── staging/
        └── production/

The key insight: base/ defines the common configuration, overlays/ applies environment-specific patches. You define ingress-nginx once and override replica counts and resource limits per environment.

Core Workflows

Infrastructure First

Infrastructure components (ingress, cert-manager, monitoring) deploy before applications. Start by declaring your Helm repository sources:

# infrastructure/base/sources/helm-repositories.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: ingress-nginx
  namespace: flux-system
spec:
  interval: 1h
  url: https://kubernetes.github.io/ingress-nginx
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: cert-manager
  namespace: flux-system
spec:
  interval: 1h
  url: https://charts.jetstack.io
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: prometheus-community
  namespace: flux-system
spec:
  interval: 1h
  url: https://prometheus-community.github.io/helm-charts

Then define the ingress controller HelmRelease. Chart versions in this post are illustrative — pin to a current release that you’ve cross-checked against the project’s security advisories before applying to a cluster:

# infrastructure/base/ingress-nginx/release.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  interval: 1h
  chart:
    spec:
      chart: ingress-nginx
      version: "4.0.13"
      sourceRef:
        kind: HelmRepository
        name: ingress-nginx
        namespace: flux-system
  values:
    controller:
      metrics:
        enabled: true
        serviceMonitor:
          enabled: true
      resources:
        requests:
          cpu: 100m
          memory: 128Mi
        limits:
          cpu: 500m
          memory: 512Mi

Production gets more resources and autoscaling via an overlay patch:

# infrastructure/overlays/production/ingress-nginx/release-patch.yaml
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
spec:
  values:
    controller:
      autoscaling:
        enabled: true
        minReplicas: 3
        maxReplicas: 10
        targetCPUUtilizationPercentage: 80
      resources:
        requests:
          cpu: 200m
          memory: 256Mi
        limits:
          cpu: 1000m
          memory: 1Gi

A Flux Kustomization ties it all together and adds health checks:

# clusters/production/infrastructure.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: infrastructure
  namespace: flux-system
spec:
  interval: 10m
  path: ./infrastructure/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  healthChecks:
    - apiVersion: apps/v1
      kind: Deployment
      name: ingress-nginx-controller
      namespace: ingress-nginx

The healthChecks field is critical — Flux won’t report the Kustomization as ready until the ingress controller deployment is actually running. Without health checks, you get false positives.

Application Deployment

Applications follow the same pattern. The important bit is the dependsOn field that ensures infrastructure deploys first:

# apps/base/app1/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app1
  namespace: apps
spec:
  replicas: 2
  selector:
    matchLabels:
      app: app1
  template:
    metadata:
      labels:
        app: app1
    spec:
      containers:
      - name: app1
        image: ghcr.io/organization/app1:v1.0.0
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Production scales up via overlay:

# apps/overlays/production/app1/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app1
  namespace: apps
spec:
  replicas: 5

The Flux Kustomization for apps references infrastructure as a dependency:

# clusters/production/apps.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
  namespace: flux-system
spec:
  interval: 5m
  path: ./apps/overlays/production
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  dependsOn:
    - name: infrastructure

dependsOn: infrastructure means Flux won’t even attempt to deploy apps until infrastructure health checks pass. This prevents the common failure mode of apps starting before their ingress controller or cert-manager is ready.

Automated Image Updates

This is one of Flux’s most powerful features and the one that surprised me most. Flux can watch a container registry, detect new image tags matching a policy, update the manifests in Git, and apply the changes — fully automated.

Configure an image repository to scan:

# apps/base/app1/image-repository.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageRepository
metadata:
  name: app1
  namespace: flux-system
spec:
  image: ghcr.io/organization/app1
  interval: 1m

Define a semver policy for which tags to accept. Keep the range tight — patch-only for anything that touches production, so a backdoored minor release can’t auto-deploy before a human reads the changelog:

# apps/base/app1/image-policy.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata:
  name: app1
  namespace: flux-system
spec:
  imageRepositoryRef:
    name: app1
  policy:
    semver:
      # Unambiguous explicit bounds. `~` and `^` semantics drift across
      # toolchains (npm, Composer, Flux, Helm) -- spell out the range you
      # actually mean so the policy doesn't silently change if the library
      # Flux uses for semver parsing swaps interpretation.
      range: '>=1.2.0 <1.3.0'  # patches only (1.2.x), no minor bumps

Auto-promoting an unsigned image from a registry is the supply-chain pattern that produced Codecov and SolarWinds-class incidents — don’t skip cosign verification. Important: ImageRepository has no native verify field. Flux verifies signatures natively only on OCIRepository (used for Helm charts and OCI-packaged manifests). For container images scanned by ImageRepository, verification must be enforced at admission time — the reflector will happily discover an unsigned tag otherwise.

Two concrete options. Pick one; don’t just add a “(with verification)” comment and call it done.

Option A: co-locate a Kyverno ClusterPolicy with a verifyImages rule. This is the admission-time gate that rejects any unsigned image when Flux applies the Deployment, regardless of how the tag got into the manifest:

# infrastructure/base/policy/verify-app1-signatures.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-app1-signatures
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
    - name: verify-cosign-signature
      match:
        any:
          - resources:
              # Match at the controller level too, otherwise `mutateDigest: true`
              # rewrites the Pod spec only after the controller has already
              # created a Pod from a mutable tag -- the Deployment/StatefulSet
              # in etcd still references the tag, and the next rollout fetches
              # whatever image the tag points to at that moment. Including the
              # controller kinds makes the digest pin happen on the template,
              # which is the object Flux actually reconciles.
              kinds:
                - Pod
                - Deployment
                - StatefulSet
                - DaemonSet
                - ReplicaSet
                - Job
                - CronJob
      verifyImages:
        - imageReferences:
            - "ghcr.io/organization/app1:*"
          failureAction: Enforce
          mutateDigest: true          # pin resolved digest into the Pod spec
          verifyDigest: true
          required: true
          attestors:
            - entries:
                - keys:
                    publicKeys: |-
                      -----BEGIN PUBLIC KEY-----
                      MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...your cosign pubkey...
                      -----END PUBLIC KEY-----
                    rekor:
                      url: https://rekor.sigstore.dev

With this in place, the existing ImageRepository stays as-is; Kyverno rejects any Pod whose image lacks a valid cosign signature from the listed key, so ImageUpdateAutomation can’t land an unsigned tag into production even if the registry is compromised.

Option B: switch chart/manifest sources to OCIRepository with native cosign verify. This works for Helm charts and OCI-packaged manifests (not for container images consumed via ImageRepository):

# infrastructure/base/sources/app1-oci.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
  name: app1-manifests
  namespace: flux-system
spec:
  interval: 5m
  url: oci://ghcr.io/organization/app1-manifests
  ref:
    semver: ">=1.2.0 <1.3.0"  # explicit bounds; see ImagePolicy note on `~`/`^` ambiguity
  verify:
    provider: cosign
    secretRef:
      name: cosign-pub  # Secret containing cosign.pub

For the container-image path this article uses (ImageRepository + ImageUpdateAutomation), Option A is the required control. Option B is the right call if you’re already packaging manifests as OCI artifacts.

And configure the automation to update manifests. For production, don’t push directly to main — commit to a separate branch and open a PR, so the human review and required-signatures gate still apply:

# apps/base/app1/image-update.yaml
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImageUpdateAutomation
metadata:
  name: app1
  namespace: flux-system
spec:
  interval: 1h
  sourceRef:
    kind: GitRepository
    # Points at the write-capable GitRepository from the two-keys bootstrap path,
    # NOT the read-only flux-system GitRepository. Only image-automation holds
    # the write key; other controllers reconcile through flux-system (read-only).
    name: flux-system-image-automation
  git:
    checkout:
      ref:
        branch: main
    commit:
      author:
        email: fluxcdbot@users.noreply.github.com
        name: fluxcdbot
      signingKey:
        secretRef:
          name: flux-bot-gpg
      messageTemplate: 'Update app1 to {{.NewTag}}'
    push:
      # Flux pushes to this branch but does NOT open the PR itself.
      # Wire up one of: a GitHub Actions workflow triggered on push to
      # `flux-image-updates` that runs `gh pr create --base main`, or a
      # Flux Notification-controller Provider + Alert that calls a webhook
      # which opens the PR. Without either, the branch accumulates commits
      # that never reach main and the production rollout stalls silently.
      branch: flux-image-updates  # production: separate branch + PR
  update:
    path: ./apps/base/app1
    strategy: Setters

The signingKey.secretRef: flux-bot-gpg is a Secret holding the bot’s GPG private key. Create it out-of-band (it must not be committed) and add the corresponding public key to the git-signing-keys Secret referenced by GitRepository.verify, otherwise Flux will reject the bot’s own commits:

# Generate a dedicated GPG key for the bot. The key has no passphrase because
# a non-interactive controller can't supply one; compensate by keeping the
# private material inside an ephemeral, isolated GNUPGHOME that we delete
# immediately after creating the Kubernetes Secret. For production, generate
# this key in an HSM-backed keyring (smartcard, KMS-backed pkcs11, cloud HSM)
# or inside an ephemeral CI container that's torn down after the Secret exists.
# Don't do this on a shared laptop.
umask 077
EPHEMERAL_HOME="$(mktemp -d)"
export GNUPGHOME="$EPHEMERAL_HOME"

# Expiration: 1y. Automation keys that "never" expire outlive the team that
# owns them, and a leak discovered two years later still reconciles commits.
# Rotation procedure: every ~10 months, generate a new bot key in a fresh
# ephemeral home, append its public key to git-signing-keys (ADDITIVE -- see
# below), cut over ImageUpdateAutomation to the new flux-bot-gpg Secret, then
# remove the old public key from git-signing-keys once no commits signed by
# the old key remain on main.
gpg --batch --passphrase '' --quick-gen-key 'fluxcdbot <fluxcdbot@users.noreply.github.com>' default default 1y

KEY_ID=$(gpg --list-secret-keys --with-colons fluxcdbot@users.noreply.github.com | awk -F: '/^sec:/ {print $5; exit}')

# Private key -> Secret used by ImageUpdateAutomation to sign commits.
# Streamed directly into kubectl; never touches /tmp or a persistent FS path.
kubectl create secret generic flux-bot-gpg \
  --namespace=flux-system \
  --from-literal="identity=$(gpg --export-secret-keys --armor "$KEY_ID")" \
  --from-literal="identity.pub=$(gpg --export --armor "$KEY_ID")" \
  --from-literal="git.openpgp.id=$KEY_ID"

# Public key -> ADD to git-signing-keys so GitRepository.verify accepts bot
# commits. This is the critical part: the existing git-signing-keys Secret
# already contains every human committer's public key. A naive `create secret
# ... --dry-run=client | kubectl apply -f -` REPLACES the Secret with only
# fluxbot.asc, evicting every human key, after which Flux rejects every
# human-signed commit and the cluster stops reconciling.
#
# Correct pattern: patch the existing Secret so the bot key is added alongside
# the humans' keys. `kubectl patch` with a strategic merge on `data` keeps
# every other key intact.
FLUXBOT_PUB_B64=$(gpg --export --armor "$KEY_ID" | base64 -w0)
kubectl patch secret git-signing-keys \
  --namespace=flux-system \
  --type=strategic \
  -p "{\"data\":{\"fluxbot.asc\":\"$FLUXBOT_PUB_B64\"}}"
# If git-signing-keys does not yet exist, create it once from a directory that
# holds every committer's .asc file (bot + humans) and commit that directory
# to a sealed-secrets / SOPS workflow:
#   kubectl -n flux-system create secret generic git-signing-keys \
#     --from-file=keys/
# Never rebuild the Secret from scratch in a script -- the pattern must be
# additive.

# Tear down the ephemeral GNUPGHOME. Plain `rm -rf` is the right tool here;
# `shred` is ineffective on journaled/COW filesystems (ext4, btrfs, zfs, APFS)
# and on tmpfs it's a no-op. Rely on disk encryption at rest for the residue.
unset GNUPGHOME
rm -rf "$EPHEMERAL_HOME"

If you’d rather not manage a bot signing key, skip signingKey entirely — but then exclude the bot from git-signing-keys and route every bot commit through the PR flow. The human-signed merge commit becomes what GitRepository.verify sees on main, so the trust anchor is the reviewer’s key, not the bot’s.

Here the bot signs its own commits (so commit-signature verification on GitRepository still holds), pushes to flux-image-updates, and a CI job or GitHub automation opens a PR against main. The PR runs policy checks, waits for review, and only then merges — at which point Flux reconciles the change. For staging and dev I’ll still push.branch: main and skip the PR, because the whole point of lower environments is fast feedback. For production, the extra hop is worth it. The Git history shows exactly when each image version was deployed, and every deployment is traceable to an approved PR.

Advanced Patterns

Multi-Tenancy

When multiple teams share a cluster, each team gets its own namespace with RBAC isolation — and, critically, their Flux Kustomization runs as a scoped ServiceAccount, not as the default flux-system controller SA. This is the point where most GitOps multi-tenancy setups fail quietly:

# tenants/team-a/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: team-a
  labels:
    # Enforce restricted PodSecurity so a wildcard Role can't spawn privileged pods
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
---
# tenants/team-a/reconciler-sa.yaml
# This is the SA Flux impersonates when reconciling team-a's manifests.
apiVersion: v1
kind: ServiceAccount
metadata:
  name: team-a-reconciler
  namespace: team-a
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: team-a-reconciler
  namespace: team-a
subjects:
  - kind: ServiceAccount
    name: team-a-reconciler
    namespace: team-a
roleRef:
  kind: Role
  name: team-a-namespace-admin
  apiGroup: rbac.authorization.k8s.io
---
# tenants/team-a/netpol-default-deny.yaml
# Default-deny ingress + egress for the namespace. Required mitigation (2)
# referenced by the Role below. Without this, a compromised tenant pod can
# reach the cluster API, cloud metadata (169.254.169.254), and sibling tenants.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: team-a
spec:
  podSelector: {}
  policyTypes: ["Ingress", "Egress"]
  # No ingress/egress rules = deny-all. Tenants add explicit allow rules
  # (DNS, specific services) on top of this baseline.
---
# tenants/team-a/rbac.yaml
# Namespace-scoped Role. Wildcards here are acceptable ONLY because all
# three mitigations are in place:
# (1) PodSecurity=restricted blocks privileged pods (Namespace labels above),
# (2) default-deny NetworkPolicy above caps lateral/egress movement,
# (3) the RoleBinding above is the ONLY binding to this Role.
# If you can't guarantee all three, enumerate verbs instead of "*" and
# explicitly EXCLUDE "escalate", "bind", and "impersonate" -- those three
# let a holder of this Role grant themselves additional permissions.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: team-a-namespace-admin
  namespace: team-a
rules:
  - apiGroups: ["*"]
    resources: ["*"]
    verbs: ["*"]

The critical field on the Kustomization is spec.serviceAccountName. Without it, Flux reconciles tenant manifests using the kustomize-controller SA in flux-system, which has cluster-admin. That means any manifest team-a commits — including a ClusterRoleBinding granting themselves cluster-admin — gets applied with cluster-admin privileges. The namespace boundary becomes cosmetic:

# tenants/team-a/flux.yaml
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: team-a
  namespace: flux-system
spec:
  interval: 10m
  path: ./tenants/team-a
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
  targetNamespace: team-a
  # This is the line that makes multi-tenancy real.
  # Flux will impersonate this SA; any resource it tries to create
  # outside team-a's RBAC is rejected by the API server.
  serviceAccountName: team-a-reconciler

With serviceAccountName set, the API server enforces the tenant boundary for you: if team-a commits a ClusterRoleBinding, Flux tries to apply it as system:serviceaccount:team-a:team-a-reconciler, the API server says no, and reconciliation fails with a permission error. That failure is visible, auditable, and non-damaging. Skipping this field is the single most common Flux multi-tenancy mistake, and it turns a “tenant namespace” into a decorative label.

Promotion Workflows

For regulated environments, changes must flow through dev, staging, then production. I’ve used both branch-based and path-based promotion, and I strongly prefer path-based.

Branch-based promotion (each environment syncs from a different branch):

# clusters/development/flux-system/gotk-sync.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 1m
  ref:
    branch: development
  url: ssh://git@github.com/organization/flux-infrastructure
  secretRef:
    name: flux-system

Path-based promotion (all environments on main, different paths) is simpler and what I recommend:

# clusters/development/flux-system/gotk-sync.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 1m
  ref:
    branch: main
  url: ssh://git@github.com/organization/flux-infrastructure
  secretRef:
    name: flux-system
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 10m
  path: ./clusters/development
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system

Promotion is a PR that copies config from one path to another. With branch-based promotion, you end up with merge conflicts and cherry-pick headaches. Concrete example: you land a dev-only experiment on the development branch that touches values.yaml for ingress-nginx. Two weeks later you want to promote an unrelated fix on the same file to staging and production. Now you’re cherry-picking individual commits across three long-lived branches, each of which has drifted independently, and Git happily gives you a three-way conflict every time someone forgot which branch was ahead. Path-based keeps everything on main and promotion is just a file copy — much cleaner.

Policy Enforcement with Kyverno

GitOps handles how changes get deployed. But you also need to enforce what can be deployed. I pair Flux with Kyverno for policy enforcement.

Kyverno is a cluster-wide admission controller — a compromised Kyverno chart means an attacker can rewrite or silently bypass every admission policy, which is effectively full-cluster takeover. For admission-controller charts, I pull them as a cosign-verified OCI artifact rather than a traditional HelmRepository, so Flux refuses to reconcile an unsigned chart:

# infrastructure/base/policy/kyverno.yaml
apiVersion: source.toolkit.fluxcd.io/v1
kind: OCIRepository
metadata:
  name: kyverno
  namespace: flux-system
spec:
  interval: 1h
  url: oci://ghcr.io/kyverno/charts/kyverno
  ref:
    semver: "3.2.x"  # illustrative -- pin to a version you've reviewed
  verify:
    provider: cosign
    secretRef:
      name: kyverno-cosign-pub  # Secret holding the Kyverno project's cosign.pub
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: kyverno
  namespace: kyverno
spec:
  interval: 1h
  chartRef:
    kind: OCIRepository
    name: kyverno
    namespace: flux-system
  values:
    admissionController:
      replicas: 3

Versions above are illustrative — cross-check current security advisories and the project’s published cosign key before pinning. For non-cluster-wide charts (app-level releases), a regular HelmRepository is usually fine; for anything that gates admission, signs images, or holds cluster-admin, OCIRepository + spec.verify is the baseline.

Kyverno policies run as admission controllers — they reject resources that violate your rules before they’re created:

# infrastructure/base/policy/require-labels.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce  # Pascal-case; lowercase is deprecated in Kyverno v1.10+
  rules:
  - name: require-team-label
    match:
      resources:
        kinds:
        - Deployment
        - Service
    validate:
      message: "The label 'team' is required"
      pattern:
        metadata:
          labels:
            team: "?*"

If someone submits a Deployment without a team label, the admission controller rejects it. Combined with Flux, this means the GitOps reconciliation will fail and Flux will report the error — giving you a clear signal that the manifest in Git doesn’t comply.

Monitoring

Flux exposes Prometheus metrics. I scrape them with a PodMonitor:

# infrastructure/base/monitoring/flux-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: flux-system
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: helm-controller
  podMetricsEndpoints:
  - port: http-prom
    interval: 15s
  namespaceSelector:
    matchNames:
    - flux-system

The metrics I alert on: reconciliation failures (something in Git doesn’t apply cleanly), reconciliation duration spikes (the cluster is struggling to converge), and source fetch failures (Git or Helm repo is unreachable). A Grafana dashboard showing these three things gives you full visibility into your GitOps pipeline health.

Lessons Learned

After running Flux in production across multiple clusters, here’s what I wish I’d known at the start:

Start with prune: false. When you’re first setting up Flux, disable pruning until you’re confident in your manifests. Pruning means Flux deletes resources that are no longer in Git — which is exactly what you want eventually, but terrifying when you’re still learning the repo structure. Enable it once you trust the workflow.

Pin your Helm chart versions. Never use version: "*" or omit the version field. A Helm chart upgrade that you didn’t review will eventually break something. Pin versions, update deliberately, test in staging first.

Use dependsOn liberally. Infrastructure before apps. CRDs before the controllers that use them. cert-manager before anything that needs TLS. The dependency graph is your safety net.

Path-based promotion over branch-based. I’ve tried both. Branch promotion creates merge conflicts and makes it hard to see the current state of all environments at once. Path-based keeps everything on main and promotion is just a file copy in a PR.

GitOps does not mean secure by default. The audit trail is only as good as your branch protection and commit-signature verification. The tenancy model is only as good as the reconciler ServiceAccount it impersonates. The image automation is only as safe as the signatures you verify. I’ve watched more than one team adopt Flux, skip all three, and treat the result as “we have GitOps now” — right up until a compromised developer token or a typosquatted image made the point for them. Budget a day for the security controls during bootstrap, not later.

The hardest part isn’t technical. It’s convincing teams to stop SSH-ing into boxes and running kubectl directly. The first time someone manually edits a resource and Flux reverts it, there will be frustration. That’s the system working as designed. Once people internalize that Git is the only way to make changes, everything gets better.

GitOps with Flux is the best operational pattern I’ve adopted in the last five years. The audit trail alone is worth it — but the real payoff is the confidence that what’s in Git is what’s running in your cluster. No drift, no surprises, no more weekend debugging sessions caused by stale local manifests.

← Back to blog