Every Kubernetes cluster I audit has the same problem. Resource requests and limits were guessed once during initial deployment and never touched again. A service asks for 2 CPU and 4Gi of memory, peaks at 200m and 500Mi, and nobody notices because the dashboards are green. Multiply that by 200 services across three environments and you’re paying for tens of thousands of dollars of idle capacity every month.
VPA helps at the pod level but can’t express org-wide policy, and its Auto mode restarts pods in ways that make production teams nervous. The solution that has worked for me on multiple client engagements is a custom controller: watch a CRD, read real usage from Prometheus, patch Deployment requests with guard rails. It’s the single highest-ROI piece of infrastructure code I’ve shipped.
This post is about how to build that kind of controller without shooting yourself in the foot. I’ll use a resource-optimization operator as the running example, but the patterns apply to any controller you build. The controller is the easy part. The discipline around reconcile semantics, watches, idempotency, and blast radius is what separates a controller that saves you money from one that takes down your cluster at 2am.
One thing up front: I’m showing you the shape of the code, not a drop-in operator. Build your own from these patterns and you’ll understand the reconciliation loop. Copy-paste without understanding and you’ll ship an incident when someone applies a malformed CR.
What Can Actually Go Wrong
Before any code, name the threats. Controllers run with broad RBAC, react to cluster-wide events, and write back to the API server. The failure modes are specific and worth enumerating:
| Failure | What happens | Defense |
|---|---|---|
| Runaway reconciliation | Controller requeues too aggressively, burns CPU, hammers the API server, trips client-side rate limiter | Bounded requeue, rate-limited workqueue, exponential backoff on errors |
| Update storm | Controller patches N deployments in a tight loop, cascades pod restarts | Predicates on spec-changed only, status subresource, no-op detection |
| Panic on malformed CR | User applies a CR missing a required field, controller crashes, manager restarts loop, nothing reconciles | Nil guards on every optional field, CRD validation, admission webhook |
| Stale cache decisions | Controller reads its own informer cache, writes based on stale state, loses the race with another writer | resourceVersion conflict handling, retry on conflict, never trust the cache for mutations |
| Controller OOMs the cluster | Informer holds every Pod in cluster memory, controller gets OOMKilled, gets rescheduled, holds every Pod again | Scoped watches with label selectors, field selectors, or per-namespace caches |
| Silent reconcile failures | Errors swallowed, user sees a CR with no status, no events | Status conditions, event recording, structured logs |
| Conflicting controllers | Two controllers fight over the same field (HPA vs VPA vs your optimizer) | Field ownership via server-side apply, explicit conflict resolution in spec |
| Cluster-admin blast radius | Compromised controller = compromised cluster | Narrow RBAC per-verb per-resource, no wildcards |
Every pattern below defends against one of these. If a pattern isn’t on this list, think twice before adding it.
Why controller-runtime, Not Raw client-go
You can write a controller against client-go directly. You’ll end up reinventing an informer factory, a workqueue, leader election, webhook scaffolding, and a dozen other things. controller-runtime gives you all of that in an opinionated package, with the same primitives Kubebuilder generates on top of.
The mental model is simple: the manager owns a shared informer cache and a client that reads from it. You register one or more controllers. Each controller has a reconciler and a set of watches. Events from the watches land on a rate-limited workqueue keyed by the object’s NamespacedName. The reconciler pulls one key at a time, fetches the current object, and does the work.
The thing that trips people up is that the reconciler receives a key, not an object. By the time you call Get, the object may already have moved on, been deleted, or been updated. Your reconcile must be safe to call on a key for any reason at any time. This is idempotency, and it’s the single most important property of a controller.
Custom Resource: When To Define a CRD
Not every automation needs a CRD. I define one when:
- The policy needs to be queryable and auditable as a first-class object (
kubectl get resourceoptimizationpolicies). - The policy has status that the controller reports back (observed generation, last reconcile time, conditions).
- The policy is declarative config the user edits repeatedly.
If the policy fits in a ConfigMap or a handful of annotations and nobody needs to see its status, skip the CRD. ConfigMaps compose with Helm and Kustomize trivially; CRDs add a whole API surface to maintain.
For resource optimization, a CRD earns its keep. Here’s the type in trimmed form:
// pkg/apis/optimization/v1alpha1/types.go
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:printcolumn:name="Optimized",type=integer,JSONPath=`.status.resourcesOptimized`
// +kubebuilder:printcolumn:name="LastRun",type=date,JSONPath=`.status.lastOptimizationTime`
type ResourceOptimizationPolicy struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec ResourceOptimizationPolicySpec `json:"spec"`
Status ResourceOptimizationPolicyStatus `json:"status,omitempty"`
}
type ResourceOptimizationPolicySpec struct {
// Target selects deployments to optimize. LabelSelector is required — a policy
// with no selector would match every deployment in scope, which is almost
// always a bug. Enforced in validation below.
Target ResourceSelector `json:"target"`
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=168
UpdateFrequencyHours int `json:"updateFrequencyHours"`
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=720
ObservationWindowHours int `json:"observationWindowHours"`
CPU OptimizationConfig `json:"cpu"`
Memory OptimizationConfig `json:"memory"`
// MinResources / MaxResources are safety bounds. The controller must
// never emit a patch that exceeds these, even if Prometheus says so.
MinResources ResourceRequirements `json:"minResources"`
MaxResources ResourceRequirements `json:"maxResources"`
}
type ResourceSelector struct {
// +kubebuilder:validation:MinItems=1
Namespaces []string `json:"namespaces"`
// +kubebuilder:validation:Required
LabelSelector *metav1.LabelSelector `json:"labelSelector"`
}
Three decisions baked into this type:
UpdateFrequencyHourshas kubebuilder min/max validation. A zero value would make the controller requeue immediately in a tight loop; I’ve debugged exactly that bug twice in clients’ clusters. Enforcing minimum 1 at admission time is cheaper than defending against it at reconcile time — though I still do both.LabelSelectoris required, not optional. A missing selector matches every pod and silently expands scope. Make the user opt in explicitly.MinResources/MaxResourcesare safety bounds. The controller caps its own writes. The policy author can’t typo a Prometheus query and get 100 CPUs provisioned.
The CRD validation catches the obvious cases at kubectl apply time. It doesn’t replace in-code defensive checks — API server validation is defense in depth, not defense in place.
Defaulting and Validation: Do Both
There are two places to enforce invariants on a CR: at admission (via a validating/defaulting webhook or CRD OpenAPI schema) and in the controller itself. Do both. Here’s why.
Admission webhooks catch bad input at apply time, which gives the user a fast feedback loop. They’re the right place for defaults, range checks, and cross-field constraints. They do not protect against CRs that were applied before the webhook existed, or against webhook bypasses (fail-open webhooks exist).
In-controller validation catches whatever slipped through. If a reconcile receives a CR that should be invalid, log it, set a Degraded condition, and requeue with backoff — don’t panic, don’t crash the process, don’t spam events.
Here’s a constructor-level defaulter and validator I add to the controller:
// pkg/controller/optimization/validate.go
package optimization
import (
"errors"
"fmt"
optimizationv1alpha1 "github.com/example/resource-optimizer/pkg/apis/optimization/v1alpha1"
)
var errInvalidPolicy = errors.New("invalid policy")
// tenantAllowlist maps a policy's own namespace to the namespaces it is
// permitted to target. This is the enforcement point for tenancy: without
// it, a tenant in team-a can write a policy that targets kube-system or
// another tenant's namespaces and coerce the controller into patching
// workloads that carry the managed label cluster-wide. Load this from a
// ConfigMap or operator config; hardcoded here for clarity.
var tenantAllowlist = map[string][]string{
"team-a-optimizer": {"team-a", "team-a-stage"},
"team-b-optimizer": {"team-b"},
}
const (
maxNamespaces = 32
maxMatchExpressions = 16
maxLabelValueCardinal = 64
)
// normalize applies defaults and validates the spec.
// Returns a copy — never mutate the object you got from the cache.
func normalize(in *optimizationv1alpha1.ResourceOptimizationPolicy) (*optimizationv1alpha1.ResourceOptimizationPolicy, error) {
if in == nil {
return nil, errInvalidPolicy
}
out := in.DeepCopy()
// Defaults. These duplicate the webhook but protect against CRs that
// predate the webhook or were applied with it disabled.
if out.Spec.UpdateFrequencyHours <= 0 {
out.Spec.UpdateFrequencyHours = 24
}
if out.Spec.ObservationWindowHours <= 0 {
out.Spec.ObservationWindowHours = 168 // one week
}
// Nil guards. LabelSelector is declared required, but a bad actor or
// a migration glitch can still produce a CR without it. A missing
// selector would match everything — refuse to proceed.
if out.Spec.Target.LabelSelector == nil {
return nil, fmt.Errorf("%w: spec.target.labelSelector is required", errInvalidPolicy)
}
if len(out.Spec.Target.Namespaces) == 0 {
return nil, fmt.Errorf("%w: spec.target.namespaces must contain at least one namespace", errInvalidPolicy)
}
if len(out.Spec.Target.Namespaces) > maxNamespaces {
return nil, fmt.Errorf("%w: spec.target.namespaces exceeds cap of %d", errInvalidPolicy, maxNamespaces)
}
// Tenancy: the policy may only target namespaces allowlisted for
// its own namespace. This is the boundary that prevents a tenant's
// CR from coercing the controller into patching kube-system or
// another tenant's workloads.
allowed, ok := tenantAllowlist[out.Namespace]
if !ok {
return nil, fmt.Errorf("%w: no tenant allowlist for policy namespace %q", errInvalidPolicy, out.Namespace)
}
allowedSet := make(map[string]struct{}, len(allowed))
for _, n := range allowed {
allowedSet[n] = struct{}{}
}
for _, ns := range out.Spec.Target.Namespaces {
if _, permitted := allowedSet[ns]; !permitted {
return nil, fmt.Errorf("%w: policy in %q may not target namespace %q", errInvalidPolicy, out.Namespace, ns)
}
}
// Cap selector complexity. matchExpressions is attacker-authored,
// runs on every list call, and Values cardinality bounds the memory
// cost of the label index lookup. Without a cap, a tenant can ship
// a pathological selector that amplifies every reconcile.
sel := out.Spec.Target.LabelSelector
if len(sel.MatchExpressions) > maxMatchExpressions {
return nil, fmt.Errorf("%w: matchExpressions exceeds cap of %d", errInvalidPolicy, maxMatchExpressions)
}
for _, expr := range sel.MatchExpressions {
if len(expr.Values) > maxLabelValueCardinal {
return nil, fmt.Errorf("%w: matchExpressions[%q].values exceeds cap of %d", errInvalidPolicy, expr.Key, maxLabelValueCardinal)
}
}
// Safety bounds must be consistent.
if err := validateBounds(out.Spec.MinResources, out.Spec.MaxResources); err != nil {
return nil, fmt.Errorf("%w: %s", errInvalidPolicy, err)
}
return out, nil
}
The two loops I’ve actually debugged in production:
UpdateFrequencyHoursset to zero. Without the guard,getNextOptimizationTimereturns 0, the reconciler requeues immediately, and the controller pins a core of the API server. Caught it because the apiserver’s audit log filled a disk.- Missing
LabelSelector. Without the guard, the controller’s list call returned every pod in the selected namespaces and the controller tried to optimize all of them, including the monitoring stack. The next resource update nearly restarted Prometheus at the exact moment it was needed for debugging.
Both bugs took me an hour to find and five minutes to fix. Five minutes of upfront defensive code would have saved both hours.
Watches and Predicates: Filter Aggressively
A controller that reconciles on every event from every resource it watches will melt. The workqueue doesn’t care about event volume; it cares about work volume. Filter events before they become work.
controller-runtime gives you predicates for this. Here’s the controller setup for our operator:
// pkg/controller/optimization/controller.go
package optimization
import (
"context"
appsv1 "k8s.io/api/apps/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/client-go/tools/record"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/builder"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/handler"
"sigs.k8s.io/controller-runtime/pkg/manager"
"sigs.k8s.io/controller-runtime/pkg/predicate"
optimizationv1alpha1 "github.com/example/resource-optimizer/pkg/apis/optimization/v1alpha1"
"github.com/example/resource-optimizer/pkg/metrics"
)
type Reconciler struct {
client.Client
Scheme *runtime.Scheme
Recorder record.EventRecorder
Metrics metrics.Provider
}
func SetupWithManager(mgr manager.Manager, m metrics.Provider) error {
r := &Reconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
Recorder: mgr.GetEventRecorderFor("resource-optimizer"),
Metrics: m,
}
return ctrl.NewControllerManagedBy(mgr).
Named("resource-optimizer").
For(&optimizationv1alpha1.ResourceOptimizationPolicy{},
builder.WithPredicates(policyChanged())).
Watches(
&appsv1.Deployment{},
handler.EnqueueRequestsFromMapFunc(r.mapDeploymentToPolicies),
builder.WithPredicates(deploymentResourcesChanged()),
).
Complete(r)
}
// policyChanged ignores status-only updates. Without this, every status write
// we do ourselves would re-enqueue the object. Classic self-inflicted update storm.
func policyChanged() predicate.Predicate {
return predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
oldP := e.ObjectOld.(*optimizationv1alpha1.ResourceOptimizationPolicy)
newP := e.ObjectNew.(*optimizationv1alpha1.ResourceOptimizationPolicy)
return oldP.Generation != newP.Generation
},
}
}
// deploymentResourcesChanged only re-enqueues when container resources or replicas changed.
// Image bumps, annotation churn, and rollout status changes are irrelevant to us.
func deploymentResourcesChanged() predicate.Predicate {
return predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
oldD := e.ObjectOld.(*appsv1.Deployment)
newD := e.ObjectNew.(*appsv1.Deployment)
return !resourcesEqual(oldD, newD)
},
}
}
The policyChanged predicate compares Generation, not ResourceVersion. The status subresource does not bump Generation; only spec changes do. This means a controller that writes status on every reconcile will not re-enqueue itself. If you compare ResourceVersion instead, you get an infinite self-retrigger loop. I’ve watched a junior engineer ship that bug and DDoS their own cluster.
The deploymentResourcesChanged predicate is where you avoid doing work for irrelevant events. An image tag bump from CI shouldn’t wake the resource optimizer. A replica count change might, depending on your domain logic — I’ve picked “no” because my optimizer operates on per-pod averages, and a scale-out doesn’t change the per-pod request.
The Reconcile Loop
Here’s the reconciler itself. This is where the idempotency discipline shows up.
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := ctrl.LoggerFrom(ctx).WithValues("policy", req.NamespacedName)
var raw optimizationv1alpha1.ResourceOptimizationPolicy
if err := r.Get(ctx, req.NamespacedName, &raw); err != nil {
if apierrors.IsNotFound(err) {
// Policy was deleted. Informer may still replay events for a while.
return ctrl.Result{}, nil
}
return ctrl.Result{}, fmt.Errorf("get policy: %w", err)
}
policy, err := normalize(&raw)
if err != nil {
// Bad CR. Set a condition and do NOT requeue by error —
// the user has to edit the spec, and a bad CR isn't a transient failure.
log.Error(err, "policy failed validation")
r.Recorder.Event(&raw, "Warning", "InvalidSpec", err.Error())
return ctrl.Result{}, r.setDegraded(ctx, &raw, err)
}
// Respect the user-specified cadence. This is idempotent — if we've
// already run recently, we simply requeue for the remaining interval.
if wait := timeUntilNextRun(policy); wait > 0 {
return ctrl.Result{RequeueAfter: wait}, nil
}
targets, err := r.listTargets(ctx, policy)
if err != nil {
return ctrl.Result{}, fmt.Errorf("list targets: %w", err)
}
optimized := 0
var firstErr error
for i := range targets {
// safeReconcile wraps reconcileDeployment with recover() so a panic
// in one deployment (bad Prometheus response, math edge case, nil deref)
// doesn't bypass writeStatus and freeze the CR's conditions at their
// previous value. Without this, operators see "Ready=True" while the
// manager restarts in a crash loop and nothing is actually reconciling.
changed, err := r.safeReconcile(ctx, policy, &targets[i])
if err != nil {
log.Error(err, "reconcile deployment", "deployment", targets[i].Name)
if firstErr == nil {
firstErr = err
}
continue
}
if changed {
optimized++
}
}
// Status update is the last thing we do, and only with a patch —
// never Update(), which would race with other writers. Because
// safeReconcile swallows panics into errors, this line always runs.
if err := r.writeStatus(ctx, policy, optimized, firstErr); err != nil {
return ctrl.Result{}, fmt.Errorf("write status: %w", err)
}
// Requeue for the next scheduled run. Errors trigger exponential backoff
// from the workqueue; this path is the happy-path cadence.
next := time.Duration(policy.Spec.UpdateFrequencyHours) * time.Hour
return ctrl.Result{RequeueAfter: next}, firstErr
}
Things worth internalizing from this function:
- We return the normalized copy everywhere downstream, and the raw object only to the event recorder (the recorder needs the original UID and ResourceVersion). Never mutate the cached object — the informer cache is shared, and mutation causes heisenbugs that only show up under concurrent reconciles.
- Validation errors don’t requeue by error. A bad spec is not a transient failure; re-running in five seconds won’t fix it. We set a condition, emit an event, and let the workqueue sit until the user edits the spec.
- Per-target errors don’t abort the reconcile. One deployment that can’t be patched shouldn’t block the other 49. We collect the first error, continue the loop, and return it at the end to trigger backoff.
- Status is written via patch, not Update. Two writers racing on Update will fight each other; patches don’t.
Idempotency: The Only Thing That Actually Matters
“Reconcile must be safe to call any number of times” is the framing you hear. The practical consequence: every write the controller makes must be a no-op if the desired state is already in place.
Before the inner loop, the panic boundary. A single bad deployment should not prevent the other 49 from reconciling, and it should never prevent writeStatus from running:
// safeReconcile wraps reconcileDeployment so a panic on any single target
// converts to an error the outer loop can record. Without this, a panic
// unwinds past writeStatus, the manager restarts, and the CR's conditions
// stay frozen at their last value — operators read "Ready=True" while
// nothing is actually being reconciled.
func (r *Reconciler) safeReconcile(
ctx context.Context,
policy *optimizationv1alpha1.ResourceOptimizationPolicy,
dep *appsv1.Deployment,
) (changed bool, err error) {
defer func() {
if rec := recover(); rec != nil {
// Log a short stack for debugging but don't re-panic; let the
// outer loop record the failure and proceed to writeStatus.
ctrl.LoggerFrom(ctx).Error(nil, "reconcile panic recovered",
"deployment", dep.Name, "namespace", dep.Namespace, "panic", rec)
err = fmt.Errorf("panic reconciling %s/%s", dep.Namespace, dep.Name)
changed = false
}
}()
return r.reconcileDeployment(ctx, policy, dep)
}
For our optimizer, the inner loop itself:
func (r *Reconciler) reconcileDeployment(
ctx context.Context,
policy *optimizationv1alpha1.ResourceOptimizationPolicy,
dep *appsv1.Deployment,
) (changed bool, err error) {
desired, err := r.computeDesiredResources(ctx, policy, dep)
if err != nil {
return false, fmt.Errorf("compute resources: %w", err)
}
// No-op short-circuit. If the spec already matches what we'd set,
// don't write. This is the difference between a controller that
// reconciles at a 24h cadence and one that patches every 30 seconds.
if resourcesMatch(dep.Spec.Template.Spec.Containers, desired) {
return false, nil
}
patch := client.MergeFrom(dep.DeepCopy())
applyResources(dep, desired)
// Bound the API-server call. The outer reconcile context is long-lived
// (per-request timeouts live at the manager level), and a wedged API
// server should not hang the reconciler for every target in the loop.
patchCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
if err := r.Patch(patchCtx, dep, patch); err != nil {
if apierrors.IsConflict(err) {
// Stale cache write. The workqueue will requeue; next reconcile
// will see fresh state and decide again.
return false, nil
}
return false, fmt.Errorf("patch deployment %s/%s: %w", dep.Namespace, dep.Name, err)
}
// Record on the deployment (operator looking at the workload) AND the
// policy (tenant auditing their CR). Single-sided events force users
// to know which object to `kubectl describe` to find the trail.
r.Recorder.Eventf(dep, "Normal", "ResourcesOptimized",
"Policy %s/%s updated container resources", policy.Namespace, policy.Name)
r.Recorder.Eventf(policy, "Normal", "DeploymentOptimized",
"Updated resources on %s/%s", dep.Namespace, dep.Name)
return true, nil
}
The no-op short-circuit is the single most important line in the whole controller. Without it, every reconcile patches, every patch triggers a deployment rollout, and a 24-hour cadence turns into a rolling restart every time the controller’s informer resyncs. I’ve seen this bug ship. The pod restart rate tripled, SRE paged, nobody could explain it for a day.
The conflict handling matters too. Patch returns IsConflict when the ResourceVersion you sent is stale. That’s not an error — it means someone wrote to the deployment between your cache read and your write. Drop the error, let the workqueue bring you back, and reconcile against fresh state.
Status Subresource: Observed Generation and Conditions
The status subresource is how a controller communicates with the outside world. A CR with no status is a CR the user can’t debug.
There are two things the status must carry:
ObservedGeneration: the Generation of the spec the controller has acted on. Ifmetadata.generation != status.observedGeneration, the user knows the controller hasn’t caught up yet. This is whatkubectl rollout statususes for every built-in resource.- Conditions: a list of typed state reports with
type,status,lastTransitionTime,reason,message.Ready,Degraded,Progressingare the usual suspects.
func (r *Reconciler) writeStatus(
ctx context.Context,
policy *optimizationv1alpha1.ResourceOptimizationPolicy,
optimized int,
reconcileErr error,
) error {
base := policy.DeepCopy()
policy.Status.ObservedGeneration = policy.Generation
policy.Status.ResourcesOptimized = optimized
policy.Status.LastOptimizationTime = metav1.Now()
cond := metav1.Condition{
Type: "Ready",
ObservedGeneration: policy.Generation,
LastTransitionTime: metav1.Now(),
}
if reconcileErr != nil {
cond.Status = metav1.ConditionFalse
cond.Reason = "ReconcileError"
// The underlying error embeds deployment names/namespaces that came
// from attacker-authored CRs via LabelSelector matches. Anything
// going into CR status is read by dashboards, shipped to log
// aggregators, and echoed on `kubectl describe`. Sanitize and cap.
cond.Message = sanitizeMessage(reconcileErr.Error())
} else {
cond.Status = metav1.ConditionTrue
cond.Reason = "ReconcileSucceeded"
cond.Message = fmt.Sprintf("optimized %d deployments", optimized)
}
setCondition(&policy.Status.Conditions, cond)
return r.Status().Patch(ctx, policy, client.MergeFrom(base))
}
// sanitizeMessage strips control characters and caps length before the
// message reaches CR status. Deployment names are user-authored through
// policy selection, and the status field is persisted + displayed by
// every tool that talks to the API server.
func sanitizeMessage(s string) string {
const maxBytes = 256
var b strings.Builder
b.Grow(len(s))
for _, r := range s {
if b.Len() >= maxBytes {
break
}
// Allow printable ASCII + common whitespace; replace control chars
// (including CR/LF, which break kubectl table rendering) with space.
if r == '\t' || r == ' ' || (r >= 0x20 && r != 0x7f) {
b.WriteRune(r)
} else {
b.WriteByte(' ')
}
}
out := b.String()
if len(out) > maxBytes {
out = out[:maxBytes-3] + "..."
}
return out
}
Two non-obvious things:
- Use
r.Status().Patch, notr.Patch. The status subresource is a separate endpoint and writing spec-fields here is silently ignored — or worse, allowed, depending on your CRD definition. Keep spec/status ownership disjoint. - Set
LastTransitionTimeonly when the status changes, not every reconcile. The helpersetConditioncompares against the existing condition and preserves the transition time on no-change. Users who hover over conditions want to see “failing for 3 hours,” not “failing for 2 seconds” on every requeue.
The controller never writes to spec. Spec is the user’s property. Status is the controller’s report card. Violating this produces the worst kind of bug: the user changes a field, the controller changes it back, and everyone loses their mind for an afternoon.
Watches That Don’t OOM The Cluster
A default informer watches every object of the type it’s registered for, cluster-wide. For a Deployment watch in a 500-namespace cluster, that’s every deployment, held in the controller’s process memory, forever. Controllers have OOMKilled themselves out of clusters doing this.
Three knobs to tighten the watch:
// cmd/main.go
mgr, err := ctrl.NewManager(cfg, ctrl.Options{
Scheme: scheme,
Cache: cache.Options{
// DefaultNamespaces applies first: the cache only considers objects
// living in these namespaces.
DefaultNamespaces: map[string]cache.Config{
"team-a": {},
"team-b": {},
},
// ByObject composes multiplicatively on top of DefaultNamespaces.
// Here: deployments in team-a OR team-b that ALSO carry the
// managed label. If you want a Deployment watch that isn't
// restricted to DefaultNamespaces, set ByObject[&appsv1.Deployment{}]
// .Namespaces explicitly — it overrides DefaultNamespaces for that type.
ByObject: map[client.Object]cache.ByObject{
&appsv1.Deployment{}: {
Label: labels.SelectorFromSet(labels.Set{
"resource-optimizer.example.com/managed": "true",
}),
},
},
},
LeaderElection: true,
LeaderElectionID: "resource-optimizer.example.com",
})
The controller itself needs CPU and memory limits in its Deployment manifest, same as anything else. A controller with resources: {} will happily consume a node’s worth of memory during a cache rehydrate and take out your cluster’s control plane.
For leader-elected controllers, set Guaranteed QoS. That means requests == limits on both CPU and memory. A leader-elected controller that gets CPU-throttled or evicted under node pressure will lose its lease to another replica, incurring a reconcile outage while leadership transfers and the new leader warms its informer cache. Guaranteed QoS (requests: 500m/512Mi, limits: 500m/512Mi) puts the pod in the last eviction tier and keeps throttling predictable. Start there and tune both numbers together from observed usage; if the controller needs more than 512Mi, the watch scope is too wide.
Distinguishing Watch Errors
The old Watch()-per-source pattern had a real footgun: when a watch returned an error, you couldn’t tell which resource’s watch failed. The builder pattern above fixes this — each Watches call is named, and errors from setup bubble up with enough context to attribute. If you’re still doing manual c.Watch(...) calls in setup code, wrap each call:
import (
"fmt"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"sigs.k8s.io/controller-runtime/pkg/source"
)
// c is controller.Controller, cache is cache.Cache from the manager,
// deploymentHandler / podHandler are handler.EventHandler values.
if err := c.Watch(source.Kind(cache, &appsv1.Deployment{}, deploymentHandler)); err != nil {
return fmt.Errorf("watch deployments: %w", err)
}
if err := c.Watch(source.Kind(cache, &corev1.Pod{}, podHandler)); err != nil {
return fmt.Errorf("watch pods: %w", err)
}
Wrapping each error with a %w that names the resource is cheap. Operating a controller whose startup fails with “watch failed” and no attribution is expensive.
Observability: Without Metrics, You’re Flying Blind
controller-runtime exposes a handful of metrics out of the box via its /metrics endpoint. The ones I always look at:
controller_runtime_reconcile_total— throughput by result (success,error,requeue).controller_runtime_reconcile_time_seconds— histogram of reconcile duration. Anything over a second per reconcile is a smell; over ten seconds is a bug.workqueue_depth— how many keys are waiting. If this grows without bound, your reconciler can’t keep up.workqueue_retries_total— per-controller retry counts. Spiking retries mean transient errors are happening; climbing retries mean a CR is stuck in a permanent error.
I also add a domain metric per controller:
// pkg/metrics/metrics.go
package metrics
import "github.com/prometheus/client_golang/prometheus"
var (
DeploymentsOptimized = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "resource_optimizer_deployments_optimized_total",
Help: "Count of deployment patches emitted by the optimizer.",
},
[]string{"policy", "namespace", "resource"},
)
ResourceChangeRatio = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "resource_optimizer_change_ratio",
Help: "Ratio of new resource request to previous value.",
Buckets: []float64{0.25, 0.5, 0.75, 0.9, 1.0, 1.1, 1.5, 2.0, 4.0},
},
[]string{"policy", "resource"},
)
)
func MustRegister(reg prometheus.Registerer) {
reg.MustRegister(DeploymentsOptimized, ResourceChangeRatio)
}
The ResourceChangeRatio histogram is what tells me the controller is behaving. A healthy distribution sits near 1.0 (small corrections). A bimodal distribution with spikes at 0.25 and 4.0 means the controller is oscillating — almost always a bug in the observation window or target utilization.
RBAC: Don’t Hand Out Cluster-Admin
A compromised controller with cluster-admin is a compromised cluster. There are two shapes the RBAC can take, and the tenancy story drives the choice.
Preferred (multi-tenant): Role + RoleBinding per tenant namespace. This is the shape I reach for whenever the operator runs a multi-tenant story. The controller carries no cluster-scope deployment permissions at all. Each tenant namespace explicitly grants the operator’s ServiceAccount the right to patch its workloads, and nothing else. A malicious ResourceOptimizationPolicy in team-a cannot coerce the operator into touching kube-system because the operator literally has no verbs there.
# One Role per tenant namespace, granting the controller scoped write.
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: resource-optimizer-tenant
namespace: team-a
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: resource-optimizer-tenant
namespace: team-a
subjects:
- kind: ServiceAccount
name: resource-optimizer
namespace: resource-optimizer-system
roleRef:
kind: Role
name: resource-optimizer-tenant
apiGroup: rbac.authorization.k8s.io
The policy CRD itself is still cluster-scoped to read because the controller watches CRs across namespaces (scoped by the tenant allowlist from normalize):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: resource-optimizer-policies
rules:
- apiGroups: ["optimization.example.com"]
resources: ["resourceoptimizationpolicies"]
verbs: ["get", "list", "watch"]
- apiGroups: ["optimization.example.com"]
resources: ["resourceoptimizationpolicies/status"]
verbs: ["get", "update", "patch"]
Fallback (single-tenant / cluster-wide operator): a narrow ClusterRole for deployments. Use this only when the operator genuinely owns every deployment in the cluster and no tenancy boundary exists:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: resource-optimizer-cluster
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: [""]
resources: ["events"]
verbs: ["create", "patch"]
Even then, the in-controller tenant allowlist in normalize still matters — cluster-wide deploy-patch with no application-level namespace guard means one bad CR can touch any workload that carries the managed label, including kube-system.
Note the split in both shapes: deployments gets read-and-patch only (no create, no delete). deployments/scale is absent — the optimizer has no business changing replica counts. pods is absent — we read Prometheus for usage, not the pod objects themselves. The status subresource is a separate rule with its own verbs.
Every wildcard verb in your RBAC is a threat-model failure to name. If you find yourself writing verbs: ["*"], stop and list the verbs you actually use.
When NOT To Write a Controller
Controllers are tempting, because they feel like the “Kubernetes-native” answer. They’re not the answer for:
- One-shot bootstrapping. If the task runs once per install, a
Jobis fine. Don’t build a controller to do whatkubectl apply -kdoes on day one. - Simple tag-based automation. “Label the pod, and it gets rotated nightly” can be a CronJob. Not everything needs a CRD.
- Policy enforcement. OPA/Gatekeeper and Kyverno exist. If your policy is “reject this at admission,” use a policy engine. Controllers are for closed-loop reconciliation, not for saying no.
- Anything that fits in CI/GitOps. If ArgoCD or Flux can do it by reconciling manifests from git, let them. A controller that watches deployments to apply config changes is a worse ArgoCD.
- Pure data transformation. A webhook is often the right tool for “mutate this object at apply time.” Controllers respond to cluster state; webhooks intervene in the apply path.
I use controllers when the desired state is computed from live runtime signals (metrics, cluster load, external systems) and has to be re-evaluated on an ongoing basis. Resource optimization fits — the answer changes every week. Installing a dashboard does not.
What I’d Actually Choose
If I’m building a controller today, these are my defaults:
Framework: controller-runtime directly. Kubebuilder is fine scaffolding but I usually don’t need its CLI after day one. The SetupWithManager pattern is all you need.
CRD versioning: Start at v1alpha1 and mean it. Commit to breaking compatibility within alpha — a single-alpha CRD you can’t migrate off of is a worse debt than a rewrite.
Defaulting: Kubebuilder markers for schema defaults, plus a normalize function inside the controller that catches what slipped through. Belt and suspenders.
Watches: Scoped with label selectors or namespace lists from day one. Never watch everything, filter in code.
Status: Conditions using metav1.Condition with ObservedGeneration on every condition. Don’t invent your own condition types — Ready, Degraded, Progressing, Available cover everything, and users recognize them.
Requeue strategy: Happy path uses RequeueAfter with the domain cadence. Error path returns the error and lets the workqueue rate limiter do its job. Never write your own backoff.
Observability: /metrics from the manager, plus one or two domain-specific metrics. If a controller isn’t exporting metrics, I can’t operate it.
RBAC: Per-verb, per-resource, no wildcards. Write it by hand once, keep it under review on every change.
The biggest mistake I see teams make: building a controller for something that’s already solved by VPA, HPA, KEDA, Kyverno, or ArgoCD. Kubernetes has a rich ecosystem of existing operators. Read the ones that exist before you write a new one. Your controller is code you now own — pick battles that are worth owning.