Enterprise Microservices in Go

The architecture patterns I rely on for building microservices that actually survive production — DDD boundaries, outbox-based events, gRPC, and the currency mistake almost everyone ships.

ArchitectureGolangMicroservices

I’ve been building microservices in Go for years, and the hardest part isn’t the code. It’s getting service boundaries right, and it’s getting data consistency right across those boundaries. Split services too fine and you end up with a distributed monolith that pays distribution tax for nothing. Split too coarse and you lose the independence you were after. Then, once you have boundaries, the next class of bugs hits: eventually-consistent systems that were supposed to be eventually consistent but are actually silently lossy.

This post covers the patterns I keep coming back to for enterprise Go microservices: domain-driven design for boundaries, the transactional outbox for events that can’t get dropped, gRPC where type safety pays off, and the operational scaffolding (observability, graceful shutdown, health probes) that separates a demo from a production service.

I’ll point at two classes of mistakes that eat teams alive — currency as float64 and event publishing as a best-effort side effect. Both look fine in code review and both cost money in production.

The Threats Worth Naming

Before any code, name what can go wrong in a distributed system. If a pattern below doesn’t defend against one of these, it shouldn’t be in the article.

ThreatDefense
Data loss between services (event dropped after DB commit)Transactional outbox, relay with at-least-once delivery
Money arithmetic drift (rounding, accumulation)Integer minor units or decimal library; never float
Eventual consistency gone wrong (clients read their own writes)Read-your-writes on the owning service; no cross-service joins
Bounded context leakage (shared schema pulled across services)Each service owns its schema; contracts travel over the wire only
Retry storms on downstream failureTimeouts + circuit breakers + jittered backoff; idempotency keys
Silent drops in release (traffic cut during rollout)Readiness probes + preStop + graceful shutdown
Unobservable failure (you find out from customers)Structured logs + OTel traces + correlation IDs end-to-end
Service-to-service spoofingmTLS with workload identity, not shared bearer secrets

I’ll reference this table as I go.

Bounded Contexts Before Services

The most common failure I see: a team lists its nouns — User, Order, Product, Invoice — and makes one service per noun. Six months later, every feature touches four services, and every deploy is coordinated across three teams. That’s a distributed monolith wearing microservice makeup.

The fix is to design bounded contexts first and only then decide how many services each context warrants. A bounded context is a region of the domain where a single model and vocabulary apply. “Customer” in the billing context is a different concept from “Customer” in the fulfillment context — same person, different invariants, different data, different lifecycle. They should not share a model.

For a typical e-commerce system, I’d start with these contexts:

  1. Order Management — order lifecycle, pricing, placement.
  2. Inventory — stock levels, reservations, replenishment.
  3. Payments — payment methods, transactions, refunds.
  4. Fulfillment — picking, packing, shipping, tracking.
  5. Customer — accounts, profile data, preferences.

Each context owns its data. No service reads another’s database. When Order needs a product price, it asks Inventory over a contract; it does not query the inventory schema. This rule is the whole game — break it and you’ve built a distributed monolith with extra network hops.

One service per context is the default. Split a context into multiple services only when you have a concrete operational reason (independent scaling profile, different release cadence, regulatory isolation). Never split because the code “feels big.”

The Order Aggregate (And the Currency Trap)

Here’s the Order aggregate root, with the currency handling people get wrong on their first pass:

// domain/order.go
package domain

import (
    "errors"
    "time"

    "github.com/google/uuid"
)

// OrderStatus is the order lifecycle state.
type OrderStatus string

const (
    OrderStatusDraft     OrderStatus = "DRAFT"
    OrderStatusPlaced    OrderStatus = "PLACED"
    OrderStatusPaid      OrderStatus = "PAID"
    OrderStatusShipped   OrderStatus = "SHIPPED"
    OrderStatusDelivered OrderStatus = "DELIVERED"
    OrderStatusCancelled OrderStatus = "CANCELLED"
)

var (
    ErrInvalidOrderState = errors.New("invalid order state")
    ErrEmptyOrder        = errors.New("order has no items")
    ErrCurrencyMismatch  = errors.New("mixed currencies in order")
)

// Order is the aggregate root. All mutations go through methods on Order
// so invariants hold regardless of which handler mutates the order.
type Order struct {
    ID          uuid.UUID
    CustomerID  uuid.UUID
    Items       []OrderItem
    Status      OrderStatus
    TotalAmount Money
    CreatedAt   time.Time
    UpdatedAt   time.Time
}

type OrderItem struct {
    ProductID uuid.UUID
    Quantity  int32
    UnitPrice Money
    Subtotal  Money
}

So far so normal. The trap is the Money type. The obvious first draft looks like this — and is wrong:

// DO NOT DO THIS. This is the bug almost every team ships at least once.
type Money struct {
    Amount   float64
    Currency string
}

Floating point cannot represent most decimal fractions exactly. 0.1 + 0.2 = 0.30000000000000004 in IEEE 754. When you sum line items, apply tax, apply a discount, and compare totals, the errors accumulate. Eventually a customer’s invoice is off by a cent, your reconciliation job fails, and somebody in finance opens a ticket that your engineering team can’t reproduce. I’ve seen this land a team in a two-week forensics exercise over pennies.

Use integer minor units. Store money as an int64 representing the smallest unit of the currency (cents for USD, yen for JPY, etc.). Arithmetic is exact. Comparisons are exact. You only convert to decimal at the edges — at display time, at persistence time if your DB column is NUMERIC, or at the API boundary.

// domain/money.go
package domain

import (
    "errors"
    "fmt"
)

// Money represents an amount in a single currency using integer minor units
// (cents for USD, pence for GBP, yen for JPY). Never use float64 for money.
//
// Why int64: exact arithmetic, no rounding drift, direct mapping to database
// BIGINT columns and Stripe/PSP APIs. The range fits every plausible amount
// in any currency.
type Money struct {
    Minor    int64  // e.g. 1999 for $19.99
    Currency string // ISO 4217 code, e.g. "USD"
}

var ErrCurrencyMismatch2 = errors.New("currency mismatch")

func NewMoney(minor int64, currency string) Money {
    return Money{Minor: minor, Currency: currency}
}

// Add returns a+b. It errors on currency mismatch rather than silently
// producing nonsense. Silent currency coercion is how you lose audit trails.
func (a Money) Add(b Money) (Money, error) {
    if a.Currency != b.Currency {
        return Money{}, fmt.Errorf("%w: %s vs %s", ErrCurrencyMismatch2, a.Currency, b.Currency)
    }
    return Money{Minor: a.Minor + b.Minor, Currency: a.Currency}, nil
}

// Mul multiplies by an integer quantity. For percentages and tax rates use
// a dedicated method with documented rounding — not a float multiplier.
func (a Money) Mul(qty int64) Money {
    return Money{Minor: a.Minor * qty, Currency: a.Currency}
}

func (a Money) String() string {
    // Minimal display. Real formatting should use x/text/currency with the
    // user's locale — that's internationalization, not domain logic.
    return fmt.Sprintf("%d.%02d %s", a.Minor/100, abs(a.Minor%100), a.Currency)
}

func abs(x int64) int64 {
    if x < 0 {
        return -x
    }
    return x
}

A couple of honest caveats about int64 cents:

  • Not every currency has two decimal places. JPY has zero. TND and BHD have three. Store currency exponents alongside the code, or use a library that knows (like github.com/bojanz/currency).
  • Percentage-based math (tax rates, discounts) needs care. Multiply by the rate’s numerator and divide by the denominator with documented rounding (banker’s rounding is standard). Never float64(price) * 0.175.
  • Cross-currency operations are a business decision, not a math problem. FX rates change. Pin the rate on the order at capture time and store it.

If integer minor units feel awkward, github.com/shopspring/decimal is the other defensible choice — arbitrary-precision decimal, slower but exact. I pick int64 for core services because it’s one less dependency and a BIGINT in every database. I’d pick decimal for financial reporting services where rates and multi-step calculations dominate.

With Money fixed, the aggregate’s mutations are straightforward:

// domain/order.go (continued)

func NewOrder(customerID uuid.UUID) *Order {
    now := time.Now().UTC()
    return &Order{
        ID:         uuid.New(),
        CustomerID: customerID,
        Status:     OrderStatusDraft,
        Items:      []OrderItem{},
        CreatedAt:  now,
        UpdatedAt:  now,
    }
}

// AddItem enforces currency consistency across the order. It's invalid to mix
// currencies in a single order — that's a checkout-flow decision, not a
// domain-layer fudge.
func (o *Order) AddItem(productID uuid.UUID, qty int32, unitPrice Money) error {
    if qty <= 0 {
        return errors.New("quantity must be positive")
    }
    if len(o.Items) > 0 && o.Items[0].UnitPrice.Currency != unitPrice.Currency {
        return ErrCurrencyMismatch
    }
    subtotal := unitPrice.Mul(int64(qty))
    o.Items = append(o.Items, OrderItem{
        ProductID: productID,
        Quantity:  qty,
        UnitPrice: unitPrice,
        Subtotal:  subtotal,
    })
    o.UpdatedAt = time.Now().UTC()
    return o.recalculateTotal()
}

func (o *Order) recalculateTotal() error {
    if len(o.Items) == 0 {
        o.TotalAmount = Money{}
        return nil
    }
    total := Money{Minor: 0, Currency: o.Items[0].UnitPrice.Currency}
    for _, item := range o.Items {
        sum, err := total.Add(item.Subtotal)
        if err != nil {
            return err
        }
        total = sum
    }
    o.TotalAmount = total
    return nil
}

// PlaceOrder is the state transition from DRAFT to PLACED. State transitions
// live on the aggregate so guards run every time.
func (o *Order) PlaceOrder() error {
    if o.Status != OrderStatusDraft {
        return ErrInvalidOrderState
    }
    if len(o.Items) == 0 {
        return ErrEmptyOrder
    }
    o.Status = OrderStatusPlaced
    o.UpdatedAt = time.Now().UTC()
    return nil
}

The aggregate enforces its own invariants. No handler, repository, or gRPC layer reaches inside and mutates Items or Status directly. That’s the discipline that keeps a domain model from rotting into a data bag with methods.

The Transactional Outbox: Events That Can’t Be Dropped

Here is the pattern I see everywhere, and it is broken:

// DO NOT DO THIS — this is the silent-data-loss bug.
func (s *OrderService) PlaceOrder(ctx context.Context, id uuid.UUID) error {
    order, err := s.repo.GetByID(ctx, id)
    if err != nil { return err }
    if err := order.PlaceOrder(); err != nil { return err }
    if err := s.repo.Update(ctx, order); err != nil { return err }

    // Fire the event. Log and continue on failure "so we don't fail the order".
    if err := s.publisher.Publish(ctx, NewOrderPlacedEvent(order)); err != nil {
        s.logger.Error("failed to publish event", "err", err)
    }
    return nil
}

This pattern is a data-loss pipeline with a log line for an audit trail. When the broker is unavailable, the order is placed in your database and nothing downstream ever knows. Inventory doesn’t reserve stock. Payments doesn’t charge. The warehouse doesn’t pick. You find out when a customer calls asking where their package is, and you can’t even query “which orders lost their events” because the evidence is a log line, not a durable record.

The fix is the transactional outbox. Instead of publishing to the broker inline, write the event to an outbox table in the same database transaction that persists the order. A separate relay process tails the outbox table and publishes to the broker, marking rows as delivered. Because the outbox write is atomic with the business state change, either both happen or neither does. At-least-once delivery to the broker is guaranteed; consumers handle idempotency (they have to anyway).

The outbox table:

CREATE TABLE outbox (
    id              UUID PRIMARY KEY,
    aggregate_type  TEXT NOT NULL,
    aggregate_id    UUID NOT NULL,
    event_type      TEXT NOT NULL,
    payload         JSONB NOT NULL,
    trace_id        TEXT,            -- W3C traceparent captured at write time
    occurred_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at    TIMESTAMPTZ,
    attempts        INT NOT NULL DEFAULT 0,
    last_error      TEXT
);

CREATE INDEX outbox_unpublished_idx ON outbox (occurred_at)
    WHERE published_at IS NULL;

Two columns that are easy to forget and painful to add later: trace_id to carry the producing span context across the async boundary (otherwise your trace dies at the outbox insert), and attempts/last_error so you can see why a row is stuck without trawling relay logs.

The write side — a single transaction across the aggregate and the outbox:

// service/order_service.go
package service

import (
    "context"
    "encoding/json"
    "fmt"
    "time"

    "github.com/google/uuid"
)

// DB, Tx, Logger, OutboxRecord and traceparentFrom are defined in the
// service package — signatures elided here for brevity. Tx exposes
// Orders() and Outbox() repositories bound to the same transaction.

type OrderService struct {
    db     DB            // transactional database handle
    logger Logger
}

// PlaceOrder persists the state change and the event atomically.
// If anything fails, the DB transaction rolls back — no partial state,
// no orphan event, no silent drop.
func (s *OrderService) PlaceOrder(ctx context.Context, orderID uuid.UUID) error {
    return s.db.InTx(ctx, func(tx Tx) error {
        order, err := tx.Orders().GetByID(ctx, orderID)
        if err != nil {
            return fmt.Errorf("load order: %w", err)
        }
        if err := order.PlaceOrder(); err != nil {
            return err
        }
        if err := tx.Orders().Update(ctx, order); err != nil {
            return fmt.Errorf("update order: %w", err)
        }

        payload, err := json.Marshal(orderPlacedPayload{
            OrderID:      order.ID.String(),
            CustomerID:   order.CustomerID.String(),
            TotalMinor:   order.TotalAmount.Minor,
            Currency:     order.TotalAmount.Currency,
            PlacedAt:     order.UpdatedAt,
        })
        if err != nil {
            return fmt.Errorf("marshal event: %w", err)
        }
        return tx.Outbox().Insert(ctx, OutboxRecord{
            ID:            uuid.New(),
            AggregateType: "order",
            AggregateID:   order.ID,
            EventType:     "order.placed",
            Payload:       payload,
            // Capture the current trace so consumers can continue the span.
            // Without this the trace terminates at the DB write.
            TraceID:       traceparentFrom(ctx),
        })
    })
}

type orderPlacedPayload struct {
    OrderID    string    `json:"order_id"`
    CustomerID string    `json:"customer_id"`
    TotalMinor int64     `json:"total_minor"`
    Currency   string    `json:"currency"`
    PlacedAt   time.Time `json:"placed_at"`
}

Notice the event payload carries TotalMinor int64 and Currency, never a float64 amount. The wire format preserves the same guarantee the domain does.

The relay is a separate process (or a goroutine in the service, if you’re comfortable with the operational tradeoff) that polls unpublished rows and ships them:

// outbox/relay.go
package outbox

import (
    "context"
    "time"
)

type Relay struct {
    db        DB
    publisher Publisher // NATS/Kafka client
    batch     int
    interval  time.Duration
    logger    Logger
}

func (r *Relay) Run(ctx context.Context) error {
    t := time.NewTicker(r.interval)
    defer t.Stop()
    for {
        select {
        case <-ctx.Done():
            return ctx.Err()
        case <-t.C:
            r.safeDispatch(ctx)
        }
    }
}

// safeDispatch wraps dispatchBatch in a recover so one poisoned row or a
// nil-deref in a publisher client can't kill the whole pipeline. Run() is
// the single goroutine driving the outbox; losing it silently stops every
// downstream consumer. The recover restarts the next tick.
func (r *Relay) safeDispatch(ctx context.Context) {
    defer func() {
        if rec := recover(); rec != nil {
            r.logger.Error("outbox relay panic; continuing on next tick",
                "panic", rec)
        }
    }()
    if err := r.dispatchBatch(ctx); err != nil {
        r.logger.Error("outbox dispatch failed", "err", err)
        // Keep looping. Transient broker errors are expected.
    }
}

func (r *Relay) dispatchBatch(ctx context.Context) error {
    // SELECT ... FOR UPDATE SKIP LOCKED holds row locks for the lifetime
    // of a transaction. That means Fetch and MarkPublished MUST share the
    // same tx — otherwise the tx from Fetch commits on return, the locks
    // drop, and a sibling relay instance picks up the same rows and
    // double-publishes. The tx boundary here is the correctness guarantee,
    // not the SELECT syntax.
    return r.db.InTx(ctx, func(tx Tx) error {
        rows, err := tx.FetchUnpublished(ctx, r.batch) // FOR UPDATE SKIP LOCKED
        if err != nil {
            return err
        }
        const maxAttempts = 10
        for _, row := range rows {
            // The event envelope carries the outbox row ID as the event ID so
            // consumers can dedupe, and the captured trace_id so spans continue
            // across the async boundary.
            env := Envelope{
                ID:      row.ID,
                Type:    row.EventType,
                TraceID: row.TraceID,
                Payload: row.Payload,
            }
            if err := r.publisher.Publish(ctx, env); err != nil {
                _ = tx.IncrementAttempts(ctx, row.ID, err.Error())
                if row.Attempts+1 >= maxAttempts {
                    // Poison message: stop it from blocking the batch forever.
                    // Park in a DLQ table and alert — don't silently drop.
                    _ = tx.MoveToDLQ(ctx, row.ID)
                    r.logger.Error("outbox row exceeded retries; moved to DLQ",
                        "id", row.ID, "err", err)
                    continue
                }
                // Skip this row for now; later rows are independent and must
                // not be held up by one bad broker interaction.
                continue
            }
            if err := tx.MarkPublished(ctx, row.ID, time.Now()); err != nil {
                // Event went out but DB write failed. The same event will
                // publish again on next loop. Consumers MUST be idempotent.
                return err
            }
        }
        return nil
    })
}

Five things I want to call out:

  1. At-least-once, not exactly-once. Exactly-once over a network is a fiction sold to people who don’t read the footnotes. Between “relay published to broker” and “relay marked row published,” a crash means the event goes out twice. Consumers deduplicate on the event id. Build idempotent handlers — it’s the only answer that survives partition.
  2. SELECT ... FOR UPDATE SKIP LOCKED must share a tx with MarkPublished. The row lock lives for the transaction’s lifetime, not the statement’s. If FetchUnpublished opens and closes its own tx, the locks drop on return and a sibling relay instance immediately grabs the same rows — same double-publish the pattern is meant to prevent. Keep Fetch, IncrementAttempts, MoveToDLQ, and MarkPublished inside one InTx block so the locks persist until commit. Without SKIP LOCKED you get contention; without FOR UPDATE you get races; without the shared tx you get both.
  3. One bad row must not block the batch. The naive loop returns on first publish error and stalls every row behind it — the same silent-loss failure mode the outbox is supposed to prevent, just shifted in time. Skip-and-continue per row, cap attempts, and move poison rows to a DLQ table with an alert. A stuck outbox row at 200 attempts is the canary.
  4. Publisher.Publish runs inside the tx — that is a real tradeoff. The loop above calls publisher.Publish while holding FOR UPDATE SKIP LOCKED row locks for the whole batch, which means a slow broker parks Postgres row locks across network I/O and blocks autovacuum on the outbox table. Under broker slowness you get a feedback loop: locks held longer, autovacuum starved, table bloat, sibling relays queue up. Keep batches small (25–100 rows), keep publish deadlines tight, and alarm on pg_stat_activity for long-running relay transactions. If broker tail latency is routinely above ~100ms, switch to an advisory-lock + commit-per-row shape so each row’s lock is released the moment its publish acks — you lose some throughput to extra commits and trade it for predictable lock duration. Partial-batch rollback has the same implication: if MarkPublished fails midway through a batch, the whole tx rolls back and every already-published row in that batch republishes, so consumer idempotency isn’t a nice-to-have here, it’s load-bearing.
  5. Carry the event ID and trace ID on the wire. The event envelope must include the outbox row’s id (so consumers can dedupe by a stable key) and the producer’s trace_id (so the downstream span links back to the originator). Events with a fresh UUID per delivery defeat dedup; events without trace context sever your observability at every async hop.

This pattern costs you: a table, a relay process, polling latency (or a LISTEN/NOTIFY trigger if you want sub-second delivery on Postgres). In exchange, you get events that cannot be silently dropped. For anything that touches money or audit, that’s non-negotiable.

Consuming Events: Idempotency and Poison Messages

The consumer side is where people forget that at-least-once means “at least once.” An inventory service handling order.placed:

// inventory/handler.go
package inventory

import (
    "context"
    "encoding/json"
    "errors"
    "fmt"
)

// ErrPermanent marks poison messages that must NOT be retried. The
// dispatch wrapper below routes these to the DLQ; everything else is
// retried with backoff. Without this sentinel, readers end up with
// infinite-retry loops on malformed payloads.
var ErrPermanent = errors.New("permanent failure")

func (h *Handler) HandleOrderPlaced(ctx context.Context, env Envelope) error {
    var e orderPlacedPayload
    if err := json.Unmarshal(env.Payload, &e); err != nil {
        // Malformed payload is a poison message. Do NOT retry forever —
        // send to a dead-letter topic and alert.
        return fmt.Errorf("%w: bad payload: %v", ErrPermanent, err)
    }

    // Idempotency: claim the event ID atomically AND perform the side effect
    // in the SAME DB transaction. If the reservation fails, the claim rolls
    // back with it — the broker redelivers and we retry. If the claim commits
    // without the side effect also committing, you've reintroduced the exact
    // silent-loss failure mode the outbox was designed to prevent: the event
    // gets ack'd, inventory is never reserved, nobody knows.
    return h.db.InTx(ctx, func(tx Tx) error {
        claimed, err := tx.Processed().Claim(ctx, env.ID, "inventory.reservation")
        if err != nil {
            return err
        }
        if !claimed {
            return nil // another delivery already committed this event; no-op
        }
        // Side effect runs on the same tx. Commit is all-or-nothing.
        return h.svc.ReserveInventoryForOrderTx(ctx, tx, e.OrderID)
    })
}

The handler is only half the contract. The dispatch wrapper that calls it must inspect the returned error and route poison messages to a DLQ — otherwise ErrPermanent is a comment, not a control flow. Every consumer gets this wrapper:

// inventory/dispatch.go
package inventory

import (
    "context"
    "errors"
)

// Dispatch invokes the handler and routes based on the error class.
// Transient failures return to the broker for redelivery with backoff.
// Permanent failures (ErrPermanent) are parked in the DLQ and alerted —
// never redelivered, because redelivery will fail the same way forever.
func (h *Handler) Dispatch(ctx context.Context, env Envelope) error {
    err := h.HandleOrderPlaced(ctx, env)
    if err == nil {
        return nil
    }
    if errors.Is(err, ErrPermanent) {
        if dlqErr := h.dlq.Park(ctx, env, err.Error()); dlqErr != nil {
            // DLQ write failure is itself transient — return so the
            // broker retries delivery rather than losing the event.
            return dlqErr
        }
        h.logger.Error("poison message parked in DLQ",
            "event_id", env.ID, "err", err)
        return nil // ack to broker; we own it now
    }
    return err // transient: let the broker redeliver
}

Three patterns earn their keep here:

  • Event-ID dedup table — one row per (consumer, event_id). First handler wins; retries are no-ops. This is the cheapest idempotency key.
  • Atomic claim with the side effect, not just atomic claim. “INSERT … ON CONFLICT DO NOTHING returning rows-affected == 1” is necessary but not sufficient. If the claim commits in its own transaction and the side effect runs in a separate one, a failure in the side effect (DB down, panic, transient error) leaves the claim committed and the work undone — same silent-loss failure the outbox was meant to prevent, now at the consumer. Run the claim INSERT and the side effect in the same DB transaction so they commit or roll back together. The alternative is pushing idempotency down to the aggregate with a unique constraint (e.g. UNIQUE(order_id, reservation_id)) so redeliveries collide at the DB layer and become a no-op — also fine, also atomic. Pick one. Do not split the claim and the side effect across transactions.
  • Poison-message policy — distinguish transient failures (retry with backoff) from permanent failures (DLQ + alert). Infinite-retry loops on bad data will chew your consumer throughput and mask real bugs.

gRPC vs REST: When Each Earns Its Place

The tradeoff between gRPC and REST is overstated in both directions. Here’s where I’d actually reach for each:

gRPC wins for service-to-service:

  • Strongly typed contracts with .proto IDL. A service change breaks callers at build time, not at 3 AM.
  • Bidirectional streaming for telemetry ingest, progress updates, and server-push scenarios.
  • Efficient binary encoding matters at high RPS between internal services.
  • Built-in deadlines propagate via context across calls.

REST wins for public-facing APIs:

  • Every client library in every language, without codegen.
  • HTTP semantics browsers, proxies, caches, and firewalls already understand.
  • Debuggable with curl. Don’t underestimate this.
  • No protobuf version hell between external consumers and your schema.

I default to gRPC inside a service mesh and REST at the edge. GraphQL has its place for aggregation layers feeding BFFs, but it’s a third system, not a gRPC replacement.

A minimal .proto for the order service, using int64 minor units to match the domain:

// order_service.proto
syntax = "proto3";

package order.v1;
option go_package = "github.com/lookatitude/ecommerce/order/v1;orderv1";

import "google/protobuf/timestamp.proto";

service OrderService {
  rpc CreateOrder(CreateOrderRequest) returns (OrderResponse);
  rpc GetOrder(GetOrderRequest) returns (OrderResponse);
  rpc AddOrderItem(AddOrderItemRequest) returns (OrderResponse);
  rpc PlaceOrder(PlaceOrderRequest) returns (OrderResponse);
}

message Money {
  // Minor units (cents, pence, yen). Never use `double` for currency.
  int64  minor    = 1;
  // ISO 4217 code.
  string currency = 2;
}

message OrderItem {
  string product_id = 1;
  int32  quantity   = 2;
  Money  unit_price = 3;
  Money  subtotal   = 4;
}

message OrderResponse {
  string id          = 1;
  string customer_id = 2;
  repeated OrderItem items = 3;
  string status      = 4;
  Money  total       = 5;
  google.protobuf.Timestamp created_at = 6;
  google.protobuf.Timestamp updated_at = 7;
}

message CreateOrderRequest  { string customer_id = 1; }
message GetOrderRequest     { string order_id    = 1; }
message PlaceOrderRequest   { string order_id    = 1; }
message AddOrderItemRequest {
  string order_id   = 1;
  string product_id = 2;
  int32  quantity   = 3;
  Money  unit_price = 4;
}

The wire contract mirrors the domain: int64 minor + string currency. If you ever see double amount in a proto for money, the person who wrote it hasn’t been through a reconciliation incident yet.

Observability: Correlation IDs or You’re Blind

In a distributed system, “I’ll grep the logs” stops working the moment a request crosses a service boundary. You need three things working together:

  1. Structured logging with consistent field names across services.
  2. Distributed traces (OpenTelemetry) with automatic context propagation.
  3. Correlation IDs flowing as metadata on every call — gRPC metadata, HTTP headers, event envelopes.

A minimal structured logger with OTel trace correlation:

// obs/logger.go
package obs

import (
    "context"

    "go.opentelemetry.io/otel/trace"
    "go.uber.org/zap"
)

func FromContext(ctx context.Context, base *zap.Logger) *zap.Logger {
    span := trace.SpanFromContext(ctx)
    sc := span.SpanContext()
    if !sc.IsValid() {
        return base
    }
    return base.With(
        zap.String("trace_id", sc.TraceID().String()),
        zap.String("span_id", sc.SpanID().String()),
    )
}

Every log line in a request lands with a trace_id. Paste that ID into Jaeger/Tempo and you get the full cross-service timeline. Grep for it in Loki/ELK and you get every log across every service. That’s the superpower — one ID, complete picture.

Initialize tracing once at service startup:

// obs/trace.go
package obs

import (
    "context"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/propagation"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
)

func InitTracer(ctx context.Context, service, endpoint string) (*sdktrace.TracerProvider, error) {
    exp, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint(endpoint),
        // WithInsecure is dev-only. Plaintext gRPC to an OTel collector
        // across a zone/VPC boundary exposes span attributes (order IDs,
        // user IDs, request shapes) to anything on the wire — the exact
        // PII-leak vector the span-attribute warning below is about. In
        // prod, use otlptracegrpc.WithTLSCredentials(...) with the
        // collector's CA, or run the collector as a sidecar on loopback.
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }
    res, err := resource.New(ctx,
        resource.WithAttributes(semconv.ServiceName(service)),
    )
    if err != nil {
        return nil, err
    }
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(res),
        // In prod: sample 1-10%. AlwaysSample only in dev/staging.
        sdktrace.WithSampler(sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.05))),
    )
    otel.SetTracerProvider(tp)
    otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
        propagation.TraceContext{},
        propagation.Baggage{},
    ))
    return tp, nil
}

Sampling is a real decision — AlwaysSample will drown you in trace storage costs at scale. 5% parent-based sampling keeps interesting traces together while keeping budget sane. If you need more detail on specific endpoints, sample-by-rule.

One warning about span attributes: traces land in a backend that is usually less hardened than your primary database, and vendor-hosted backends multiply that blast radius. Do not attach raw PII, full request bodies, secrets, or auth tokens as span attributes. Use stable IDs (order ID, customer ID) as trace keys, never email addresses, payment details, or JWTs. Every attribute you add is a row in someone’s cheap search index — treat the span like a log line, not a debug dump.

Kubernetes: Probes, Graceful Shutdown, Resource Limits

A pod that restarts during a deploy while it’s still serving traffic drops requests. A pod without resource limits starves its neighbors. A pod with liveness == readiness flaps when the service is warming up. These are all boring, all preventable, and all bite teams in production.

The deployment I actually ship:

# kubernetes/order-service.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: ecommerce
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels: { app: order-service }
  template:
    metadata:
      labels: { app: order-service }
    spec:
      terminationGracePeriodSeconds: 45
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
        seccompProfile: { type: RuntimeDefault }
      containers:
        - name: order-service
          # Pin by immutable digest, not by mutable tag. Tags can be
          # re-pushed (a compromised or rebuilt `v1.2.3` silently
          # replaces yours); digests cannot. Keep the tag for humans,
          # the digest for the kubelet.
          image: lookatitude/order-service:v1.2.3@sha256:0c4c6a1f3a9b1e7d2c8f5e9a1b2c3d4e5f60718293a4b5c6d7e8f9012345abcd
          ports:
            - { containerPort: 8080, name: http }
            - { containerPort: 9090, name: grpc }
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities: { drop: ["ALL"] }
          resources:
            requests: { cpu: 100m, memory: 128Mi }
            limits:   { cpu: 500m, memory: 512Mi }
          # Readiness gates traffic: fails -> removed from Service endpoints.
          readinessProbe:
            httpGet: { path: /health/ready, port: 8080 }
            initialDelaySeconds: 2
            periodSeconds: 5
            failureThreshold: 3
          # Liveness restarts the pod: fails -> kill and reschedule.
          # MUST be more lenient than readiness, or you restart during warmup.
          livenessProbe:
            httpGet: { path: /health/live, port: 8080 }
            initialDelaySeconds: 20
            periodSeconds: 20
            failureThreshold: 3
          # preStop + graceful shutdown = no dropped requests on rollout.
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 10"]

The preStop sleep is the trick most people miss: Kubernetes removes the pod from Service endpoints and sends SIGTERM at the same time. If your app exits immediately on SIGTERM, in-flight requests from clients that haven’t yet refreshed their endpoint cache get dropped. The sleep gives the cluster time to propagate the endpoint change; your app then drains gracefully.

The securityContext block is non-negotiable for anything touching customer data. Running as non-root with a read-only root filesystem and all capabilities dropped means a compromised Go binary can’t write a webshell to disk or escalate via ambient privileges. Add these on day one — retrofitting readOnlyRootFilesystem: true into a service that writes to /tmp by default is a weekend of tracking down hidden writes.

The Go side has to cooperate:

// cmd/order-service/main.go
package main

import (
    "context"
    "errors"
    "log"
    "net/http"
    "os/signal"
    "syscall"
    "time"
)

// router is the http.Handler wired up elsewhere (chi, mux, gRPC-gateway, etc.).
// tp is the OTel TracerProvider returned by obs.InitTracer at startup.
var (
    router http.Handler
    tp     interface{ Shutdown(context.Context) error }
)

func main() {
    ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
    defer stop()

    srv := &http.Server{
        Addr:    ":8080",
        Handler: router,
        // Timeouts are non-negotiable. A missing ReadHeaderTimeout is a
        // Slowloris DoS vector — an attacker opens sockets, dribbles one
        // byte of headers per minute, and starves your goroutine pool.
        // Go's zero-value is "no timeout," which is exactly wrong for a
        // service exposed to the internet or to an untrusted mesh.
        ReadHeaderTimeout: 5 * time.Second,
        ReadTimeout:       30 * time.Second,
        WriteTimeout:      30 * time.Second,
        IdleTimeout:       120 * time.Second,
    }
    go func() {
        if err := srv.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
            log.Fatal(err)
        }
    }()

    <-ctx.Done()
    // Give in-flight requests up to 30s to complete.
    shutCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()
    if err := srv.Shutdown(shutCtx); err != nil {
        log.Printf("graceful shutdown failed: %v", err)
    }
    // Flush pending spans before the process exits. Without this the last
    // batch — often the interesting one containing the shutdown path — is
    // lost, and your trace backend shows a clean exit where a real incident
    // happened.
    if err := tp.Shutdown(shutCtx); err != nil {
        log.Printf("tracer shutdown failed: %v", err)
    }
}

Resource limits deserve a word. requests is what the scheduler packs on. limits is what the kernel enforces. For Go services, memory limits matter most — OOMKilled is a hard kill with no grace. Start with 2x observed steady-state memory as the limit, watch it, tune down. CPU limits throttle; they don’t kill, but they can tank tail latency if set too tight on bursty workloads.

A Case From the Field

A while back we built a telematics platform for an automotive client that ingests data from a large vehicle fleet and powers real-time analytics and customer apps. The contexts we landed on:

  1. Telemetry Ingestion — receives, validates, and buffers vehicle data.
  2. Vehicle — device metadata, ownership, lifecycle.
  3. Trip — aggregates telemetry into trip records.
  4. Analytics — rolling metrics, driver-behavior scoring.
  5. Notification — alerts and user-facing updates.
  6. API Gateway — unified surface for client apps.

The patterns that actually moved the needle: transactional outbox between Trip and Analytics (we were losing ~0.3% of trips to dropped events in v1 until we fixed it), int64 odometer/money fields (fleet-level aggregates had drift at v1 scale), and aggressive readiness-probe tuning combined with preStop sleeps to stop the rollout-drop pattern that was flagging 5xx spikes in every deploy. Once those three were in place, the system ran steadily at production scale with independent releases per service.

None of that came from exotic patterns. It came from picking the unfashionable answer (outbox over inline publish, integer cents over float64) and being disciplined about it.

What I’d Actually Choose

Starting a new Go microservice system today, with the constraints I usually face:

Service boundaries: bounded contexts first, one service per context as the default. Split further only for a concrete operational reason. Never split by data model.

Money: int64 minor units in domain, wire, and database. shopspring/decimal if you’re building financial reporting that does heavy multi-step arithmetic. Never float64, not once.

Events: transactional outbox, period. The cost is a table and a relay. The payoff is “events cannot be silently dropped” — which, if you’ve ever been through the post-mortem for a missing event, you will pay for gladly.

RPC: gRPC between services, REST at the edge. Proto contracts with int64 minor units for money. Codegen into your services and your SDKs.

Data consistency: each service owns its schema. No cross-service database reads, not even “just this once.” If you need data from another context, it’s a contract call or an event subscription.

Observability: OpenTelemetry from day one. Trace IDs in every log. 5% sampling in prod with rule-based boosts. Don’t bolt this on later — the information you lose before instrumentation is the information you need when it’s 3 AM.

Kubernetes: readiness != liveness, preStop sleeps for zero-drop rollouts, memory limits at 2x observed, terminationGracePeriodSeconds longer than your slowest handler.

Service-to-service auth: mTLS with workload identities (SPIFFE/SPIRE via the mesh), not shared bearer secrets. The user’s JWT flows as context; the transport proves the caller.

The single biggest mistake I see: teams treating “eventually consistent” as “we’ll deal with drift later.” Drift compounds. An event dropped in month one is a reconciliation nightmare in month six. Get the outbox in before you have three services, and the currency type right before you take your first payment.

At Lookatitude, we help teams build and harden Go microservice systems — boundaries, event flows, observability, and the boring operational details that separate a system that survives production from one that leaks data. If you’re planning a new system or fighting drift in an existing one, get in touch.

← Back to blog