Transactional Outbox Pattern in Go with PostgreSQL
Write the event with the data. Never split them.
Two writes that should succeed together will eventually fail separately.
Your order service saves the order to the database, then publishes an order.created event to a message broker.
These two operations run one after the other.
Between them, things go wrong: the broker is down, the network times out, the process restarts, or the container is evicted. The database write succeeded. The publish did not. The downstream service that needs to know about the new order never finds out. Nobody noticed until a customer called.
This is the dual-write problem, and it is one of the most common sources of silent data loss in distributed systems. The transactional outbox pattern is the standard fix.

The dual-write problem
The failure mode is easy to reason about once you see it:
BEGIN;
INSERT INTO orders ... -- succeeds
COMMIT;
PUBLISH order.created ... -- fails, crashes, or is never reached
The database and the message broker do not share a transaction boundary. There is no rollback that covers both. Every service that does save -> publish in sequence has this gap. The pattern shows up in many forms:
db.Save(order)followed byevents.Publish(OrderCreated{...})- HTTP handler that commits a transaction and then calls an external webhook
- Worker that processes a record from one queue and writes results to another
The outcome in all cases is the same: one side succeeds while the other fails, and the system ends up in a state that is invisible to monitoring because both individual operations returned success at some point.
A retry loop does not fix this. Retrying the publish after the database commit only works if the retry itself is reliable – which requires exactly the durability guarantee you do not have.
What the transactional outbox pattern does
The outbox pattern eliminates the gap by removing the direct publish entirely. Instead of calling the broker from within your business logic, you write an event record into an outbox table in the same database transaction as the business data. A separate background process – the relay – reads from the outbox table and publishes to the broker.
BEGIN;
INSERT INTO orders ... -- business data
INSERT INTO outbox_events ... -- event record
COMMIT;
-- Relay process (separately):
SELECT ... FROM outbox_events FOR UPDATE SKIP LOCKED;
PUBLISH order.created ...
UPDATE outbox_events SET processed_at = NOW() WHERE id = $1;
Both writes succeed or both fail. The transaction guarantee you already have from PostgreSQL now covers the event record too. The relay can retry publishing as many times as needed because the event sits in durable storage. If the relay crashes mid-flight, it restarts and retries. The worst outcome is that the event is published more than once – which is handled by making consumers idempotent (see Idempotency in Distributed Systems).
PostgreSQL schema for the outbox table
The schema is intentionally simple:
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(100) NOT NULL,
aggregate_id VARCHAR(100) NOT NULL,
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
attempts INT NOT NULL DEFAULT 0,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
processed_at TIMESTAMPTZ
);
-- Partial index: only indexes unprocessed rows, stays small as rows are marked done
CREATE INDEX idx_outbox_unprocessed
ON outbox_events (created_at)
WHERE processed_at IS NULL;
The partial index on created_at WHERE processed_at IS NULL is important. Without it, the index grows with every event ever written and the relay’s polling query gets slower over time. With it, the index covers only the pending rows, which in steady state is a small, bounded set regardless of how many events have been published.
Key field choices:
aggregate_typeandaggregate_iddescribe which entity the event belongs to. Useful for ordering guarantees and routing.event_typeis the event name your consumers expect.payload JSONBstores the event body. UseJSONBrather thanTEXTso you can query it if needed.attemptstracks how many times the relay has tried to publish this row. Used for retry limits and dead-letter handling.processed_atisNULLfor pending rows and set when the relay successfully publishes.
Writing business data and outbox event in one transaction
The business logic writes both records inside a single BeginTx / Commit call. There is no publish call here – only database writes.
type OrderService struct {
db *sql.DB
}
func (s *OrderService) CreateOrder(ctx context.Context, order Order) error {
tx, err := s.db.BeginTx(ctx, nil)
if err != nil {
return fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback()
if _, err := tx.ExecContext(ctx, `
INSERT INTO orders (id, customer_id, total, created_at)
VALUES ($1, $2, $3, NOW())
`, order.ID, order.CustomerID, order.Total); err != nil {
return fmt.Errorf("insert order: %w", err)
}
payload, err := json.Marshal(map[string]any{
"order_id": order.ID,
"customer_id": order.CustomerID,
"total": order.Total,
})
if err != nil {
return fmt.Errorf("marshal payload: %w", err)
}
if _, err := tx.ExecContext(ctx, `
INSERT INTO outbox_events
(aggregate_type, aggregate_id, event_type, payload)
VALUES ($1, $2, $3, $4)
`, "order", order.ID, "order.created", payload); err != nil {
return fmt.Errorf("insert outbox event: %w", err)
}
return tx.Commit()
}
If tx.Commit() fails, neither the order row nor the outbox row is persisted. If it succeeds, both are guaranteed to be in the database. The relay can publish the event at any point after that – immediately, in one second, or after the relay restarts following a crash.
This is the only code change required in your business layer. The rest of the pattern lives in the relay.
Go relay implementation
The relay is a background worker that polls the outbox table on a timer. It fetches a batch of unprocessed rows, publishes each one, and marks it done. Keep it in the same binary as your application or run it as a separate process – either works, but the same binary is simpler to operate.
type OutboxRelay struct {
db *sql.DB
publisher Publisher
logger *slog.Logger
batchSize int
pollInterval time.Duration
maxAttempts int
}
func (r *OutboxRelay) Run(ctx context.Context) error {
ticker := time.NewTicker(r.pollInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
if err := r.processBatch(ctx); err != nil {
r.logger.Error("outbox relay batch failed", "err", err)
}
}
}
}
The relay respects context cancellation, which makes it straightforward to integrate with graceful shutdown. For a detailed treatment of context lifetime and cancellation patterns, see Go context.Context Done Right.
FOR UPDATE SKIP LOCKED: the concurrent worker pattern
The processBatch function uses FOR UPDATE SKIP LOCKED to safely handle concurrent relay workers:
func (r *OutboxRelay) processBatch(ctx context.Context) error {
tx, err := r.db.BeginTx(ctx, nil)
if err != nil {
return fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback()
rows, err := tx.QueryContext(ctx, `
SELECT id, aggregate_type, aggregate_id, event_type, payload
FROM outbox_events
WHERE processed_at IS NULL
AND attempts < $1
ORDER BY created_at
LIMIT $2
FOR UPDATE SKIP LOCKED
`, r.maxAttempts, r.batchSize)
if err != nil {
return fmt.Errorf("query outbox: %w", err)
}
defer rows.Close()
type row struct {
id string
aggregateType string
aggregateID string
eventType string
payload json.RawMessage
}
var batch []row
for rows.Next() {
var e row
if err := rows.Scan(
&e.id, &e.aggregateType, &e.aggregateID, &e.eventType, &e.payload,
); err != nil {
return fmt.Errorf("scan row: %w", err)
}
batch = append(batch, e)
}
if err := rows.Err(); err != nil {
return err
}
for _, e := range batch {
if err := r.publisher.Publish(ctx, e.eventType, e.aggregateID, e.payload); err != nil {
r.logger.Error("publish failed", "event_id", e.id, "err", err)
if _, err := tx.ExecContext(ctx,
`UPDATE outbox_events SET attempts = attempts + 1 WHERE id = $1`, e.id,
); err != nil {
r.logger.Error("increment attempts failed", "event_id", e.id, "err", err)
}
continue
}
if _, err := tx.ExecContext(ctx,
`UPDATE outbox_events SET processed_at = NOW() WHERE id = $1`, e.id,
); err != nil {
return fmt.Errorf("mark processed: %w", err)
}
}
return tx.Commit()
}
FOR UPDATE SKIP LOCKED does two things. First, FOR UPDATE locks the selected rows for the duration of the transaction, preventing any other transaction from selecting them. Second, SKIP LOCKED means that if a row is already locked by another transaction, the query skips it instead of waiting. The result is that multiple relay workers can run in parallel and each will pick up a non-overlapping subset of rows.
Without SKIP LOCKED, a second worker would block until the first transaction commits before seeing the same rows – at which point they would already be marked done. With SKIP LOCKED, the second worker immediately picks up different rows instead of waiting, giving you safe horizontal scaling.
Note the scan-then-publish separation in the code above: all rows are scanned into a slice before the publish loop starts. This avoids holding an open *sql.Rows cursor across network calls to the broker, which would hold the transaction open longer than necessary.
Idempotency and deduplication
The relay publishes at least once. If it publishes an event and then crashes before committing the processed_at update, it will publish the same event again on restart. This is unavoidable – exactly-once delivery across a database and a message broker without a distributed transaction coordinator requires this trade-off.
Consumers must be idempotent. The simplest approach is to track processed event IDs in a processed_events table:
CREATE TABLE processed_events (
event_id UUID PRIMARY KEY,
processed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
func (h *OrderHandler) HandleOrderCreated(ctx context.Context, eventID string, payload []byte) error {
// Deduplicate using the event ID as the natural key
_, err := h.db.ExecContext(ctx, `
INSERT INTO processed_events (event_id) VALUES ($1)
ON CONFLICT (event_id) DO NOTHING
`, eventID)
if err != nil {
return fmt.Errorf("dedup check: %w", err)
}
// Check if the insert actually happened (1 row) or was a no-op (0 rows)
// A simpler approach: use RETURNING or check rows affected
// If 0 rows affected, this is a duplicate -- skip it
...
}
In practice, many teams rely on the broker’s own deduplication headers (such as Kafka’s key field for log-compacted topics, or RabbitMQ’s message-id header) and treat database-level deduplication as a fallback. Both are valid layers to apply.
Include the outbox event id (a UUID) in the published message as the deduplication key. Consumers can then use it regardless of which deduplication mechanism they prefer.
Retry policy and poison messages
The attempts column drives the retry policy. The relay skips rows where attempts >= maxAttempts and treats those rows as dead letters. A separate process or operator alert handles them.
A simple dead-letter view:
CREATE VIEW outbox_dead_letters AS
SELECT *
FROM outbox_events
WHERE attempts >= 5
AND processed_at IS NULL
ORDER BY created_at;
A good production retry policy:
- Set
maxAttemptsto 5-10 depending on how expensive retries are. - Consider exponential backoff: include a
retry_aftercolumn and skip rows whereretry_after > NOW(). - Alert on
COUNT(*) FROM outbox_dead_lettersexceeding a threshold. - Provide a manual retry path: an admin endpoint or script that resets
attempts = 0andretry_after = NULLfor specific rows.
Poison messages – rows that consistently fail due to a bug in the consumer or a schema mismatch – should not block healthy messages. Since the relay processes a batch per tick and marks failures with an attempt increment rather than removing them from the queue, healthy rows proceed normally while poisoned ones accumulate attempts until they hit the dead-letter threshold.
Event ordering and partitioning
The polling query orders by created_at, which gives first-in-first-out ordering within a batch. For most use cases that is enough. When strict per-entity ordering matters – for example, ensuring that order.updated is never published before order.created for the same order – you need per-aggregate ordering.
Add aggregate_id to the ORDER BY clause and use it as the message key when publishing to a partitioned topic like Apache Kafka. Kafka routes all messages with the same key to the same partition, and partitions are consumed in order. This gives you per-aggregate ordering guarantees without global ordering, which would require a single relay instance.
ORDER BY aggregate_id, created_at
For brokers that do not support partitioned ordering (such as basic AMQP queues), single-instance relay or application-level ordering checks in the consumer are the practical alternatives.
Reduce polling latency with LISTEN/NOTIFY
A polling interval of one second means average event latency of 500 milliseconds. For most workloads that is fine. For cases where you need near-zero latency, PostgreSQL’s LISTEN/NOTIFY mechanism lets the relay wake up immediately when a new outbox row is inserted.
Add a trigger to the outbox table:
CREATE OR REPLACE FUNCTION notify_outbox_insert() RETURNS trigger AS $$
BEGIN
PERFORM pg_notify('outbox_event', NEW.id::text);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER outbox_insert_notify
AFTER INSERT ON outbox_events
FOR EACH ROW EXECUTE FUNCTION notify_outbox_insert();
In the relay, listen on the channel and wake up on notifications while still falling back to periodic polling:
func (r *OutboxRelay) Run(ctx context.Context) error {
listener := pq.NewListener(r.dsn, 10*time.Second, time.Minute, nil)
defer listener.Close()
if err := listener.Listen("outbox_event"); err != nil {
return fmt.Errorf("listen: %w", err)
}
ticker := time.NewTicker(5 * time.Second) // fallback poll
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-listener.Notify:
if err := r.processBatch(ctx); err != nil {
r.logger.Error("outbox batch failed (notify)", "err", err)
}
case <-ticker.C:
if err := r.processBatch(ctx); err != nil {
r.logger.Error("outbox batch failed (poll)", "err", err)
}
}
}
}
The fallback ticker handles any notifications missed during a relay restart or network hiccup. Keep the fallback interval at a few seconds rather than milliseconds – its job is recovery, not low latency.
Observability: metrics, logs, and alerts
The outbox is infrastructure. Treat it like infrastructure and instrument it accordingly.
Key metrics:
var (
outboxPublished = prometheus.NewCounter(prometheus.CounterOpts{
Name: "outbox_events_published_total",
Help: "Total outbox events successfully published.",
})
outboxFailed = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "outbox_events_failed_total",
Help: "Total outbox publish failures by event type.",
}, []string{"event_type"})
outboxPending = prometheus.NewGauge(prometheus.GaugeOpts{
Name: "outbox_events_pending",
Help: "Current number of unprocessed outbox events.",
})
outboxBatchDuration = prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "outbox_batch_duration_seconds",
Help: "Duration of each outbox processing batch.",
Buckets: prometheus.DefBuckets,
})
)
Gauge refresh: run a periodic query to keep outbox_events_pending accurate:
SELECT COUNT(*) FROM outbox_events WHERE processed_at IS NULL;
Alert thresholds to consider:
outbox_events_pending > 1000for more than two minutes: relay is falling behind or stuck.outbox_events_pendinggrowing monotonically: broker is down or relay has crashed.- Dead-letter count non-zero: schema or consumer bug needs investigation.
outbox_batch_duration_seconds p95 > 5s: database is slow or batch size is too large.
Structured log fields: include event_id, event_type, aggregate_id, and attempt in every log line from the relay. These fields let you correlate a failed publish with the specific outbox row and the downstream consumer trace.
Outbox vs. direct queue vs. saga
The outbox pattern is not the right tool for every coordination problem. Here is the comparison:
| Approach | Atomicity | Complexity | When to use |
|---|---|---|---|
| Direct publish | None | Low | Acceptable to occasionally lose events |
| Transactional outbox | Strong | Medium | Reliable event delivery from a single service |
| Saga pattern | Eventual | High | Multi-service transactions that span multiple databases |
| Two-phase commit | Strong | Very high | Rarely practical; avoided in most distributed systems |
The outbox pattern guarantees that a single service reliably emits events that reflect its own state changes. It does not coordinate state changes across multiple services – that is what the Saga pattern is for. The choice of broker – whether RabbitMQ, SQS, or Kafka – is independent of the outbox pattern itself; the relay publishes to whichever broker your system uses.
If you are building a saga, the outbox pattern is still useful: each participant in the saga writes its local state change and its saga event in one transaction using the outbox, then the saga orchestrator or choreography reads those events reliably.
WAL-based CDC as an alternative relay
Instead of polling, you can tail PostgreSQL’s Write-Ahead Log (WAL) and read outbox inserts directly from the replication stream. Tools like Debezium do this. The advantages are lower latency and no lock pressure on the outbox table. The disadvantages are operational complexity, a dedicated PostgreSQL replication slot, and an external service to run and monitor.
For most teams, the polling relay described above is the right starting point. WAL tailing makes sense when you have high outbox insert rates (tens of thousands per second), need sub-100ms event latency, or are already running Debezium for other change-capture needs.
sqlc integration
If you use sqlc for type-safe Go database code, the outbox queries fit naturally:
-- name: InsertOutboxEvent :exec
INSERT INTO outbox_events (aggregate_type, aggregate_id, event_type, payload)
VALUES (@aggregate_type, @aggregate_id, @event_type, @payload);
-- name: FetchOutboxBatch :many
SELECT id, aggregate_type, aggregate_id, event_type, payload
FROM outbox_events
WHERE processed_at IS NULL
AND attempts < @max_attempts
ORDER BY created_at
LIMIT @batch_size
FOR UPDATE SKIP LOCKED;
-- name: MarkOutboxProcessed :exec
UPDATE outbox_events SET processed_at = NOW() WHERE id = @id;
-- name: IncrementOutboxAttempts :exec
UPDATE outbox_events SET attempts = attempts + 1 WHERE id = @id;
-- name: OutboxPendingCount :one
SELECT COUNT(*) FROM outbox_events WHERE processed_at IS NULL;
sqlc generates type-safe functions for each query, which avoids string interpolation errors and keeps the outbox query logic co-located with the rest of your database access layer.
Production checklist
Use this before shipping an outbox implementation:
Database
- Outbox table has the partial index on
created_at WHERE processed_at IS NULL -
attemptscolumn present with a default of 0 - Dead-letter view or query defined
- Old processed rows are periodically archived or deleted (a nightly cleanup job suffices)
Relay
-
FOR UPDATE SKIP LOCKEDused in the polling query - Relay runs inside a transaction (begin before query, commit after all updates)
- Batch size is bounded (50-200 rows is typical)
- Relay respects context cancellation for graceful shutdown
- Failed publishes increment
attemptsrather than causing the batch to abort
Idempotency
- Published message includes the outbox
idas a deduplication key - Consumers are idempotent or the broker provides deduplication
- See Idempotency in Distributed Systems for deduplication patterns
Observability
-
outbox_events_pendinggauge is monitored and alerted on - Dead-letter count is alerted on
- Relay batch duration is tracked
- Structured logs include
event_id,event_type, andaggregate_id
Operations
- Manual retry path exists for dead-letter rows
- Relay restart behavior is tested (does it re-publish correctly?)
- Broker outage behavior is tested (does the outbox grow and drain correctly?)
Final thoughts
The dual-write problem is easy to dismiss as an edge case until it causes an incident. The transactional outbox pattern solves it with tools you already have: a PostgreSQL transaction, a background goroutine, and one extra table. The relay is simple to build, simple to operate, and simple to reason about.
The cost is that consumers must be designed for at-least-once delivery. That is a reasonable trade-off. Exactly-once delivery across a database and a broker without distributed transactions is not achievable in practice – and pretending otherwise leads to systems that silently drop or double-process events under failure conditions.
Write the event with the data. Relay it reliably. Make consumers idempotent. That is the whole pattern.
This article is part of the App Architecture in Production cluster.