Postgres Read Replicas¶

paas's managed Postgres addon (CNPG) supports asynchronous read replicas to offload read-heavy traffic from the writable primary. Cycle 2 ships the API + K8s wiring; the dashboard's "Read replicas" pane and the connection-pooling guides land in this page.

At a glance¶

Capability	Default	Where to configure
Backing operator	CNPG (`postgresql.cnpg.io/v1::Cluster`)	inherited from AD/39 postgres addon
Replica topology	1 primary + N standbys (async streaming)	`read_replicas` field
Connection routing	`DATABASE_URL` (writes) / `DATABASE_URL_RO` (reads)	dual env vars injected via Secrets
Replication mode	Async streaming via WAL	CNPG default
Plans supporting replicas	`business` (1 standby) / `enterprise` (≥1 standby)	enforced by `validate_plan_supports_read_replicas`
Plans NOT supporting replicas	`starter` (single-instance)	rejected with HTTP 400

CNPG palette is starter / business / enterprise

NOT free / standard / pro. The latter is the AddonRequest display palette (consumed by the create endpoint validator). The CNPG-side palette is what app_addons.plan carries forward into validate_plan_supports_read_replicas. Only business and enterprise ship at least one standby instance — the plans where read replicas are physically possible.

Lifecycle¶

flowchart LR
    A["PATCH /v1/apps/{id}/addons/{addon_id}<br/>{read_replicas: 2}"]
    A --> B["addons.rs::patch_addon"]
    B --> V{"plan ∈ business/enterprise?"}
    V -- no --> X["400 Bad Request"]
    V -- yes --> C["cnpg::ensure_cnpg_cluster_with_replicas<br/>instances = max(plan.instances, 1+N)"]
    C --> D["CNPG Operator scales StatefulSet<br/>+ emits {cluster}-ro Service"]
    D --> E["cnpg::ensure_database_url_ro_secret<br/>app-{app}-database-url-ro"]
    E --> F["state.config.set_config_vars<br/>DATABASE_URL_RO_SECRET = ..."]
    F --> G["Next deploy: pod envFrom both<br/>DATABASE_URL + DATABASE_URL_RO"]

Plans¶

CNPG plan	Instances	Standby slots	Replica capacity
`starter`	1	0	0 (rejected)
`business`	2	1	1 (no scale-up; same pod doubles as standby + replica target)
`enterprise`	3	2	up to 2 (or more — `ensure_cnpg_cluster_with_replicas` scales out: `instances = max(plan.instances, 1+N)`)

The max(...) invariant means the plan's HA floor wins: an enterprise cluster (3 instances) with read_replicas: 0 keeps its 3 instances. No downscale below the plan's HA floor — it would silently break the plan's availability guarantees.

API surface¶

PATCH /v1/apps/{app_id}/addons/{addon_id}
Body: { "read_replicas": 2 }

→ 200 OK
{
  "data": {
    "app_id": "...",
    "addon_id": "...",
    "read_replicas": 2,
    "total_instances": 3,
    "status": "updating"
  }
}

→ 400 Bad Request
{ "error": { "code": "bad_request",
             "message": "read_replicas requires plan=business or enterprise (CNPG palette, got plan=starter)" } }

GET /v1/apps/{app_id}/addons/{addon_id}/metrics

→ 200 OK
{
  "data": {
    "app_id": "...",
    "addon_id": "...",
    "replication_lag_seconds": null,
    "note": "cycle-2-stub: real Prometheus scrape wired in cycle 3 via parse_replication_lag"
  }
}

The replication_lag_seconds is null in cycle 2 — the endpoint shape is final but the live Prometheus scrape lands in cycle 3 via paas_database::cnpg::parse_replication_lag.

Connection routing in your application¶

The Postgres addon now injects two environment variables into the app pod:

DATABASE_URL — points at {cluster}-rw (writable primary). Use for INSERT / UPDATE / DELETE.
DATABASE_URL_RO — points at {cluster}-ro (CNPG-managed round-robin Service across the standbys). Use for SELECT on data that tolerates async replication lag.

Standby write attempts are rejected at the Postgres protocol level — even if your client misconfigures the URL, write queries get an explicit error rather than silent corruption.

Django¶

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'OPTIONS': dict(host=parsed_url.hostname, ...),
        ...
    },
    'replica': {
        'ENGINE': 'django.db.backends.postgresql',
        'OPTIONS': dict(host=parsed_ro.hostname, ...),
        ...
    },
}

DATABASE_ROUTERS = ['myapp.routers.PrimaryReplicaRouter']

# myapp/routers.py
class PrimaryReplicaRouter:
    def db_for_read(self, model, **hints):
        return 'replica'
    def db_for_write(self, model, **hints):
        return 'default'

SQLAlchemy¶

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

write_engine = create_engine(os.environ['DATABASE_URL'])
read_engine = create_engine(os.environ['DATABASE_URL_RO'])

WriteSession = sessionmaker(bind=write_engine)
ReadSession = sessionmaker(bind=read_engine)

# In your view / handler:
def list_users():
    with ReadSession() as s:
        return s.query(User).all()

Node.js (pg-pool)¶

const { Pool } = require('pg');

const writePool = new Pool({ connectionString: process.env.DATABASE_URL });
const readPool  = new Pool({ connectionString: process.env.DATABASE_URL_RO });

// Use readPool for SELECT-only paths, writePool for everything else.

Rust (sqlx)¶

let write_pool = sqlx::PgPool::connect(&std::env::var("DATABASE_URL")?).await?;
let read_pool  = sqlx::PgPool::connect(&std::env::var("DATABASE_URL_RO")?).await?;

Replication lag — what it means¶

CNPG runs asynchronous streaming replication. Standbys apply WAL records emitted by the primary with some delay — typically sub-second on a healthy network, but a slow standby or a write spike can push lag into seconds. Read-only queries hitting DATABASE_URL_RO may see slightly stale data.

If your code path requires read-your-own-writes consistency (e.g. "user just submitted a form, immediately load the created row"), use DATABASE_URL (the writable primary) for that path. Async replication is not a replacement for strong consistency — it's a load-spreading optimisation for read-heavy paths.

Limits (cycle 2)¶

Downscale: setting read_replicas below the current count is out of scope cycle 2 — the wire accepts the request but the K8s Patch::Apply will trigger a rolling-down of standbys without a pre-flight safety check (no warning if the standby is the only one). Cycle 3 adds a --force flag plus a confirmation prompt in the dashboard.
replication_lag_seconds: null in cycle 2 (the metrics endpoint shape is final; the scrape wire lands cycle 3 via parse_replication_lag).
Cross-AZ routing: the -ro Service is round-robin across all standbys in the cluster (no AZ affinity). Cycle 3+ may add a closest-standby mode if cross-AZ latency becomes a real issue.
Connection pooling (PgBouncer): out of scope. Tenants who need tight pooling can run their own sidecar.

FAQ¶

Q: Does enterprise automatically have read replicas? A: Yes — enterprise ships 3 instances out of the box (1 primary + 2 standbys). The standbys are queryable via DATABASE_URL_RO from day one without a PATCH. The PATCH read_replicas flow is for scaling beyond the plan's default count (e.g. enterprise + read_replicas: 5 → 6 instances total).

Q: Can business have 2 read replicas? A: Yes — PATCH read_replicas: 2 on business sets instances = max(2, 1+2) = 3, so the cluster scales from 2 to 3 instances (1 primary + 2 standbys).

Q: What happens to DATABASE_URL_RO if I disable read replicas (read_replicas: 0)? A: The Secret stays in place (the -ro Service still exists on business/enterprise because the plan ships ≥ 1 standby). The URL keeps working but with reduced redundancy.

Q: How do I detect replication lag programmatically? A: Cycle 2 ships the endpoint shape only. Cycle 3 will return the live replication_lag_seconds from the CNPG sidecar metrics. Until then, use pg_stat_replication.lag_seconds directly via a psql connection to the primary if you need the value urgently.

Implementation pointers¶

Concern	File
AddonConfig `read_replicas` field	`crates/build/src/paas_toml.rs::AddonConfig`
CNPG palette predicate	`crates/database/src/cnpg.rs::plan_supports_read_replicas`
Instance count computation	`crates/database/src/cnpg.rs::cnpg_instances_for_replicas`
Cluster Patch::Apply wrapper	`crates/database/src/cnpg.rs::ensure_cnpg_cluster_with_replicas`
RO URL formatter	`crates/database/src/cnpg.rs::database_url_ro`
RO URL derive from RW	`crates/database/src/cnpg.rs::database_url_ro_from_rw`
RO Secret materialiser	`crates/database/src/cnpg.rs::ensure_database_url_ro_secret`
Replication lag parser	`crates/database/src/cnpg.rs::parse_replication_lag`
PATCH route handler	`crates/control-plane/src/routes/addons.rs::patch_addon`
Validator	`crates/control-plane/src/routes/addons.rs::validate_plan_supports_read_replicas`
Metrics endpoint	`crates/control-plane/src/routes/addons.rs::get_addon_replication_metrics`

Postgres Addon — base CNPG addon flow this extends.
PITR Backup — sister CNPG capability (continuous WAL archiving).
Add-ons — umbrella addon flow.