Platform · Testing · Observability · Runbooks

Operations & runbooks.

How quality is enforced before a deploy, what to look at when something breaks, and how to recover from the specific failure modes that have actually happened on this codebase.

Backend tests 1,247+ Coverage gate 80% E2E specs 85 Runbooks 8 RB-* drills APM Actuator + logs · Sentry pending

Quality gates

Four gates sit between a commit and a deploy:

flowchart TB classDef u fill:#ecfeff,stroke:#0891b2,color:#0f172a classDef a fill:#fff7ed,stroke:#f97316,color:#0f172a classDef i fill:#eef2ff,stroke:#1e3a5f,color:#0f172a classDef e fill:#f0fdf4,stroke:#16a34a,color:#0f172a PR["PR opened"] --> UNIT["Unit tests
~1,200 JUnit · 94.9% cov"]:::u UNIT --> ARCH["ArchUnit · 25 rules
layering · naming · contracts"]:::a ARCH --> INT["Integration tests
@Tag('integration')
Tenant · Payment · Cascade"]:::i INT --> E2E["Playwright E2E
88 specs · 20 journeys · 10 role projects"]:::e E2E --> MERGE["Merge to dev"] MERGE --> DEPLOY["make api-deploy / web-deploy"]

The first three run as part of ./gradlew check. Playwright runs from e2e/ against a chosen base URL.

Testing

Backend

cd backend
ulimit -n 8192 && ./gradlew test                 # all tests
./gradlew test --tests "com.cplk.api.architecture.ArchitectureTest"
ulimit -n 8192 && ./gradlew test jacocoTestReport
./gradlew check                                  # tests + 80% gate

Unit · ~90 service/util tests.
Integration · TenantIsolationIntegrationTest, PaymentIdempotencyIntegrationTest, CascadeDeleteIntegrationTest (Testcontainers Postgres).
Coverage HTML: backend/build/reports/jacoco/test/html/index.html
Test report: backend/build/reports/tests/test/index.html

E2E (Playwright)

cd e2e
npx playwright test                  # all
npx playwright test --project agent  # one role project
CPLK_E2E_BASE_URL=https://dev.cplk.org npx playwright test

88 spec files, ~25k LOC, 20 journeys, ~938 tests.
10 role projects (anonymous, agent, agency-admin, agency-super-admin, super-admin, blog-editor, financial-officer, property-owner, plus setup + anonymous-journeys).
Auth handled in setup project — Keycloak UI login → .auth/<role>.json.
Workers: 2 local / 1 CI; retries 0 (journeys must be deterministic).
Locale en-US, timezone Asia/Colombo.

Observability

Health & metrics

GET /api/actuator/health — used by Nginx + deploy script.
/actuator/info — build/commit metadata.
/actuator/metrics — Micrometer; details when-authorized in dev, never in prod.
No Sentry / Datadog / Grafana wired in today. Add an exporter + dashboard before multi-node rollout.

Logs

Logback profile-driven (logback-spring.xml): console in dev, JSON (logstash encoder) in prod.
File: /opt/cplk/logs/app.log (rotated).
Journald: journalctl -u cplk-api -f.
Heap dump on OOM: /opt/cplk/logs/heapdump.hprof.

Audit ledger as observability

audit_logs and financial_audit_logs are first-class diagnostic tools: every state change carries actor, IP, old/new values, request id. Index on request_id lets you trace a single transaction across services.

Frontend telemetry

Google Analytics 4 (GA_MEASUREMENT_ID) and Meta Pixel (FB_PIXEL_ID).
Sonner toasts capture user-visible 4xx/5xx.
No client error reporter today — Sentry browser SDK is the obvious next step.

Suggested SLOs

SLO	Target	Window	Where to read it today
API availability	99.9%	30 d	Nginx access log + Actuator health probe (manual)
Public listing p95 latency	< 800 ms	7 d	Nginx `$request_time` histogram
Portal action success	> 99.5%	30 d	Audit-log success/failure ratio per action
PayHere webhook latency-to-effect	< 5 s p95	7 d	Diff `payments.paid_at` vs FinancialAuditLog.created_at
Sev-2 MTTR	< 60 min	rolling Q	Manual postmortems

Known gotchas

Real failure modes we've already hit — keep these top of mind:

Flyway checksum mismatch

Never modify an applied migration. If a checksum mismatch shows up:

-- on the affected env's DB
UPDATE flyway_schema_history SET checksum = <new_checksum>
WHERE version = '<version>';

Then create a corrective forward migration so the right state is reproducible elsewhere.

Keycloak issuer URL

Internal URL is http://localhost:8090, but the issuer claim uses the public URL https://auth.cplk.org/realms/cplk. JWK set URI uses the internal URL for direct access. Misconfiguring this fails JWT validation with InvalidIssuerException.

Spring env vars vs defaults

${VAR:} resolves to an empty string and overrides any @Value default. Always set env vars explicitly on the server — empty > absent.

fail2ban

Rapid SSH connections to the API server trigger a ban. Recovery: hop via the DB VPS and unban the API VPS IP, or wait the ban window.

Sitemap XML escaping

Next.js does not auto-escape & in sitemap URLs. Always emit & manually in sitemap.ts.

React hydration warning #418

Hidden spans used for email obfuscation cause hydration mismatch. Use plain text instead of dynamic hidden segments.

Runbooks

RB-01 · API 5xx spike

Check journalctl -u cplk-api -f for stack traces; group by exception type.
Hit /api/actuator/health — DB / mail / disk components.
Was there a deploy in the last hour? If yes, rollback via symlink swap (Deployment → Rollback).
If DB is slow: check pg_stat_activity for long-running queries; verify Hikari pool isn't exhausted (HIKARI_MAX_POOL_SIZE).
If Keycloak unreachable: validate JWKS reachable from the API VPS; verify AUTH_JWK_SET_URI.

RB-02 · Suspected cross-tenant access

Identify the offending request id from the audit log.
Inspect the call stack — does it pass through a method tagged @BypassTenantFilter? If yes, was it the right role?
Suspend the suspect user in Keycloak; revoke their sessions.
Diff old_values / new_values in audit_logs to scope the breach.
Write a regression test in TenantIsolationIntegrationTest before patching.

RB-03 · PayHere webhook outage

Inspect Nginx access logs for POST /payments/notify — are calls arriving?
Verify PayHere config: PAYHERE_BASE_URL, _MERCHANT_ID, _NOTIFY_URL.
If calls arrive but fail signature: rotate PAYHERE_MERCHANT_SECRET with PayHere support; redeploy.
Queue manual webhook replay via PayHere dashboard for affected orders.
Confirm idempotency: pulling on order_id should show a single SUCCESS row even after multiple deliveries.

RB-04 · Recurring charges not firing

Confirm scheduler ran: SELECT * FROM shedlock WHERE name = 'rechargeSubscriptions'.
If locked_until is in the future from a crashed previous run, clear it manually.
Check PayHereOAuthService for repeated 401s in logs — the bearer cache may be stuck.
Look at FinancialAuditLog(event_type='PAYMENT_FAILED') for the agency to identify the failure mode (card declined, token revoked, etc.).

RB-05 · R2 upload failures

Verify CF_R2_ACCESS_KEY not rotated; bucket exists; CORS allows the web origin.
Fall back to LocalStorageService by setting app.storage.r2.enabled=false temporarily.
Re-upload failed images from property_images rows whose base_key points to missing objects.

RB-06 · Email not delivered

EMAIL_ENABLED true? NoOpEmailService silently no-ops if false.
SMTP creds valid? Test with swaks --to me --server $SMTP_HOST:$SMTP_PORT --auth LOGIN --auth-user $SMTP_USERNAME --auth-password $SMTP_PASSWORD.
Check Notification DLQ: DlqNotificationHandler records the original message + cause.

RB-07 · Flyway migration fails on deploy

Do not edit the failed migration on disk.
Identify the cause — locked rows? wrong assumed state?
If checksum mismatch (someone edited an applied file): see Known gotchas.
If actual schema error: write a fix-forward migration with a higher version; redeploy.

RB-08 · Database backup / restore

make db-backup env=prod   # pg_dump → encrypted artefact
make db-restore env=prod FILE=cplk-prod-2026-05-13.dump

Backups are not encrypted at rest on the DB VPS by default — copy them off the box and keep them in R2 / object storage with restricted access.

Disaster recovery

Order of restoration after total VPS loss:

Provision new DB VPS, restore from latest pg_dump.
Provision new API VPS, install Java 21, Nginx, systemd unit.
Place /etc/cplk/api.env from secrets store; make api-deploy env=prod.
Re-point DNS for api.cplk.org to the new API VPS.
Re-deploy frontend (Cloudflare Pages is independent — should still serve).
Verify Keycloak — auth.cplk.org may be its own host; if not, restore that too.
Smoke test: log in, list properties, submit inquiry, run a sandbox PayHere checkout, verify webhook reaches API.

Backup strategy. Daily pg_dump via cron on the DB VPS, retained 30 days locally and 90 days off-box. Object storage (R2) is treated as immutable — image churn is low, so per-tenant export is the recovery path for accidental deletes.

Operational checklist (weekly)

Review FinancialAuditLog totals vs PayHere settlement report.
Sample 5 audit-log entries per agency to spot unusual patterns.
Confirm scheduled jobs ran: SELECT name, locked_at FROM shedlock.
Confirm Nginx error log clean (no 502/504 spikes).
Disk usage on both VPSes < 70% (logs + heap dumps grow).
Rotate SMTP_PASSWORD and PAYHERE_MERCHANT_SECRET on the documented cadence.