Platform · Testing · Observability · Runbooks

Operations & runbooks.

How quality is enforced before a deploy, what to look at when something breaks, and how to recover from the specific failure modes that have actually happened on this codebase.

Quality gates

Four gates sit between a commit and a deploy:

flowchart LR classDef u fill:#ecfeff,stroke:#0891b2,color:#0f172a classDef a fill:#fff7ed,stroke:#f97316,color:#0f172a classDef i fill:#eef2ff,stroke:#1e3a5f,color:#0f172a classDef e fill:#f0fdf4,stroke:#16a34a,color:#0f172a PR["PR opened"] --> UNIT["Unit tests
~1,200 JUnit · 94.9% cov"]:::u UNIT --> ARCH["ArchUnit · 16 rules
layering · naming · contracts"]:::a ARCH --> INT["Integration · @Tag('integration')
Tenant · Payment · Cascade"]:::i INT --> E2E["Playwright E2E · 85 specs
10 role projects"]:::e E2E --> MERGE["Merge to dev"] MERGE --> DEPLOY["make api-deploy / web-deploy"]
The first three run as part of ./gradlew check. Playwright runs from e2e/ against a chosen base URL.

Testing

Backend

cd backend
ulimit -n 8192 && ./gradlew test                 # all tests
./gradlew test --tests "com.cplk.api.architecture.ArchitectureTest"
ulimit -n 8192 && ./gradlew test jacocoTestReport
./gradlew check                                  # tests + 80% gate
  • Unit · ~90 service/util tests.
  • Integration · TenantIsolationIntegrationTest, PaymentIdempotencyIntegrationTest, CascadeDeleteIntegrationTest (Testcontainers Postgres).
  • Coverage HTML: backend/build/reports/jacoco/test/html/index.html
  • Test report: backend/build/reports/tests/test/index.html

E2E (Playwright)

cd e2e
npx playwright test                  # all
npx playwright test --project agent  # one role project
CPLK_E2E_BASE_URL=https://dev.cplk.org npx playwright test
  • 85 spec files, ~24.6k LOC.
  • 10 role projects (anonymous, agent, agency-admin, agency-super-admin, super-admin, blog-editor, financial-officer, property-owner, plus setup + anonymous-journeys).
  • Auth handled in setup project — Keycloak UI login → .auth/<role>.json.
  • Workers: 2 local / 1 CI; retries 0 (journeys must be deterministic).
  • Locale en-US, timezone Asia/Colombo.

Observability

Health & metrics

  • GET /api/actuator/health — used by Nginx + deploy script.
  • /actuator/info — build/commit metadata.
  • /actuator/metrics — Micrometer; details when-authorized in dev, never in prod.
  • No Sentry / Datadog / Grafana wired in today. Add an exporter + dashboard before multi-node rollout.

Logs

  • Logback profile-driven (logback-spring.xml): console in dev, JSON (logstash encoder) in prod.
  • File: /opt/cplk/logs/app.log (rotated).
  • Journald: journalctl -u cplk-api -f.
  • Heap dump on OOM: /opt/cplk/logs/heapdump.hprof.

Audit ledger as observability

audit_logs and financial_audit_logs are first-class diagnostic tools: every state change carries actor, IP, old/new values, request id. Index on request_id lets you trace a single transaction across services.

Frontend telemetry

  • Google Analytics 4 (GA_MEASUREMENT_ID) and Meta Pixel (FB_PIXEL_ID).
  • Sonner toasts capture user-visible 4xx/5xx.
  • No client error reporter today — Sentry browser SDK is the obvious next step.

Suggested SLOs

SLOTargetWindowWhere to read it today
API availability99.9%30 dNginx access log + Actuator health probe (manual)
Public listing p95 latency< 800 ms7 dNginx $request_time histogram
Portal action success> 99.5%30 dAudit-log success/failure ratio per action
PayHere webhook latency-to-effect< 5 s p957 dDiff payments.paid_at vs FinancialAuditLog.created_at
Sev-2 MTTR< 60 minrolling QManual postmortems

Known gotchas

Real failure modes we've already hit — keep these top of mind:

Flyway checksum mismatch

Never modify an applied migration. If a checksum mismatch shows up:

-- on the affected env's DB
UPDATE flyway_schema_history SET checksum = <new_checksum>
WHERE version = '<version>';

Then create a corrective forward migration so the right state is reproducible elsewhere.

Keycloak issuer URL

Internal URL is http://localhost:8090, but the issuer claim uses the public URL https://auth.cplk.org/realms/cplk. JWK set URI uses the internal URL for direct access. Misconfiguring this fails JWT validation with InvalidIssuerException.

Spring env vars vs defaults

${VAR:} resolves to an empty string and overrides any @Value default. Always set env vars explicitly on the server — empty > absent.

fail2ban

Rapid SSH connections to the API server trigger a ban. Recovery: hop via the DB VPS and unban the API VPS IP, or wait the ban window.

Sitemap XML escaping

Next.js does not auto-escape & in sitemap URLs. Always emit &amp; manually in sitemap.ts.

React hydration warning #418

Hidden spans used for email obfuscation cause hydration mismatch. Use plain text instead of dynamic hidden segments.

Runbooks

RB-01 · API 5xx spike

  1. Check journalctl -u cplk-api -f for stack traces; group by exception type.
  2. Hit /api/actuator/health — DB / mail / disk components.
  3. Was there a deploy in the last hour? If yes, rollback via symlink swap (Deployment → Rollback).
  4. If DB is slow: check pg_stat_activity for long-running queries; verify Hikari pool isn't exhausted (HIKARI_MAX_POOL_SIZE).
  5. If Keycloak unreachable: validate JWKS reachable from the API VPS; verify AUTH_JWK_SET_URI.

RB-02 · Suspected cross-tenant access

  1. Identify the offending request id from the audit log.
  2. Inspect the call stack — does it pass through a method tagged @BypassTenantFilter? If yes, was it the right role?
  3. Suspend the suspect user in Keycloak; revoke their sessions.
  4. Diff old_values / new_values in audit_logs to scope the breach.
  5. Write a regression test in TenantIsolationIntegrationTest before patching.

RB-03 · PayHere webhook outage

  1. Inspect Nginx access logs for POST /payments/notify — are calls arriving?
  2. Verify PayHere config: PAYHERE_BASE_URL, _MERCHANT_ID, _NOTIFY_URL.
  3. If calls arrive but fail signature: rotate PAYHERE_MERCHANT_SECRET with PayHere support; redeploy.
  4. Queue manual webhook replay via PayHere dashboard for affected orders.
  5. Confirm idempotency: pulling on order_id should show a single SUCCESS row even after multiple deliveries.

RB-04 · Recurring charges not firing

  1. Confirm scheduler ran: SELECT * FROM shedlock WHERE name = 'rechargeSubscriptions'.
  2. If locked_until is in the future from a crashed previous run, clear it manually.
  3. Check PayHereOAuthService for repeated 401s in logs — the bearer cache may be stuck.
  4. Look at FinancialAuditLog(event_type='PAYMENT_FAILED') for the agency to identify the failure mode (card declined, token revoked, etc.).

RB-05 · R2 upload failures

  1. Verify CF_R2_ACCESS_KEY not rotated; bucket exists; CORS allows the web origin.
  2. Fall back to LocalStorageService by setting app.storage.r2.enabled=false temporarily.
  3. Re-upload failed images from property_images rows whose base_key points to missing objects.

RB-06 · Email not delivered

  1. EMAIL_ENABLED true? NoOpEmailService silently no-ops if false.
  2. SMTP creds valid? Test with swaks --to me --server $SMTP_HOST:$SMTP_PORT --auth LOGIN --auth-user $SMTP_USERNAME --auth-password $SMTP_PASSWORD.
  3. Check Notification DLQ: DlqNotificationHandler records the original message + cause.

RB-07 · Flyway migration fails on deploy

  1. Do not edit the failed migration on disk.
  2. Identify the cause — locked rows? wrong assumed state?
  3. If checksum mismatch (someone edited an applied file): see Known gotchas.
  4. If actual schema error: write a fix-forward migration with a higher version; redeploy.

RB-08 · Database backup / restore

make db-backup env=prod   # pg_dump → encrypted artefact
make db-restore env=prod FILE=cplk-prod-2026-05-13.dump

Backups are not encrypted at rest on the DB VPS by default — copy them off the box and keep them in R2 / object storage with restricted access.

Disaster recovery

Order of restoration after total VPS loss:

  1. Provision new DB VPS, restore from latest pg_dump.
  2. Provision new API VPS, install Java 21, Nginx, systemd unit.
  3. Place /etc/cplk/api.env from secrets store; make api-deploy env=prod.
  4. Re-point DNS for api.cplk.org to the new API VPS.
  5. Re-deploy frontend (Cloudflare Pages is independent — should still serve).
  6. Verify Keycloak — auth.cplk.org may be its own host; if not, restore that too.
  7. Smoke test: log in, list properties, submit inquiry, run a sandbox PayHere checkout, verify webhook reaches API.
Backup strategy. Daily pg_dump via cron on the DB VPS, retained 30 days locally and 90 days off-box. Object storage (R2) is treated as immutable — image churn is low, so per-tenant export is the recovery path for accidental deletes.

Operational checklist (weekly)

  • Review FinancialAuditLog totals vs PayHere settlement report.
  • Sample 5 audit-log entries per agency to spot unusual patterns.
  • Confirm scheduled jobs ran: SELECT name, locked_at FROM shedlock.
  • Confirm Nginx error log clean (no 502/504 spikes).
  • Disk usage on both VPSes < 70% (logs + heap dumps grow).
  • Rotate SMTP_PASSWORD and PAYHERE_MERCHANT_SECRET on the documented cadence.