Operations & runbooks.
How quality is enforced before a deploy, what to look at when something breaks, and how to recover from the specific failure modes that have actually happened on this codebase.
Quality gates
Four gates sit between a commit and a deploy:
~1,200 JUnit · 94.9% cov"]:::u UNIT --> ARCH["ArchUnit · 16 rules
layering · naming · contracts"]:::a ARCH --> INT["Integration · @Tag('integration')
Tenant · Payment · Cascade"]:::i INT --> E2E["Playwright E2E · 85 specs
10 role projects"]:::e E2E --> MERGE["Merge to dev"] MERGE --> DEPLOY["make api-deploy / web-deploy"]
./gradlew check. Playwright runs from e2e/ against a chosen base URL.Testing
Backend
cd backend
ulimit -n 8192 && ./gradlew test # all tests
./gradlew test --tests "com.cplk.api.architecture.ArchitectureTest"
ulimit -n 8192 && ./gradlew test jacocoTestReport
./gradlew check # tests + 80% gate
- Unit · ~90 service/util tests.
- Integration ·
TenantIsolationIntegrationTest,PaymentIdempotencyIntegrationTest,CascadeDeleteIntegrationTest(Testcontainers Postgres). - Coverage HTML:
backend/build/reports/jacoco/test/html/index.html - Test report:
backend/build/reports/tests/test/index.html
E2E (Playwright)
cd e2e
npx playwright test # all
npx playwright test --project agent # one role project
CPLK_E2E_BASE_URL=https://dev.cplk.org npx playwright test
- 85 spec files, ~24.6k LOC.
- 10 role projects (
anonymous,agent,agency-admin,agency-super-admin,super-admin,blog-editor,financial-officer,property-owner, plussetup+anonymous-journeys). - Auth handled in
setupproject — Keycloak UI login →.auth/<role>.json. - Workers: 2 local / 1 CI; retries 0 (journeys must be deterministic).
- Locale
en-US, timezoneAsia/Colombo.
Observability
Health & metrics
GET /api/actuator/health— used by Nginx + deploy script./actuator/info— build/commit metadata./actuator/metrics— Micrometer; detailswhen-authorizedin dev,neverin prod.- No Sentry / Datadog / Grafana wired in today. Add an exporter + dashboard before multi-node rollout.
Logs
- Logback profile-driven (
logback-spring.xml): console in dev, JSON (logstash encoder) in prod. - File:
/opt/cplk/logs/app.log(rotated). - Journald:
journalctl -u cplk-api -f. - Heap dump on OOM:
/opt/cplk/logs/heapdump.hprof.
Audit ledger as observability
audit_logs and financial_audit_logs are first-class diagnostic tools: every state change carries actor, IP, old/new values, request id. Index on request_id lets you trace a single transaction across services.
Frontend telemetry
- Google Analytics 4 (
GA_MEASUREMENT_ID) and Meta Pixel (FB_PIXEL_ID). - Sonner toasts capture user-visible 4xx/5xx.
- No client error reporter today — Sentry browser SDK is the obvious next step.
Suggested SLOs
| SLO | Target | Window | Where to read it today |
|---|---|---|---|
| API availability | 99.9% | 30 d | Nginx access log + Actuator health probe (manual) |
| Public listing p95 latency | < 800 ms | 7 d | Nginx $request_time histogram |
| Portal action success | > 99.5% | 30 d | Audit-log success/failure ratio per action |
| PayHere webhook latency-to-effect | < 5 s p95 | 7 d | Diff payments.paid_at vs FinancialAuditLog.created_at |
| Sev-2 MTTR | < 60 min | rolling Q | Manual postmortems |
Known gotchas
Real failure modes we've already hit — keep these top of mind:
Flyway checksum mismatch
Never modify an applied migration. If a checksum mismatch shows up:
-- on the affected env's DB
UPDATE flyway_schema_history SET checksum = <new_checksum>
WHERE version = '<version>';
Then create a corrective forward migration so the right state is reproducible elsewhere.
Keycloak issuer URL
Internal URL is http://localhost:8090, but the issuer claim uses the public URL https://auth.cplk.org/realms/cplk. JWK set URI uses the internal URL for direct access. Misconfiguring this fails JWT validation with InvalidIssuerException.
Spring env vars vs defaults
${VAR:} resolves to an empty string and overrides any @Value default. Always set env vars explicitly on the server — empty > absent.
fail2ban
Rapid SSH connections to the API server trigger a ban. Recovery: hop via the DB VPS and unban the API VPS IP, or wait the ban window.
Sitemap XML escaping
Next.js does not auto-escape & in sitemap URLs. Always emit & manually in sitemap.ts.
React hydration warning #418
Hidden spans used for email obfuscation cause hydration mismatch. Use plain text instead of dynamic hidden segments.
Runbooks
RB-01 · API 5xx spike
- Check
journalctl -u cplk-api -ffor stack traces; group by exception type. - Hit
/api/actuator/health— DB / mail / disk components. - Was there a deploy in the last hour? If yes, rollback via symlink swap (Deployment → Rollback).
- If DB is slow: check
pg_stat_activityfor long-running queries; verify Hikari pool isn't exhausted (HIKARI_MAX_POOL_SIZE). - If Keycloak unreachable: validate JWKS reachable from the API VPS; verify
AUTH_JWK_SET_URI.
RB-02 · Suspected cross-tenant access
- Identify the offending request id from the audit log.
- Inspect the call stack — does it pass through a method tagged
@BypassTenantFilter? If yes, was it the right role? - Suspend the suspect user in Keycloak; revoke their sessions.
- Diff
old_values/new_valuesinaudit_logsto scope the breach. - Write a regression test in
TenantIsolationIntegrationTestbefore patching.
RB-03 · PayHere webhook outage
- Inspect Nginx access logs for
POST /payments/notify— are calls arriving? - Verify PayHere config:
PAYHERE_BASE_URL,_MERCHANT_ID,_NOTIFY_URL. - If calls arrive but fail signature: rotate
PAYHERE_MERCHANT_SECRETwith PayHere support; redeploy. - Queue manual webhook replay via PayHere dashboard for affected orders.
- Confirm idempotency: pulling on
order_idshould show a single SUCCESS row even after multiple deliveries.
RB-04 · Recurring charges not firing
- Confirm scheduler ran:
SELECT * FROM shedlock WHERE name = 'rechargeSubscriptions'. - If
locked_untilis in the future from a crashed previous run, clear it manually. - Check
PayHereOAuthServicefor repeated 401s in logs — the bearer cache may be stuck. - Look at
FinancialAuditLog(event_type='PAYMENT_FAILED')for the agency to identify the failure mode (card declined, token revoked, etc.).
RB-05 · R2 upload failures
- Verify
CF_R2_ACCESS_KEYnot rotated; bucket exists; CORS allows the web origin. - Fall back to
LocalStorageServiceby settingapp.storage.r2.enabled=falsetemporarily. - Re-upload failed images from
property_imagesrows whosebase_keypoints to missing objects.
RB-06 · Email not delivered
EMAIL_ENABLEDtrue?NoOpEmailServicesilently no-ops if false.- SMTP creds valid? Test with
swaks --to me --server $SMTP_HOST:$SMTP_PORT --auth LOGIN --auth-user $SMTP_USERNAME --auth-password $SMTP_PASSWORD. - Check Notification DLQ:
DlqNotificationHandlerrecords the original message + cause.
RB-07 · Flyway migration fails on deploy
- Do not edit the failed migration on disk.
- Identify the cause — locked rows? wrong assumed state?
- If checksum mismatch (someone edited an applied file): see Known gotchas.
- If actual schema error: write a fix-forward migration with a higher version; redeploy.
RB-08 · Database backup / restore
make db-backup env=prod # pg_dump → encrypted artefact
make db-restore env=prod FILE=cplk-prod-2026-05-13.dump
Backups are not encrypted at rest on the DB VPS by default — copy them off the box and keep them in R2 / object storage with restricted access.
Disaster recovery
Order of restoration after total VPS loss:
- Provision new DB VPS, restore from latest
pg_dump. - Provision new API VPS, install Java 21, Nginx, systemd unit.
- Place
/etc/cplk/api.envfrom secrets store;make api-deploy env=prod. - Re-point DNS for
api.cplk.orgto the new API VPS. - Re-deploy frontend (Cloudflare Pages is independent — should still serve).
- Verify Keycloak —
auth.cplk.orgmay be its own host; if not, restore that too. - Smoke test: log in, list properties, submit inquiry, run a sandbox PayHere checkout, verify webhook reaches API.
pg_dump via cron on the DB VPS, retained 30 days locally and
90 days off-box. Object storage (R2) is treated as immutable — image churn is
low, so per-tenant export is the recovery path for accidental deletes.
Operational checklist (weekly)
- Review
FinancialAuditLogtotals vs PayHere settlement report. - Sample 5 audit-log entries per agency to spot unusual patterns.
- Confirm scheduled jobs ran:
SELECT name, locked_at FROM shedlock. - Confirm Nginx error log clean (no 502/504 spikes).
- Disk usage on both VPSes < 70% (logs + heap dumps grow).
- Rotate
SMTP_PASSWORDandPAYHERE_MERCHANT_SECRETon the documented cadence.