Six layers. One blast radius per failure.
The six stages — with engineering notes
Gateway
Rate-limited ingress. Idempotency keys. Request validation. Nothing reaches the orchestrator raw.
Fastify + custom rate-limit keyed on CF-Connecting-IP. Per-route overrides tighten brute-forceable endpoints (auth, OTP submit, booking create). Stripe webhooks bypass rate-limit but require signature verification. Idempotency keys are stored with a 24h TTL — replaying a key returns the original response, not a new booking.
Orchestrator
BullMQ + state machine. Per-booking locks. Retry policies per error class. Full audit trail per transition.
20-state machine from INIT through CONFIRMED, with dedicated PAYMENT_REQUIRED, OPERATOR_ACTION_REQUIRED, and AWAITING_USER_SIGNUP branches. Every transition writes a BookingEvent row. Per-booking Redis locks prevent double-processing across workers. Retry policy splits by error class: transient network (backoff + retry), portal rate-limit (long cooldown), hard DOM break (escalate).
Adapter layer
One adapter per portal. Versioned. Isolated. Rolled back automatically when the portal DOM shifts.
Each portal lives in apps/worker/src/adapters/<portal>/. Selector maps, timers, and field discovery are versioned per-adapter so we can ship v2 behind a flag, shadow-run against v1, and cut over once confidence passes threshold. When the DOM breaks in prod, the worker escalates to the operator console rather than guessing.
Browser fleet
Playwright workers on rotating residential IPs. Fingerprint hygiene. Session warmup. CAPTCHA offload to solver pool.
Residential IPs are table stakes — hosting-ASN egress gets 403201 from VFS edge. Chrome is launched with a cloned User Data dir per worker to dodge the default-profile security guard. CapSolver handles Turnstile; the forged-MessageEvent probe (see engineering notes) bypasses the FormControl blocker that stalled six weeks of effort.
Intelligence
Observation pipeline feeds slot predictors. Every run makes the next run smarter.
Each successful availability scan writes a SlotObservation row (portal, center, visa type, timestamp, slot count). The predictor learns release-window patterns and wakes watchers at the right second. Today it's a heuristic; v1 lands in Q3 2026 as a proper model.
Operator console
Live queue of escalations. 30-second median response. Full session replay, not just logs.
Socket.io streams the queue live; operators see DOM snapshots + step history + stored credentials (identity-vault decrypted on open, re-encrypted on close). Every operator action is audit-logged with before/after state so we can diff what the human did and feed it back into the adapter.
What's actually in production
Ingress
- — Cloudflare (DNS + proxy)
- — Fastify (REST API)
- — Custom rate-limit middleware
- — Stripe webhook receiver (signature-verified)
Orchestration
- — BullMQ (Redis-backed)
- — Prisma + Postgres (audit log)
- — 20-state booking FSM
- — Per-booking Redis locks
- — Idempotency key store (24h TTL)
Workers
- — Node.js workers (Railway + residential)
- — Playwright (Chrome)
- — Cloned User Data dir per worker
- — CapSolver for CAPTCHA
- — Forged MessageEvent for Turnstile
Observability
- — Pino JSON logs
- — Global error handler with stack traces
- — Public /status endpoint
- — 24h metrics aggregation (queue depth, active jobs, success rate)
Identity + payments
- — Per-user encryption of portal credentials
- — JWT auth (Fastify + bcrypt)
- — Stripe Checkout + subscriptions
- — In-product plan upgrade with proration
Frontend
- — Next.js App Router (Vercel)
- — Socket.io realtime (queue + OTP push)
- — Lucide icons
- — Dashboard, operator console, admin, billing
What breaks — and what happens next
Every distributed system fails. The difference is whether failure is silent or observable, and whether one break takes down the rest. This is how the orchestrator responds to each class of error.
Portal DOM shift
Signal: Selector fails. Adapter logs unknown-element.
Booking pauses in OPERATOR_ACTION_REQUIRED. Operator console surfaces DOM snapshot + last known selector. Operator fixes, adapter emits patch, next booking unblocks.
Transient network / 5xx
Signal: Fetch timeout or 503 from portal.
Exponential backoff within the booking lock. After N retries, reclassify as PORTAL_DEGRADED and surface on /status.
Portal rate-limit (429)
Signal: Retry-After header or detected block pattern.
Long cooldown at the adapter level. All bookings in that portal pause; resume automatically once the window clears.
Residential IP burn
Signal: 403201 from VFS edge, or Cloudflare challenge that won't resolve.
Worker marks its IP as burned, rotates to a fresh one. If the whole pool is burned, escalate to the operator queue so we can top up before the customer notices.
OTP timeout
Signal: User doesn't submit the code in time, portal voids the hold.
Booking returns to SCAN state. Slot is released back into the pool. We re-acquire on the next cycle; user gets a webhook letting them know.
Source of truth
Everything lives in Postgres via Prisma — User, BookingRequest, BookingEvent, PortalAccount, VerificationToken, Subscription, Payment, ApiKey, Webhook, AccessRequest, SlotObservation. Redis is strictly for BullMQ queues + locks + rate-limit buckets; no durable state. Evidence blobs live in object storage keyed by booking ID with signed-URL access.
BookingEvent is the audit log — every state transition is a row with actor, before state, after state, reason, and a metadata JSON blob. Replay any booking from its events.