back to overview
For developers·7 min read

State machines vs cron jobs for booking systems

Why the cron+flag approach collapses for long-running booking workflows, and how a finite state machine with durable retries fixes the failure modes that actually bite.
The default

Cron + flag: where it breaks

The natural first implementation: a cron job fires every minute, reads a list of pending bookings, opens a browser, tries to book, and writes status: "booked" or status: "failed" to a database.

Three failure modes show up quickly:

  1. No mid-flight visibility. A booking that takes 4 minutes (login, OTP, slot scan, hold, confirm) is just a row marked "in progress" for 4 minutes. If the worker dies mid-OTP, the row is stuck and you have no idea where.
  2. No deterministic retry. "Failed" is not actionable. Was it a captcha? A timeout? Did the portal rate-limit you? Your retry logic degrades into "try again and hope," which either hammers the portal or gives up too early.
  3. Human-in-the-loop is impossible. When the portal asks for a code the applicant receives, you have nowhere to pause. Either you crash the run or you pivot to a manual checklist — both kill the automation story.
The fix

Booking as a finite state machine

Opaige models every booking as a row in the BookingState enum with 28 possible values — CREATED, AUTHENTICATING, AUTH_OTP_REQUIRED, HOLDING_SLOT, CONFIRM_OTP_REQUIRED, BOOKED, and so on. Every transition is driven by a typed event (auth.otp_required, slot.found, portal.error, operator.resolved).

Three properties fall out naturally:

  • Every pause is a state. AUTH_OTP_REQUIRED isn't a log line — it's a first-class resting state. Workers can die and restart; the booking resumes from exactly where it stopped because the DB is the source of truth, not worker memory.
  • Every retry class is distinct. RETRY_SCHEDULED with a backoff delay for transient portal errors. OPERATOR_ESCALATED for captcha or unknown error codes. CREDENTIALS_REQUIRED when the portal rejects the login. Each has its own downstream SLA — no more "hope and retry."
  • Invalid transitions throw. Our orchestrator has a TRANSITIONS table mapping (state, event) → nextState. If code tries to move a booking from BOOKED back to QUEUED, the function throws before it touches the database.
Infrastructure

What actually runs it

The state machine is a TypeScript module in packages/shared; the execution lives in BullMQ jobs backed by Redis. A job picks up a booking by ID, loads the current state, calls the adapter, and commits the next state in a single transaction. If the worker crashes, BullMQ re-delivers the job; the orchestrator sees the state is already past the point the last job reached, and resumes.

Emission to the frontend is a thin layer: every state change also calls emitBookingUpdate() on a Socket.io namespace, so dashboards flip live without polling. Webhooks fire on terminal transitions (BOOKED, FAILED, EXPIRED) and a filtered set of lifecycle ones (status_changed) if your app wants to mirror the machine.

When cron is fine

Don't over-engineer

State machines are overkill for workflows that have ≤3 meaningful steps, always run to completion inside a single job, and never require human input. A cron that emails you when a slot opens and expects you to book manually — that's a cron job, and adding a state machine around it would be ceremony, not value.

The heuristic: if your workflow has at least one point where it must pause waiting for something outside your control (an OTP, an operator decision, a portal queue), and if crashes during that pause must not lose state, you need the state machine. Booking on hostile portals hits both conditions in every run.

Trade-offs

What you give up

  • Up-front schema work. You have to enumerate states and events before you write code. Adding a state mid-project means a migration and an exhaustive check of the transitions table. The payoff is that dead states show up as unused transitions, not as orphan rows.
  • More moving parts. BullMQ, Redis, a realtime channel, a state enum synced between Postgres and TypeScript. Each is a small cost; the alternative is a lot of brittle glue code reinvented poorly.
  • Debugging is different. You stop asking "why did this job fail" and start asking "why is this booking stuck at AUTH_OTP_REQUIRED for 40 minutes." The answer is usually simpler, but you learn a new vocabulary.