State machines vs cron jobs for booking systems
Cron + flag: where it breaks
The natural first implementation: a cron job fires every minute, reads a list of pending bookings, opens a browser, tries to book, and writes status: "booked" or status: "failed" to a database.
Three failure modes show up quickly:
- No mid-flight visibility. A booking that takes 4 minutes (login, OTP, slot scan, hold, confirm) is just a row marked "in progress" for 4 minutes. If the worker dies mid-OTP, the row is stuck and you have no idea where.
- No deterministic retry. "Failed" is not actionable. Was it a captcha? A timeout? Did the portal rate-limit you? Your retry logic degrades into "try again and hope," which either hammers the portal or gives up too early.
- Human-in-the-loop is impossible. When the portal asks for a code the applicant receives, you have nowhere to pause. Either you crash the run or you pivot to a manual checklist — both kill the automation story.
Booking as a finite state machine
Opaige models every booking as a row in the BookingState enum with 28 possible values — CREATED, AUTHENTICATING, AUTH_OTP_REQUIRED, HOLDING_SLOT, CONFIRM_OTP_REQUIRED, BOOKED, and so on. Every transition is driven by a typed event (auth.otp_required, slot.found, portal.error, operator.resolved).
Three properties fall out naturally:
- Every pause is a state.
AUTH_OTP_REQUIREDisn't a log line — it's a first-class resting state. Workers can die and restart; the booking resumes from exactly where it stopped because the DB is the source of truth, not worker memory. - Every retry class is distinct.
RETRY_SCHEDULEDwith a backoff delay for transient portal errors.OPERATOR_ESCALATEDfor captcha or unknown error codes.CREDENTIALS_REQUIREDwhen the portal rejects the login. Each has its own downstream SLA — no more "hope and retry." - Invalid transitions throw. Our orchestrator has a
TRANSITIONStable mapping(state, event) → nextState. If code tries to move a booking fromBOOKEDback toQUEUED, the function throws before it touches the database.
What actually runs it
The state machine is a TypeScript module in packages/shared; the execution lives in BullMQ jobs backed by Redis. A job picks up a booking by ID, loads the current state, calls the adapter, and commits the next state in a single transaction. If the worker crashes, BullMQ re-delivers the job; the orchestrator sees the state is already past the point the last job reached, and resumes.
Emission to the frontend is a thin layer: every state change also calls emitBookingUpdate() on a Socket.io namespace, so dashboards flip live without polling. Webhooks fire on terminal transitions (BOOKED, FAILED, EXPIRED) and a filtered set of lifecycle ones (status_changed) if your app wants to mirror the machine.
Don't over-engineer
State machines are overkill for workflows that have ≤3 meaningful steps, always run to completion inside a single job, and never require human input. A cron that emails you when a slot opens and expects you to book manually — that's a cron job, and adding a state machine around it would be ceremony, not value.
The heuristic: if your workflow has at least one point where it must pause waiting for something outside your control (an OTP, an operator decision, a portal queue), and if crashes during that pause must not lose state, you need the state machine. Booking on hostile portals hits both conditions in every run.
What you give up
- Up-front schema work. You have to enumerate states and events before you write code. Adding a state mid-project means a migration and an exhaustive check of the transitions table. The payoff is that dead states show up as unused transitions, not as orphan rows.
- More moving parts. BullMQ, Redis, a realtime channel, a state enum synced between Postgres and TypeScript. Each is a small cost; the alternative is a lot of brittle glue code reinvented poorly.
- Debugging is different. You stop asking "why did this job fail" and start asking "why is this booking stuck at
AUTH_OTP_REQUIREDfor 40 minutes." The answer is usually simpler, but you learn a new vocabulary.