back to overview
For enterprise·9 min read

Standardizing the human-in-the-loop: a protocol for real-time OTP routing

The messy reality of 2FA during automation, the protocol Opaige uses to route codes between portal, worker, and applicant in under 90 seconds, and what a Fortune-500 HITL implementation looks like end-to-end.

Why 'human in the loop' is where most automation dies

< 90 sec
Target OTP delivery
Worker → Redis → Socket.io → browser
~10 min
VFS hold window
TLS Contact ~8 min — latency budget tight
4 phases
Protocol states
Quiet wait → reminder → escalation → retry
95%+
OTPs resolved phase 1
Applicant self-serves in first 90 seconds

Every serious automation platform eventually hits the same problem. Somewhere in the pipeline, a remote system needs a piece of information only a human being has. A bank sends an SMS code. A portal asks for a specific word from the applicant's passport. A customs endpoint wants a confirmation the traveller is still at this number. The moment that happens, your beautiful Kubernetes-scheduled workflow has to pause and wait for a person — and people are slow, unpredictable, and usually asleep.

Most engineering teams handle this the same wrong way. They bolt a notification system on top of their existing automation: "if OTP required, send an SMS to the user, then retry in 5 minutes." This works on the happy path. It fails on every real-world edge case — user doesn't see the SMS in time, user enters it 30 seconds after the retry runs, the portal's hold window expires, the whole run is wasted. The result is a system that books appointments 60% of the time and tells applicants "please try again later" for the other 40%.

A proper human-in-the-loop protocol is not a notification system. It's a bidirectional, latency-sensitive, state-preserving bridge between an automated worker and a live human.

The four non-negotiable properties

Bolt-on notification
Opaige HITL protocol
Worker behaviour on OTP
Finishes, schedules retry in 5 min
Suspends in place, session stays open
Human prompt delivery
SMS/email — async, high latency
< 2s via Socket.io + dashboard push
Multi-channel race handling
None — first channel wins by accident
Deterministic: first-write-wins in Redis
Worker crash recovery
Re-runs from start, slot lost
Resumes from paused state in DB

The first property is the one most systems get wrong. When the portal demands an OTP, the worker keeps its session open, the page loaded, and its process running. It doesn't "finish" and retry; it pauses in place and waits for a value. Miss this property and you will always lose the slot on OTP-required runs.

The protocol, end-to-end

Portal returns OTP prompt
Adapter signals AUTH_OTP_REQUIRED or CONFIRM_OTP_REQUIRED. Worker suspends in-session.
1
OTPEvent row written to Postgres
Status: REQUESTED. Context: AUTH or CONFIRMATION. Expiry derived from the portal's known hold window. Run ID recorded for correlation.
2
booking:update socket event → applicant's dashboard (< 1 second)
Every open browser tab for that user renders the OTP input immediately. Deep-link email dispatched as backup for applicants who don't have the app open.
3
operator:otp_pending event → operator console
Backup pathway. If the applicant doesn't self-serve within 90 seconds, the operator console flags it.
4
waitForOTP(runId, expiresAt) — worker idles but stays alive
The worker is dormant but the portal session is live. The slot is held. Nothing times out until the portal's own hold window expires.
Code arrives → receiveOTP() → worker resumes
First submission wins. Subsequent submissions return { accepted: false, reason: 'already_submitted' }. Worker types the code, submits, and continues the state machine forward.

The latency budget that keeps slots alive

0–90s
Phase 1 — quiet wait
Applicant has just seen the socket event. Their phone is buzzing with the portal's OTP SMS. No operator intervention. ~95% of OTPs resolve here.
90–240s
Phase 2 — reminder + operator watch
Second socket ping fires. SMS reminder if provisioned. Pending OTP appears on operator 'getting warm' list. Operator can call the applicant on the booking's phone number.
240s+
Phase 3 — escalation
Operator console flags as 'at risk'. Operator can submit the code on the applicant's behalf if reached, or mark for retry — which releases the slot cleanly and restarts the scan.
Expiry
Timeout → RETRY_SCHEDULED (not FAILED)
Worker's waitForOTP promise rejects. State machine transitions to RETRY_SCHEDULED. A fresh slot hunt starts for the same applicant. Slot hunts are fundamentally retry-able — the booking is not lost, just delayed.
Why the retry-not-fail distinction matters
A naive system marks OTP timeout as FAILED and tells the applicant to resubmit. Our protocol marks it RETRY_SCHEDULED and starts a new availability watch automatically. The applicant experiences a delay, not a failure. For high-demand corridors where slots are rare, this difference is the product.

Why this reads like a protocol, not a feature

"Feature" suggests a checkbox: "we support 2FA." "Protocol" suggests a contract: specific states, specific events, specific timing guarantees, specific failure modes. For any platform integrating Opaige, the protocol framing matters because it's what lets you reason about the behaviour without reading our source code.

For the engineering teams at EORs and relocation platforms: our HITL protocol is idempotent, observable, and explicit about its failure modes. You can wire it into your own state machine, build your own UI around the socket events, and know exactly how timing affects outcomes. That's the difference between integrating infrastructure and integrating a script.