For enterprise·9 min read

Standardizing the human-in-the-loop: a protocol for real-time OTP routing

The messy reality of 2FA during automation, the protocol Opaige uses to route codes between portal, worker, and applicant in under 90 seconds, and what a Fortune-500 HITL implementation looks like end-to-end.

Why 'human in the loop' is where most automation dies

< 90 sec

Target OTP delivery

Worker → Redis → Socket.io → browser

~10 min

VFS hold window

TLS Contact ~8 min — latency budget tight

4 phases

Protocol states

Quiet wait → reminder → escalation → retry

95%+

OTPs resolved phase 1

Applicant self-serves in first 90 seconds

Every serious automation platform eventually hits the same problem. Somewhere in the pipeline, a remote system needs a piece of information only a human being has. A bank sends an SMS code. A portal asks for a specific word from the applicant's passport. A customs endpoint wants a confirmation the traveller is still at this number. The moment that happens, your beautiful Kubernetes-scheduled workflow has to pause and wait for a person — and people are slow, unpredictable, and usually asleep.

Most engineering teams handle this the same wrong way. They bolt a notification system on top of their existing automation: "if OTP required, send an SMS to the user, then retry in 5 minutes." This works on the happy path. It fails on every real-world edge case — user doesn't see the SMS in time, user enters it 30 seconds after the retry runs, the portal's hold window expires, the whole run is wasted. The result is a system that books appointments 60% of the time and tells applicants "please try again later" for the other 40%.

A proper human-in-the-loop protocol is not a notification system. It's a bidirectional, latency-sensitive, state-preserving bridge between an automated worker and a live human.

The four non-negotiable properties

Bolt-on notification

Opaige HITL protocol

Worker behaviour on OTP

Finishes, schedules retry in 5 min

Suspends in place, session stays open

Human prompt delivery

SMS/email — async, high latency

< 2s via Socket.io + dashboard push

Multi-channel race handling

None — first channel wins by accident

Deterministic: first-write-wins in Redis

Worker crash recovery

Re-runs from start, slot lost

Resumes from paused state in DB

The first property is the one most systems get wrong. When the portal demands an OTP, the worker keeps its session open, the page loaded, and its process running. It doesn't "finish" and retry; it pauses in place and waits for a value. Miss this property and you will always lose the slot on OTP-required runs.

The protocol, end-to-end

→

Portal returns OTP prompt

Adapter signals AUTH_OTP_REQUIRED or CONFIRM_OTP_REQUIRED. Worker suspends in-session.

OTPEvent row written to Postgres

Status: REQUESTED. Context: AUTH or CONFIRMATION. Expiry derived from the portal's known hold window. Run ID recorded for correlation.

booking:update socket event → applicant's dashboard (< 1 second)

Every open browser tab for that user renders the OTP input immediately. Deep-link email dispatched as backup for applicants who don't have the app open.

operator:otp_pending event → operator console

Backup pathway. If the applicant doesn't self-serve within 90 seconds, the operator console flags it.

waitForOTP(runId, expiresAt) — worker idles but stays alive

The worker is dormant but the portal session is live. The slot is held. Nothing times out until the portal's own hold window expires.

✓

Code arrives → receiveOTP() → worker resumes

First submission wins. Subsequent submissions return { accepted: false, reason: 'already_submitted' }. Worker types the code, submits, and continues the state machine forward.

The latency budget that keeps slots alive

0–90s

Phase 1 — quiet wait

Applicant has just seen the socket event. Their phone is buzzing with the portal's OTP SMS. No operator intervention. ~95% of OTPs resolve here.

90–240s

Phase 2 — reminder + operator watch

Second socket ping fires. SMS reminder if provisioned. Pending OTP appears on operator 'getting warm' list. Operator can call the applicant on the booking's phone number.

240s+

Phase 3 — escalation

Operator console flags as 'at risk'. Operator can submit the code on the applicant's behalf if reached, or mark for retry — which releases the slot cleanly and restarts the scan.

Expiry

Timeout → RETRY_SCHEDULED (not FAILED)

Worker's waitForOTP promise rejects. State machine transitions to RETRY_SCHEDULED. A fresh slot hunt starts for the same applicant. Slot hunts are fundamentally retry-able — the booking is not lost, just delayed.

Why the retry-not-fail distinction matters

A naive system marks OTP timeout as FAILED and tells the applicant to resubmit. Our protocol marks it RETRY_SCHEDULED and starts a new availability watch automatically. The applicant experiences a delay, not a failure. For high-demand corridors where slots are rare, this difference is the product.

Why this reads like a protocol, not a feature

"Feature" suggests a checkbox: "we support 2FA." "Protocol" suggests a contract: specific states, specific events, specific timing guarantees, specific failure modes. For any platform integrating Opaige, the protocol framing matters because it's what lets you reason about the behaviour without reading our source code.

For the engineering teams at EORs and relocation platforms: our HITL protocol is idempotent, observable, and explicit about its failure modes. You can wire it into your own state machine, build your own UI around the socket events, and know exactly how timing affects outcomes. That's the difference between integrating infrastructure and integrating a script.

Read the developer reference Deep-dive: OTP latency

Keep reading

Related guides

For enterprise

Zero-knowledge credential vaulting: how we protect applicant passports

The threat model for applicant data, per-user envelope encryption, audit logging, and how Opaige's identity vault maps to SOC 2, GDPR, and t…

Read guide

For enterprise

Building a global mobility vertical: why infrastructure beats internal scraping

Total cost of ownership for an in-house visa-booking bot vs licensing Opaige. Engineering headcount, maintenance cadence, legal exposure, an…

Read guide

For enterprise

GDPR Article 28 and visa appointment data: a due-diligence guide for EOR platforms

When a relocation platform or EOR automates visa appointments, it becomes a data processor for passport numbers, addresses, and biometrics. …

Read guide

All guides