How to test worker-execution recording & retry locally
The orchestrator records every worker dispatch as a durable Redis record (no TTL) via the
shared @alakai/execution-store. The Alakai dashboard never talks to the orchestrator: core
owns the external, authenticated surface and exposes a single action — POST /dashboard/executions/:taskId/retry. A retry reads the captured snapshot from Redis and
re-enqueues it to SQS with the next attempt number; the orchestrator consumer then dispatches
it like any other task (capped by MAX_RETRIES). There is no list/get endpoint — the dashboard
sources its failed-task list from its own task tracking. This guide covers two testing paths:
- Level 1 — Fast (5 min): seed a failed execution, inspect the Redis record directly, and
confirm the
retryable/MAX_RETRIESbudget. No SQS or worker needed. - Level 2 — Full pipeline (15 min): enqueue a real task to SQS, let the orchestrator dispatch
it, simulate the worker reporting failure, then retry (via
coreor by re-enqueuing to SQS).
Both
coreand the orchestrator default to port3000. These examples runcoreon3000and the orchestrator on3001to avoid the clash.
Topology — which service handles what
The execution records live in Redis, written by the orchestrator and read by core. The only
HTTP endpoint is the retry on core. Sending a retry to the orchestrator port returns 404 —
that's the wrong service, not a bug.
| What you're doing | Where | How |
|---|---|---|
| Inspect a failed execution record | Redis | redis-cli HGETALL <prefix>:<taskId> · make seed-failed-implementation prints the taskId |
| Seed / simulate worker callback | orchestrator 3001 | make seed-failed-implementation · POST /task-complete |
| Retry a failed execution | core (external) 3000 | POST /dashboard/executions/:taskId/retry · make retry-worker-execution core_url=http://localhost:3000 |
core and the orchestrator must point at the same Redis instance and the same
EXECUTIONS_REDIS_KEY_PREFIX (default alakai-executions), or core reads nothing and every
retry 404s.
Prerequisites
Both levels need:
- Redis running —
redis-cli pingshould returnPONG. If you don't have it:docker run -p 6379:6379 redis. - Dependencies installed —
yarn installat the repo root, thenyarn workspace @alakai/execution-store build(the shared store compiles todist, consumed by both services).
Level 2 additionally needs:
- Docker (for LocalStack SQS), plus the AWS CLI and
jq.
There is no Postgres in this feature anymore.
Start the orchestrator (both levels)
In local mode (ENV=local) the orchestrator skips ECS — launcher.launch returns a fake ARN and
stores the task in Redis instead of running a real container.
cd orchestrator
ENV=local PORT=3001 AWS_REGION=us-east-1 \
AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test \
SQS_QUEUE_URL=http://localhost:4566/000000000000/alakai-queue \
REDIS_URL=redis://localhost:6379 \
EXECUTIONS_REDIS_KEY_PREFIX=alakai-executions \
yarn dev # or: yarn build && yarn start
Check it's up:
curl -sS http://localhost:3001/health # -> {"ok":true}
For Level 1 the
SQS_QUEUE_URLcan be any valid URL (the consumer will log harmless poll errors). For Level 2 it must point at the LocalStack queue created below.
Start core (only needed to retry via core)
You only need this to exercise the real dashboard retry hop. Level 1 and the SQS-direct retry path
work without core.
core registers the retry route when dashboard auth is configured. Point its Redis at the
same instance and prefix the orchestrator uses:
cd core
ENV=local PORT=3000 \
TASK_TRACKING_REDIS_URL=redis://localhost:6379 \
EXECUTIONS_REDIS_KEY_PREFIX=alakai-executions \ # must equal the orchestrator's
MAX_RETRIES=1 \ # must equal the orchestrator's
# ... plus the usual core env (Slack signing secret, TASK_DASHBOARD_GOOGLE_CLIENT_IDS, etc.)
yarn dev
On startup, confirm these two log lines — they're how you diagnose a 404 later:
Dashboard retry route enabled
Local auth bypass active on the retry route: 'x-dev-user' header accepted (ENV=local only)
- If you see
Dashboard retry route disabled (...), dashboard auth isn't configured → the route doesn't exist →404. - The bypass line only appears under
ENV=local. Without it, retry needs a real Google ID token.
After changing code, restart
core.yarn dev(tsx watch) reloads automatically; a builtdist(yarn start) does not.
Level 1 — Fast retry test
All make targets live in core/, so run them from there.
1. Seed a failed implementation execution
This writes a failed / attempt 1 record into both the dashboard task-tracking store and the
execution store. Both stores share the same taskName and actor, so the dashboard row is stable
across retries.
cd core
make seed-failed-implementation repo='my-org/my-repo' task_name='Fix the widget bug'
Optional overrides (all have defaults):
make seed-failed-implementation \
repo='my-org/my-repo' \
task_name='Fix the widget bug' \
actor_provider='github' actor_id='U123' actor_display='Jane Dev' \
execution_id='exec-demo' \
attempt='1'
The command prints two IDs:
execution-id— the dashboard row key (used by the SSE stream and the snapshot read service)queue-task-id— the execution-store key, passed to the retry button
2. Inspect the execution record and its retry budget
The record is a Redis hash keyed by <prefix>:<queueTaskId>, one field per attempt:
redis-cli HGET alakai-executions:<queue-task-id> 1 | python3 -m json.tool
You should see status: "failed", attempt: 1, and the stored taskSnapshot. With
MAX_RETRIES=1, an attempt: 1 failed record is retryable — retryable = status === 'failed' && attempt <= MAX_RETRIES. This is exactly the gate core computes before re-enqueuing a retry.
3. (Optional) Prove the MAX_RETRIES budget
With MAX_RETRIES=1 (default), only attempt 1 may be retried. Re-seed at attempt 2 and re-check:
make seed-failed-implementation repo='my-org/my-repo' attempt='2'
redis-cli HGET alakai-executions:<queue-task-id> 2 | python3 -m json.tool
attempt: 2 with MAX_RETRIES=1 is not retryable — core returns 409 "retry budget exhausted", and the consumer also drops any SQS message whose attempt exceeds MAX_RETRIES + 1.
The actual retry dispatch is exercised end-to-end in Level 2 (it requires SQS).
4. Clean up
redis-cli DEL alakai-executions:<queue-task-id>
Level 2 — Full pipeline test
This drives the real chain: enqueue → consume → dispatch (records attempt 1 running) → worker
fails → callback records failed → retry.
1. Start LocalStack and create the queue
cd .local-aws
make up && make infra # creates SQS queue http://localhost:4566/000000000000/alakai-queue
Make sure the orchestrator (above) is running with SQS_QUEUE_URL pointing at this queue and the
test AWS credentials.
2. Enqueue an implementation task
The orchestrator consumes it and "dispatches" the worker — in local mode that records an attempt 1
running record and stores the task in Redis (no real container runs).
cd .local-aws
make send-sqs TASK_ID=impl-e2e
Override fields as needed — make send-sqs TASK_ID=impl-e2e REPO=my-org/my-repo PROMPT='add retries' — or pass a full payload with make send-sqs MESSAGE_BODY='{…}'.
Avoid raw
aws sqs send-message: without the dummyAWS_ACCESS_KEY_ID=test/AWS_SECRET_ACCESS_KEY=test, the CLI falls back to your real credential chain and tries to reach AWS instead of LocalStack. Themake send-sqstarget handles this.
3. Simulate the worker reporting failure
Because no real worker runs locally, POST the callback yourself. The implementation worker uses a
nested result shape — outcome at result.status ("success"/"error") and reason at
result.errorMessage:
curl -sS -X POST http://localhost:3001/task-complete \
-H 'content-type: application/json' \
--data '{"taskId":"impl-e2e","result":{"status":"error","errorMessage":"codex provider timed out"}}'
The digest and bug-hunter workers use the canonical top-level shape instead:
{"taskId":"…","taskType":"digest","status":"failure","error":{"message":"…"}}. The orchestrator normalizes both.
4. Verify it was recorded as failed
redis-cli HGET alakai-executions:impl-e2e 1 | python3 -m json.tool
status should be failed with errorMessage: "codex provider timed out" captured.
5. Retry, then simulate success
Retry re-enqueues the snapshot to SQS with attempt: 2; the orchestrator consumer dispatches it and
records attempt 2 running. Two ways to trigger it:
Via core (the real path — recommended). With core running on ENV=local, use the
dev_user shortcut instead of minting a Google token — POST to core on port 3000:
cd ../core
make retry-worker-execution task_id='impl-e2e' core_url=http://localhost:3000 dev_user='you@local'
# -> 202 { "taskId": "impl-e2e", "attempt": 2, "status": "queued" }
In production (or to verify the real auth path), pass a Google ID token instead — see "How do I get a Google ID token?":
make retry-worker-execution task_id='impl-e2e' core_url=http://localhost:3000 auth_token="$GOOGLE_ID_TOKEN"
Via SQS directly (fast, no core/auth): re-enqueue the snapshot with attempt: 2 yourself —
this is exactly the message core sends:
cd ../.local-aws
make send-sqs MESSAGE_BODY='{"taskId":"impl-e2e","taskType":"implementation","payload":{"repo":"my-org/my-repo","source":"coding.implementation","prompt":"add retries"},"metadata":{"createdAt":1,"trace_id":"impl-e2e"},"callbacks":{"github":{"repo":"my-org/my-repo","issueNumber":1}},"attempt":2}'
Either path lands an attempt: 2 running record. Confirm it dispatched:
redis-cli HGET alakai-executions:impl-e2e 2 | python3 -m json.tool # -> attempt: 2, status: "running"
If field
2never appears,coreand the orchestrator are on different SQS queues —core'sIMPLEMENTATION_SQS_QUEUE_URLmust equal the orchestrator'sSQS_QUEUE_URL— or on different Redis instances/prefixes.
Then simulate the worker succeeding:
curl -sS -X POST http://localhost:3001/task-complete \
-H 'content-type: application/json' \
--data '{"taskId":"impl-e2e","result":{"status":"success","pullRequestUrl":"https://github.com/x/y/pull/1"}}'
Confirm attempt 2 flipped to succeeded:
redis-cli HGET alakai-executions:impl-e2e 2 | python3 -m json.tool # -> status: "succeeded"
Callback idempotency gotcha: once an attempt is
succeeded(orfailed), a later callback is a no-op — the store only resolves an attempt stillrunning. So if you succeed attempt 2 and then send a failure callback, the record stayssucceeded, and a subsequent retry returns409 "cannot retry execution in status 'succeeded'"(the status gate), not "budget exhausted". To see the budget gate, don't succeed the attempt first — see Step 6.
6. (Optional) Prove the MAX_RETRIES budget end-to-end
With MAX_RETRIES=1, attempt 2 is the last allowed dispatch. Drive it on a fresh task and leave
attempt 2 failed (don't succeed it). Capture the queue-task-id from the seed output:
cd ../core
make seed-failed-implementation repo='my-org/my-repo'
# -> prints: queue-task-id: <QUEUE_ID> (use this below)
QUEUE_ID=<paste-queue-task-id-from-above>
make retry-worker-execution task_id="$QUEUE_ID" core_url=http://localhost:3000 dev_user='you@local' # -> 202, attempt 2
# fail attempt 2 (do NOT succeed it first)
curl -sS -X POST http://localhost:3001/task-complete -H 'content-type: application/json' \
--data "{\"taskId\":\"$QUEUE_ID\",\"result\":{\"status\":\"error\",\"errorMessage\":\"second failure\"}}"
make retry-worker-execution task_id="$QUEUE_ID" core_url=http://localhost:3000 dev_user='you@local'
# -> 409 { "error": "retry budget exhausted", "maxRetries": 1 }
For the consumer's defensive guard, re-enqueue past budget straight to SQS ("attempt":3) and watch
the orchestrator log Dropping over-budget dispatch — no new record is created.
7. Verify the stream and dashboard hash
After the success callback, the orchestrator emits a success tracking event to the Redis stream.
Confirm the stream event and the task-tracking hash agree:
# Last event on the tracking stream (should be status:success for <execution-id>)
redis-cli XREVRANGE task-tracking:events:v1 + - COUNT 1
# Dashboard hash for the execution row (should be status:success)
redis-cli HGETALL alakai-local:task-tracking:execution:<execution-id>
The taskName and actor in the stream event must match what make seed-failed-implementation
printed — if they differ, the seed scripts are not aligned (see bug #2 this guide targets).
8. Tear down
# Delete execution records by queue-task-id; task-tracking keys by execution-id
redis-cli DEL alakai-executions:<queue-task-id>
cd ../.local-aws && make reset && make down
How do I get a Google ID token?
core verifies the dashboard token with verifyIdToken({ audience: TASK_DASHBOARD_GOOGLE_CLIENT_IDS }),
so it must be a real Google-signed ID token whose aud matches one of your configured client
IDs, with a verified email. That rules out the usual shortcuts: gcloud auth print-identity-token and the OAuth Playground issue tokens with a different audience, so
verifyIdToken rejects them.
The reliable way is to copy the token the dashboard already mints:
- Open the Alakai dashboard in the browser and sign in with Google.
- DevTools → Network → click any
/dashboard/...request. - Copy the token from the
Authorization: Bearer <token>header or the?authToken=<token>query param (it's a JWT; expires ~1h).
For local testing, prefer the dev_user bypass (ENV=local only).
Endpoint reference
Core — dashboard API (external; authenticated with a dashboard Google ID token). Reads the execution record from Redis; retry re-enqueues to SQS:
| Method & path | Purpose |
|---|---|
POST /dashboard/executions/:taskId/retry | Re-enqueue the latest failed execution to SQS with the next attempt. Enforces MAX_RETRIES. |
Retry responses: 202 queued · 401 unauthenticated · 404 unknown task · 409 not in failed
state or budget exhausted · 502 Redis read failed or enqueue failed.
There is no orchestrator HTTP API for this feature. Inspect records with redis-cli against the
<prefix>:<taskId> hash.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Retry returns 404 "Route POST:/dashboard/executions/… not found" | You hit the orchestrator (3001) instead of core, or the route isn't registered | Point core_url at core (3000). retry-worker-execution defaults to 3000; if you ran the orchestrator on 3001 make sure core is also up. If still 404, check core's startup log for Dashboard retry route enabled — if it says disabled, configure dashboard auth and restart |
Retry returns 404 "execution not found" (JSON {error}) | No record for that taskId, or core and orchestrator use different Redis/prefix | Seed one (Level 1) or enqueue + fail one (Level 2); verify both services share TASK_TRACKING_REDIS_URL/REDIS_URL and EXECUTIONS_REDIS_KEY_PREFIX |
retry-worker-execution returns 401 dashboard_unauthorized | Missing/invalid dashboard auth | With core on ENV=local, pass dev_user=<email>; otherwise a valid auth_token=<google-id-token>; or use the SQS re-enqueue path |
Retry returns 409 "cannot retry execution in status 'succeeded'" | You succeeded the attempt; a later failure callback was a no-op (idempotency) | Expected. To test the budget gate, fail the attempt without succeeding it first (Level 2, Step 6) |
| Code change has no effect | Built dist still serving old code (incl. stale @alakai/execution-store) | Restart core/orchestrator; rebuild the shared package after editing it (yarn workspace @alakai/execution-store build) |
Retry returns 409 "cannot retry … 'running'" | Latest attempt hasn't failed yet | Only failed executions are retryable; report a failure callback first |
Retry returns 409 "retry budget exhausted" | attempt > MAX_RETRIES | Expected once the cap is hit; raise MAX_RETRIES (on both services) to allow more |
retry-worker-execution returns 502 | core can't reach Redis or the SQS enqueue failed | Check TASK_TRACKING_REDIS_URL, that Redis is up, and SQS is reachable |
| Re-enqueued retry never dispatches | attempt > MAX_RETRIES + 1, so the consumer drops it | Expected budget guard; send a smaller attempt or raise MAX_RETRIES |
POST /task-complete returns 400 FST_ERR_CTP_EMPTY_JSON_BODY | JSON content-type with empty body | Send a JSON body (the Makefile target already sends {} for retry) |
Successful run shows up as failed | Worker payload shape not normalized | Ensure you're on the build with normalizeOutcome in callbackRoutes.ts |
aws calls hang or fail (Level 2) | LocalStack not running / wrong endpoint | cd .local-aws && make up && make infra; always pass --endpoint-url=http://localhost:4566 |
Notes
- A worker that dies without calling back (ECS crash/OOM, or all callback retries fail) is
reconciled by the orchestrator's ECS reconciliation loop, which marks the orphaned
runningrecordfailedwitherror_code='WORKER_LOST'so it becomes retryable. This loop only runs whenENV != local(it needs ECS), so it can't be exercised in this local setup. - Redis writes on the callback are best-effort — a Redis outage at callback time is logged but never breaks the callback or the notification path.
Related docs
../architecture/execution-retry.md— the full retry architecture../architecture/background-tasks.md— how tasks are queued and dispatched../architecture/worker-pipeline.md— the worker dispatch/callback pipeline