Skip to main content

How to test worker-execution recording & retry locally

The orchestrator records every worker dispatch as a durable Redis record (no TTL) via the shared @alakai/execution-store. The Alakai dashboard never talks to the orchestrator: core owns the external, authenticated surface and exposes a single action — POST /dashboard/executions/:taskId/retry. A retry reads the captured snapshot from Redis and re-enqueues it to SQS with the next attempt number; the orchestrator consumer then dispatches it like any other task (capped by MAX_RETRIES). There is no list/get endpoint — the dashboard sources its failed-task list from its own task tracking. This guide covers two testing paths:

  • Level 1 — Fast (5 min): seed a failed execution, inspect the Redis record directly, and confirm the retryable / MAX_RETRIES budget. No SQS or worker needed.
  • Level 2 — Full pipeline (15 min): enqueue a real task to SQS, let the orchestrator dispatch it, simulate the worker reporting failure, then retry (via core or by re-enqueuing to SQS).

Both core and the orchestrator default to port 3000. These examples run core on 3000 and the orchestrator on 3001 to avoid the clash.


Topology — which service handles what

The execution records live in Redis, written by the orchestrator and read by core. The only HTTP endpoint is the retry on core. Sending a retry to the orchestrator port returns 404 — that's the wrong service, not a bug.

What you're doingWhereHow
Inspect a failed execution recordRedisredis-cli HGETALL <prefix>:<taskId> · make seed-failed-implementation prints the taskId
Seed / simulate worker callbackorchestrator 3001make seed-failed-implementation · POST /task-complete
Retry a failed executioncore (external) 3000POST /dashboard/executions/:taskId/retry · make retry-worker-execution core_url=http://localhost:3000

core and the orchestrator must point at the same Redis instance and the same EXECUTIONS_REDIS_KEY_PREFIX (default alakai-executions), or core reads nothing and every retry 404s.


Prerequisites

Both levels need:

  • Redis runningredis-cli ping should return PONG. If you don't have it: docker run -p 6379:6379 redis.
  • Dependencies installedyarn install at the repo root, then yarn workspace @alakai/execution-store build (the shared store compiles to dist, consumed by both services).

Level 2 additionally needs:

  • Docker (for LocalStack SQS), plus the AWS CLI and jq.

There is no Postgres in this feature anymore.


Start the orchestrator (both levels)

In local mode (ENV=local) the orchestrator skips ECS — launcher.launch returns a fake ARN and stores the task in Redis instead of running a real container.

cd orchestrator
ENV=local PORT=3001 AWS_REGION=us-east-1 \
AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test \
SQS_QUEUE_URL=http://localhost:4566/000000000000/alakai-queue \
REDIS_URL=redis://localhost:6379 \
EXECUTIONS_REDIS_KEY_PREFIX=alakai-executions \
yarn dev # or: yarn build && yarn start

Check it's up:

curl -sS http://localhost:3001/health # -> {"ok":true}

For Level 1 the SQS_QUEUE_URL can be any valid URL (the consumer will log harmless poll errors). For Level 2 it must point at the LocalStack queue created below.


Start core (only needed to retry via core)

You only need this to exercise the real dashboard retry hop. Level 1 and the SQS-direct retry path work without core.

core registers the retry route when dashboard auth is configured. Point its Redis at the same instance and prefix the orchestrator uses:

cd core
ENV=local PORT=3000 \
TASK_TRACKING_REDIS_URL=redis://localhost:6379 \
EXECUTIONS_REDIS_KEY_PREFIX=alakai-executions \ # must equal the orchestrator's
MAX_RETRIES=1 \ # must equal the orchestrator's
# ... plus the usual core env (Slack signing secret, TASK_DASHBOARD_GOOGLE_CLIENT_IDS, etc.)
yarn dev

On startup, confirm these two log lines — they're how you diagnose a 404 later:

Dashboard retry route enabled
Local auth bypass active on the retry route: 'x-dev-user' header accepted (ENV=local only)
  • If you see Dashboard retry route disabled (...), dashboard auth isn't configured → the route doesn't exist → 404.
  • The bypass line only appears under ENV=local. Without it, retry needs a real Google ID token.

After changing code, restart core. yarn dev (tsx watch) reloads automatically; a built dist (yarn start) does not.


Level 1 — Fast retry test

All make targets live in core/, so run them from there.

1. Seed a failed implementation execution

This writes a failed / attempt 1 record into both the dashboard task-tracking store and the execution store. Both stores share the same taskName and actor, so the dashboard row is stable across retries.

cd core
make seed-failed-implementation repo='my-org/my-repo' task_name='Fix the widget bug'

Optional overrides (all have defaults):

make seed-failed-implementation \
repo='my-org/my-repo' \
task_name='Fix the widget bug' \
actor_provider='github' actor_id='U123' actor_display='Jane Dev' \
execution_id='exec-demo' \
attempt='1'

The command prints two IDs:

  • execution-id — the dashboard row key (used by the SSE stream and the snapshot read service)
  • queue-task-id — the execution-store key, passed to the retry button

2. Inspect the execution record and its retry budget

The record is a Redis hash keyed by <prefix>:<queueTaskId>, one field per attempt:

redis-cli HGET alakai-executions:<queue-task-id> 1 | python3 -m json.tool

You should see status: "failed", attempt: 1, and the stored taskSnapshot. With MAX_RETRIES=1, an attempt: 1 failed record is retryable — retryable = status === 'failed' && attempt <= MAX_RETRIES. This is exactly the gate core computes before re-enqueuing a retry.

3. (Optional) Prove the MAX_RETRIES budget

With MAX_RETRIES=1 (default), only attempt 1 may be retried. Re-seed at attempt 2 and re-check:

make seed-failed-implementation repo='my-org/my-repo' attempt='2'
redis-cli HGET alakai-executions:<queue-task-id> 2 | python3 -m json.tool

attempt: 2 with MAX_RETRIES=1 is not retryable — core returns 409 "retry budget exhausted", and the consumer also drops any SQS message whose attempt exceeds MAX_RETRIES + 1. The actual retry dispatch is exercised end-to-end in Level 2 (it requires SQS).

4. Clean up

redis-cli DEL alakai-executions:<queue-task-id>

Level 2 — Full pipeline test

This drives the real chain: enqueue → consume → dispatch (records attempt 1 running) → worker fails → callback records failed → retry.

1. Start LocalStack and create the queue

cd .local-aws
make up && make infra # creates SQS queue http://localhost:4566/000000000000/alakai-queue

Make sure the orchestrator (above) is running with SQS_QUEUE_URL pointing at this queue and the test AWS credentials.

2. Enqueue an implementation task

The orchestrator consumes it and "dispatches" the worker — in local mode that records an attempt 1 running record and stores the task in Redis (no real container runs).

cd .local-aws
make send-sqs TASK_ID=impl-e2e

Override fields as needed — make send-sqs TASK_ID=impl-e2e REPO=my-org/my-repo PROMPT='add retries' — or pass a full payload with make send-sqs MESSAGE_BODY='{…}'.

Avoid raw aws sqs send-message: without the dummy AWS_ACCESS_KEY_ID=test / AWS_SECRET_ACCESS_KEY=test, the CLI falls back to your real credential chain and tries to reach AWS instead of LocalStack. The make send-sqs target handles this.

3. Simulate the worker reporting failure

Because no real worker runs locally, POST the callback yourself. The implementation worker uses a nested result shape — outcome at result.status ("success"/"error") and reason at result.errorMessage:

curl -sS -X POST http://localhost:3001/task-complete \
-H 'content-type: application/json' \
--data '{"taskId":"impl-e2e","result":{"status":"error","errorMessage":"codex provider timed out"}}'

The digest and bug-hunter workers use the canonical top-level shape instead: {"taskId":"…","taskType":"digest","status":"failure","error":{"message":"…"}}. The orchestrator normalizes both.

4. Verify it was recorded as failed

redis-cli HGET alakai-executions:impl-e2e 1 | python3 -m json.tool

status should be failed with errorMessage: "codex provider timed out" captured.

5. Retry, then simulate success

Retry re-enqueues the snapshot to SQS with attempt: 2; the orchestrator consumer dispatches it and records attempt 2 running. Two ways to trigger it:

Via core (the real path — recommended). With core running on ENV=local, use the dev_user shortcut instead of minting a Google token — POST to core on port 3000:

cd ../core
make retry-worker-execution task_id='impl-e2e' core_url=http://localhost:3000 dev_user='you@local'
# -> 202 { "taskId": "impl-e2e", "attempt": 2, "status": "queued" }

In production (or to verify the real auth path), pass a Google ID token instead — see "How do I get a Google ID token?":

make retry-worker-execution task_id='impl-e2e' core_url=http://localhost:3000 auth_token="$GOOGLE_ID_TOKEN"

Via SQS directly (fast, no core/auth): re-enqueue the snapshot with attempt: 2 yourself — this is exactly the message core sends:

cd ../.local-aws
make send-sqs MESSAGE_BODY='{"taskId":"impl-e2e","taskType":"implementation","payload":{"repo":"my-org/my-repo","source":"coding.implementation","prompt":"add retries"},"metadata":{"createdAt":1,"trace_id":"impl-e2e"},"callbacks":{"github":{"repo":"my-org/my-repo","issueNumber":1}},"attempt":2}'

Either path lands an attempt: 2 running record. Confirm it dispatched:

redis-cli HGET alakai-executions:impl-e2e 2 | python3 -m json.tool # -> attempt: 2, status: "running"

If field 2 never appears, core and the orchestrator are on different SQS queuescore's IMPLEMENTATION_SQS_QUEUE_URL must equal the orchestrator's SQS_QUEUE_URL — or on different Redis instances/prefixes.

Then simulate the worker succeeding:

curl -sS -X POST http://localhost:3001/task-complete \
-H 'content-type: application/json' \
--data '{"taskId":"impl-e2e","result":{"status":"success","pullRequestUrl":"https://github.com/x/y/pull/1"}}'

Confirm attempt 2 flipped to succeeded:

redis-cli HGET alakai-executions:impl-e2e 2 | python3 -m json.tool # -> status: "succeeded"

Callback idempotency gotcha: once an attempt is succeeded (or failed), a later callback is a no-op — the store only resolves an attempt still running. So if you succeed attempt 2 and then send a failure callback, the record stays succeeded, and a subsequent retry returns 409 "cannot retry execution in status 'succeeded'" (the status gate), not "budget exhausted". To see the budget gate, don't succeed the attempt first — see Step 6.

6. (Optional) Prove the MAX_RETRIES budget end-to-end

With MAX_RETRIES=1, attempt 2 is the last allowed dispatch. Drive it on a fresh task and leave attempt 2 failed (don't succeed it). Capture the queue-task-id from the seed output:

cd ../core
make seed-failed-implementation repo='my-org/my-repo'
# -> prints: queue-task-id: <QUEUE_ID> (use this below)

QUEUE_ID=<paste-queue-task-id-from-above>

make retry-worker-execution task_id="$QUEUE_ID" core_url=http://localhost:3000 dev_user='you@local' # -> 202, attempt 2

# fail attempt 2 (do NOT succeed it first)
curl -sS -X POST http://localhost:3001/task-complete -H 'content-type: application/json' \
--data "{\"taskId\":\"$QUEUE_ID\",\"result\":{\"status\":\"error\",\"errorMessage\":\"second failure\"}}"

make retry-worker-execution task_id="$QUEUE_ID" core_url=http://localhost:3000 dev_user='you@local'
# -> 409 { "error": "retry budget exhausted", "maxRetries": 1 }

For the consumer's defensive guard, re-enqueue past budget straight to SQS ("attempt":3) and watch the orchestrator log Dropping over-budget dispatch — no new record is created.

7. Verify the stream and dashboard hash

After the success callback, the orchestrator emits a success tracking event to the Redis stream. Confirm the stream event and the task-tracking hash agree:

# Last event on the tracking stream (should be status:success for <execution-id>)
redis-cli XREVRANGE task-tracking:events:v1 + - COUNT 1

# Dashboard hash for the execution row (should be status:success)
redis-cli HGETALL alakai-local:task-tracking:execution:<execution-id>

The taskName and actor in the stream event must match what make seed-failed-implementation printed — if they differ, the seed scripts are not aligned (see bug #2 this guide targets).

8. Tear down

# Delete execution records by queue-task-id; task-tracking keys by execution-id
redis-cli DEL alakai-executions:<queue-task-id>
cd ../.local-aws && make reset && make down

How do I get a Google ID token?

core verifies the dashboard token with verifyIdToken({ audience: TASK_DASHBOARD_GOOGLE_CLIENT_IDS }), so it must be a real Google-signed ID token whose aud matches one of your configured client IDs, with a verified email. That rules out the usual shortcuts: gcloud auth print-identity-token and the OAuth Playground issue tokens with a different audience, so verifyIdToken rejects them.

The reliable way is to copy the token the dashboard already mints:

  1. Open the Alakai dashboard in the browser and sign in with Google.
  2. DevTools → Network → click any /dashboard/... request.
  3. Copy the token from the Authorization: Bearer <token> header or the ?authToken=<token> query param (it's a JWT; expires ~1h).

For local testing, prefer the dev_user bypass (ENV=local only).


Endpoint reference

Core — dashboard API (external; authenticated with a dashboard Google ID token). Reads the execution record from Redis; retry re-enqueues to SQS:

Method & pathPurpose
POST /dashboard/executions/:taskId/retryRe-enqueue the latest failed execution to SQS with the next attempt. Enforces MAX_RETRIES.

Retry responses: 202 queued · 401 unauthenticated · 404 unknown task · 409 not in failed state or budget exhausted · 502 Redis read failed or enqueue failed.

There is no orchestrator HTTP API for this feature. Inspect records with redis-cli against the <prefix>:<taskId> hash.


Troubleshooting

SymptomCauseFix
Retry returns 404 "Route POST:/dashboard/executions/… not found"You hit the orchestrator (3001) instead of core, or the route isn't registeredPoint core_url at core (3000). retry-worker-execution defaults to 3000; if you ran the orchestrator on 3001 make sure core is also up. If still 404, check core's startup log for Dashboard retry route enabled — if it says disabled, configure dashboard auth and restart
Retry returns 404 "execution not found" (JSON {error})No record for that taskId, or core and orchestrator use different Redis/prefixSeed one (Level 1) or enqueue + fail one (Level 2); verify both services share TASK_TRACKING_REDIS_URL/REDIS_URL and EXECUTIONS_REDIS_KEY_PREFIX
retry-worker-execution returns 401 dashboard_unauthorizedMissing/invalid dashboard authWith core on ENV=local, pass dev_user=<email>; otherwise a valid auth_token=<google-id-token>; or use the SQS re-enqueue path
Retry returns 409 "cannot retry execution in status 'succeeded'"You succeeded the attempt; a later failure callback was a no-op (idempotency)Expected. To test the budget gate, fail the attempt without succeeding it first (Level 2, Step 6)
Code change has no effectBuilt dist still serving old code (incl. stale @alakai/execution-store)Restart core/orchestrator; rebuild the shared package after editing it (yarn workspace @alakai/execution-store build)
Retry returns 409 "cannot retry … 'running'"Latest attempt hasn't failed yetOnly failed executions are retryable; report a failure callback first
Retry returns 409 "retry budget exhausted"attempt > MAX_RETRIESExpected once the cap is hit; raise MAX_RETRIES (on both services) to allow more
retry-worker-execution returns 502core can't reach Redis or the SQS enqueue failedCheck TASK_TRACKING_REDIS_URL, that Redis is up, and SQS is reachable
Re-enqueued retry never dispatchesattempt > MAX_RETRIES + 1, so the consumer drops itExpected budget guard; send a smaller attempt or raise MAX_RETRIES
POST /task-complete returns 400 FST_ERR_CTP_EMPTY_JSON_BODYJSON content-type with empty bodySend a JSON body (the Makefile target already sends {} for retry)
Successful run shows up as failedWorker payload shape not normalizedEnsure you're on the build with normalizeOutcome in callbackRoutes.ts
aws calls hang or fail (Level 2)LocalStack not running / wrong endpointcd .local-aws && make up && make infra; always pass --endpoint-url=http://localhost:4566

Notes

  • A worker that dies without calling back (ECS crash/OOM, or all callback retries fail) is reconciled by the orchestrator's ECS reconciliation loop, which marks the orphaned running record failed with error_code='WORKER_LOST' so it becomes retryable. This loop only runs when ENV != local (it needs ECS), so it can't be exercised in this local setup.
  • Redis writes on the callback are best-effort — a Redis outage at callback time is logged but never breaks the callback or the notification path.