How to test worker-execution recording & retry locally

The orchestrator records every worker dispatch as a durable Redis record (no TTL) via the shared @alakai/execution-store. The Alakai dashboard never talks to the orchestrator: core owns the external, authenticated surface and exposes a single action — POST /dashboard/executions/:taskId/retry. A retry reads the captured snapshot from Redis and re-enqueues it to SQS with the next attempt number; the orchestrator consumer then dispatches it like any other task (capped by MAX_RETRIES). There is no list/get endpoint — the dashboard sources its failed-task list from its own task tracking. This guide covers two testing paths:

Level 1 — Fast (5 min): seed a failed execution, inspect the Redis record directly, and confirm the retryable / MAX_RETRIES budget. No SQS or worker needed.
Level 2 — Full pipeline (15 min): enqueue a real task to SQS, let the orchestrator dispatch it, simulate the worker reporting failure, then retry (via core or by re-enqueuing to SQS).

Both core and the orchestrator default to port 3000. These examples run core on 3000 and the orchestrator on 3001 to avoid the clash.

Topology — which service handles what

The execution records live in Redis, written by the orchestrator and read by core. The only HTTP endpoint is the retry on core. Sending a retry to the orchestrator port returns 404 — that's the wrong service, not a bug.

What you're doing	Where	How
Inspect a failed execution record	Redis	`redis-cli HGETALL <prefix>:<taskId>` · `make seed-failed-implementation` prints the taskId
Seed / simulate worker callback	orchestrator `3001`	`make seed-failed-implementation` · `POST /task-complete`
Retry a failed execution	core (external) `3000`	`POST /dashboard/executions/:taskId/retry` · `make retry-worker-execution core_url=http://localhost:3000`

core and the orchestrator must point at the same Redis instance and the same EXECUTIONS_REDIS_KEY_PREFIX (default alakai-executions), or core reads nothing and every retry 404s.

Prerequisites

Both levels need:

Redis running — redis-cli ping should return PONG. If you don't have it: docker run -p 6379:6379 redis.
Dependencies installed — yarn install at the repo root, then yarn workspace @alakai/execution-store build (the shared store compiles to dist, consumed by both services).

Level 2 additionally needs:

Docker (for LocalStack SQS), plus the AWS CLI and jq.

There is no Postgres in this feature anymore.

Start the orchestrator (both levels)

In local mode (ENV=local) the orchestrator skips ECS — launcher.launch returns a fake ARN and stores the task in Redis instead of running a real container.

cd orchestrator
ENV=local PORT=3001 AWS_REGION=us-east-1 \
  AWS_ACCESS_KEY_ID=test AWS_SECRET_ACCESS_KEY=test \
  SQS_QUEUE_URL=http://localhost:4566/000000000000/alakai-queue \
  REDIS_URL=redis://localhost:6379 \
  EXECUTIONS_REDIS_KEY_PREFIX=alakai-executions \
  yarn dev        # or: yarn build && yarn start

Check it's up:

curl -sS http://localhost:3001/health      # -> {"ok":true}

For Level 1 the SQS_QUEUE_URL can be any valid URL (the consumer will log harmless poll errors). For Level 2 it must point at the LocalStack queue created below.

Start core (only needed to retry via `core`)

You only need this to exercise the real dashboard retry hop. Level 1 and the SQS-direct retry path work without core.

core registers the retry route when dashboard auth is configured. Point its Redis at the same instance and prefix the orchestrator uses:

cd core
ENV=local PORT=3000 \
  TASK_TRACKING_REDIS_URL=redis://localhost:6379 \
  EXECUTIONS_REDIS_KEY_PREFIX=alakai-executions \   # must equal the orchestrator's
  MAX_RETRIES=1 \                                   # must equal the orchestrator's
  # ... plus the usual core env (Slack signing secret, TASK_DASHBOARD_GOOGLE_CLIENT_IDS, etc.)
  yarn dev

On startup, confirm these two log lines — they're how you diagnose a 404 later:

Dashboard retry route enabled
Local auth bypass active on the retry route: 'x-dev-user' header accepted (ENV=local only)

If you see Dashboard retry route disabled (...), dashboard auth isn't configured → the route doesn't exist → 404.
The bypass line only appears under ENV=local. Without it, retry needs a real Google ID token.

After changing code, restart core. yarn dev (tsx watch) reloads automatically; a built dist (yarn start) does not.

Level 1 — Fast retry test

All make targets live in core/, so run them from there.

1. Seed a failed implementation execution

This writes a failed / attempt 1 record into both the dashboard task-tracking store and the execution store. Both stores share the same taskName and actor, so the dashboard row is stable across retries.

cd core
make seed-failed-implementation repo='my-org/my-repo' task_name='Fix the widget bug'

Optional overrides (all have defaults):

make seed-failed-implementation \
  repo='my-org/my-repo' \
  task_name='Fix the widget bug' \
  actor_provider='github' actor_id='U123' actor_display='Jane Dev' \
  execution_id='exec-demo' \
  attempt='1'

The command prints two IDs:

execution-id — the dashboard row key (used by the SSE stream and the snapshot read service)
queue-task-id — the execution-store key, passed to the retry button

2. Inspect the execution record and its retry budget

The record is a Redis hash keyed by <prefix>:<queueTaskId>, one field per attempt:

redis-cli HGET alakai-executions:<queue-task-id> 1 | python3 -m json.tool

You should see status: "failed", attempt: 1, and the stored taskSnapshot. With MAX_RETRIES=1, an attempt: 1 failed record is retryable — retryable = status === 'failed' && attempt <= MAX_RETRIES. This is exactly the gate core computes before re-enqueuing a retry.

3. (Optional) Prove the `MAX_RETRIES` budget

With MAX_RETRIES=1 (default), only attempt 1 may be retried. Re-seed at attempt 2 and re-check:

make seed-failed-implementation repo='my-org/my-repo' attempt='2'
redis-cli HGET alakai-executions:<queue-task-id> 2 | python3 -m json.tool

attempt: 2 with MAX_RETRIES=1 is not retryable — core returns 409 "retry budget exhausted", and the consumer also drops any SQS message whose attempt exceeds MAX_RETRIES + 1. The actual retry dispatch is exercised end-to-end in Level 2 (it requires SQS).

4. Clean up

redis-cli DEL alakai-executions:<queue-task-id>

Level 2 — Full pipeline test

This drives the real chain: enqueue → consume → dispatch (records attempt 1 running) → worker fails → callback records failed → retry.

1. Start LocalStack and create the queue

cd .local-aws
make up && make infra      # creates SQS queue http://localhost:4566/000000000000/alakai-queue

Make sure the orchestrator (above) is running with SQS_QUEUE_URL pointing at this queue and the test AWS credentials.

2. Enqueue an implementation task

The orchestrator consumes it and "dispatches" the worker — in local mode that records an attempt 1 running record and stores the task in Redis (no real container runs).

cd .local-aws
make send-sqs TASK_ID=impl-e2e

Override fields as needed — make send-sqs TASK_ID=impl-e2e REPO=my-org/my-repo PROMPT='add retries' — or pass a full payload with make send-sqs MESSAGE_BODY='{…}'.

Avoid raw aws sqs send-message: without the dummy AWS_ACCESS_KEY_ID=test / AWS_SECRET_ACCESS_KEY=test, the CLI falls back to your real credential chain and tries to reach AWS instead of LocalStack. The make send-sqs target handles this.

3. Simulate the worker reporting failure

Because no real worker runs locally, POST the callback yourself. The implementation worker uses a nested result shape — outcome at result.status ("success"/"error") and reason at result.errorMessage:

curl -sS -X POST http://localhost:3001/task-complete \
  -H 'content-type: application/json' \
  --data '{"taskId":"impl-e2e","result":{"status":"error","errorMessage":"codex provider timed out"}}'

The digest and bug-hunter workers use the canonical top-level shape instead: {"taskId":"…","taskType":"digest","status":"failure","error":{"message":"…"}}. The orchestrator normalizes both.

4. Verify it was recorded as failed

redis-cli HGET alakai-executions:impl-e2e 1 | python3 -m json.tool

status should be failed with errorMessage: "codex provider timed out" captured.

5. Retry, then simulate success

Retry re-enqueues the snapshot to SQS with attempt: 2; the orchestrator consumer dispatches it and records attempt 2 running. Two ways to trigger it:

Via core (the real path — recommended). With core running on ENV=local, use the dev_user shortcut instead of minting a Google token — POST to core on port 3000:

cd ../core
make retry-worker-execution task_id='impl-e2e' core_url=http://localhost:3000 dev_user='you@local'
# -> 202 { "taskId": "impl-e2e", "attempt": 2, "status": "queued" }

In production (or to verify the real auth path), pass a Google ID token instead — see "How do I get a Google ID token?":

make retry-worker-execution task_id='impl-e2e' core_url=http://localhost:3000 auth_token="$GOOGLE_ID_TOKEN"

Via SQS directly (fast, no core/auth): re-enqueue the snapshot with attempt: 2 yourself — this is exactly the message core sends:

cd ../.local-aws
make send-sqs MESSAGE_BODY='{"taskId":"impl-e2e","taskType":"implementation","payload":{"repo":"my-org/my-repo","source":"coding.implementation","prompt":"add retries"},"metadata":{"createdAt":1,"trace_id":"impl-e2e"},"callbacks":{"github":{"repo":"my-org/my-repo","issueNumber":1}},"attempt":2}'

Either path lands an attempt: 2 running record. Confirm it dispatched:

redis-cli HGET alakai-executions:impl-e2e 2 | python3 -m json.tool   # -> attempt: 2, status: "running"

If field 2 never appears, core and the orchestrator are on different SQS queues — core's IMPLEMENTATION_SQS_QUEUE_URL must equal the orchestrator's SQS_QUEUE_URL — or on different Redis instances/prefixes.

Then simulate the worker succeeding:

curl -sS -X POST http://localhost:3001/task-complete \
  -H 'content-type: application/json' \
  --data '{"taskId":"impl-e2e","result":{"status":"success","pullRequestUrl":"https://github.com/x/y/pull/1"}}'

Confirm attempt 2 flipped to succeeded:

redis-cli HGET alakai-executions:impl-e2e 2 | python3 -m json.tool   # -> status: "succeeded"

Callback idempotency gotcha: once an attempt is succeeded (or failed), a later callback is a no-op — the store only resolves an attempt still running. So if you succeed attempt 2 and then send a failure callback, the record stays succeeded, and a subsequent retry returns 409 "cannot retry execution in status 'succeeded'" (the status gate), not "budget exhausted". To see the budget gate, don't succeed the attempt first — see Step 6.

6. (Optional) Prove the `MAX_RETRIES` budget end-to-end

With MAX_RETRIES=1, attempt 2 is the last allowed dispatch. Drive it on a fresh task and leave attempt 2 failed (don't succeed it). Capture the queue-task-id from the seed output:

cd ../core
make seed-failed-implementation repo='my-org/my-repo'
# -> prints: queue-task-id: <QUEUE_ID>   (use this below)

QUEUE_ID=<paste-queue-task-id-from-above>

make retry-worker-execution task_id="$QUEUE_ID" core_url=http://localhost:3000 dev_user='you@local'    # -> 202, attempt 2

# fail attempt 2 (do NOT succeed it first)
curl -sS -X POST http://localhost:3001/task-complete -H 'content-type: application/json' \
  --data "{\"taskId\":\"$QUEUE_ID\",\"result\":{\"status\":\"error\",\"errorMessage\":\"second failure\"}}"

make retry-worker-execution task_id="$QUEUE_ID" core_url=http://localhost:3000 dev_user='you@local'
# -> 409 { "error": "retry budget exhausted", "maxRetries": 1 }

For the consumer's defensive guard, re-enqueue past budget straight to SQS ("attempt":3) and watch the orchestrator log Dropping over-budget dispatch — no new record is created.

7. Verify the stream and dashboard hash

After the success callback, the orchestrator emits a success tracking event to the Redis stream. Confirm the stream event and the task-tracking hash agree:

# Last event on the tracking stream (should be status:success for <execution-id>)
redis-cli XREVRANGE task-tracking:events:v1 + - COUNT 1

# Dashboard hash for the execution row (should be status:success)
redis-cli HGETALL alakai-local:task-tracking:execution:<execution-id>

The taskName and actor in the stream event must match what make seed-failed-implementation printed — if they differ, the seed scripts are not aligned (see bug #2 this guide targets).

8. Tear down

# Delete execution records by queue-task-id; task-tracking keys by execution-id
redis-cli DEL alakai-executions:<queue-task-id>
cd ../.local-aws && make reset && make down

How do I get a Google ID token?

core verifies the dashboard token with verifyIdToken({ audience: TASK_DASHBOARD_GOOGLE_CLIENT_IDS }), so it must be a real Google-signed ID token whose aud matches one of your configured client IDs, with a verified email. That rules out the usual shortcuts: gcloud auth print-identity-token and the OAuth Playground issue tokens with a different audience, so verifyIdToken rejects them.

The reliable way is to copy the token the dashboard already mints:

Open the Alakai dashboard in the browser and sign in with Google.
DevTools → Network → click any /dashboard/... request.
Copy the token from the Authorization: Bearer <token> header or the ?authToken=<token> query param (it's a JWT; expires ~1h).

For local testing, prefer the dev_user bypass (ENV=local only).

Endpoint reference

Core — dashboard API (external; authenticated with a dashboard Google ID token). Reads the execution record from Redis; retry re-enqueues to SQS:

Method & path	Purpose
`POST /dashboard/executions/:taskId/retry`	Re-enqueue the latest failed execution to SQS with the next attempt. Enforces `MAX_RETRIES`.

Retry responses: 202 queued · 401 unauthenticated · 404 unknown task · 409 not in failed state or budget exhausted · 502 Redis read failed or enqueue failed.

There is no orchestrator HTTP API for this feature. Inspect records with redis-cli against the <prefix>:<taskId> hash.

Troubleshooting

Symptom	Cause	Fix
Retry returns `404 "Route POST:/dashboard/executions/… not found"`	You hit the orchestrator (`3001`) instead of `core`, or the route isn't registered	Point `core_url` at core (`3000`). `retry-worker-execution` defaults to `3000`; if you ran the orchestrator on `3001` make sure `core` is also up. If still 404, check core's startup log for `Dashboard retry route enabled` — if it says `disabled`, configure dashboard auth and restart
Retry returns `404 "execution not found"` (JSON `{error}`)	No record for that `taskId`, or core and orchestrator use different Redis/prefix	Seed one (Level 1) or enqueue + fail one (Level 2); verify both services share `TASK_TRACKING_REDIS_URL`/`REDIS_URL` and `EXECUTIONS_REDIS_KEY_PREFIX`
`retry-worker-execution` returns `401 dashboard_unauthorized`	Missing/invalid dashboard auth	With core on `ENV=local`, pass `dev_user=<email>`; otherwise a valid `auth_token=<google-id-token>`; or use the SQS re-enqueue path
Retry returns `409 "cannot retry execution in status 'succeeded'"`	You succeeded the attempt; a later failure callback was a no-op (idempotency)	Expected. To test the budget gate, fail the attempt without succeeding it first (Level 2, Step 6)
Code change has no effect	Built `dist` still serving old code (incl. stale `@alakai/execution-store`)	Restart core/orchestrator; rebuild the shared package after editing it (`yarn workspace @alakai/execution-store build`)
Retry returns `409 "cannot retry … 'running'"`	Latest attempt hasn't failed yet	Only `failed` executions are retryable; report a failure callback first
Retry returns `409 "retry budget exhausted"`	`attempt > MAX_RETRIES`	Expected once the cap is hit; raise `MAX_RETRIES` (on both services) to allow more
`retry-worker-execution` returns `502`	`core` can't reach Redis or the SQS enqueue failed	Check `TASK_TRACKING_REDIS_URL`, that Redis is up, and SQS is reachable
Re-enqueued retry never dispatches	`attempt > MAX_RETRIES + 1`, so the consumer drops it	Expected budget guard; send a smaller `attempt` or raise `MAX_RETRIES`
`POST /task-complete` returns `400 FST_ERR_CTP_EMPTY_JSON_BODY`	JSON content-type with empty body	Send a JSON body (the Makefile target already sends `{}` for retry)
Successful run shows up as `failed`	Worker payload shape not normalized	Ensure you're on the build with `normalizeOutcome` in `callbackRoutes.ts`
`aws` calls hang or fail (Level 2)	LocalStack not running / wrong endpoint	`cd .local-aws && make up && make infra`; always pass `--endpoint-url=http://localhost:4566`

Notes

A worker that dies without calling back (ECS crash/OOM, or all callback retries fail) is reconciled by the orchestrator's ECS reconciliation loop, which marks the orphaned running record failed with error_code='WORKER_LOST' so it becomes retryable. This loop only runs when ENV != local (it needs ECS), so it can't be exercised in this local setup.
Redis writes on the callback are best-effort — a Redis outage at callback time is logged but never breaks the callback or the notification path.

../architecture/execution-retry.md — the full retry architecture
../architecture/background-tasks.md — how tasks are queued and dispatched
../architecture/worker-pipeline.md — the worker dispatch/callback pipeline

Topology — which service handles what​

Prerequisites​

Start the orchestrator (both levels)​

Start core (only needed to retry via core)​

Level 1 — Fast retry test​

1. Seed a failed implementation execution​

2. Inspect the execution record and its retry budget​

3. (Optional) Prove the MAX_RETRIES budget​

4. Clean up​

Level 2 — Full pipeline test​

1. Start LocalStack and create the queue​

2. Enqueue an implementation task​

3. Simulate the worker reporting failure​

4. Verify it was recorded as failed​

5. Retry, then simulate success​

6. (Optional) Prove the MAX_RETRIES budget end-to-end​

7. Verify the stream and dashboard hash​

8. Tear down​

How do I get a Google ID token?​

Endpoint reference​

Troubleshooting​

Notes​

Related docs​