Execution Tracing Guide
This guide explains how to reconstruct the full history of any Alakai execution using
structured logs in AWS CloudWatch Logs Insights. Every discrete unit of work emits one
named summary event at the end of its slice, carrying a trace_id that joins logs
across components.
How it works
Each execution flows through up to three services — core, orchestrator, and
worker — running as separate ECS tasks. Each service emits its own summary event with
the same trace_id and a different context.component value. To reconstruct the full
history of a request, filter by trace_id across all log groups.
GitHub webhook / Slack command
→ core emits event (e.g. slack.implementation, component=core)
→ SQS (carries trace_id in metadata)
→ orchestrator emits event (component=orchestrator)
→ ECS task (carries trace_id in Redis payload)
→ worker emits event (component=worker)
Sources of trace_id:
- GitHub webhooks: the
x-github-deliveryheader value (a UUID set by GitHub). - Slack commands: a UUIDv4 generated when the request enters the handler.
Log groups
| Service | Log group |
|---|---|
| core | /ecs/alakai-core |
| orchestrator | /ecs/alakai-orchestrator |
| worker | /ecs/alakai-worker-implementation |
For cross-component queries, select all three log groups in the CloudWatch Logs Insights
console, or use the AWS CLI with --log-group-names.
Event JSON structure
Every summary event is a single JSON line with this shape:
{
"time": "2026-05-05T12:00:00.000Z",
"level": 30,
"service": "alakai-core",
"component": "core",
"trace_id": "a1b2c3d4-...",
"event": "slack.implementation",
"outcome": "success",
"final_status": "completed",
"duration_ms": 1234,
"context": { "trace_id": "a1b2c3d4-...", "component": "core" },
"steps": {
"verify_signature": "ok",
"parse_command": "ok",
"validate_args": "ok",
"enqueue": "ok"
},
"error": null,
"repo": { "full_name": "org/repo", "base_ref": "main" },
"task": { "task_id": "...", "task_type": "...", "source": "slack_implement" },
"msg": "[event] slack.implementation -> success"
}
Key fields at a glance:
| Field | Type | Description |
|---|---|---|
trace_id | string | Joins all slices of a single execution |
event | string | Event name — see catalog below |
outcome | success | error | skipped | Slice outcome |
final_status | string | Domain-specific result code |
duration_ms | number | Slice wall-clock time |
context.component | core | orchestrator | worker | Emitting service |
steps | object | Map of step name → ok | error | - |
error.type | string | Exception class name (when outcome=error) |
error.message | string | Sanitized error message |
Event catalog
| Event name | Emitters | Trigger |
|---|---|---|
coding.prompt | core | GitHub issue opened or docs/prompts/* PR merged |
coding.implementation | core, orchestrator, worker | docs/prompts/* PR merged → async implementation flow |
slack.prompt | core | Slack /prompt slash command |
slack.implementation | core, orchestrator, worker | Slack /implement slash command |
slack.clickup.ask | core | Slack /clickup-ask slash command |
slack.clickup.task | core | Slack slash command to create/update a ClickUp task |
Runbook: trace a full execution
Step 1 — Get the trace_id
- From a GitHub webhook: copy the
x-github-deliveryheader from the GitHub webhook delivery page (Settings → Webhooks → Recent deliveries). - From a Slack command: find a log line in
/ecs/alakai-corethat contains the user's Slackuser_idor channel and copy thetrace_idfield. - From a CloudWatch error alert: the alert payload contains
trace_idif the log line is a summary event.
Step 2 — Run the cross-component query
Select all three log groups and run:
fields @timestamp, context.component, event, outcome, final_status, duration_ms,
steps, error.type, error.message, repo.full_name, task.task_id
| filter trace_id = "REPLACE_WITH_TRACE_ID"
| sort @timestamp asc
This returns one row per service slice. Read them top-to-bottom to follow the execution:
- core row — did the request parse and enqueue correctly?
- orchestrator row — did the ECS task launch successfully?
- worker row — did the implementation agent complete? What was the PR URL?
Step 3 — Drill into step-level logs
To see all log lines (not just summary events) for the same execution:
fields @timestamp, context.component, level, msg
| filter trace_id = "REPLACE_WITH_TRACE_ID"
| sort @timestamp asc
| limit 500
Common queries
Recent errors across all events
fields @timestamp, context.component, event, outcome, final_status,
error.type, error.message, trace_id
| filter outcome = "error"
| sort @timestamp desc
| limit 100
Errors for a specific event type
fields @timestamp, context.component, final_status, error.type, error.message,
duration_ms, trace_id
| filter event = "coding.implementation" and outcome = "error"
| sort @timestamp desc
| limit 50
Slow executions (worker slice > 5 minutes)
fields @timestamp, trace_id, event, final_status, duration_ms,
agent.provider, agent.model, repo.full_name
| filter context.component = "worker" and duration_ms > 300000
| sort duration_ms desc
| limit 50
All implementations for a given repository
fields @timestamp, trace_id, event, outcome, final_status, duration_ms,
pull_request.url, pull_request.number
| filter (event = "coding.implementation" or event = "slack.implementation")
and context.component = "worker"
and repo.full_name = "REPLACE_WITH_ORG/REPO"
| sort @timestamp desc
| limit 100
Slack command failures by user
fields @timestamp, event, final_status, error.message, trace_id
| filter (event like /^slack\./) and outcome = "error"
and slack.user_id = "REPLACE_WITH_SLACK_USER_ID"
| sort @timestamp desc
| limit 50
Orchestrator slot contention
fields @timestamp, trace_id, event, final_status
| filter context.component = "orchestrator"
and final_status = "slot_unavailable"
| sort @timestamp desc
| limit 50
p90 / p99 duration per event and component
filter event like /./
| stats
pct(duration_ms, 90) as p90_ms,
pct(duration_ms, 99) as p99_ms,
count() as total
by event, context.component
| sort p99_ms desc
Using the AWS CLI
TRACE_ID="REPLACE_WITH_TRACE_ID"
START=$(date -u -v-1H +%s) # macOS; use date -d '1 hour ago' +%s on Linux
END=$(date -u +%s)
QUERY_ID=$(aws logs start-query \
--log-group-names \
"/ecs/alakai-core" \
"/ecs/alakai-orchestrator" \
"/ecs/alakai-worker-implementation" \
--start-time "$START" \
--end-time "$END" \
--query-string "fields @timestamp, context.component, event, outcome, final_status, duration_ms, error.type, error.message | filter trace_id = \"$TRACE_ID\" | sort @timestamp asc" \
--query 'queryId' \
--output text)
aws logs get-query-results --query-id "$QUERY_ID"
Tips
- Always filter by
trace_idfirst when debugging a specific request. - Summary events always log at level
info(30). To see only summary events, add| filter msg like /^\[event\]/. final_statusis the fastest triage field: it maps directly to the failure mode and narrows the search to the right component.- To link a GitHub webhook delivery to its implementation outcome, filter by
trace_id = "<x-github-delivery value>"— the same UUID is used end-to-end.
See also CloudWatch Helper for ready-to-paste queries focused on webhook debugging.