Execution Tracing Guide

This guide explains how to reconstruct the full history of any Alakai execution using structured logs in AWS CloudWatch Logs Insights. Every discrete unit of work emits one named summary event at the end of its slice, carrying a trace_id that joins logs across components.

How it works

Each execution flows through up to three services — core, orchestrator, and worker — running as separate ECS tasks. Each service emits its own summary event with the same trace_id and a different context.component value. To reconstruct the full history of a request, filter by trace_id across all log groups.

GitHub webhook / Slack command
  → core        emits event (e.g. slack.implementation, component=core)
      → SQS (carries trace_id in metadata)
        → orchestrator  emits event (component=orchestrator)
            → ECS task (carries trace_id in Redis payload)
              → worker  emits event (component=worker)

Sources of trace_id:

GitHub webhooks: the x-github-delivery header value (a UUID set by GitHub).
Slack commands: a UUIDv4 generated when the request enters the handler.

Log groups

Service	Log group
core	`/ecs/alakai-core`
orchestrator	`/ecs/alakai-orchestrator`
worker	`/ecs/alakai-worker-implementation`

For cross-component queries, select all three log groups in the CloudWatch Logs Insights console, or use the AWS CLI with --log-group-names.

Event JSON structure

Every summary event is a single JSON line with this shape:

{
  "time": "2026-05-05T12:00:00.000Z",
  "level": 30,
  "service": "alakai-core",
  "component": "core",
  "trace_id": "a1b2c3d4-...",
  "event": "slack.implementation",
  "outcome": "success",
  "final_status": "completed",
  "duration_ms": 1234,
  "context": { "trace_id": "a1b2c3d4-...", "component": "core" },
  "steps": {
    "verify_signature": "ok",
    "parse_command": "ok",
    "validate_args": "ok",
    "enqueue": "ok"
  },
  "error": null,
  "repo": { "full_name": "org/repo", "base_ref": "main" },
  "task": { "task_id": "...", "task_type": "...", "source": "slack_implement" },
  "msg": "[event] slack.implementation -> success"
}

Key fields at a glance:

Field	Type	Description
`trace_id`	string	Joins all slices of a single execution
`event`	string	Event name — see catalog below
`outcome`	`success \| error \| skipped`	Slice outcome
`final_status`	string	Domain-specific result code
`duration_ms`	number	Slice wall-clock time
`context.component`	`core \| orchestrator \| worker`	Emitting service
`steps`	object	Map of step name → `ok \| error \| -`
`error.type`	string	Exception class name (when `outcome=error`)
`error.message`	string	Sanitized error message

Event catalog

Event name	Emitters	Trigger
`coding.prompt`	core	GitHub issue opened or `docs/prompts/*` PR merged
`coding.implementation`	core, orchestrator, worker	`docs/prompts/*` PR merged → async implementation flow
`slack.prompt`	core	Slack `/prompt` slash command
`slack.implementation`	core, orchestrator, worker	Slack `/implement` slash command
`slack.clickup.ask`	core	Slack `/clickup-ask` slash command
`slack.clickup.task`	core	Slack slash command to create/update a ClickUp task

Runbook: trace a full execution

Step 1 — Get the `trace_id`

From a GitHub webhook: copy the x-github-delivery header from the GitHub webhook delivery page (Settings → Webhooks → Recent deliveries).
From a Slack command: find a log line in /ecs/alakai-core that contains the user's Slack user_id or channel and copy the trace_id field.
From a CloudWatch error alert: the alert payload contains trace_id if the log line is a summary event.

Step 2 — Run the cross-component query

Select all three log groups and run:

fields @timestamp, context.component, event, outcome, final_status, duration_ms,
       steps, error.type, error.message, repo.full_name, task.task_id
| filter trace_id = "REPLACE_WITH_TRACE_ID"
| sort @timestamp asc

This returns one row per service slice. Read them top-to-bottom to follow the execution:

core row — did the request parse and enqueue correctly?
orchestrator row — did the ECS task launch successfully?
worker row — did the implementation agent complete? What was the PR URL?

Step 3 — Drill into step-level logs

To see all log lines (not just summary events) for the same execution:

fields @timestamp, context.component, level, msg
| filter trace_id = "REPLACE_WITH_TRACE_ID"
| sort @timestamp asc
| limit 500

Common queries

Recent errors across all events

fields @timestamp, context.component, event, outcome, final_status,
       error.type, error.message, trace_id
| filter outcome = "error"
| sort @timestamp desc
| limit 100

Errors for a specific event type

fields @timestamp, context.component, final_status, error.type, error.message,
       duration_ms, trace_id
| filter event = "coding.implementation" and outcome = "error"
| sort @timestamp desc
| limit 50

Slow executions (worker slice > 5 minutes)

fields @timestamp, trace_id, event, final_status, duration_ms,
       agent.provider, agent.model, repo.full_name
| filter context.component = "worker" and duration_ms > 300000
| sort duration_ms desc
| limit 50

All implementations for a given repository

fields @timestamp, trace_id, event, outcome, final_status, duration_ms,
       pull_request.url, pull_request.number
| filter (event = "coding.implementation" or event = "slack.implementation")
       and context.component = "worker"
       and repo.full_name = "REPLACE_WITH_ORG/REPO"
| sort @timestamp desc
| limit 100

Slack command failures by user

fields @timestamp, event, final_status, error.message, trace_id
| filter (event like /^slack\./) and outcome = "error"
       and slack.user_id = "REPLACE_WITH_SLACK_USER_ID"
| sort @timestamp desc
| limit 50

Orchestrator slot contention

fields @timestamp, trace_id, event, final_status
| filter context.component = "orchestrator"
       and final_status = "slot_unavailable"
| sort @timestamp desc
| limit 50

p90 / p99 duration per event and component

filter event like /./
| stats
    pct(duration_ms, 90) as p90_ms,
    pct(duration_ms, 99) as p99_ms,
    count() as total
  by event, context.component
| sort p99_ms desc

Using the AWS CLI

TRACE_ID="REPLACE_WITH_TRACE_ID"
START=$(date -u -v-1H +%s)   # macOS; use date -d '1 hour ago' +%s on Linux
END=$(date -u +%s)

QUERY_ID=$(aws logs start-query \
  --log-group-names \
      "/ecs/alakai-core" \
      "/ecs/alakai-orchestrator" \
      "/ecs/alakai-worker-implementation" \
  --start-time "$START" \
  --end-time "$END" \
  --query-string "fields @timestamp, context.component, event, outcome, final_status, duration_ms, error.type, error.message | filter trace_id = \"$TRACE_ID\" | sort @timestamp asc" \
  --query 'queryId' \
  --output text)

aws logs get-query-results --query-id "$QUERY_ID"

Tips

Always filter by trace_id first when debugging a specific request.
Summary events always log at level info (30). To see only summary events, add | filter msg like /^\[event\]/.
final_status is the fastest triage field: it maps directly to the failure mode and narrows the search to the right component.
To link a GitHub webhook delivery to its implementation outcome, filter by trace_id = "<x-github-delivery value>" — the same UUID is used end-to-end.

See also CloudWatch Helper for ready-to-paste queries focused on webhook debugging.

How it works​

Log groups​

Event JSON structure​

Event catalog​

Runbook: trace a full execution​

Step 1 — Get the trace_id​

Step 2 — Run the cross-component query​

Step 3 — Drill into step-level logs​

Common queries​

Recent errors across all events​

Errors for a specific event type​

Slow executions (worker slice > 5 minutes)​

All implementations for a given repository​

Slack command failures by user​

Orchestrator slot contention​

p90 / p99 duration per event and component​

Using the AWS CLI​

Tips​