Skip to main content

Execution Tracing Guide

This guide explains how to reconstruct the full history of any Alakai execution using structured logs in AWS CloudWatch Logs Insights. Every discrete unit of work emits one named summary event at the end of its slice, carrying a trace_id that joins logs across components.


How it works

Each execution flows through up to three services — core, orchestrator, and worker — running as separate ECS tasks. Each service emits its own summary event with the same trace_id and a different context.component value. To reconstruct the full history of a request, filter by trace_id across all log groups.

GitHub webhook / Slack command
→ core emits event (e.g. slack.implementation, component=core)
→ SQS (carries trace_id in metadata)
→ orchestrator emits event (component=orchestrator)
→ ECS task (carries trace_id in Redis payload)
→ worker emits event (component=worker)

Sources of trace_id:

  • GitHub webhooks: the x-github-delivery header value (a UUID set by GitHub).
  • Slack commands: a UUIDv4 generated when the request enters the handler.

Log groups

ServiceLog group
core/ecs/alakai-core
orchestrator/ecs/alakai-orchestrator
worker/ecs/alakai-worker-implementation

For cross-component queries, select all three log groups in the CloudWatch Logs Insights console, or use the AWS CLI with --log-group-names.


Event JSON structure

Every summary event is a single JSON line with this shape:

{
"time": "2026-05-05T12:00:00.000Z",
"level": 30,
"service": "alakai-core",
"component": "core",
"trace_id": "a1b2c3d4-...",
"event": "slack.implementation",
"outcome": "success",
"final_status": "completed",
"duration_ms": 1234,
"context": { "trace_id": "a1b2c3d4-...", "component": "core" },
"steps": {
"verify_signature": "ok",
"parse_command": "ok",
"validate_args": "ok",
"enqueue": "ok"
},
"error": null,
"repo": { "full_name": "org/repo", "base_ref": "main" },
"task": { "task_id": "...", "task_type": "...", "source": "slack_implement" },
"msg": "[event] slack.implementation -> success"
}

Key fields at a glance:

FieldTypeDescription
trace_idstringJoins all slices of a single execution
eventstringEvent name — see catalog below
outcomesuccess | error | skippedSlice outcome
final_statusstringDomain-specific result code
duration_msnumberSlice wall-clock time
context.componentcore | orchestrator | workerEmitting service
stepsobjectMap of step name → ok | error | -
error.typestringException class name (when outcome=error)
error.messagestringSanitized error message

Event catalog

Event nameEmittersTrigger
coding.promptcoreGitHub issue opened or docs/prompts/* PR merged
coding.implementationcore, orchestrator, workerdocs/prompts/* PR merged → async implementation flow
slack.promptcoreSlack /prompt slash command
slack.implementationcore, orchestrator, workerSlack /implement slash command
slack.clickup.askcoreSlack /clickup-ask slash command
slack.clickup.taskcoreSlack slash command to create/update a ClickUp task

Runbook: trace a full execution

Step 1 — Get the trace_id

  • From a GitHub webhook: copy the x-github-delivery header from the GitHub webhook delivery page (Settings → Webhooks → Recent deliveries).
  • From a Slack command: find a log line in /ecs/alakai-core that contains the user's Slack user_id or channel and copy the trace_id field.
  • From a CloudWatch error alert: the alert payload contains trace_id if the log line is a summary event.

Step 2 — Run the cross-component query

Select all three log groups and run:

fields @timestamp, context.component, event, outcome, final_status, duration_ms,
steps, error.type, error.message, repo.full_name, task.task_id
| filter trace_id = "REPLACE_WITH_TRACE_ID"
| sort @timestamp asc

This returns one row per service slice. Read them top-to-bottom to follow the execution:

  1. core row — did the request parse and enqueue correctly?
  2. orchestrator row — did the ECS task launch successfully?
  3. worker row — did the implementation agent complete? What was the PR URL?

Step 3 — Drill into step-level logs

To see all log lines (not just summary events) for the same execution:

fields @timestamp, context.component, level, msg
| filter trace_id = "REPLACE_WITH_TRACE_ID"
| sort @timestamp asc
| limit 500

Common queries

Recent errors across all events

fields @timestamp, context.component, event, outcome, final_status,
error.type, error.message, trace_id
| filter outcome = "error"
| sort @timestamp desc
| limit 100

Errors for a specific event type

fields @timestamp, context.component, final_status, error.type, error.message,
duration_ms, trace_id
| filter event = "coding.implementation" and outcome = "error"
| sort @timestamp desc
| limit 50

Slow executions (worker slice > 5 minutes)

fields @timestamp, trace_id, event, final_status, duration_ms,
agent.provider, agent.model, repo.full_name
| filter context.component = "worker" and duration_ms > 300000
| sort duration_ms desc
| limit 50

All implementations for a given repository

fields @timestamp, trace_id, event, outcome, final_status, duration_ms,
pull_request.url, pull_request.number
| filter (event = "coding.implementation" or event = "slack.implementation")
and context.component = "worker"
and repo.full_name = "REPLACE_WITH_ORG/REPO"
| sort @timestamp desc
| limit 100

Slack command failures by user

fields @timestamp, event, final_status, error.message, trace_id
| filter (event like /^slack\./) and outcome = "error"
and slack.user_id = "REPLACE_WITH_SLACK_USER_ID"
| sort @timestamp desc
| limit 50

Orchestrator slot contention

fields @timestamp, trace_id, event, final_status
| filter context.component = "orchestrator"
and final_status = "slot_unavailable"
| sort @timestamp desc
| limit 50

p90 / p99 duration per event and component

filter event like /./
| stats
pct(duration_ms, 90) as p90_ms,
pct(duration_ms, 99) as p99_ms,
count() as total
by event, context.component
| sort p99_ms desc

Using the AWS CLI

TRACE_ID="REPLACE_WITH_TRACE_ID"
START=$(date -u -v-1H +%s) # macOS; use date -d '1 hour ago' +%s on Linux
END=$(date -u +%s)

QUERY_ID=$(aws logs start-query \
--log-group-names \
"/ecs/alakai-core" \
"/ecs/alakai-orchestrator" \
"/ecs/alakai-worker-implementation" \
--start-time "$START" \
--end-time "$END" \
--query-string "fields @timestamp, context.component, event, outcome, final_status, duration_ms, error.type, error.message | filter trace_id = \"$TRACE_ID\" | sort @timestamp asc" \
--query 'queryId' \
--output text)

aws logs get-query-results --query-id "$QUERY_ID"

Tips

  • Always filter by trace_id first when debugging a specific request.
  • Summary events always log at level info (30). To see only summary events, add | filter msg like /^\[event\]/.
  • final_status is the fastest triage field: it maps directly to the failure mode and narrows the search to the right component.
  • To link a GitHub webhook delivery to its implementation outcome, filter by trace_id = "<x-github-delivery value>" — the same UUID is used end-to-end.

See also CloudWatch Helper for ready-to-paste queries focused on webhook debugging.