Skip to main content

Background Task Orchestration Architecture

Overview

This document specifies a generic architecture for executing heavy, resource-intensive operations outside of the Core service. Tasks are queued, orchestrated, and executed in isolated ECS tasks with concurrency control.

First use case: The /implement command, which performs code generation using Codex SDK.

Future use cases: Any heavy processing that should be isolated from the Core service (e.g., large file processing, batch operations, long-running AI tasks).

Chaining workers into multi-step flows (e.g. implementation → bug-hunter) is documented separately in Worker Pipeline.

Listing and retrying failed executions from the dashboard is documented in Worker Execution Retry.

Problem Statement

Some operations are resource-intensive and can take several minutes to complete:

  • Code implementation (cloning repos, running Codex SDK, creating PRs)
  • Future: other heavy operations

Running these inside the Core Fargate task causes:

  • Risk of affecting the entire Core service if the operation fails or hangs
  • Resource contention between API handling and heavy processing
  • No control over concurrent executions

Goals

  1. Isolation: Task failures should not affect the Core service
  2. Concurrency control: Configurable limit on simultaneous task executions (default: 5)
  3. Reliability: Messages should not be lost if components restart
  4. Extensibility: Easy to add new task types without changing the orchestration layer
  5. Simplicity: Minimal operational complexity

Architecture

Background task orchestration architecture

Components

1. Core Fargate (existing, modified)

Changes required:

  • Create a generic enqueueTask(taskType, payload) function
  • Add endpoint POST /internal/task-complete that routes by taskType
  • Respond immediately with "queued" status

Generic SQS Message Format:

interface BackgroundTask<T = unknown> {
taskId: string; // Unique identifier (UUID)
taskType: string; // e.g., "implement", "batch-process", etc.
payload: T; // Task-specific payload
metadata: {
createdAt: number; // Timestamp
source: string; // e.g., "slack", "github-webhook", "api"
correlationId?: string; // For tracing
};
callbacks: {
slackResponseUrl?: string;
slackChannelId?: string;
githubIssueNumber?: number;
webhookUrl?: string; // Generic callback URL
};
}

2. SQS Queue

Configuration:

SettingValueReason
TypeStandardOrder not critical, need high throughput
Visibility Timeout60 secondsTime for orchestrator to process and delete
Message Retention24 hoursAllow recovery from extended outages
Receive Wait Time20 secondsLong polling to reduce empty receives
Dead Letter QueueYesCapture failed messages after 3 attempts

Queue Name: alakai-tasks

DLQ Name: alakai-tasks-dlq

3. Orchestrator Fargate

Purpose: Dedicated service that consumes SQS messages, routes to appropriate worker task definitions, and enforces concurrency control.

Resources:

  • CPU: 256 (0.25 vCPU)
  • Memory: 512 MB
  • Always running (24/7)
  • Estimated cost: ~$8-10/month

Task Type Registry:

const TASK_REGISTRY: Record<string, TaskTypeConfig> = {
'implement': {
taskDefinition: 'alakai-worker-implement',
containerName: 'worker',
maxConcurrent: 5,
timeoutMinutes: 30,
},
// Add new task types here
};

Endpoints:

EndpointMethodPurpose
/healthGETHealth check for ECS
/task-completePOSTCallback from workers when done
/statusGETReturn current state (active tasks by type)

4. Worker Tasks

Purpose: Ephemeral tasks that perform specific work and terminate. Each task type has its own task definition.

Naming Convention:

  • Task Definition: alakai-worker-{taskType} (e.g., alakai-worker-implement)
  • Log Group: /ecs/alakai/workers/{taskType}

Common environment variables passed to all workers:

TASK_ID - Unique task identifier
TASK_TYPE - Type of task (e.g., "implement")
ORCHESTRATOR_URL - URL to call when complete
CORE_URL - URL of Core service (optional, for direct callbacks)
PAYLOAD_JSON - JSON-encoded task-specific payload

Adding a new task type

To add a new task type (e.g., batch-export):

  1. Define the payload type in Core
  2. Create the worker (new task definition or shared image)
  3. Register in orchestrator (TASK_REGISTRY in orchestrator config)
  4. Add handler in Core for completion (POST /internal/task-complete)

AWS resources

ResourceNamePurpose
SQS Queuealakai-tasksGeneric task queue
SQS Queuealakai-tasks-dlqDead letter queue
ECS Task Definitionalakai-orchestratorOrchestrator service
ECS Servicealakai-orchestrator-serviceRun orchestrator 24/7
ECS Task Definitionalakai-worker-implementImplement worker
CloudWatch Log Group/ecs/alakai/orchestratorOrchestrator logs
CloudWatch Log Group/ecs/alakai/workers/implementImplement worker logs

Cost estimate

ComponentMonthly Cost
Orchestrator Fargate (256 CPU, 512 MB, 24/7)~$8-10
SQS (estimated 1000 messages/month)~$0.50
Worker executions (100 runs @ 10 min each)~$5-8
CloudWatch Logs~$2-5
Total additional cost~$15-25/month

Monitoring

MetricSourceAlert Threshold
Queue depthSQS> 20 messages for 10 min
DLQ messagesSQS DLQ> 0
Orchestrator CPUECS> 80% for 5 min
Worker failures by typeCustom metric> 3 in 1 hour

See Tracing Guide for CloudWatch Logs Insights queries to trace executions.