Slack meets Bedrock: analytics, monitoring, and production

22 minutes read

This is part 3 of a three-part series on building a Slack bot powered by AWS Bedrock Agents.

  1. Building the gateway
  2. Knowledge bases, tools, and action groups
  3. Analytics, monitoring, and production (this post)

In part 1 we built the gateway. In part 2 we gave the agent knowledge and tools. The bot works. Users can ask questions, search tickets, analyze data.

Now what? How do you know which knowledge bases are actually useful? Whether the agent hallucinates? What questions users are asking? What the system costs per conversation?

Deploying an AI bot is the easy part. Understanding its behavior in production is where the real engineering happens.

Analytics pipeline

The goal: capture every conversation event, store it cheaply, and make it queryable without running a database.

Analytics pipeline

The pipeline uses four AWS services:

ServiceRoleCost model
Kinesis FirehoseBuffer and deliver eventsPer GB ingested
S3Store compressed event filesPer GB stored
GlueDefine table schema over S3Free (catalog only)
AthenaQuery with SQLPer TB scanned

No servers. No database. Events flow in, land in S3 as gzipped JSON, and you query them with SQL when you need answers.

What to capture

Every time the agent processes a message, the gateway publishes an event:

{
  "timestamp": "2026-04-19T14:32:00Z",
  "user_id": "U024BE7LH",
  "thread_id": "1713534720.001234",
  "message_length": 142,
  "response_length": 1893,
  "knowledge_bases": ["security-policies-kb"],
  "search_queries": ["password rotation policy remote employees"],
  "rationale": "User is asking about security policy for remote workers. Searching security policies knowledge base."
}

Each field serves a purpose:

FieldWhat it tells you
user_idWho uses the bot and how often
thread_idConversation grouping for session analysis
message_length / response_lengthComplexity proxy; long responses suggest detailed questions
knowledge_basesWhich KBs the agent consults; unused KBs can be removed
search_queriesWhat the agent actually searches for, not what the user typed
rationaleThe agent’s reasoning before taking action

The search_queries field is especially valuable. It shows the gap between what users ask and what the agent searches. If users ask “PTO policy” but the agent searches “vacation leave entitlement policy,” you learn something about your document naming.

Firehose configuration

EventsStream:
  Type: AWS::KinesisFirehose::DeliveryStream
  Properties:
    DeliveryStreamType: DirectPut
    ExtendedS3DestinationConfiguration:
      BucketARN: !GetAtt AnalyticsBucket.Arn
      Prefix: 'events/'
      ErrorOutputPrefix: 'errors/events/'
      CompressionFormat: GZIP
      BufferingHints:
        IntervalInSeconds: 300
        SizeInMBs: 5

Firehose buffers events for up to 5 minutes or 5MB, whichever comes first, then writes a single gzipped file to S3. For a bot handling 50 conversations per day, this means roughly one file per 5-minute window during active hours.

Glue table

The Glue table maps a schema over the raw S3 files. No ETL, no data transformation. Just metadata.

EventsTable:
  Type: AWS::Glue::Table
  Properties:
    DatabaseName: !Ref GlueDatabase
    TableInput:
      Name: events
      TableType: EXTERNAL_TABLE
      Parameters:
        classification: json
        compressionType: gzip
      StorageDescriptor:
        Location: !Sub 's3://${AnalyticsBucket}/events/'
        SerdeInfo:
          SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
        Columns:
          - Name: timestamp
            Type: string
          - Name: user_id
            Type: string
          - Name: thread_id
            Type: string
          - Name: message_length
            Type: int
          - Name: response_length
            Type: int
          - Name: knowledge_bases
            Type: array<string>
          - Name: search_queries
            Type: array<string>
          - Name: rationale
            Type: string

Useful queries

With the table in place, Athena lets you answer real questions:

Most active users this month:

SELECT user_id, COUNT(*) as messages
FROM events
WHERE timestamp >= '2026-04-01'
GROUP BY user_id
ORDER BY messages DESC
LIMIT 10;

Knowledge base usage:

SELECT kb, COUNT(*) as hits
FROM events
CROSS JOIN UNNEST(knowledge_bases) AS t(kb)
WHERE timestamp >= '2026-04-01'
GROUP BY kb
ORDER BY hits DESC;

Questions where no knowledge base was consulted (potential gaps):

SELECT timestamp, user_id, rationale
FROM events
WHERE cardinality(knowledge_bases) = 0
  AND timestamp >= '2026-04-01'
ORDER BY timestamp DESC
LIMIT 50;

This shows conversations where the agent couldn’t find relevant documents. Each result is a potential content gap in your knowledge bases.

Trace parsing

Bedrock Agent responses come as an event stream. Mixed in with the actual text are trace events containing the agent’s internal reasoning: which knowledge bases it consulted, what queries it ran, what its preprocessing rationale was.

ark parses these traces in real time as the response streams in:

EventStream.decode(response) do |message|
  case message.event_type
  when "chunk"
    # Append text to the response buffer
    extract_text_and_citations(message)
  when "trace"
    # Extract metadata for analytics
    parse_trace(message)
  when "files"
    # Collect output files from code interpreter
    extract_files(message)
  end
end

Trace events contain nested JSON with varying structure depending on the trace type:

Trace typeWhat it contains
preProcessingTraceThe agent’s rationale before acting
orchestrationTraceWhich knowledge bases and action groups were invoked
knowledgeBaseLookupOutputRetrieved document references with source names

The preprocessor rationale is particularly useful. It shows the agent’s reasoning before it does anything: “User is asking about security compliance for remote workers. I should search the security policies knowledge base.” When the agent makes wrong routing decisions, this trace tells you why.

Source extraction from knowledge base lookups also matters for the user experience. The gateway collects source document names and appends them to the Slack response, so users can verify the information.

Deployment

Deployment architecture

ECS task

The gateway runs as an ECS Fargate task. The resource requirements are minimal:

SlackGatewayService:
  Type: AWS::ECS::Service
  Properties:
    Cluster: !Ref ECSCluster
    TaskDefinition: !Ref TaskDefinition
    DesiredCount: 1
    DeploymentConfiguration:
      MinimumHealthyPercent: 0
      MaximumPercent: 200
      DeploymentCircuitBreaker:
        Enable: true
        Rollback: true

A few decisions worth explaining:

MinimumHealthyPercent: 0. During deployment, the old task can stop before the new one starts. For a Slack bot, a few seconds of downtime during deploy is acceptable. The alternative (running two tasks simultaneously) creates duplicate message processing because both connect to the same Socket Mode WebSocket.

Circuit breaker with rollback. If the new task fails to start (bad image, missing env vars, crash loop), ECS automatically rolls back to the previous task definition. This prevents a bad deploy from leaving the bot permanently offline.

128MB RAM, 128 CPU units. The gateway is I/O bound. It spends its time waiting for Slack WebSocket events and Bedrock HTTP responses. CPU and memory usage are negligible. If you’re running in a language with a larger runtime (Node.js, Python), you might need 256MB or 512MB, but the CPU allocation stays the same.

Health checks

Socket Mode bots don’t expose HTTP endpoints, so traditional health checks don’t apply. A simple process-level check works:

HealthCheck:
  Command:
    - CMD-SHELL
    - kill -0 1
  Interval: 5
  Timeout: 2
  Retries: 3
  StartPeriod: 10

This checks whether PID 1 (the gateway process) is alive. It doesn’t verify that the Slack WebSocket is connected, but combined with Slack’s own reconnection logic and ECS’s restart policy, it’s sufficient.

For deeper health monitoring, log a heartbeat message periodically and alert on its absence (see alarms below).

Secrets

Two categories of secrets, handled differently:

Slack tokens (bot token and app-level token) are injected via ECS task definition Secrets, which pulls from Secrets Manager at task start. The container sees them as environment variables. They never touch disk.

External API credentials (e.g., Jira) are read by the Lambda function at invocation time, not cached. The Lambda calls secretsmanager:GetSecretValue on each request. For low-throughput bots (under 100 requests per hour), this adds negligible latency and ensures credential rotation takes effect immediately.

SlackBotTokenSecret:
  Type: AWS::SecretsManager::Secret
  Properties:
    Name: !Sub '${AWS::StackName}/slack/bot-token'
    Description: 'Slack bot user OAuth token'
    SecretString: 'PLACEHOLDER_UPDATE_ME'

After deploying the stack, update the placeholder with the real token:

aws secretsmanager put-secret-value \
  --secret-id my-bot/slack/bot-token \
  --secret-string "xoxb-your-actual-token"

Monitoring and alarms

Lambda errors

If your action group Lambda starts failing, the agent loses a capability. Users get vague responses instead of ticket data.

LambdaErrorAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    Namespace: AWS/Lambda
    MetricName: Errors
    Dimensions:
      - Name: FunctionName
        Value: !Sub '${AWS::StackName}-ticket-action-group'
    Statistic: Sum
    Period: 900
    EvaluationPeriods: 1
    Threshold: 0
    ComparisonOperator: GreaterThanThreshold
    TreatMissingData: notBreaching
    AlarmActions:
      - !Ref AlarmSNSTopic

This fires after a single error in any 15-minute window. Aggressive, but for a bot with low traffic, any Lambda error is worth investigating immediately.

Gateway health

Monitor the ECS service for task restarts:

TaskFailureAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    Namespace: ECS/ContainerInsights
    MetricName: RunningTaskCount
    Dimensions:
      - Name: ServiceName
        Value: !Ref SlackGatewayService
    Statistic: Minimum
    Period: 300
    EvaluationPeriods: 2
    Threshold: 1
    ComparisonOperator: LessThanThreshold
    TreatMissingData: breaching

If the running task count drops below 1 for two consecutive 5-minute periods, something is wrong. Combined with the circuit breaker, this catches both crash loops and deployment failures.

Cost model

Understanding costs helps you set expectations and optimize. Here’s what each component costs for a bot handling around 50 conversations per day.

ComponentCost driverTypical monthly cost
Bedrock Agent (Claude Haiku)Input/output tokens$5 to $30
Knowledge base embeddingsTitan embed per sync< $1
S3 vector storeStorage + queries< $1
FirehoseGB ingested< $1
S3 (analytics)GB stored< $0.10
AthenaTB scanned per query< $0.50/month
ECS FargatevCPU + memory hours$3 to $5
Secrets ManagerPer secret per month$2 (4 secrets)
LambdaInvocations< $0.10
Total $15 to $40

The dominant cost is Bedrock invocations. Everything else is negligible. Three ways to control it:

  1. Choose the right model. Claude Haiku is 10 to 20x cheaper than Sonnet or Opus. For most internal bot tasks, Haiku is sufficient.
  2. Watch response lengths. Long agent responses burn output tokens. The agent instructions can say “keep responses concise” and it actually works.
  3. Monitor knowledge base usage. If a knowledge base is rarely consulted, consider whether it’s pulling its weight. Each KB retrieval adds token cost from the retrieved chunks.

Putting it all together

Here is the complete resource map for the system:

ResourceTypePurpose
ECS ServiceComputeRuns the gateway
Secrets Manager (x3)ConfigSlack tokens, API credentials
Bedrock AgentAIFoundation model + routing
Bedrock Agent AliasAIStable deployment pointer
Knowledge Bases (x3)AIDomain-specific document search
LambdaComputeAction group handler
FirehoseAnalyticsEvent delivery
S3 (knowledge base)StorageSource documents
S3 (analytics)StorageEvent archive
Glue Database + TableAnalyticsSchema for Athena
CloudWatch AlarmsMonitoringFailure detection

One CloudFormation stack. All resources reference each other through !Ref and !GetAtt. Deploy once, update incrementally. The agent alias gives you versioned deployments with rollback.

Wrap-up

The most valuable thing you build isn’t the bot itself. It’s the analytics pipeline that shows you what people actually ask, what the agent can’t answer, and where your documentation has gaps. That feedback loop improves both the bot and the organization it serves.

The total infrastructure cost for a team-sized bot runs $15 to $40 per month. The dominant cost is Bedrock invocations. Everything else rounds to zero.

The entire gateway layer described across this series is implemented in ark. It handles Socket Mode, session management, Bedrock Agent invocation, trace parsing, analytics publishing, and Slack formatting in a single binary. If you’re building something similar, it can save you a few weeks of plumbing.

Resources

Leave a Comment