Slack meets Bedrock: analytics, monitoring, and production
This is part 3 of a three-part series on building a Slack bot powered by AWS Bedrock Agents.
- Building the gateway
- Knowledge bases, tools, and action groups
- Analytics, monitoring, and production (this post)
In part 1 we built the gateway. In part 2 we gave the agent knowledge and tools. The bot works. Users can ask questions, search tickets, analyze data.
Now what? How do you know which knowledge bases are actually useful? Whether the agent hallucinates? What questions users are asking? What the system costs per conversation?
Deploying an AI bot is the easy part. Understanding its behavior in production is where the real engineering happens.
Analytics pipeline
The goal: capture every conversation event, store it cheaply, and make it queryable without running a database.
The pipeline uses four AWS services:
| Service | Role | Cost model |
|---|---|---|
| Kinesis Firehose | Buffer and deliver events | Per GB ingested |
| S3 | Store compressed event files | Per GB stored |
| Glue | Define table schema over S3 | Free (catalog only) |
| Athena | Query with SQL | Per TB scanned |
No servers. No database. Events flow in, land in S3 as gzipped JSON, and you query them with SQL when you need answers.
What to capture
Every time the agent processes a message, the gateway publishes an event:
{
"timestamp": "2026-04-19T14:32:00Z",
"user_id": "U024BE7LH",
"thread_id": "1713534720.001234",
"message_length": 142,
"response_length": 1893,
"knowledge_bases": ["security-policies-kb"],
"search_queries": ["password rotation policy remote employees"],
"rationale": "User is asking about security policy for remote workers. Searching security policies knowledge base."
}
Each field serves a purpose:
| Field | What it tells you |
|---|---|
user_id | Who uses the bot and how often |
thread_id | Conversation grouping for session analysis |
message_length / response_length | Complexity proxy; long responses suggest detailed questions |
knowledge_bases | Which KBs the agent consults; unused KBs can be removed |
search_queries | What the agent actually searches for, not what the user typed |
rationale | The agent’s reasoning before taking action |
The search_queries field is especially valuable. It shows the gap between what users ask and what the agent searches. If users ask “PTO policy” but the agent searches “vacation leave entitlement policy,” you learn something about your document naming.
Firehose configuration
EventsStream:
Type: AWS::KinesisFirehose::DeliveryStream
Properties:
DeliveryStreamType: DirectPut
ExtendedS3DestinationConfiguration:
BucketARN: !GetAtt AnalyticsBucket.Arn
Prefix: 'events/'
ErrorOutputPrefix: 'errors/events/'
CompressionFormat: GZIP
BufferingHints:
IntervalInSeconds: 300
SizeInMBs: 5
Firehose buffers events for up to 5 minutes or 5MB, whichever comes first, then writes a single gzipped file to S3. For a bot handling 50 conversations per day, this means roughly one file per 5-minute window during active hours.
Glue table
The Glue table maps a schema over the raw S3 files. No ETL, no data transformation. Just metadata.
EventsTable:
Type: AWS::Glue::Table
Properties:
DatabaseName: !Ref GlueDatabase
TableInput:
Name: events
TableType: EXTERNAL_TABLE
Parameters:
classification: json
compressionType: gzip
StorageDescriptor:
Location: !Sub 's3://${AnalyticsBucket}/events/'
SerdeInfo:
SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
Columns:
- Name: timestamp
Type: string
- Name: user_id
Type: string
- Name: thread_id
Type: string
- Name: message_length
Type: int
- Name: response_length
Type: int
- Name: knowledge_bases
Type: array<string>
- Name: search_queries
Type: array<string>
- Name: rationale
Type: string
Useful queries
With the table in place, Athena lets you answer real questions:
Most active users this month:
SELECT user_id, COUNT(*) as messages
FROM events
WHERE timestamp >= '2026-04-01'
GROUP BY user_id
ORDER BY messages DESC
LIMIT 10;
Knowledge base usage:
SELECT kb, COUNT(*) as hits
FROM events
CROSS JOIN UNNEST(knowledge_bases) AS t(kb)
WHERE timestamp >= '2026-04-01'
GROUP BY kb
ORDER BY hits DESC;
Questions where no knowledge base was consulted (potential gaps):
SELECT timestamp, user_id, rationale
FROM events
WHERE cardinality(knowledge_bases) = 0
AND timestamp >= '2026-04-01'
ORDER BY timestamp DESC
LIMIT 50;
This shows conversations where the agent couldn’t find relevant documents. Each result is a potential content gap in your knowledge bases.
Trace parsing
Bedrock Agent responses come as an event stream. Mixed in with the actual text are trace events containing the agent’s internal reasoning: which knowledge bases it consulted, what queries it ran, what its preprocessing rationale was.
ark parses these traces in real time as the response streams in:
EventStream.decode(response) do |message|
case message.event_type
when "chunk"
# Append text to the response buffer
extract_text_and_citations(message)
when "trace"
# Extract metadata for analytics
parse_trace(message)
when "files"
# Collect output files from code interpreter
extract_files(message)
end
end
Trace events contain nested JSON with varying structure depending on the trace type:
| Trace type | What it contains |
|---|---|
preProcessingTrace | The agent’s rationale before acting |
orchestrationTrace | Which knowledge bases and action groups were invoked |
knowledgeBaseLookupOutput | Retrieved document references with source names |
The preprocessor rationale is particularly useful. It shows the agent’s reasoning before it does anything: “User is asking about security compliance for remote workers. I should search the security policies knowledge base.” When the agent makes wrong routing decisions, this trace tells you why.
Source extraction from knowledge base lookups also matters for the user experience. The gateway collects source document names and appends them to the Slack response, so users can verify the information.
Deployment
ECS task
The gateway runs as an ECS Fargate task. The resource requirements are minimal:
SlackGatewayService:
Type: AWS::ECS::Service
Properties:
Cluster: !Ref ECSCluster
TaskDefinition: !Ref TaskDefinition
DesiredCount: 1
DeploymentConfiguration:
MinimumHealthyPercent: 0
MaximumPercent: 200
DeploymentCircuitBreaker:
Enable: true
Rollback: true
A few decisions worth explaining:
MinimumHealthyPercent: 0. During deployment, the old task can stop before the new one starts. For a Slack bot, a few seconds of downtime during deploy is acceptable. The alternative (running two tasks simultaneously) creates duplicate message processing because both connect to the same Socket Mode WebSocket.
Circuit breaker with rollback. If the new task fails to start (bad image, missing env vars, crash loop), ECS automatically rolls back to the previous task definition. This prevents a bad deploy from leaving the bot permanently offline.
128MB RAM, 128 CPU units. The gateway is I/O bound. It spends its time waiting for Slack WebSocket events and Bedrock HTTP responses. CPU and memory usage are negligible. If you’re running in a language with a larger runtime (Node.js, Python), you might need 256MB or 512MB, but the CPU allocation stays the same.
Health checks
Socket Mode bots don’t expose HTTP endpoints, so traditional health checks don’t apply. A simple process-level check works:
HealthCheck:
Command:
- CMD-SHELL
- kill -0 1
Interval: 5
Timeout: 2
Retries: 3
StartPeriod: 10
This checks whether PID 1 (the gateway process) is alive. It doesn’t verify that the Slack WebSocket is connected, but combined with Slack’s own reconnection logic and ECS’s restart policy, it’s sufficient.
For deeper health monitoring, log a heartbeat message periodically and alert on its absence (see alarms below).
Secrets
Two categories of secrets, handled differently:
Slack tokens (bot token and app-level token) are injected via ECS task definition Secrets, which pulls from Secrets Manager at task start. The container sees them as environment variables. They never touch disk.
External API credentials (e.g., Jira) are read by the Lambda function at invocation time, not cached. The Lambda calls secretsmanager:GetSecretValue on each request. For low-throughput bots (under 100 requests per hour), this adds negligible latency and ensures credential rotation takes effect immediately.
SlackBotTokenSecret:
Type: AWS::SecretsManager::Secret
Properties:
Name: !Sub '${AWS::StackName}/slack/bot-token'
Description: 'Slack bot user OAuth token'
SecretString: 'PLACEHOLDER_UPDATE_ME'
After deploying the stack, update the placeholder with the real token:
aws secretsmanager put-secret-value \
--secret-id my-bot/slack/bot-token \
--secret-string "xoxb-your-actual-token"
Monitoring and alarms
Lambda errors
If your action group Lambda starts failing, the agent loses a capability. Users get vague responses instead of ticket data.
LambdaErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Namespace: AWS/Lambda
MetricName: Errors
Dimensions:
- Name: FunctionName
Value: !Sub '${AWS::StackName}-ticket-action-group'
Statistic: Sum
Period: 900
EvaluationPeriods: 1
Threshold: 0
ComparisonOperator: GreaterThanThreshold
TreatMissingData: notBreaching
AlarmActions:
- !Ref AlarmSNSTopic
This fires after a single error in any 15-minute window. Aggressive, but for a bot with low traffic, any Lambda error is worth investigating immediately.
Gateway health
Monitor the ECS service for task restarts:
TaskFailureAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
Namespace: ECS/ContainerInsights
MetricName: RunningTaskCount
Dimensions:
- Name: ServiceName
Value: !Ref SlackGatewayService
Statistic: Minimum
Period: 300
EvaluationPeriods: 2
Threshold: 1
ComparisonOperator: LessThanThreshold
TreatMissingData: breaching
If the running task count drops below 1 for two consecutive 5-minute periods, something is wrong. Combined with the circuit breaker, this catches both crash loops and deployment failures.
Cost model
Understanding costs helps you set expectations and optimize. Here’s what each component costs for a bot handling around 50 conversations per day.
| Component | Cost driver | Typical monthly cost |
|---|---|---|
| Bedrock Agent (Claude Haiku) | Input/output tokens | $5 to $30 |
| Knowledge base embeddings | Titan embed per sync | < $1 |
| S3 vector store | Storage + queries | < $1 |
| Firehose | GB ingested | < $1 |
| S3 (analytics) | GB stored | < $0.10 |
| Athena | TB scanned per query | < $0.50/month |
| ECS Fargate | vCPU + memory hours | $3 to $5 |
| Secrets Manager | Per secret per month | $2 (4 secrets) |
| Lambda | Invocations | < $0.10 |
| Total | $15 to $40 |
The dominant cost is Bedrock invocations. Everything else is negligible. Three ways to control it:
- Choose the right model. Claude Haiku is 10 to 20x cheaper than Sonnet or Opus. For most internal bot tasks, Haiku is sufficient.
- Watch response lengths. Long agent responses burn output tokens. The agent instructions can say “keep responses concise” and it actually works.
- Monitor knowledge base usage. If a knowledge base is rarely consulted, consider whether it’s pulling its weight. Each KB retrieval adds token cost from the retrieved chunks.
Putting it all together
Here is the complete resource map for the system:
| Resource | Type | Purpose |
|---|---|---|
| ECS Service | Compute | Runs the gateway |
| Secrets Manager (x3) | Config | Slack tokens, API credentials |
| Bedrock Agent | AI | Foundation model + routing |
| Bedrock Agent Alias | AI | Stable deployment pointer |
| Knowledge Bases (x3) | AI | Domain-specific document search |
| Lambda | Compute | Action group handler |
| Firehose | Analytics | Event delivery |
| S3 (knowledge base) | Storage | Source documents |
| S3 (analytics) | Storage | Event archive |
| Glue Database + Table | Analytics | Schema for Athena |
| CloudWatch Alarms | Monitoring | Failure detection |
One CloudFormation stack. All resources reference each other through !Ref and !GetAtt. Deploy once, update incrementally. The agent alias gives you versioned deployments with rollback.
Wrap-up
The most valuable thing you build isn’t the bot itself. It’s the analytics pipeline that shows you what people actually ask, what the agent can’t answer, and where your documentation has gaps. That feedback loop improves both the bot and the organization it serves.
The total infrastructure cost for a team-sized bot runs $15 to $40 per month. The dominant cost is Bedrock invocations. Everything else rounds to zero.
The entire gateway layer described across this series is implemented in ark. It handles Socket Mode, session management, Bedrock Agent invocation, trace parsing, analytics publishing, and Slack formatting in a single binary. If you’re building something similar, it can save you a few weeks of plumbing.
Resources
- ark — open-source Slack gateway for AWS Bedrock Agents
- Kinesis Firehose documentation
- Athena query documentation
- ECS deployment circuit breaker
- Bedrock Agent traces
- AWS Secrets Manager
Leave a Comment