- Multi‑Agent Web Exploreration with Shared Graph Memory
- Our Lessons from Building Production Voice AI
- Reinforcement Learning, Memory and Law
- Automating Secret Management
- The Knowledge Layer
- ›OTel Sidecars on Fargate
- Git Disasters and Process Debt
- Is Code Rotting Due To AI?
- The Integration Illusion
- When MCP Fails
- Context Engineering
- Stop Email Spoofing with DMARC
- SOTA Embedding Retrieval: Gemini + pgvector for Production Chat
- A Review of Agentic Design Patterns
- Building AI Agents for Automated Podcasts
- Rediscovering Cursor
- GraphRAG > Traditional Vector RAG
- Cultural Bias in LLMs
- Mapping out the AI Landscape with Topic Modelling
- Sustainable Cloud Computing: Carbon-Aware AI
- Defensive Technology for the Next Decade of AI
- Situational Awareness: The Decade Ahead
- Mechanistic Interpretability: A Survey
- Why I Left Ubuntu
- Multi-Agent Collaboration
- Building Better Retrieval Systems
- Building an Automated Newsletter-to-Summary Pipeline with Zapier AI Actions vs AWS SES & Lambda
- Local AI Image Generation
- Deploying a Distributed Ray Python Server with Kubernetes, EKS & KubeRay
- Making the Switch to Linux for Development
- Scaling Options Pricing with Ray
- The Async Worker Pool
- Browser Fingerprinting: Introducing My First NPM Package
- Reading Data from @socket.io/redis-emitter without Using a Socket.io Client
- Socket.io Middleware for Redux Store Integration
- Sharing TypeScript Code Between Microservices: A Guide Using Git Submodules
- Efficient Dataset Storage: Beyond CSVs
- Why I switched from Plain React to Next.js 13
- Deploy & Scale Socket.io Containers in ECS with Elasticache
- Implementing TOTP Authentication in Python using PyOTP
- Simplifying Lambda Layer ARNs and Creating Custom Layers in AWS
- TimeScaleDB Deployment: Docker Containers and EC2 Setup
- How to SSH into an EC2 Instance Using PuTTY
How many log streams do you have to search though any time you're investigating some bug? For us it was three. A user reports a slow response, so we searched Loggly for API errors - nothing stands out. Swap to the async daemon logs, filter by timestamp. Something processed, but slowly. Then the CPU daemon logs, hoping some id was logged so we can correlate across services (spoiler, it usually doesn't). In the end, we had three services, three log streams, and the only thing connecting them was our patience.
Loggly worked fine for structured JSON search. tag:prod AND level:ERROR AND json.user_id:"user123" found what you needed, even if the syntax was pretty alien to us. The problem was never finding a single log entry. The problem was connecting entries across three services to explain why a request failed. We swapped Loggly for OpenTelemetry distributed tracing with a sidecar architecture on ECS Fargate. That same slow response? Without the trace, we'd probably still be profiling the wrong service.
Traces Over Logs
A distributed trace is a tree of spans - each span represents a single operation, and child spans nest under parents across service boundaries. A simpler request across our three services:
HTTP POST /api/v1/users (root span - my-api)
├── validate_request (12ms)
├── check_permissions (8ms)
├── DB SELECT users (45ms)
│ └── Cache GET user:123 (2ms)
├── process_user_data (156ms)
│ ├── validate_data (23ms)
│ └── transform_data (133ms)
├── DB UPDATE users (67ms)
├── SQS SEND process_user_task (23ms)
│ └── async-daemon: process_user_task (2.3s)
│ ├── validate_message (15ms)
│ ├── HTTP POST email-service (450ms)
│ ├── SQS SEND compute_report (18ms)
│ │ └── cpu-daemon: compute_report (5.2s)
│ │ ├── load_data (890ms)
│ │ ├── process_computation (3.8s)
│ │ └── S3 PUT report.pdf (510ms)
│ └── DB UPDATE job_status (67ms)
└── HTTP Response 201 Created (2.5s total)
The CPU daemon's process_computation was the bottleneck at 3.8s. Not the API. Not the async daemon. Before OTel, we'd have blamed "slow API response" without easy visibility into the downstream chain.
Auto-instrumentation does most of the work for us now. Install the OpenTelemetry packages and every database query, HTTP call, SQS message, and S3 operation becomes a span1. Zero code changes to business logic.
One trace ID connects the entire request across the API, async daemon, and CPU daemon. Click any log entry from any service and see the full distributed trace.
The trace tree is useful on its own. But the architecture behind it - how telemetry gets from the app to a backend you can query - and how we chose to set it up is what i'll be going through in this post.
Why Sidecars
We started off by just sending traces directly to SigNoz Cloud. Fast to set up, but tightly coupled. If we ever wanted to switch backends - and observability vendor lock-in is a real risk2 - we'd have to change application code in every service. Coinbase famously racked up a $65 million annual Datadog bill and had to spin up a dedicated team just to evaluate migrating away3. We wanted that exit door built in from day one.
ECS Fargate rules out the standard DaemonSet approach - there's no host to install an agent on. AWS ships ADOT (their OTel distribution) as a managed sidecar, but we chose upstream otel/opentelemetry-collector-contrib for the wider exporter ecosystem and faster release cycle. A centralized collector would add network latency and would be more work to set up on our end infra wise (I didn't even want to get into meddling with the network isolation per task or the scaling that would be needed). Sidecars were the remaining easy option4.
The app sends telemetry to localhost:4317. The sidecar receives, batches, and forwards to SigNoz and S3. The application never knows about SigNoz Cloud or S3 - it just sends to localhost. When the task scales, sidecars scale with it.
This decoupling is the real benefit. Want to switch from SigNoz to Grafana Cloud? Change the collector config. Want to send traces to two backends simultaneously for evaluation? Add another exporter. Your application code never changes. Even Jaeger - the most widely-used tracing backend - rewrote itself as an OTel Collector in Jaeger 2.05. The ecosystem is converging on this architecture.
So we had the architecture. But how does a trace actually survive crossing three separate Python services?
Context Propagation
We got middleware order wrong twice before it clicked. The first attempt had AuditMiddleware registered before TracingMiddleware - which meant the user context was set before a root span existed. Every span came through with user data but no parent trace. The second attempt reversed them but forgot that Starlette middleware is LIFO: the last middleware added runs first. We ended up with orphaned spans that looked correct individually but never connected into a trace.
The fix is simple; TracingMiddleware must create the root span first, then AuditMiddleware runs inside it to stamp user context:
if config.otel.enabled:
app.add_middleware(TracingMiddleware, service_name=config.otel.service_name)
# AuditMiddleware added after - runs inside TracingMiddleware's spanTracingMiddleware extracts W3C traceparent headers from incoming requests, generates or reads x-request-id, and creates a SERVER span:
class TracingMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next):
ctx = extract(carrier=request.headers)
request_id = request.headers.get("x-request-id") or str(uuid.uuid4())
set_request_context({"request_id": request_id, "method": request.method, "path": request.url.path})
with self.tracer.start_as_current_span(
f"{request.method} {request.url.path}",
kind=trace.SpanKind.SERVER,
context=ctx,
) as span:
response = await call_next(request)
span.set_attribute("http.status_code", response.status_code)
return responseThat gets you connected spans. But a trace without user context is surprisingly useless for debugging - you can see what happened, but not who it happened to. AuditMiddleware fixes that by decoding the JWT and stamping user_id and organisation_id on every span:
class AuditMiddleware:
async def audit_request(self, request, call_next):
user_id, org_id = self.extract_identity(request)
# Set context vars (survive async boundaries)
user_id_var.set(user_id)
org_id_var.set(org_id)
with tracer.start_as_current_span(span_name, kind=SERVER) as span:
span.set_attribute("user.id", user_id)
span.set_attribute("organisation.id", org_id)The most valuable custom component is AuditSpanProcessor. It reads request context from a ContextVar and stamps every span automatically - including auto-instrumented child spans from httpx, SQS, and SQLAlchemy:
class AuditSpanProcessor(SpanProcessor):
def on_start(self, span: Span, parent_context=None) -> None:
try:
request_ctx = request_context_var.get()
if request_ctx:
span.set_attribute("user.id", request_ctx.get("user_id", ""))
span.set_attribute("organisation.id", request_ctx.get("organisation_id", ""))
span.set_attribute("request.id", request_ctx.get("request_id", ""))
span.set_attribute("http.method", request_ctx.get("method", ""))
span.set_attribute("http.target", request_ctx.get("path", ""))
span.set_attribute("client.address", request_ctx.get("client_ip", ""))
except Exception as e:
logger.debug(f"Failed to add audit context to span: {e}")The try-except looks like defensive overkill until you hit the edge case. A span processor runs on every span creation - if it throws, it breaks auto-instrumented spans from libraries you don't control. We learned this when a None value in the request context crashed the SQLAlchemy instrumentor mid-query. The query still ran, but the span vanished from the trace. Tracing should never crash your application. So always wrap it up, log and continue.
When the HTTPX instrumentor creates a child span for an external API call, AuditSpanProcessor fires and adds user/org context. When SQS instrumentation creates a producer span, same thing. Every span inherits request context without a single manual set_attribute call in business logic.
Charity Majors has argued that observability should answer questions about individual customer experience, not just aggregate metrics6. That's exactly what AuditSpanProcessor enables - filter by user_id and see every span from that user's request across all three services.
Context propagates through three layers, each solving a different problem:
def set_request_context(context_data: dict[str, Any]):
# Layer 1: ContextVar for async-safe local access
request_context_var.set(context_data)
# Layer 2: Baggage for cross-service propagation
for key, value in context_data.items():
baggage.set_baggage(key, str(value))
# Layer 3: Span attributes for SigNoz filtering
span = trace.get_current_span()
if span and span.is_recording():
for key, value in context_data.items():
span.set_attribute(f"context.{key}", str(value))
span.add_event("request_context_set", attributes={...})ContextVar survives async boundaries within a service. Baggage propagates across service boundaries via HTTP headers. Span attributes make everything filterable in SigNoz. You need all three - drop any layer and you lose visibility into part of the request path.
For outgoing requests, inject() handles the other direction - stuffing the current trace context into headers so the next service can continue the trace:
async def call_external_api(url: str):
headers = {}
inject(carrier=headers) # adds traceparent header
response = await httpx.get(url, headers=headers)
return responseThe traceparent header carries the trace across service boundaries:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
| | | |
| trace-id (16 bytes) span-id (8 bytes)|
| |
version trace-flagsAt this point, traces were connecting across all three services. We were happy - until we noticed that half our logs were missing from those traces entirely.
Log Bridge
Getting Python logs into OTel traces is harder than the docs suggest. You need to hook into every logger hierarchy - not just your application logger:
def configure_logging_provider(settings, resource):
loggers = [
"myapp", "myapp.middleware",
"uvicorn", "uvicorn.access", "uvicorn.error",
"gunicorn", "gunicorn.access", "gunicorn.error",
"fastapi", "sqlalchemy",
"httpx", "httpcore",
"boto3", "botocore",
]
for name in loggers:
logging.getLogger(name).addHandler(otel_handler)
# Root logger catches everything else
logging.root.addHandler(otel_handler)
logging.root.setLevel(logging.INFO)We initially only hooked our app logger and spent two days trying to figure out why SQLAlchemy queries and boto3 calls were invisible in traces. The fix was registering every framework and library logger explicitly. The root logger catches stragglers.
OTelLogHandler bridges Python logging to the OTel pipeline, mapping log levels to OTel severity and injecting trace correlation into every record:
class OTelLogHandler(LoggingHandler):
def emit(self, record: logging.LogRecord) -> None:
# Map Python levels to OTel severity
record.severity_text, record.severity_number = self.SEVERITY_MAP[record.levelno]
# Inject trace correlation
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, "032x")
record.span_id = format(ctx.span_id, "016x")
# Pull request context from ContextVar
request_ctx = request_context_var.get()
record.request_id = request_ctx.get("request_id", "")
record.user_id = request_ctx.get("user_id", "")
record.organisation_id = request_ctx.get("organisation_id", "")
super().emit(record) # → OTLPLogExporter → sidecar → SigNoz + S3Every log entry ends up with full correlation metadata:
{
"timestamp": "2025-10-31T15:42:33.123Z",
"level": "INFO",
"logger": "myapp.api.users",
"message": "User created successfully",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"request_id": "req_abc123",
"user_id": "user_12345",
"organisation_id": "org_789",
"http_method": "POST",
"http_path": "/api/v1/users"
}Click the trace_id in SigNoz and you get the complete distributed trace. Filter by user_id and you see one customer's journey across all three services. That's the payoff - a logger.info() call in business logic now carries the full request context without the developer doing anything extra.
Logs were flowing into traces. Then Gunicorn forked a worker and everything deadlocked.
Instrumentation
Setup
The init function wires up the tracer provider, OTLP exporter to the sidecar on localhost:4317, BatchSpanProcessor, and the AuditSpanProcessor we built for context stamping:
def configure_opentelemetry(
settings: OTelSettings, app=None, force_init: bool = False
) -> TracerProvider | None:
# Guard against double-init (important for fork safety)
current_provider = trace.get_tracer_provider()
if not force_init and not isinstance(current_provider, trace.ProxyTracerProvider):
return current_provider
resource = Resource.create(settings.get_resource_attributes())
tracer_provider = TracerProvider(resource=resource)
# OTLP exporter → sidecar on localhost:4317
otlp_exporter = OTLPSpanExporter(
endpoint=settings.get_endpoint(),
headers=settings.get_headers(),
insecure=settings.should_use_insecure(),
)
tracer_provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
# Auto-stamp every span with user/org context
tracer_provider.add_span_processor(AuditSpanProcessor())
trace.set_tracer_provider(tracer_provider)
set_global_textmap(TraceContextTextMapPropagator())
configure_logging_provider(settings, resource)
instrument_libraries(app)The force_init guard matters more than it looks. Without it, Gunicorn's fork model causes double-initialization - the parent process sets up a tracer, the fork duplicates it, and the child tries to set up another. You get two BatchSpanProcessor threads competing for the same export queue. The guard checks for ProxyTracerProvider (the default uninitialized state) and bails out if a real provider already exists.
Configuration lives in a Pydantic model driven by environment variables:
class OTelSettings(BaseModel):
enabled: bool = Field(default=False)
environment: Environment = Field(default=Environment.LOCAL)
service_name: str = Field(default="my-api")
service_version: str = Field(default="0.1.0")
signoz_endpoint: str | None = None
signoz_ingestion_key: SecretStr | None = None
export_to_console: bool = False
insecure: bool = True # plain gRPC to sidecar
sampling_rate: float = 1.0
model_config = ConfigDict(env_prefix="MY_APP__OTEL__")MY_APP__OTEL__ENABLED=true turns it on. MY_APP__OTEL__SIGNOZ_ENDPOINT=http://localhost:4317 points to the sidecar. Auto-instrumentation hooks into every library:
def instrument_libraries(app=None):
FastAPIInstrumentor.instrument_app(app)
RequestsInstrumentor().instrument()
HTTPXClientInstrumentor().instrument()
AioHttpClientInstrumentor().instrument()
LoggingInstrumentor().instrument(set_logging_format=False)
Boto3SQSInstrumentor().instrument()One gotcha: LoggingInstrumentor silently overwrites your custom log format with OTel's default. You must pass set_logging_format=False or you lose your structured JSON logs.
sampling_rate: float = 1.0 at the application level means the app sends everything. Sampling happens at the collector, not the app. This is deliberate - if you sample at the app, the collector never sees the dropped spans, which means you can't do tail-based sampling (keeping 100% of errors and slow requests while sampling normal traffic). Let the collector decide what to keep.
SQS Propagation
For SQS message processing in the daemons, a custom span wraps each message to continue the trace:
class _MessageProcessingSpan:
def __enter__(self):
self.span_context = tracer.start_as_current_span(
"process_sqs_message", kind=trace.SpanKind.CONSUMER
)
self.span = self.span_context.__enter__()
self.span.set_attribute("messaging.system", "aws_sqs")
self.span.set_attribute("messaging.message_id", self.message_id)
set_request_context({"message_id": ..., "event_type": ..., "task_id": ...})Traces kept breaking between the API and the async daemon, but only for certain message types. The simpler messages traced fine. Turns out SQS has a hard 10-attribute limit on message attributes, and OTel's SQS instrumentation stuffs traceparent in as one of those attributes. Our larger messages were already using 9 or 10 custom attributes, so trace context silently vanished.
In daemon processing, you rarely know all context upfront. A message arrives with an event type, but the task ID, workflow ID, and other details only emerge as processing progresses. We built update_span_context() to enrich spans incrementally:
def update_span_context(task_id=None, workflow_id=None, **additional_attrs):
span = trace.get_current_span()
if span and span.is_recording():
if task_id:
span.set_attribute("task.id", str(task_id))
if workflow_id:
span.set_attribute("workflow.id", str(workflow_id))
for key, value in additional_attrs.items():
if value is not None:
span.set_attribute(key, str(value))
# Also update the ContextVar so logs stay in sync
current_context = request_context_var.get()
updated = {**current_context, **({"task_id": task_id} if task_id else {})}
request_context_var.set(updated)Call it as context becomes available. Early in processing: update_span_context(task_id="123"). After resolving the workflow: update_span_context(workflow_id="456", video_duration="30s"). Both the span and the logging context stay in sync, so you can filter by workflow.id in SigNoz and see every span and log across every service that touched that workflow.
Manual Spans
record_exception is worth highlighting.
@router.post("/process")
async def process_data(data: dict):
with create_span("validate_input", {"data_size": len(str(data))}):
if not data:
raise ValueError("Empty data")
with create_span("process_business_logic") as span:
try:
result = await complex_operation(data)
span.set_attribute("result.success", True)
return result
except Exception as e:
record_exception(span, e)
raiseWhen a span catches an exception, it adds the stack trace as an event. In SigNoz, you can filter for spans with exceptions and see the full trace context around the failure.
Be deliberate about span attributes. Follow OTel semantic conventions and avoid two traps:
# Good: semantic attributes
span.set_attribute("http.method", "POST")
span.set_attribute("order.value", 129.99)
# Bad: PII in traces
span.set_attribute("user.password", password) # never
span.set_attribute("user.email", email) # high cardinality, explodes data volume
# Bad: meaningless names
span.set_attribute("data", value) # not filterableHigh-cardinality attributes (like email addresses) create millions of unique values in your backend, ballooning storage and slowing queries. Use user.id instead of user.email. Never put secrets or PII in spans - traces are shared debugging tools, not secure storage.
"Don't put PII in spans" is easy to say. Harder to enforce when auto-instrumentation captures everything. Our TracingMiddleware actively redacts sensitive data before it enters the trace:
# Sanitize query parameters
if request.query_params:
for key, value in request.query_params.items():
if key.lower() not in ["password", "token", "secret", "key", "auth"]:
span_attributes[f"http.query.{key}"] = value
# Capture request body with redaction (POST/PUT/PATCH only, 1KB limit)
body = await request.body()
if body and len(body) < 1024:
body_json = json.loads(body)
for sensitive_key in ["password", "token", "secret", "key", "auth"]:
if sensitive_key in body_json:
body_json[sensitive_key] = "[REDACTED]"
span.add_event("request.body", attributes={"body": json.dumps(body_json)[:500]})
request._body = body # Reset body for the actual handlerThe request._body = body line is easy to miss. Without it, reading the body in the middleware consumes it and the route handler gets an empty request. Reset it after capturing.
Gunicorn
OTel initialization must happen after Gunicorn forks workers. Initializing before the fork causes BatchSpanProcessor deadlocks - the background thread in the parent process gets duplicated into child processes with a locked mutex7.
def post_fork(server, worker):
from myapp.config import get_app_settings
from myapp.observability.otel_config import init_telemetry_post_fork
config = get_app_settings()
if config.otel.enabled:
otel_endpoint = config.otel.get_endpoint()
# Sidecar needs ~2-5s to bind port 4317
if "localhost" in otel_endpoint or "127.0.0.1" in otel_endpoint:
time.sleep(4)
init_telemetry_post_fork(config.otel)That time.sleep(4) looks wrong but is necessary. The OTel collector sidecar starts in parallel with the application container. ECS waits for it to START (not HEALTHY - the collector image is minimal, no shell for health check commands). The sidecar needs 2-5 seconds to bind port 4317. Without the sleep, the first worker tries to send telemetry to a port that isn't listening yet.
When running under uvicorn directly (local dev), initialization happens in create_app() instead:
if not os.environ.get("SERVER_SOFTWARE", "").startswith("gunicorn"):
configure_opentelemetry(config.otel, app=app, force_init=True)
else:
logger.info("Running under Gunicorn - OTel will init in post_fork")At this point, three services were producing traces. The question was where to send them without creating a new single point of failure.
Collector Config
Pin your collector version. The otel/opentelemetry-collector-contrib image has had breaking config changes between minor versions - the logging exporter was renamed to debug in v0.86.0, for example. A latest tag on a Tuesday morning can break your entire observability pipeline.
The collector YAML is shared across all three services:
receivers:
otlp:
protocols:
grpc:
endpoint: 127.0.0.1:4317
http:
endpoint: 127.0.0.1:4318
processors:
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 25
batch:
timeout: 10s
send_batch_size: 1000
send_batch_max_size: 1000
resourcedetection:
detectors: [env, ecs, ec2, system]
timeout: 2s
override: false
resource:
attributes:
- key: cloud.provider
value: aws
action: upsert
- key: deployment.environment
value: ${ENVIRONMENT}
action: upsert
exporters:
otlp/signoz:
endpoint: ${SIGNOZ_ENDPOINT}
headers:
signoz-ingestion-key: ${SIGNOZ_INGESTION_KEY}
tls:
insecure: false
compression: gzip
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
awss3:
s3uploader:
region: ${AWS_REGION}
s3_bucket: ${S3_LOGS_BUCKET}
s3_prefix: otel-logs/${ENVIRONMENT}
marshaler: otlp_json
service:
pipelines:
logs:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch, resource]
exporters: [otlp/signoz, awss3]
traces:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch, resource]
exporters: [otlp/signoz]
metrics:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch, resource]
exporters: [otlp/signoz]Pipeline ordering matters. The memory limiter runs first - it prevents OOM kills at 75% of the sidecar's 512 MB allocation before any processing happens. Resource detection auto-discovers ECS metadata (task ARN, cluster name, region), so every span gets cloud context without configuration. The batch processor collects up to 1000 spans before flushing, or flushes every 10 seconds, balancing cost (fewer S3 PutObject calls) with latency.
One interaction took us a while to figure out: both resourcedetection and resource processors can set the same attribute keys. Our manually-set deployment.environment kept getting overwritten by the auto-detected one. The fix was override: false on resourcedetection, which means manually-set values win. If you're debugging "why isn't my cluster name showing up," that flag is probably the answer.
Logs go to both SigNoz and S3. Traces go to SigNoz only. Traces are ephemeral debugging data; logs are the compliance audit trail. Storing traces in S3 would be expensive with limited value.
The SigNoz exporter uses compression: gzip. In our testing, gzip achieves roughly 70-80% compression on trace data. With 100 GB/day, the compressed payload drops to around 25-30 GB, meaningfully reducing both egress costs and SigNoz ingestion volume.
Deployment
The sidecar is a reusable Terraform module. Each service gets an identical collector with its own CloudWatch log group, IAM policy, and container definition:
locals {
otel_collector_container = {
name = "otel-collector"
image = var.otel_collector_image
essential = true
cpu = 256
memory = 512
memoryReservation = 256
command = ["-config=env:OTEL_CONFIG"]
environment = [
{ name = "OTEL_CONFIG", value = file("${path.module}/collector-config.yaml") }
]
secrets = [
{ name = "SIGNOZ_ENDPOINT", valueFrom = "${var.secrets_arn}:<endpoint_key>::" },
{ name = "SIGNOZ_INGESTION_KEY", valueFrom = "${var.secrets_arn}:<ingestion_key>::" }
]
}
}The essential = true flag means ECS replaces the entire task if the sidecar crashes. We embed the collector YAML via file() rather than Parameter Store - config changes are infrequent and should be version-controlled. The YAML is 3.8 KB against a 4 KB environment variable limit. If it grows past that, we'll migrate to Parameter Store.
The dependency block ensures the application container waits for the sidecar to start:
output "dependency_block" {
value = [{
containerName = "otel-collector"
condition = "START" # Not HEALTHY - minimal image, no shell for curl
}]
}Resource allocation: 256 CPU units and 512 MB memory handles up to ~5000 spans/second per task. Use Container Insights to tune - if you're consistently under 50% utilization, reduce the allocation. If you're above 80%, bump memory to 1024 MB.
One request, three services but a single unified trace. The API enqueues a job to SQS with trace context in the message headers. The async daemon picks it up and continues the trace. If that job triggers computation, the CPU daemon inherits the same trace.
Local Dev
Docker Compose mirrors the sidecar pattern locally:
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.138.0
command: ['-config=/etc/otel-collector-config.yaml']
volumes:
- ./otel/collector-config.yaml:/etc/otel-collector-config.yaml:ro
ports:
- '4317:4317' # OTLP gRPC
- '4318:4318' # OTLP HTTP
- '13133:13133' # Health check
environment:
- SIGNOZ_ENDPOINT=${SIGNOZ_ENDPOINT}
- SIGNOZ_INGESTION_KEY=${SIGNOZ_INGESTION_KEY}
- S3_LOGS_BUCKET=${S3_LOGS_BUCKET:-my-otel-logs-dev}
- ENVIRONMENT=localSame collector config as production. Same dual export pipeline. Set MY_APP__OTEL__ENABLED=true and MY_APP__OTEL__SIGNOZ_ENDPOINT=http://localhost:4317 in your .env and traces flow identically to production.
For quick testing without external dependencies, console export dumps traces to stdout:
export MY_APP__OTEL__ENABLED=true
export MY_APP__OTEL__EXPORT_TO_CONSOLE=trueYou'll see each span printed with its trace context and attributes:
{
"name": "POST /api/v1/users",
"context": {
"trace_id": "0x4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "0x00f067aa0ba902b7"
},
"parent_id": null,
"kind": "SpanKind.SERVER",
"status": { "status_code": "OK" },
"attributes": {
"http.method": "POST",
"http.status_code": 201,
"user.id": "user_12345"
}
}Useful for verifying parent-child relationships before deploying. If parent_id is null on a span that should be a child, your context propagation is broken. Catch these issues locally instead of discovering them in production.
Running It
Applications and sidecars use separate secret namespaces. Apps get service-specific keys pointing to localhost:
MY_API__OTEL__SIGNOZ_ENDPOINT=http://localhost:4317
Sidecars get shared keys pointing to the real endpoint:
OTEL__SIDECAR__SIGNOZ_ENDPOINT=https://<your-signoz-endpoint>OTEL__SIDECAR__SIGNOZ_INGESTION_KEY=<key>
Two-hop architecture. The application sends to localhost. The sidecar forwards to SigNoz Cloud and S3. If an application container is compromised, the attacker only sees localhost:4317. Production observability credentials live exclusively in the sidecar's environment.
Using insecure: true for the app-to-sidecar connection is correct here. Traffic never leaves the ECS task boundary. ECS awsvpc mode provides network isolation per task - each task gets its own elastic network interface. TLS overhead for same-task communication is unnecessary. Real TLS protection happens at the sidecar-to-SigNoz boundary.
The sidecar's IAM policy follows least privilege - write-only S3 permissions scoped to a single bucket per environment. Specifically: s3:PutObject and s3:GetBucketLocation only. No s3:DeleteObject (prevents log tampering), no s3:GetObject (no read needed), no s3:ListBucket (no enumeration). If a sidecar is compromised, attackers can only append logs. They can't delete evidence or read existing data.
We pull all secrets from AWS Secrets Manager through a Terraform mapping layer that renames them per service.
Failure Modes
When the sidecar crashes, essential: true means ECS replaces the entire task within 30-60 seconds. Multiple tasks provide redundancy - other healthy tasks continue serving while the replacement starts. You lose telemetry from the crashed task during recovery, but no application impact.
When S3 export fails, the collector retries with exponential backoff (5s initial, 30s max). After 5 minutes of continuous failure, it drops the batch and fires a CloudWatch alarm. SigNoz still has a 30-day copy of everything, so you're covered for active debugging.
When SigNoz goes down, retries run up to 300 seconds max. The memory limiter prevents the retry queue from causing OOM. S3 export continues independently. You can query S3 with Athena during outages - slower but functional.
When memory pressure spikes from high traffic, the limiter triggers at 384 MB (75% of 512 MB) and drops the oldest batches first. Autoscaling adds tasks to distribute load across more sidecars.
A CloudWatch metric filter catches S3 issues early:
resource "aws_cloudwatch_log_metric_filter" "s3_upload_errors" {
name = "otel-s3-upload-errors-${var.service_name}-${var.environment}"
log_group_name = aws_cloudwatch_log_group.otel_collector.name
pattern = "[time, request_id, level=ERROR*, msg=\"*S3*\" || msg=\"*upload*\"]"
metric_transformation {
name = "OTelS3UploadErrors"
namespace = "CustomLogs/${var.service_name}"
value = "1"
}
}Monitoring
The collector exposes three endpoints for debugging itself:
:13133/health- returns 200 if the collector is running:55679- ZPages debug views of pipelines and span counts (port-forward to access):8888/metrics- Prometheus-format collector metrics
In CloudWatch, look for "Everything is ready. Begin running and processing data." on startup. If it's missing, check for "error":"failed to get secret" (Secrets Manager key name mismatch), "connection refused" (bad SigNoz endpoint), or "Unauthorized" (wrong ingestion key).
Each service's sidecar behaves differently under load. The CPU daemon has the highest memory usage because its computations produce long-running spans (5-60 seconds). The async daemon has the highest throughput from hundreds of concurrent jobs. The API has the most requests but the shortest spans. Size your sidecar memory accordingly - the CPU daemon may need 1024 MB while the API gets by with 512 MB.
Troubleshooting
When traces aren't appearing, check in order:
# 1. Is the sidecar running?
aws ecs describe-tasks --tasks <task-arn> \
| jq '.tasks[].containers[] | select(.name=="otel-collector")'
# 2. Is it healthy?
aws logs tail /aws/ecs/otel-collector-myapp-dev --followFor S3 upload failures (CloudWatch alarm firing but SigNoz still receiving data):
aws logs filter-log-events \
--log-group-name /aws/ecs/otel-collector-myapp-dev \
--filter-pattern "S3" \
--start-time $(date -u -d '1 hour ago' +%s)000Common errors: AccessDenied (IAM policy missing), NoSuchBucket (bucket name wrong), SlowDown (S3 rate limiting - reduce batch frequency).
If the sidecar keeps getting OOMKilled, the culprit is usually traffic spikes accumulating batches faster than they export. Increase memory to 1024 MB, or reduce send_batch_size to 500.
The Math
The dual export to S3 pulls double duty - debugging data and SOC 2 compliance. S3 export satisfies Common Criteria CC7.2 (monitoring controls) and CC7.3 (evaluation and communication)8:
- Complete logs from all system components
- Immutable audit trail (S3 versioning enabled)
- 90+ day retention via lifecycle policies
- AES-256 encryption at rest, TLS 1.2+ in transit
- Every log stamped with
trace_id,user_id,org_id, andtimestamp - CloudWatch logs from sidecars encrypted with KMS
SigNoz provides 30-day real-time analysis. S3 Standard provides 90-day warm storage queryable via Athena. Glacier Deep Archive handles long-term compliance at a fraction of the cost.
Performance overhead is minimal - 1-2ms per request for tracing, with asynchronous log shipping that never blocks the application. BatchSpanProcessor accumulates spans in memory and flushes on a background thread, so the application thread returns immediately after creating a span.
For 100 GB/day of logs:
- S3 Standard (90 days): $207/month
- Glacier Deep Archive (after 90 days): $36/month
- S3 PUT requests: $15/month
- SigNoz Cloud ingestion: $150-300/month
- CloudWatch Logs (sidecars): $5/month
Total: ~$413-563/month for complete observability plus SOC 2 compliance.
Compare that to Datadog at 15−25/GB,NewRelicat10-20/GB, or Splunk at $150-200/GB9. Goldman Sachs downgraded Datadog to "Sell" in early 2026, citing a "pincer movement" of rising competition and customer budget fatigue10. The open-source observability ecosystem is eating the proprietary vendors for a reason - according to a Middleware survey, 96% of organizations are actively trying to reduce observability spend11.
Migration
We ran both systems in parallel during the transition:
if config.otel.enabled:
configure_opentelemetry(config.otel, app=app)
else:
setup_legacy_logging(config.logging)Test OTel in dev while keeping Loggly (or whatever else you use) as your production safety net. When confidence is high, flip the flag. We ran both in parallel for two weeks before cutting over.
Adding tracing to a new service means installing packages and pointing to the collector. Database queries, HTTP calls, SQS operations - all traced without touching business logic. When we add our fourth service, it inherits the complete observability stack automatically.
What's Still Imperfect
The time.sleep(4) is still in production. It works, but it's a hack - we'd rather have a proper health check or readiness probe. The collector YAML is at 3.8 KB against a 4 KB env var limit, so Parameter Store migration is coming. Tail-based sampling is configured but not yet tuned - we're still sending 100% of everything, which is fine for our current volume but won't scale forever. And debugging the observability system itself is its own challenge. When the sidecar is the thing that breaks, you're back to CloudWatch logs.
None of that matters as much as the result: a user reports something slow, and instead of guessing which of three services to blame, we open one trace and see the full story.
References
Footnotes
-
OpenTelemetry Community: "OpenTelemetry Python Auto-Instrumentation", OpenTelemetry, 2025 ↩
-
OneUptime: "OpenTelemetry: Your Escape Hatch from the Observability Cartel", OneUptime, 2025 ↩
-
Gergely Orosz: "Datadog's $65M/year customer mystery solved", The Pragmatic Engineer, 2024 ↩
-
SigNoz Team: "ECS Sidecar Collection", SigNoz Documentation, 2025 ↩
-
CNCF: "Jaeger v2 Released: OpenTelemetry at the Core", CNCF Blog, 2024 ↩
-
Charity Majors: "Observability is a Many-Splendored Definition", charity.wtf, 2020 ↩
-
OpenTelemetry Community: "Working With Fork Process Models", OpenTelemetry Python Docs, 2025 ↩
-
AICPA: "SOC 2 Trust Services Criteria", AICPA, 2024 ↩
-
SigNoz Team: "SigNoz vs Datadog", SigNoz, 2025 ↩
-
SiliconValley.com: "Observability Outlook 2026", Market Minute, 2026 ↩
-
Middleware: "Tips to Reduce Observability Costs", Middleware Blog, 2025 ↩