Published: October 23, 2025

Automating Secret Management

In October 2024, the Internet Archive got breached. The root cause was a GitLab API token that had been sitting exposed in a configuration file since December 2022¹. Two years. Unrotated. Someone found it, exfiltrated 7TB of data including 31 million user records, and walked out the front door. Then, weeks later, attackers came back and breached the Archive's Zendesk support platform using different tokens from the same initial exposure - ones the team still hadn't rotated after the first breach¹.

At that point, that's not a sophisticated attack, it's just logging in.

The Snowflake breach later that year proved the pattern scales. Attackers bought credentials from infostealer malware markets - some dating back to 2020 - and just tried them². No MFA, no rotation. 165 companies compromised including Ticketmaster, AT&T, and Santander. AT&T paid $370,000 in ransom to make it stop. The entire attack required zero technical sophistication.

GitHub reported 39 million leaked secrets in 2024, up 67% year-over-year³. 70% of secrets leaked in 2022 were still valid two years later⁴. Stolen credentials are behind 22% of all breaches⁵. The industry knows it has a rotation problem.

Secret management starts simple; slap some API keys in a .env file and call it a day. Then you hit scale and you're juggling credentials across environments, manually rotating keys, praying nobody commits the wrong file.

We needed a system with clear goals: one place to define every secret, one command to rotate, and granular access control per environment. Local dev environments had to stay in sync without manual copying. Every secret lives in AWS Secrets Manager, and Terraform pushes updates everywhere else - ECS task definitions, CI/CD secrets, frontend environment variables. One change, propagated in minutes.

The Alternatives

1Password is great for team passwords. It's the wrong tool for infrastructure secrets - manual entry, no native Terraform provider, no bulk rotation, shared vault model instead of resource-based IAM. We needed to manage hundreds of secrets across multiple environments with zero human intervention. 1Password is a password manager. We needed a secrets pipeline.

HashiCorp Vault excels at dynamic secrets and fine-grained access policies. It's also a full distributed system you have to operate, upgrade, back up, and monitor. If you already run Vault, use Vault. We live in AWS and didn't want to adopt an entire platform to solve a plumbing problem.

Doppler and Infisical are developer-friendly SaaS options, but they move your secrets outside your AWS account boundary. For teams on Kubernetes, External Secrets Operator syncs from Secrets Manager (or Vault) directly into pods - worth evaluating if that's your stack.

We were already using Terraform to manage our AWS infrastructure, so extending it to handle secret distribution was a natural fit. We built on AWS Secrets Manager⁶ with Terraform⁷ as the distribution layer. The first question was how to organize the secrets themselves.

Five Buckets

100%

We split secrets into five buckets, each stored as a JSON object in Secrets Manager with an {env} suffix. API secrets (api-secrets-{env}) hold the original credentials plus SHARED__* and internal alias mappings - everything the backend API needs. Each worker gets its own bucket (worker-a-secrets-{env} and worker-b-secrets-{env}) containing only the mapped subset it needs under SHARED__* and service-specific prefixes. Worker B's bucket includes S3 artifact credentials that Worker A doesn't need. Platform config (platform-config-{env}) covers deployment platform tokens and project identifiers. Dashboard secrets (dashboard-secrets-{env}) map API secrets to frontend naming conventions like NEXT_PUBLIC_*.

We originally had two buckets - backend and platform. The frontend developers kept asking for secrets under different names because Next.js expects NEXT_PUBLIC_DB_URL, not MY_APP__DB_SERVICE__URL. After the third Slack thread about mapping env vars, we added the dashboard bucket and translated at the infrastructure layer instead. When we split background processing into separate workers with different permission profiles, each got its own bucket - one worker needs S3 artifact access that the other doesn't.

Five buckets solved both the naming problem and the least-privilege problem. The harder question was how to fan out a single secret to all these consumers without drift.

One Source, Many Consumers

Most backend systems have multiple services that need overlapping secrets under different environment variable prefixes - an API, background workers, scheduled jobs. The naive approach is defining every secret N times. We tried that briefly and the drift was immediate. The pattern that actually works: define each secret once and let Terraform rename it per consumer.

The mapping definitions are straightforward. Left side is the canonical name in environment_secrets. Right side is what the application actually sees:

variable "shared_daemon_secret_mappings" {
  default = {
    # Database service
    "MY_APP__DB_SERVICE__KEY" = "SHARED__DB_SERVICE__KEY"
    "MY_APP__DB_SERVICE__URL" = "SHARED__DB_SERVICE__URL"
 
    # Third-party integrations
    "VENDOR_A_API_KEY" = "SHARED__INTEGRATIONS__VENDOR_A_API_KEY"
    "VENDOR_B_API_KEY" = "SHARED__INTEGRATIONS__VENDOR_B_API_KEY"
    "VENDOR_C_API_KEY" = "SHARED__INTEGRATIONS__VENDOR_C_API_KEY"
 
    # Cloud services
    "CLOUD_SERVICE_ACCOUNT_JSON" = "SHARED__INTEGRATIONS__CLOUD_SERVICE_ACCOUNT_JSON"
    "CLOUD_PROJECT_ID"           = "SHARED__INTEGRATIONS__CLOUD_PROJECT_ID"
    "CLOUD_REGION"               = "SHARED__INTEGRATIONS__CLOUD_REGION"
 
    # S3
    "S3_AWS_ACCESS_KEY_ID"     = "SHARED__S3__AWS_ACCESS_KEY_ID"
    "S3_AWS_SECRET_ACCESS_KEY" = "SHARED__S3__AWS_SECRET_ACCESS_KEY"
    "S3_AWS_REGION"            = "SHARED__S3__AWS_REGION"
    "S3_AWS_S3_BUCKET"         = "SHARED__S3__AWS_S3_BUCKET"
 
    # Backend API Client
    "MY_APP__AUTH_SERVICE__HANDSHAKE_KEY" = "SHARED__BACKEND_API_CLIENT__HANDSHAKE_KEY"
  }
}

One definition fans out to every consumer. When a secret changes, every mapping updates from the same source - no more "which environment has the stale key" conversations.

Each service also gets application-specific mappings for things like observability config, where different workers need their own prefixes:

variable "worker_secret_mappings" {
  default = {
    "MY_APP__LOGGING__LOG_API_KEY"    = "WORKER__LOGGING__LOG_API_KEY"
    "MY_APP__OTEL__ENABLED"           = "WORKER__OTEL__ENABLED"
    "MY_APP__OTEL__ENVIRONMENT"       = "WORKER__OTEL__ENVIRONMENT"
    "MY_APP__OTEL__OTEL_ENDPOINT"     = "WORKER__OTEL__OTEL_ENDPOINT"
    "MY_APP__OTEL__OTEL_INGESTION_KEY" = "WORKER__OTEL__OTEL_INGESTION_KEY"
  }
}

These mappings eventually merge inside Terraform modules.

The API and workers need different merge strategies, and getting this wrong means silently overwriting production credentials. The API keeps its original secrets and adds SHARED__* versions alongside them. It needs both namespaces because it hosts the shared services layer that all applications consume:

locals {
  mapped_shared_secrets = {
    for api_key, shared_key in var.shared_secret_mappings :
    shared_key => lookup(var.environment_secrets, api_key, null)
    if lookup(var.environment_secrets, api_key, null) != null
  }
 
  mapped_api_internal_secrets = {
    for source_key, target_key in var.api_internal_secret_mappings :
    target_key => lookup(var.environment_secrets, source_key, null)
    if lookup(var.environment_secrets, source_key, null) != null
  }
 
  computed_shared_secrets = {
    "SHARED__BACKEND_API_CLIENT__BASE_URL" = var.environment == "dev"
      ? "https://api-dev.example.com"
      : "https://api.example.com"
  }
 
  final_api_secrets = merge(
    var.environment_secrets,            # Original MY_APP__* secrets
    local.mapped_api_internal_secrets,  # Internal namespace aliases
    local.mapped_shared_secrets,        # SHARED__* versions
    local.computed_shared_secrets       # Computed values
  )
}

The merge order matters - in Terraform, later arguments win on key collisions. The originals go first so that internal aliases and shared mappings can extend the namespace without accidentally overwriting core secrets. We validate that no mapping produces a key that already exists in environment_secrets, because if one slips through, the later map silently wins and the original value disappears. No warnings, no errors, just a broken service at 2am.

Workers are different - they only get renamed secrets, no originals. This was a deliberate choice, not a convenience. They don't run the API, so they don't need its namespace. They get SHARED__* and their own service-specific prefixes:

locals {
  mapped_secrets = {
    for api_key, daemon_key in var.secret_mappings :
    daemon_key => lookup(var.api_secrets, api_key, null)
    if lookup(var.api_secrets, api_key, null) != null
  }
 
  combined_secrets = merge(
    local.mapped_secrets,
    var.daemon_only_secrets
  )
 
  final_secrets = merge(
    local.combined_secrets,
    {
      "SHARED__BACKEND_API_CLIENT__BASE_URL" = var.environment == "dev"
        ? "https://api-dev.example.com"
        : "https://api.example.com"
      "WORKER__SQS_CLIENT__AWS_REGION" = var.region
      "WORKER__OTEL__SERVICE_NAME"     = "worker"
    }
  )
}

This enforces least-privilege at the secret level - workers can't accidentally reference secrets they shouldn't have, because those secrets simply aren't in their namespace. If a worker container gets compromised, the attacker sees only the mapped subset.

The root module wires everything together:

module "api_secrets" {
  source                       = "./modules/secrets"
  environment_secrets          = var.environment_secrets
  shared_secret_mappings       = var.shared_secret_mappings
  api_internal_secret_mappings = var.api_internal_secret_mappings
}
 
module "worker_secrets" {
  source          = "./modules/worker_secrets"
  api_secrets     = var.environment_secrets
  secret_mappings = merge(
    var.shared_secret_mappings,
    var.worker_secret_mappings
  )
  worker_only_secrets = var.worker_secrets
}

One secret definition fans out to multiple AWS Secrets Manager entries, each application getting exactly what it needs under the right prefix.

There's one more wrinkle. When two internal services share the same database but expect different environment variable names, you don't duplicate values - you alias internally:

variable "api_internal_secret_mappings" {
  default = {
    "MY_APP__DB_SERVICE__URL" = "MY_APP__AUTH_SERVICE__DB_URL"
    "MY_APP__DB_SERVICE__KEY" = "MY_APP__AUTH_SERVICE__DB_SECRET_KEY"
    "MY_APP__DB_ANON__KEY"    = "MY_APP__AUTH_SERVICE__DB_PUBLISHABLE_KEY"
  }
}

So you get the same credentials with different names and zero duplication. The merge module handles this in mapped_api_internal_secrets - same for expression with a lookup null guard.

Consuming Secrets

All of that Terraform plumbing exists so application code never thinks about secret distribution. On the Python side, a SharedSettings class consumes the SHARED__ prefixed secrets:

class SharedSettings(BaseSettings):
    integrations: IntegrationsSettings      # SHARED__INTEGRATIONS__*
    s3: S3Settings                          # SHARED__S3__*
    db_service: DBServiceSettings           # SHARED__DB_SERVICE__*
    backend_api_client: BackendApiClientSettings  # SHARED__BACKEND_API_CLIENT__*
 
    class Config:
        env_prefix = "SHARED__"

Every application creates a SharedSettings instance. The shared services layer reads from it without knowing which application it's running in - completely context-agnostic:

# Works identically in any service
class AppSettings(BaseSettings):
    shared: SharedSettings = SharedSettings()
 
service = MyService(
    db_client=db,
    shared_settings=settings.shared,
)

The same config interface works across all deployment contexts. The settings class is the chokepoint - any secret access goes through it rather than raw os.environ.get() calls scattered through the codebase.

The frontend has its own translation layer - Terraform maps backend secrets to whatever naming convention your deploy platform expects:

variable "frontend_secret_mappings" {
  default = {
    "MY_APP__DB_SERVICE__URL"            = "NEXT_PUBLIC_DB_URL"
    "MY_APP__DB_ANON__KEY"               = "NEXT_PUBLIC_DB_ANON_KEY"
    "MY_APP__AUTH_SERVICE__HANDSHAKE_KEY" = "API_HANDSHAKE_KEY"
    "VENDOR_A_API_KEY"                   = "VENDOR_A_API_KEY"
    "VENDOR_B_API_KEY"                   = "VENDOR_B_API_KEY"
  }
}

Mark production secrets as sensitive in your deploy platform so they're hidden from the dashboard UI. Keep dev secrets visible for debugging. Small tradeoff between security and developer experience, but a deliberate one.

One quirk we discovered the hard way: some secrets break preview deployments. A base URL override can conflict with the platform's dynamic preview URL, so we exclude specific keys from preview targets and let the platform provide its own value.

Distribution and naming are solved. But none of it matters if the secrets themselves aren't locked down.

Access Control

We started with a single IAM policy that gave developers read access to all Secrets Manager entries. It worked until someone on the team accidentally queried production secrets from a local script during a debugging session. Nothing broke, but the audit log lit up and we realized our access model had no environment boundaries.

The fix was deny-first IAM with environment tagging. Every resource in AWS gets an Environment tag - dev or production. Developer IAM policies explicitly deny all actions on resources tagged production, with no exception mechanism.

We also block all Secrets Manager write operations for developers entirely - secrets are provisioned by the platform team, and developers only consume them. Even the ability to tag or untag secrets is denied, which prevents developers from re-tagging a production secret as dev to bypass the controls.

At runtime, we split ECS into two IAM roles per service. The execution role handles startup - pulling container images, reading secrets from Secrets Manager, writing logs. The task role handles runtime - queue access, storage writes, publishing events. If a running container gets compromised, the attacker gets task permissions only, not the keys to pull images or read other services' secrets.

We also use valueFrom ARN pointers in ECS task definitions instead of injecting secret values directly. If someone dumps a task definition, they get ARN pointers, not credentials. ECS resolves the actual values at launch time through the execution role.

For CI/CD, OIDC federation eliminates long-lived AWS credentials entirely⁸. GitHub Actions gets temporary credentials scoped to each workflow run - they expire when the job finishes. No static keys to rotate, no secrets to leak.

Locking down access is half the problem. The other half is changing credentials without breaking production.

Rotation

Secret rotation used to mean updating five platforms manually, in order, praying you didn't typo a connection string. In March 2025, a Cloudflare engineer rotating credentials for R2 object storage omitted a single CLI flag - --env production - and pushed new credentials to dev instead of production⁹. When the old credentials were deleted, production lost authentication. Every R2 write operation globally failed for over an hour. Images, Stream, Vectorize, billing, email security - all cascaded from one missing flag.

That's what manual rotation looks like at a competent company with experienced engineers. The average organization takes 27 days to rotate an exposed secret after detection¹⁰. Not because they're lazy - manual rotation is terrifying. The fear of causing an outage feels more immediate than the risk of a breach.

Our rotation now:

Update the secret in .tfvars
terraform apply
ECS tasks, CI/CD secrets, and frontend environment variables all update automatically

The whole thing takes minutes, leaves a full audit trail, and requires zero coordination.

Local development stays in sync through a CLI script that pulls from the same source:

# Backend .env
python sync_secrets.py dev --format env-api --dry-run
 
# Frontend .env
python sync_secrets.py dev --format env-dashboard --dry-run
 
# Infrastructure deployment
python sync_secrets.py prod --format tfvars
terraform apply -target=module.secrets -var-file="prod.secrets.tfvars"

Per-developer isolated environments paid off - each developer gets their own queues and scoped IAM credentials. No sharing dev secrets across the team, no stepping on each other's environment variables.

No more "which environment has the old key" conversations. No more Slack threads asking who rotated what and when.

Open Problems

Terraform state files store secret values in plaintext by default - even values marked sensitive. That's a known industry-wide limitation, not specific to our setup. We mitigate it with an encrypted S3 backend (encrypt = true), DynamoDB state locking to prevent concurrent access, strict IAM policies on state access, and no local state files. The secrets are encrypted at rest and in transit, and only CI/CD roles can read the state.

Terraform 1.10 introduced ephemeral resources that never persist to state, and 1.11 added write-only arguments for managed resources¹¹. We're evaluating both - they'd eliminate the plaintext-in-state issue entirely.

Our pipeline also can't detect secrets baked into container images at build time - and GitGuardian found 100,000 valid secrets across 15 million public Docker Hub images¹². Repositories using GitHub Copilot leak secrets at a 40% higher rate than non-Copilot repos¹³, which means the pre-commit scanning we rely on is fighting an accelerating problem. A 2025 analysis of MCP servers found 53% using hardcoded credentials in plaintext JSON configs¹⁴ - the same .env antipattern our pipeline was built to eliminate, showing up in new tooling.

The window between "credential exposed" and "credential exploited" is shrinking. Automated harvesting operations like EMERALDWHALE scan IP ranges for exposed .git/config and .env files at scale¹⁵, extracting thousands of cloud credentials using commodity tools. That urgency is what makes the pipeline worth maintaining - not the automation itself, but what didn't happen. No production outages from stale credentials, no secrets committed to git, no rotation paralysis.

References

BleepingComputer: "Internet Archive breached again through stolen access tokens", October 2024 ↩ ↩²
Cloud Security Alliance: "Unpacking the 2024 Snowflake Data Breach", 2025 ↩
GitHub: "The next evolution of GitHub Advanced Security", 2025 ↩
GitGuardian: "The State of Secrets Sprawl 2025", 2025 ↩
Verizon: "2025 Data Breach Investigations Report", 2025 ↩
AWS: "AWS Secrets Manager User Guide", Amazon Web Services Documentation, 2024 ↩
HashiCorp: "Terraform AWS Provider - Secrets Manager", Terraform Registry, 2024 ↩
GitHub: "Configuring OpenID Connect in Amazon Web Services", GitHub Docs ↩
Cloudflare: "Cloudflare incident on March 21, 2025", 2025 ↩
GitGuardian: "The Hidden Challenges of Automating Secrets Rotation", 2024 ↩
HashiCorp: "Terraform 1.11 brings ephemeral values to managed resources with write-only arguments", 2025 ↩
GitGuardian: "Fresh From The Docks: Uncovering 100,000 Valid Secrets in DockerHub", 2025 ↩
CSO Online: "AI programming copilots are worsening code security and leaking more secrets", 2025 ↩
Trend Micro: "Beware of MCP Hardcoded Credentials", 2025 ↩
Sysdig: "EMERALDWHALE", 2024 ↩

Loading comments...

PreviousReinforcement Learning, Memory and Law NextThe Knowledge Layer