Sat Feb 14 2026

From console clicks to Terraform — Quarkle's AWS journey

Three eras of running Quarkle on AWS — clicking around the ECS console, scripting EC2 with bash, and finally putting everything in Terraform. What made each transition necessary.

  • AWS
  • TERRAFORM
  • DEVOPS

Three eras

Quarkle's AWS setup went through three eras:

  1. Clickops — provisioning ECS, ALBs, ECR, and everything around them by hand in the AWS console.
  2. Bash — shell scripts wrapping the AWS CLI. For ECS, this meant the day-2 deploy flow (build image, push to ECR, update the service). For EC2, this meant both provisioning and deploys.
  3. Terraform — state in S3, modules per service, GitHub Actions deploys.

The interesting part isn't the end state. It's what made each transition necessary.

Era 1 — Clickops

The first version of the backend ran on ECS, and the way it got there was every wizard in the AWS console, in sequence. Create Cluster. Create ECR Repository. Create Application Load Balancer — pick subnets, pick a security group, attach a target group, paste the ARN of an ACM cert from another tab. Then ECS → Services → Create, pick the cluster, pick the task definition (which we'd just built in another wizard), bind the target group, click Create Service, watch for green.

The first deploy from a clean AWS account to a healthy ECS service serving HTTPS traffic took us 72 hours. Every time we got one piece working, the next wizard told us we'd configured the previous piece wrong — target group health check path didn't match the container, IAM execution role missing an ECR permission, subnets in the wrong AZ for the ALB, ACM cert in the wrong region.

We stayed on this for longer than was reasonable. Two things eventually broke it.

First was the security group. We needed to add a port. We went to the console, opened the SG, added a rule. A week later we needed to remember why we'd opened that port, and the console doesn't have a git log. We wrote it in a Notion doc, then forgot which Notion doc.

Second was when the scheduler EC2 instance died — kernel panic after some disk-pressure event. We went to launch a fresh one and had no record of what was on the old one: which IAM role, which volume size, which AMI. Half an hour of detective work comparing dropdowns to old screenshots.

The console was fine for creating things. It was terrible for re-creating them.

Era 2 — Bash scripts

The natural reaction wasn't Terraform. It was bash. Bash is just typing the AWS CLI commands you'd run anyway, in a file — nothing new to learn.

So we wrote a lot of it. New ECR repo: bash. New security group rule: bash. Spin up the scheduler EC2 instance with the right IAM role and user-data: bash. Deploy a new backend image to ECS: bash. bin/ got a script for every common operation, and quarkle-infra/bash-ec2/ had the EC2 lifecycle ones:

bin/push_to_prod.sh                             # build, push to ECR, force ECS rollover
quarkle-infra/bash-ec2/ec2_infra.sh             # provision the scheduler EC2 + IAM + SG
quarkle-infra/bash-ec2/ec2_startup.sh           # user-data: docker, clone, compose up
quarkle-infra/bash-ec2/update_ec2_container.sh  # day-2 rolling deploy via docker compose

The actual ECS deploy was small:

docker build -t quarkle-backend:latest .
docker tag quarkle-backend:latest \
  884237330161.dkr.ecr.us-east-1.amazonaws.com/backend-repo:latest
aws ecr get-login-password ... | docker login ...
docker push 884237330161.dkr.ecr.us-east-1.amazonaws.com/backend-repo:latest
aws ecs update-service --cluster backend-app --service quarkle-backend-service \
  --force-new-deployment --region us-east-1

The provisioning scripts had their own quirks. ec2_infra.sh had a literal sleep 20 # wait for ssh because that was faster to type than wait-for-instance-status-ok. The user-data script kept the GitHub token in an S3-hosted .env file with bucket-level access. We knew. We left it.

The real problems weren't in any specific script. They were structural.

Everything ran on a laptop. A docker push is the same whether it's done from a CI runner or someone's MacBook, except the MacBook has 1.5 MB/s up on hotel wifi. Provisioning a new SG or pushing a 1GB image took as long as our internet allowed, and we were on the road a lot.

State was a text file we passed between us. resources.txt — a flat list of resource IDs — was the only record of what existed:

cluster_arn         = arn:aws:ecs:us-east-1:...:cluster/quarkle-backend
ecr_repo            = 884237330161.dkr.ecr.us-east-1.amazonaws.com/backend-repo
target_group_arn    = arn:aws:elasticloadbalancing:...:targetgroup/qb-tg/...
acm_cert_arn        = arn:aws:acm:us-east-1:...:certificate/...
prod_sg             = sg-0a1b2c3d4e5f6g7h8
scheduler_instance  = i-09abc123def456789

It lived in the repo, but the repo wasn't actually the source of truth for it. My cofounder's laptop was, or mine, depending on who'd most recently run a script. Half the time someone would create a resource and forget to commit. The other person's script would either fail looking up an ID that didn't exist yet, or worse, create a duplicate. We had Slack threads that were just one of us pasting a fresh resources.txt to the other.

Adding staging was the killer. We started splitting it into resources.prod.txt and resources.staging.txt, which is the kind of move that tells you the abstraction is broken. The deploy scripts had hardcoded cluster names, service names, account IDs — duplicating them with a staging_ prefix and keeping two copies in sync was not going to scale.

That's when Terraform stopped being optional.

Era 3 — Terraform

Terraform did three things at once: real state with locking, the same code against staging and prod, and deploys reachable from CI instead of someone's laptop.

The layout splits along service boundaries:

quarkle-infra/
  terraform-ec2/tf/production/   # the scheduler (EC2)
  github_actions_user/            # CI's IAM user
backend-service/
  devops/tf/
    modules/                      # the reusable backend module
    production/                    # prod consumer
    staging/                       # staging consumer

The backend and WSS run on ECS Fargate. The scheduler stays on EC2 — it's cheaper for what it does.

Real staging and prod from one module

modules/main.tf holds everything for the backend service: ECS cluster, task definition, service, ALB, target group, ECR repo, S3 bucket for user uploads, CloudFront distribution, security group, IAM roles. production/main.tf and staging/main.tf are ~30-line wrappers that pass different CPU/memory/desired-count values.

This is the change resources.txt couldn't make. Staging and prod now share code, not just intent. Add a new env var, an IAM permission, an ALB rule — it goes into the module once, and both environments pick it up on the next apply.

The scheduler, with only production, didn't get modularized. One config, no abstraction. Premature module structure has a real cost: every variable becomes a contract.

Remote state, single bucket, locked

terraform {
  backend "s3" {
    bucket         = "quarkle-terraform-state"
    key            = "quarkle-backend-service/production/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"
    encrypt        = true
  }
}

One state bucket, one lock table, separate keys per service-and-env. Boring and correct. The DynamoDB lock is what lets a GitHub Actions runner and our laptops apply against the same state without trampling each other. resources.txt got deleted in the same PR.

GitOps deploys via GitHub Actions

The day this became unavoidable was a Friday I spent at an airport. A customer hit a bug. The fix was small, the container was about a gig, and uploading it to ECR over airport wifi turned a 2–3 minute deploy into 30 minutes of staring at a progress bar with the customer waiting. The deploy was bash. The bash was on my laptop. The laptop was on airport wifi. None of that was the customer's problem.

We'd been talking about moving deploys off our machines for months. That afternoon was when it stopped being a "should." The thing real state actually unlocked is that deploys are no longer on our laptops.

quarkle-infra/github_actions_user/ provisions a GitHub Actions IAM user with the policies it needs (ECR, ECS, S3, Route53, ACM, Secrets Manager, CloudFront, Lambda). The rest of the stack assumes this user exists. This config creates it.

The deploy itself is a GitHub Actions workflow:

push to main          → tests → terraform plan → apply (prod)
push to staging       → tests → terraform plan → apply (staging)

The trick is that the ECS task definition's container image is a Terraform variable. Deploys are "change image_tag, apply." Terraform diffs the task def, ECS does the rolling update. No force-new-deployment calls, no separate state in the AWS console.

push_to_prod.sh still exists as a human escape hatch — direct ECR push + aws ecs update-service --force-new-deployment — for the rare case where CI is broken and we need a hotfix in under five minutes. It bypasses Terraform, which means a follow-up terraform apply is needed to reconverge. We've used it twice.

What I'd tell past-me

A few things, in retrospect.

The moment resources.txt appeared in the repo, we'd already admitted we needed a state file. We just hadn't said it out loud. If you find yourself keeping a hand-edited list of ARNs, you're already late to Terraform.

The reason Terraform feels heavier than bash isn't the HCL. It's that you have to think about state — what exists, what depends on what, what the desired end looks like. That's exactly the thinking that was missing in the earlier eras, and the thing you actually need. Once it exists, everything else — staging/prod parity, GitOps, drift detection — mostly falls out of it for free.

Don't modularize on the first caller. We had a half-finished modules/ directory we deleted when we realized we had one consumer. Modules are worth it when you have two real users, not before.

Quarkle is now several services across ECS Fargate and EC2, two environments, and a CI pipeline that diffs the plan before it applies. bash-ec2/ is still in the repo, untouched since the migration. We keep it partly as a museum, partly as a reminder of how long we went without state.