I usually do a year-end post about the architectural shifts the team made over the past twelve months. 2020 was about getting things to the cloud at all because the office didn't exist anymore. 2021 was about growing up — moving from "we run servers, but in GCP" to actually engineering for reliability.
The transition from "ops" to "SRE" is a buzzword if you let it be one. For us it was a real and uncomfortable change.
Why the old way had to go
Early 2021 looked like this:
- Ticket-driven work. Developers shipped code over a wall. When it broke, an Ops ticket got filed.
- Alert fatigue. PagerDuty fired on CPU > 60%, disk > 80%, single pod restarts. People were waking up at 3am for things that had already self-healed by the time they opened the laptop.
- Hero culture. There was one person on the team (often me) who knew the exact script to run when the database locked up. That's not a system; that's a fragile bus factor of one.
What we were doing was toil in the SRE sense — manual, repetitive, automatable, no enduring value, scaling linearly with the service.
The shift was deliberate: treat operations as a software problem and stop accepting toil as the cost of doing business.
Pillar 1: no console changes
The first house rule we adopted: no production change happens in the console. If you can't reproduce it from code, it doesn't exist.
We rebuilt the GCP organization around the Cloud Foundation Toolkit. Every project, every GKE cluster, every firewall rule lives in a Terraform module. New environments are folder-and-tfvars exercises rather than two-week console projects.
Where this paid off concretely: when Log4j hit a couple of weeks ago, we didn't have to SSH around looking for vulnerable JARs. We queried Terraform state and Artifact Registry to know what was running where. The blast radius was knowable in hours instead of days.
Pillar 2: SLIs, SLOs, and burn-rate alerts
This was the hardest cultural shift, because it required telling management that "the server is up" is not the same as "the user is happy."
We defined Service Level Indicators per service — latency, error rate, traffic volume, saturation — and Service Level Objectives that expressed what "good enough" looks like. Cloud Monitoring's SLO widgets do most of the heavy lifting once you've defined the SLIs.
The alerting rule changed too. Instead of paging on CPU thresholds, we page on error budget burn rate: alert me if we burn 2% of the monthly error budget in the last hour. That single change silenced about 90% of our pagers. CPU at 95% with healthy latency and zero errors? The user is fine. Don't wake anyone up.
Pillar 3: GKE Autopilot for stateless services
I covered this in more depth in the Autopilot post, but the SRE angle is the relevant one here. Managing node pools by hand — patching, upgrading, debugging bin-packing — is toil. It generates no business value. Moving the stateless microservices to Autopilot returned roughly ten engineering hours per week to the team. We spent that time on better CI/CD and automated load testing, which has compounding returns.
What the workflow looks like now
┌─ Dev & Design ──────────────────────────────────────────┐
│ │
│ ┌────────────────┐ push ┌──────────────────────┐ │
│ │ Code & Config │ ────────► │ Cloud Build / GH │ │
│ │ ▲ │ │ Actions │ │
│ └───────┼────────┘ └──────────┬───────────┘ │
└───────────┼───────────────────────────────┼─────────────┘
│ │ Terraform / Helm
│ ▼
┌─ SRE & Deploy ──────────────────────────────────────────┐
│ ┌──────────────────────┐ │
│ │ GKE & Cloud Run │ │
│ └──────────┬───────────┘ │
│ │ metrics/logs│
│ ▼ │
│ ┌──────────────────────┐ │
│ │ Cloud Ops Suite │ │
│ └──┬────────────────┬──┘ │
└───────────────────────────────────┼────────────────┼────┘
│ │
┌─ Feedback Loop ───────────────────┼────────────────┼────┐
│ SLO breach? │ data │ │
│ ┌────────────────────┘ ▼ │
│ ▼ ┌──────────────────┐
│ ┌──────────────┐ │ Post-Mortem │
│ │ PagerDuty │ │ Analysis │
│ └──────────────┘ └────────┬─────────┘
│ │ action │
│ ▼ items │
│ ┌──────────────────────┐
│ │ Engineering Backlog │
│ └──────────┬───────────┘
│ │ fix │
│ ▼ reliability
│ (back to Code) │
└─────────────────────────────────────────────────────────┘
The post-mortem box is the part that's easy to skip and matters most. When something breaks, we write a blameless doc: what happened, the chain of contributing causes (the 5 Whys is a fine starting heuristic), and what changes prevent the same class of failure automatically. Action items go on the backlog with the same priority as features.
Three lessons from the year
100% reliability isn't a goal, it's a refusal to ship. A request from leadership for "100% availability" comes from the right place but means "we never deploy again." We landed on 99.9% with an explicit error budget — about 43 minutes per month. We spend that budget on faster releases. When we run out, deploys freeze until next month. That tradeoff has to be agreed on in writing or it falls apart at the first incident.
Observability has a bill. We turned on full request tracing and verbose logging on everything. The next month's bill made the case for sampling. 5% trace sampling is usually enough to spot real latency regressions, and aggressive log exclusion (drop successful health checks before they hit storage) cut the bill by more than half.
The Log4j response was the test. When the CVE dropped in December, the difference between "ops" and "SRE" was visible. The ops version is SSHing into every box. The SRE version was: query the SBOM in Container Analysis, push a Cloud Armor WAF rule to block JNDI strings at the load balancer, then patch and roll the affected images through CI/CD. Same outcome, different number of weekends.
What's next
Heading into 2022 the questions I'm spending time on:
- Supply chain security. Binary Authorization, signed images end-to-end, and provenance for what actually runs in production. Log4j made this concrete.
- FinOps as part of reliability. A query that costs $50 to run is a bug, even if the result is correct. Cost belongs in the SLO conversation.
- Chaos engineering. Starting in staging — break things on purpose to see whether the alerts and runbooks hold up.
If 3am disk-space pages are still part of your job, make 2022 the year that stops. Either auto-clean or auto-grow. Be a reliability engineer; don't babysit servers.