Event-Driven Architecture: Decoupling Services with Pub/Sub and Eventarc

November 7, 2021

A year into a microservices migration, the second pathology shows up: the services are split, but they're still chained together over synchronous HTTP. Service A calls B, B calls C, C queries the database, and a hiccup anywhere in the chain takes the whole call down with it.

I had a checkout service that synchronously waited on an email confirmation service. The email provider had an outage, checkout latency went from 200ms to "timeout," and customers couldn't buy anything for an hour. The post-mortem question wasn't "how do we make email more reliable" — it was "why is the checkout call waiting on email at all."

Event-driven architecture is the answer most of the time. The producer drops a message and moves on; the consumer reacts when it can. On GCP the two relevant tools are Cloud Pub/Sub (the workhorse) and Eventarc (newer, opinionated, useful in different situations).

The HTTP coupling trap vs the event model

A direct HTTP call between services creates a runtime dependency:

  • Synchronous — caller blocks until callee returns.
  • Brittle — if the callee is down, the caller fails. Circuit breakers and retries help, but they're complexity you have to maintain.
  • Rigid — adding a new consumer of that data means changing the producer.

The event model flips this:

  1. Producer emits an event ("Order Placed").
  2. Broker holds the event.
  3. Consumers react when they're ready.

If the email service is down, the message sits in the queue. When email comes back, it processes the backlog. The checkout call doesn't even know email exists.

Cloud Pub/Sub

Pub/Sub is the asynchronous messaging backbone on GCP. Global, durable, fan-out friendly.

Reach for it when:

  • You're streaming high volumes — clickstream, IoT telemetry, transaction logs.
  • One message needs to fan out to multiple downstream systems (analytics, search index, notification service).
  • You own the message schema. App-to-app communication where you define the JSON.

Pull subscriptions are the right choice for legacy workers on VMs that need to control their own throughput. Push subscriptions pair naturally with Cloud Run and Cloud Functions for serverless consumers that should scale with the queue.

Eventarc

Eventarc went GA earlier this year. It's built on Pub/Sub but layered for a different use case: routing GCP infrastructure events into Cloud Run / Cloud Functions Gen 2.

What makes it different is the catalog of supported sources, particularly Cloud Audit Logs. You can trigger a service when an object lands in a GCS bucket, when an IAM permission changes, when a BigQuery job completes, etc., without rolling your own log sink → Pub/Sub → function pipeline.

The other thing it gives you is the CloudEvents format. Events arrive with a standard envelope regardless of source, so consumers don't need bespoke parsing per event type.

Putting them together

The pattern we're using on a current project: Pub/Sub for application-emitted events where we control the schema, Eventarc for reactions to GCP state changes.

   ┌──────────────────────────────────────────────────────────┐
                       Direct App Data                       
                                                             
      ┌────────────┐  Order Placed   ┌──────────────────┐    
       Client App  ───────────────►│  Cloud Pub/Sub       
      └────────────┘                 └────────┬─────────┘    
                                               push         
                                     ┌────────┴─────────┐    
                                                           
                          ┌──────────────────┐  ┌──────────┐ 
                           Inventory Svc       Shipping  
                            (Cloud Run)         (CR)     
                          └──────────────────┘  └──────────┘ 
   └──────────────────────────────────────────────────────────┘

   ┌──────────────────────────────────────────────────────────┐
                     Infrastructure Signals                  
                                                             
      ┌──────────────┐  Object Finalized   ┌──────────────┐  
       GCS Bucket    ───────────────────►│   Eventarc     
        (Images)                         └──────┬───────┘  
      └──────────────┘                             route    
                                                            
                                      ┌────────────────────┐ 
                                       Thumbnail Gen (CR)  
                                      └────────────────────┘ 
   └──────────────────────────────────────────────────────────┘

The client app doesn't know inventory or shipping exists. The thumbnail generator gets a CloudEvent it can parse the same way regardless of which storage source fires it. Both sides scale to zero when nothing is happening.

Two scars worth sharing

The event loop. I set up an Eventarc trigger on storage.object.create for a bucket. The Cloud Run service it called processed the file and wrote a processed version back to the same bucket. The write triggered a new event. The service ran again. By the time the billing alert woke me up, the function had recursed several thousand times. Always separate input and output buckets, or filter strictly on the object name at the start of the handler.

At-least-once delivery is a contract you have to honor. Both Pub/Sub and Eventarc deliver at-least-once. On a network blip or a retry, the same message can arrive twice. Make consumers idempotent. For an order processor, that means a unique constraint on the transaction ID and a check before charging the card. Skip this and you will, eventually, charge a customer twice.

Picking between them

  • Building a data pipeline or app-to-app messaging where you own the schema: Pub/Sub.
  • Reacting to GCP state (object uploads, audit log entries, job completions) or wanting CloudEvents-shaped events out of the box: Eventarc.

There's overlap, and Eventarc sits on Pub/Sub anyway, so it's not a hard choice. The point isn't which tool — it's getting away from the synchronous-HTTP-everywhere pattern that turns every dependency into a tail-latency risk.

Next post wraps up the year: how the team stopped doing "ops" and started doing SRE, and which Google SRE practices actually translated to a real workload.