Cloud Composer 2 is Here: Why the Architecture Changes Matter for Autoscaling

October 20, 2021

If you've run Apache Airflow on GCP via Cloud Composer 1, you probably have mixed feelings about it. The managed-service convenience is real. The bill and the cold-start time when the morning ETL crunch lands are also real.

Cloud Composer 2 just hit GA, and it's not a version bump — it's a different architecture. I migrated a few heavier pipelines over the last two weeks. The short version: the rigidity that made CC1 painful is mostly gone.

What was wrong with CC1

CC1 ran your Airflow workers on a GKE Standard cluster you could see in your project. You picked a node count up front: too small and morning tasks queued forever; too large and you paid for idle capacity all night. Scaling existed via the GKE Cluster Autoscaler, but provisioning a VM, booting the OS, pulling images, and joining the cluster takes long enough that the burst is usually over by the time the capacity arrives.

CC2: GKE Autopilot underneath, split control plane on top

Two things changed.

1. Workers run on GKE Autopilot. No node management, no node visibility, no node decisions. Workers are just pods that scale horizontally based on the queue depth, almost immediately.

2. The environment is split between a Google-managed tenant project and your project. The web server and the metadata Cloud SQL instance live in the tenant project. The schedulers and workers live in your project on the Autopilot cluster.

   ┌──── Cloud Composer 1 ─────────────┐   ┌──── Cloud Composer 2 ─────────────┐
                                                                            
      Scheduler    Web Server               Scheduler    Web Server         
                                                                            
          ┌──────────────┐                      ┌──────────────┐            
           Worker Pool                         Worker Pool              
          └──┬────────┬──┘                      └──┬────┬────┬─┘            
     fixed             fixed            elastic            elastic     
     capacity│          capacity         scaling            scaling     
                                                                       
         ┌──────┐ ┌──────┐                      ┌────┐┌────┐┌────┐          
         │Node 1│ │Node 2│                      │Pod ││Pod ││Pod           
         └──┬───┘ └──────┘                      └─┬──┘└─┬──┘└─┬──┘          
                                                 └─────┼─────┘             
                                                        (on demand)       
     ┌────────────────┐                         ┌──────────────────┐        
      Idle resources                           GKE Autopilot            
       = wasted $                              Infrastructure           
     └────────────────┘                         └──────────────────┘        
   └───────────────────────────────────┘   └───────────────────────────────────┘

The split is the underrated part. In CC1 a heavy Python task could starve the web server because they shared a cluster, and the UI would hang exactly when you needed it most. In CC2 the web server is somebody else's problem, and the UI stays responsive even when workers are pinned.

Autoscaling becomes task-driven instead of CPU-driven: 50 queued tasks → Autopilot provisions worker pods → tasks finish → pods go away → you stop paying.

What to watch out for during migration

VPC Native is required. CC2 won't deploy on a non-VPC-native network. If you're still on legacy networking or haven't set up secondary IP ranges for pods and services, sort that out first.

PyPI installs from a private IP environment. If your environment uses private IPs (which it should), workers need a route to PyPI or to a private mirror. I lost three hours debugging a pip install failure that turned out to be a Cloud NAT misconfiguration on the new Autopilot subnet. Artifact Registry as a private mirror is the cleaner long-term answer.

The cost shape is different. CC1 was a high fixed cost — you paid for the nodes 24/7. CC2 is a smaller fixed environment fee plus variable compute. That's a win on average but a footgun in the wrong shape: a buggy DAG that fans out to 1,000 tasks will scale up to run them all and the bill will reflect it. Set max worker caps before you're surprised.

Is it worth migrating

Yes. Cost is part of it (~20% savings on dev environments for us), but the bigger win is reliability. Decoupling the web server and metadata DB from the execution layer means the UI stays usable while the cluster is doing real work. Not patching GKE nodes is a bonus.

If you're still on CC1, start scoping the migration now. The old version isn't going away tomorrow, but it's clearly the past.

Next post: event-driven architecture on GCP — Pub/Sub, Eventarc, and how to stop building tight HTTP coupling between services that don't need to know about each other.