Recap: in part 1 we used Cloud Load Balancing to slice traffic off the monolith. In part 2 we containerized the legacy Java app with Jib so we had something to send the traffic to. Now those containers need to live somewhere on the network, and that turns out to be the design decision with the longest blast radius.
The two failure modes I see most often:
- A project per team, peering everywhere. Each team gets its own VPC, and you peer them as needed. By the time you have 10 teams the topology is a full mesh, a single CIDR overlap brings down half the company, and on-prem connectivity is duplicated everywhere.
- One giant project. Everyone shares a default VPC. The intern on the frontend team accidentally deletes the firewall rule protecting the database.
Shared VPC is the middle ground that has actually held up for us at enterprise scale.
What Shared VPC is
The mental model that works: Shared VPC is an apartment building.
- Host project = landlord. Owns the network, subnets, VPNs, firewalls, Cloud NAT.
- Service projects = tenants. Each team gets one. They can run their own VMs, GKE clusters, Cloud Functions, but they plug into the network the host project provides.
This separation maps cleanly onto the org chart: NetOps owns the host project, application teams own their service projects.
Architecture
Here's what we deployed for this migration. The intent was: let the checkout team move fast, but don't let them touch the BGP routes back to on-prem.
┌────────────────────────────────────────────────┐
│ HOST PROJECT (Network Admin) │
│ ┌──────────────────────────────────────────┐ │
│ │ Shared VPC Network │ │
│ ├──────────────────────────────────────────┤ │
│ │ Subnet A: 10.0.1.0/24 (Checkout) │ │
│ │ Subnet B: 10.0.2.0/24 (Inventory) │ │
│ ├──────────────────────────────────────────┤ │
│ │ Cloud VPN / Interconnect │ │
│ │ Central Firewall Policies │ │
│ └──────┬─────────────────────────┬─────────┘ │
└─────────┼─────────────────────────┼────────────┘
│ uses A │ uses B
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Service Project: │ │ Service Project: │
│ Checkout │ │ Inventory │
│ ┌────────────────┐ │ │ ┌────────────────┐ │
│ │ GKE / VMs │◄─┼──┼─►│ GKE / VMs │ │
│ └────────▲───────┘ │ │ └────────▲───────┘ │
└───────────┼──────────┘ └───────────┼──────────┘
│ │
└────── traffic ◄─────────┘
▲
│
Cloud VPN / Interconnect
What this gets us:
- One VPN/Interconnect. Paid once, reused by every service project automatically.
- Real security boundaries. Each application team has admin in their service project but minimal rights in the host project. They can't open port 22 to the internet because they don't own the firewall rules.
- Per-team billing. Compute costs land on the service project, so finance can still see how much each team is burning even though the network is shared.
Connecting GKE to the Shared VPC
GKE needs VPC-native (alias IP) clusters, which complicates the host-project setup a little.
1. Subnets with secondary ranges. A primary range is not enough — GKE pods and services each need their own secondary range:
- Primary range (nodes):
10.0.1.0/24 - Secondary range (pods):
10.100.0.0/20 - Secondary range (services):
10.200.0.0/20
2. IAM bindings on the host project. Grant the service project's Google APIs service agent two roles, scoped to the specific subnet:
compute.networkUseron the subnetcontainer.hostServiceAgentUserso the GKE control plane can manage the network
3. Reference the host network from Terraform. The cluster lives in the service project, the network lives in the host project:
network = "projects/host-project-prod/global/networks/shared-vpc"
subnetwork = "projects/host-project-prod/regions/us-central1/subnetworks/checkout-subnet"
The cluster comes up in the service project, with IPs allocated from the host project. From the user's perspective it's a normal GKE cluster.
Why not VPC peering
This question comes up every time. Peering is one click, Shared VPC is a multi-step setup.
The reason peering doesn't scale: non-transitivity. If A peers with B and B peers with C, A still cannot reach C. To make A reach C you have to peer them directly. With 50 microservices that's 50·49/2 = 1,225 peerings to maintain. Shared VPC sidesteps this entirely — everyone is on the same network, segmented by subnets and firewall rules.
Where it hurt
Firewall friction. A developer deploys a service on port 8080, it doesn't work, and they don't have permission to add a firewall rule. Every change becomes a NetOps ticket. We softened this by agreeing on a set of network tags (allow-8080, etc.) with pre-approved firewall rules — devs tag their resources and the rules apply automatically. Governance design for this is its own post.
Pod IP exhaustion. We gave the inventory team a /22 secondary range — 1,024 pod IPs. They scaled up, ran out, and GKE refused to schedule new pods. Secondary ranges in use can't be expanded; we had to add a new secondary range to the cluster (node masking). Lesson: pod IP space inside RFC1918 is effectively free, so over-provision generously the first time.
Deletion order. Tearing down a test environment failed because Terraform couldn't delete a host-project subnet still being used by a service-project cluster. The tenant has to move out before the building comes down.
Wrap up
Shared VPC is the load-bearing piece of this whole migration. It gives application teams the autonomy they need without letting them touch the bits of the network you don't want anyone touching.
GKE Standard worked fine for this, but looking at the bills and the time we spend on node-pool maintenance, I want to revisit whether GKE Autopilot would have been a better default. That's the next post: a real-world cost and ops comparison after running both in parallel for a quarter.