Adventures in migrating microservices
How Wealthsimple handled an EKS IPv4 Exhaustion Problem
Background
For the past eighteen months, Wealthsimple has been building out a new AWS-hosted EKS platform and migrating our microservices away from our HashiCorp Nomad stack (which served us well for years). So far, the new Kubernetes platform has been wildly successful from a reduced cost perspective, allowing our Product Development teams to use features such as Progressive Delivery in order to minimize impact to our clients. In fact, Wealthsimple clients have been happily using our products without any idea this was happening behind the scenes!
In AWS VPCs, IPv6 allocates “publicly routable” addresses. These addresses can be accessed directly by any other IPv6 host on the Internet when connected to an AWS Internet Gateway (these can be made “private” by using an egress-only internet gateway which only allows outbound access). Conversely, IPv4 addresses in AWS VPCs are private (typically in the 10.x.x.x range), and when using technologies such as VPC peering and transit gateways, the addresses are routable and accessible across a company’s internal network. In EKS, pods are allocated addresses as part of these ranges; pods can communicate directly with entities in other peered VPCs and entities in other peered VPCs can communicate directly with pods.
We run several production Kubernetes clusters for various use cases, some of which are IPv6-native and others are IPv4. In this article, I want to share a bit about how we ran out of available IPv4 addresses in one of our production clusters, how we addressed this in the short-term to get microservices back online, and what we implemented after the fact to ensure this won’t be an issue in future.
November Incident
When we first started building our new Kubernetes platform, we identified early on that due to the “direct allocation” nature of EKS and VPC subnets, exhaustion of these private IPv4 addresses would be an issue with EKS. As such, we elected to build several clusters with IPv6, but soon realized there was a risk of application incompatibility. So we also built IPv4 clusters and anticipated we would be able to complete our migration before we ran out of private IPv4 addresses allocated to each EKS cluster. Unfortunately, this was not the case.
At the beginning of November 2024, we reached 80% utilization in one of our EKS subnets and two weeks later hit 100%. However, Kubernetes pods couldn't schedule because they were unable to obtain an IP address from AWS. We quickly addressed this issue by adding more private subnets to our EKS clusters, doubling the number of IPv4 addresses available for Kubernetes nodes and pods. This was made significantly easier by leveraging Terraform and Atlantis. Only two GitHub Pull Requests were required to accomplish this: one to update our common EKS VPC module to add more subnets and another to pick up the module change for the cluster that had exhausted its address space.
Long-Term Plan
While we had addressed the immediate incident of pods being unable to schedule on this production cluster, we knew this had to be addressed for the long-term. We needed a solution that ensured future scalability and prevented any other incidents related to IPv4 exhaustion. Three solutions emerged:
- Further increase the number of private subnets in our EKS VPCs by adding secondary CIDR blocks
- Build new production IPv6 enabled clusters and migrate microservices to them
- Implement a Carrier-Grade NAT overlay network for EKS pods
The first solution gives us scalability, but consumes an extremely large number of internal IPv4 addresses: three /16 address blocks for EKS, and a fourth for non-EKS ancillary services within the VPC. This limits us to a maximum of 64 EKS clusters and in fact even less given our existing VPC and private IPv4 usage. The second option gives us enough scalability far into the future (4.7 sextillion addresses per VPC/cluster), but would require us to test all our microservices against IPv6 and migrate these microservices to newly built clusters while we are already migrating from Nomad to Kubernetes. Given these constraints, we chose to implement the third option. This allowed us to provide scalability to our platform without consuming excessive privately-routable IPv4 addresses, or requiring effort to test against IPv6.
Implementing an IPv4 CGNAT Overlay Network
Once we had decided to implement a Carrier-Grade NAT overlay network for our EKS clusters, our first order of business was to built an isolated test VPC and EKS cluster. Our existing test clusters were in use by other teams and we did not want to impact their work. With our strong use of Infrastructure-as-Code, doing so was almost trivial. We deployed a Terraform PR to create the new VPC, a second Terraform PR to bootstrap EKS and install ArgoCD and two more PRs in our EKS configuration repos to configure both ArgoCD and our Kubernetes controllers for the new cluster. A total of four Pull Requests occurred to build a completely new, fully-functional test environment!
From there we performed some refactoring on our Terraform AWS VPC modules in order for them to support secondary CIDR ranges and create the necessary subnets for EKS pods to use, while maintaining backwards compatibility for our existing VPCs. Once these secondary ranges and subnets were created, we deployed new ENIConfig objects to our EKS cluster through ArgoCD to configure the AWS VPC-CNI, which tells the driver to use these new subnets and AWS Security Groups. Finally, we reconfigured the VPC-CNI driver through Terraform to use these ENIConfig objects by enabling custom networking, and began validation.
Validation and Gotchas
With a fully-functional EKS cluster implementing a Carrier-Grade NAT overlay network, we began testing against this cluster to ensure the configuration was appropriate for our platform. During this testing, we discovered that using custom networking significantly impacted pod density. For example EC2 “large” instance types are limited to 20 pods, whereas without custom networking we could run 29 pods on these nodes. Switching to prefix mode addressed this for us by increasing the network limit to more than 110 pods on a “large” EC2 instance (the default Kubernetes/EKS per-node limit). We also added tuning for MINIMUM_IP_TARGET and WARM_IP_TARGET in order to minimize unused IP addresses in this new configuration.
Production Rollout
Once our validation was complete, it was time to promote this configuration to our production clusters. Over the next two days, we opened 3 PRs per cluster to accomplish this: one to reconfigure the VPC and add the new CIDR ranges and subnets, a second to setup the ENIConfig resources through ArgoCD, and a third to enable custom networking through the VPC-CNI. We then identified a low-risk, low-traffic time to make the cutover, then we recycled our EKS nodes to ensure all old addresses were reclaimed and all pods were using addresses from the CGNAT range.
This whole adventure was made significantly easier to address thanks to our strong use of Infrastructure-as-Code. Three GitHub PRs and our underlying automation allowed us to focus our efforts on validation and performance. It also allowed the rest of the Infrastructure team to continue migrating microservices from Nomad to Kubernetes, keeping our platform safe, secure, and available for our clients. While I wish we could have implemented this overlay network during our initial buildout of our EKS platform, that would mean I couldn’t tell this story and highlight how easy it was to deploy these kinds of changes to our platform with the automation we built.
Thanks to the entire Wealthsimple Platform Engineering team for their effort on building a truly world-class platform; one that is delightful to use by the rest of Engineering and one that is secure and scalable for our valued clients.
...
Written by Andrew Brown, Staff Software Developer
Interested in working at Wealthsimple? Check out the open roles on our team today.