New Kafka Tier, No Kafka Tears
How we managed a Kafka migration solution that wouldn’t inconvenience our clients.
Migrating systems is no one’s idea of a good time, but when we made the switch from Flow to TypeScript, we did learn a few valuable lessons we hoped would minimize future misery:
1. Planning is priority number one.
2. It takes the work of many teams to get to the finish line.
3. Sometimes, the process of finding a solution will feel a bit like playing whack-a-mole. In the dark. Wearing oven mitts.
2021 was a year of phenomenal growth for Wealthsimple almost from the get-go, which provided a sooner-than-expected opportunity to put our Migration lessons to the test. While all systems appeared to be operating normally to the outside world, we knew that behind the scenes, we had to scale all of our systems — soon — to keep up.
One of those systems was Kafka. While we had been getting by with our current production Kafka setup, we wanted to be proactive about migrating from the standard to the dedicated tier before we outgrew our current setup — and before anything happened that might negatively impact our customers’ experience. We knew from our last go-round that initiating a migration like this is no small feat. So we recalled Lesson One: Start planning right away.
What we were up against
While our production environment didn’t feel like it was lacking, the signs of limitation were already starting to show. The environment was a multi-tenant cluster and had limits on partitions and throughput that we knew wouldn’t scale with our needs. There was also no option for additional capacity expansion.
We wanted to complete this migration with as little downtime as possible, so we had our work cut out for us. And like any arduous journey, the migration to Kafka came with a fair amount of baggage:
- There were no tools that would allow us to automatically upgrade from a multi-tenant to a dedicated Kafka cluster.
- Because there were so many interdependencies, it wasn’t possible to fully automate the migration.
- Groupings were determined by the related topic(s) and the services that use them. Sometimes these groups evolved as we learned more about interdependencies.
- Most of our services didn’t support multiple connections to different clusters simultaneously. We had to make code changes to support that, which meant coordinating with many teams to do the pre-work.
- We had to work with multiple tech stacks: Java/Kotlin, Node.js, Ruby, and Python.
Our research phase initially led us down a few different pathways. At first, we entertained the idea of creating topics on the new cluster and then manually switching each service to use the new topics. However, when we dug into what that solution would look like for some of our microservices, we unearthed a nest of interdependencies. Untangling these connections would have complicated things and made it impossible to migrate one service without affecting a handful of others in the process.
We also looked at Confluent Replicator and Mirror Maker 2, each of which seemed like a viable solution. Both of them, however, would have required us to set up a custom infrastructure — something we were hoping to avoid.
Finally, we landed on the ideal solution for our team: Confluent’s Cluster Linking would not require any changes to our infrastructure. The choice was clear: Cluster Linking for the win.
Specificity is the key
When it came to executing the migration, we didn’t want to leave any room for guesswork. We knew from Lesson Two that we’d need the work of many teams to reach our goal. To avoid chaos, we had to be incredibly, almost comically specific about who had each task and when those tasks would be completed.
And so, after many conversations and a lot of careful planning (Remember: Lesson One!), we finally had a detailed blueprint to follow. And we’re not using the term “detailed” lightly — each grouping had a designated owner, customized playbook, dedicated Slack channel, and a dry run that was completed in a staging environment to make sure we avoided any hiccups.
When it came to performing the actual migration, we had a multi-prong strategy to follow. For groupings where some downtime was acceptable, we used Confluent Cloud Cluster Linking, a fully managed service that allowed us to create a perfect copy of topics and consumer group offsets from one cluster to another.
For services where we couldn’t afford any downtime, we created new topics on the destination cluster and migrated the producer first, followed by the consumers. This allowed us to keep the services up with only a minor delay in processing messages while we re-deployed the consumers to use the new cluster.
In other cases, we created temporary containers to consume the data from the source cluster while the existing containers were migrated over to the new destination cluster. This allowed us to handle some indirect circular dependencies between producers and consumers.
What we learned
In the end, we completed the migration without any disruption for Wealthsimple clients — in fact, they never realized it was happening. Pulling off such a seamless shift was anything but easy. The weeks of planning and preparation, as well as the help we got from each team that was involved, were essential in making sure the migration went off without a hitch. If we had simply jumped in head-first, we would have run into a number of unexpected problems.
Lesson Three: The only enjoyable games of migration Whack-A-Mole are those you can avoid playing by following Lessons One and Two.
The protagonist of Franz Kafka’s ‘Metamorphosis’ awakens one morning to discover a very unpleasant surprise. Our engineers aren’t big fans of receiving bad news before their first cup of coffee, either. Building support into our services to connect to different brokers at the same time will allow us to migrate or failover more easily in the future.
In the end, a process that could have been Kafkaesque ran relatively smoothly thanks to all our up front work. We’re equipped to scale at the rate that our Client growth demands, and that makes it all worth it. And if we ever have to run a migration of this magnitude again? You can bet we plan to spend a lot of time on planning.
...
Written by Teresa Lo, Senior Software Engineer and Andrew Thauer, Staff Software Engineer