Get to know our Machine Learning Platform — and how it can deploy a model in under 15 minutes
How we reduced the time it took to deploy machine learning models from months to minutes.
In 2005, Ruby on Rails made a lasting impact on web development when its creator, David Hansson, demonstrated how to build a blog in just 15 minutes — something that used to take several weeks to do (or about the time it takes to watch Stranger Things eight times on repeat). This abstraction of complex workflows empowered millions of developers to do their best work, without having to worry about configuration or labour-intensive setup.
Fast forward to 2022, and we were still looking for that same eureka moment when it came to deploying machine learning models. From powering money movement to fraud detection, machine learning models have been critical to Wealthsimple’s core business processes since 2018. So, when we decided to build our first-ever machine learning platform last year, we set our sights high.
We wanted to revolutionize model deployment just like Rails transformed web development 17 years ago: what if we could deploy a machine learning model in just 15 minutes?
Gathering the Requirements
The first step on our road to reimagining model deployment was a deep dive into the existing state. Kind of like going into the “upside down” to better understand what you’re really dealing with. We found that systems for developing and deploying these models were splintered across several teams. Efforts were duplicated and it often meant it could take months before a model went into production.
With this in mind, we designed our platform to scale machine learning at Wealthsimple. Our priorities were:
- Fast iterations for exploration and experimentation, which would enable model developers to test changes in under hours;
- New models with low maintenance burdens that could be deployable in days, rather than weeks or months;
- And lastly, the ability to have full confidence in our automated testing and monitoring frameworks to ship changes with confidence.
Building Our Machine Learning Platform
When it came to design principles, we drew inspiration from the rails doctrine to accelerate machine learning by convention over configuration. The developer experience was put on a pedestal — just as it should be.
Value Integrated Systems: one system to manage the entire lifecycle
We believed in the value of building an integrated system; that is, a single system that addresses the entire problem. Our platform had to manage the whole lifecycle — from model development all the way to serving predictions in production.
The integrated system enabled several powerful abstractions, which allowed us to train and deploy a toy model in under 15 minutes.
Model Training
The developer interface for model training was a REST API over a MLflow backend. Each model is hosted as a separate project within this application. The goal is to create containerized projects to train and develop each model in an isolated environment. This way, data scientists can choose the most suitable frameworks and integrations for their models without worrying about conflicting dependencies.
Convention Over Configuration: code automations for new projects
The first models deployed on the platform spent time mulling over configurations. Subsequent platform improvements favoured convention. Abstracting the mundane tasks of how we set up logging or how we integrate with the experimentation platform supercharged model development. Thankfully, by decreasing the cognitive load for data scientists, we allowed them to focus on the work they do best — building next level models.
We noticed that over 60% of code written for new projects — and where much of the time was spent — was on scaffolding. We wrote boilerplate code that could be generated with a single command: new projects came pre-populated with the necessary plugins and scaffolding out of the box. This was a stepwise improvement in how quickly we could ship new models.
Model Deployment
With one integrated system, we were able to streamline model deployment into a single request. Once trained and vetted, models were registered and tracked through our model registry, where they were then picked up by our prediction serving system.
Predictions Serving
We abstracted the complex logic of serving predictions by adopting an open-source framework (Nvidia’s Triton Inference Server) to do the heavy lifting under the hood. We built a web application on top of that to broker communication with the backend. This provided a simple layer to handle data processing and supported functionalities such as A/B testing different model variants and making predictions in shadow mode. Just like Eleven, we were honing our skills and getting ready to make a huge impact — without the explosions.
Impact of Migrating the Division Inference Model onto the Platform
We were able to quantify the impact of the platform by migrating one of our earliest models, the division inference model (DIM), onto the platform. DIM is a complex model that uses hundreds of thousands of parameters to predict the division, within a particular institution, to which an institutional transfer should be sent.
Each incorrect prediction extends the time it takes for assets to reach clients’ accounts, both adding toil to our internal teams and also worsening the client experience. We found that migrating the DIM onto the platform meant it was five times faster to make changes and run experiments on the machine learning platform, which expedited model improvements.
In addition, we reduced prediction failures by over 98% as the unified platform was more resilient to unexpected inputs and had better tooling to diagnose bugs in production. Lastly, we observed much faster predictions, with an almost 20% reduction in latency.
By June of 2022 we built the first version of our machine learning platform!
What’s next?
Since last June, several new models have been developed and deployed using this platform. Through its power, we reduced the time to produce new models from months to days, and unlocked even more capabilities for continuous improvements. The next step is to transition the machine learning platform into the automated decision platform to create a game changing, single integrated system for all automated decisions at Wealthsimple.
...
Written by Mandy Gu, Senior Engineering Manager