Building a Culture of Reliability at Wealthsimple
How we ensure that our clients receive exceptional service and unwavering reliability every day.
Wealthsimple’s commitment to reliability is more than just a technical requirement—it's an integral part of our culture and a cornerstone of how we deliver value to our clients. With over 3 million Canadians who trust us to handle their financial needs, “being reliable” is simply table stakes. We’ve crafted a comprehensive approach that combines technology, proactive processes, and a relentless focus on continuous improvement to ensure our services remain reliable, resilient, and robust.
Leveraging Advanced Tools for Observability, Logging and Incident Management
A crucial component of our reliability strategy is observability. We use Datadog to gain real-time insights into our systems and applications through tools like APM (Application Performance Management), SLOs (Service Level Objectives), and Monitoring.
Datadog provides us with the ability to observe system performance using dashboards and notebooks, pinpoint issues through indexed logs, and helps us identify issues early through monitoring, alerts, and SLOs. Ideally, we catch issues before they escalate into client-facing problems. By visualizing metrics, logs, and traces, our teams quickly discover and resolve problems, ensuring smooth and uninterrupted experiences for our clients.
However, when incidents do occur, timely and effective incident response is essential. We know they can erode trust in Wealthsimple, so we employ a number of tools in order to respond quickly and effectively:
- We leverage PagerDuty for notifying responders of incidents and notifications are both automated—in response to monitor alerts—and manually triggered.
- We have a culture of “if you smell smoke, pull the fire alarm!”
- Rootly is the tool we use for Incident Response and has given us superpowers for responding to incidents. It helps us create a ton of structured workflows that facilitate coordinated responses, clear communication directly in Slack, and detailed post-incident analysis.
Cultivating Sustainable On-Call Practices
Employee well-being is a key priority in our reliability framework. We have established sustainable on-call practices that prevent burnout and maintain high morale among our team members. We have dozens of trained Incident Commanders participating in a 24-hour rotation to ensure incidents are coordinated and communicated effectively. By building a large rotation of Incident Commanders to support each other, we foster an environment where developers can perform at their best, balancing after-hours responsibilities. We also spread general incident response knowledge throughout the company.
Defining and Measuring Reliability with SLOs
Service Level Objectives (SLOs) are vital benchmarks in our pursuit of reliability. These objectives are carefully defined to measure potential client expectations and our ability to meet those expectations. We monitor our SLOs using Datadog and provide a clear framework using Terraform for teams to create SLOs. SLOs help guide our operational priorities and align our efforts with the expectations of our clients, ensuring that we meet and exceed their needs.
Ensuring Readiness for Traffic Surges
In the dynamic world of financial services, anticipating and preparing for traffic surges is crucial. We conduct regular load testing to ensure our systems can handle increased demand, particularly during critical events like Market Open—when most of the trading on our platform begins. These performance tests simulate real-world scenarios, allowing us to fully test recent changes on both the frontend and backend to ensure the scalability of our solutions, optimize our infrastructure, and fine-tune application performance under high load conditions.
Retrospectives
Our commitment to continuous improvement is reinforced through regular retrospectives and postmortems. Following every high-severity incident, we run a retrospective to unpack what happened, learn from the incident, and take action to strengthen our services and processes. Following a focused retrospective with appropriate incident stakeholders, we also have a weekly ritual where teams present their retrospective learnings and actions with a much wider audience. In these review meetings, we share retrospectives with developers, product managers, and various levels of leadership. By learning from both challenges and where we got lucky, we share knowledge as widely as possible. This helps us become stronger and more reliable.
Fostering a Culture of Ownership and Continuous Improvement
Our "maker owner" approach empowers every team (and team member) to take ownership of the quality and reliability of their work. Reliability is not only the concern of our Reliability Engineering team. It’s everyone’s responsibility. Wealthsimple's approach to reliability is a holistic one, rooted in our culture and supported by skill teams, advanced tools, and effective processes. By prioritizing observability, sustainable practices, and continuous improvement, we ensure that our clients receive exceptional service and unwavering reliability every day.
...
Written by Chris Inch, Director of Developer Platform Engineering
Interested in working at Wealthsimple? Check out the open roles on our team today.