Swift incident recovery with event-driven architecture

Published in

Macquarie Engineering Blog

4 min readOct 23, 2023

By Cliff Cubero, Solutions Architect at Macquarie Bank

At Macquarie Bank, software engineers are given the opportunity to explore the latest technology, architecture and frameworks that accelerate delivery, establish security and promote resiliency within our platforms. These efforts are backed by security assessments, risk reviews and broader architecture and engineering team approvals, empowering us to confidently deliver great outcomes.

In a recent project, Macquarie Bank’s engineers in the Wealth Management division implemented an event-driven architecture using microservices while integrating with a third-party platform for trade settlement and clearing. Event-driven architecture is a solution design wherein distributed domain systems communicate through events to perform an operation or transaction. Event-driven architecture is known for resolving several problems such as availability, scalability and synchronous processing, enabling our platforms to communicate and process asynchronously while also maintaining reliability to recover gracefully in the event of incidents.

The opportunity

With this project, it was an opportunity for the engineers to prove how resilient the event-driven architecture is compared to legacy platforms. Usually, recovery for legacy platforms in the event of a failure or down time requires more manual intervention and processing, taking longer to fix. On the other hand, the new systems were implemented in a way that promotes minimal or no operational intervention upon failure or down time.

Since the project was done in phases, it allowed us to compare the latest architecture with the legacy platform.

The legacy side

In the legacy platform, the communication between systems were synchronous and relied on a request-response pattern. When one of the integrated systems went down, the end-to-end transaction was disrupted without handling for other succeeding requests which were bound to fail as well. The number of unsuccessful requests only grew until the system recovered.

Logical architecture showing the interactions with the legacy platform

Once remediated, there were no other ways to replay or handle the unsuccessful requests in the legacy setup except for manually going through the requests one by one. If there were 1000 unsuccessful requests due to the incident, that meant manually repeating or processing those 1000 requests and that takes a lot of time and effort to execute.

The event-driven architecture side

With the event-driven architecture that we implemented, we highly considered asynchronous messaging between services and the ability to replay requests. This enabled us with an effective contingency in the event that the receiving microservice becomes unavailable. The requests would stay in the queue upon failure and if the requests come through but do not process due to a technical issue from another service call, the requests can easily be replayed.

Logical architecture demonstrating the interactions with the new platform through event-driven architecture

Upon learning about the disruption in one of the integrated services through our monitoring, we deliberately shut down the listener service or the microservice that receives the requests. By doing this, there will be no more errors in calling the system that was disrupted and all requests will stay in the queue until the listener service is started again. The plan was that when the issue is fixed, we will start up the microservice so that requests would come in from the queue and then for the ones that were unsuccessful initially, they can just be replayed manually through a single service call.

As soon as we got the go ahead to start the recovery, executing the recovery plan mentioned above took less than 10 minutes and confirmed that the recovery was a success. There were more than 1000 queued requests that came through seamlessly upon starting the microservice.

Outcome

This incident has initiated a review of how the legacy platform retrying the transactions would handle such scenarios. The whole process would have taken a longer time compared to the 10 minutes recovery time of the new platform with the event-driven architecture. This way ensures our platforms are always on and there is no downtime for customers.
Event-driven architecture has its pros and cons, and the design must be given careful thought. There are several other considerations apart from asynchronous processing and the ability to replay for the architecture to be scalable, secure, consistent, performant and resilient. This instance has proven how recovery can be swift with minimal operational effort through event-driven architecture.

Swift incident recovery with event-driven architecture

The opportunity

The legacy side

The event-driven architecture side

Outcome

Written by Engineers at Macquarie