Case studies·Operational Excellence

Disaster Recovery as a Service

Designing, implementing, and operating a full DRaaS programme for a mature Blue Prism CoE in UK financial services — including multiple weeks running live production workloads against DR infrastructure.

DRaaSInfrastructureResilienceFinancial ServicesBlue Prism

Context

Running business-critical automation in a regulated UK financial services environment means the consequences of a production failure are serious — not just technically, but operationally and from a compliance standpoint. Processes that had been running reliably for years were now embedded deeply enough in operational workflows that losing them, even temporarily, had direct business impact.

The CoE had reached a level of maturity where DR was no longer a future consideration. It was a present obligation. The question was whether the existing environment could support a credible DR capability — and what it would take to build one that actually worked.

The challenge

Most organisations with a mature Blue Prism CoE have some form of DR documentation. Very few have a DR environment that has been architecturally integrated, operationally validated, and tested under real production conditions. The gap between a DR plan and a DR capability that you can actually invoke is wider than it looks.

The work here was to close that gap entirely — not just design a DR architecture, but build it, integrate it into the production operational model, and prove it by running in it.

Three layers of resilience

A credible DR capability for Blue Prism requires work across three distinct layers — each one a dependency for the others. Solving only two means the third becomes your single point of failure.

Underlying architecture

Every production component had a DRaaS counterpart. Runtime Resource VMs were cloned and held in the DR environment ready to activate. Application servers were configured in logical load balanced pools separated by site — the pattern Blue Prism's own HA reference architecture recommends for DR scenarios. SQL databases were already clustered; DR used a pilot light approach, maintaining a minimal standby ready to scale up on invocation rather than running full parallel infrastructure at permanent cost.

Operational control

The Blue Prism platform was configured to run in either environment without reconfiguration. DRaaS runtime resources were onboarded into Blue Prism and a full parallel set of schedules was built against them. Failover was executable as a deliberate operational step: retire the production machines in Blue Prism, activate the DRaaS schedules. Process design and exception handling were assessed and adjusted to ensure automation would continue cleanly under DR conditions — not just start, but sustain.

Target system availability

Business systems accessed by automated processes have their own DR availability profiles — and those profiles don't always match the automation platform's. Every process dependency was mapped against the DR availability of its target systems, identifying where automation could continue immediately, where it needed to wait on system recovery, and where recovery sequencing mattered. This dependency mapping shaped both the failover runbooks and the schedule activation order.

What made it real

The DRaaS switch was invoked and the entire production automation operation was run on DRaaS infrastructure for multiple weeks. Not a test window, not a subset of processes — the full production workload, on DR machines, against DR systems, with live business transactions running through it.

On the Blue Prism side, this meant the production runtime resources were retired, the DRaaS schedules were activated, and the platform operated as normal — just against a different set of machines. The pilot light SQL environment scaled up cleanly. The parallel schedule set ran without intervention.

Running sustained production in DRaaS surfaces what planned tests cannot. Timing dependencies that behave differently under DR infrastructure. Processes with implicit assumptions about system availability windows that only become visible under real operational load. Exception handling edge cases that only appear when recovery sequencing differs from the production run order.

Each finding fed back into the design — refining runbooks, adjusting schedule configurations, and tightening the dependency map — until the environment ran without issue. By the time the programme concluded, the DR capability had been proven by operation, not assumption.

Outcome

A production-proven DR capability — not just a plan.

For organisations in regulated sectors, the distinction matters. Auditors and compliance teams understand the difference between documented DR intent and a DR environment that has operated under real conditions. The programme delivered both the architecture and the evidence.

ROM2 dimension

Operational Excellence

Operational Excellence in ROM2 covers the infrastructure, configuration, and practices that determine whether automation runs reliably in production. Resilience and disaster recovery sit at the heart of this dimension — the difference between a CoE that can sustain operations under adverse conditions and one that can't.

Assess your CoE's operational maturity →

← All case studies Discuss your programme →