**UK:** National Grid’s multi-year digital transformation uses cloud-based event-driven architecture, domain-driven design, and new team structures to maintain customer portal uptime during mainframe outages while improving scalability, reliability, and customer satisfaction.
In mid-2024, during a Program Increment (PI) planning session at National Grid—a regional utility provider delivering gas and electric services—about half of the assembled group abruptly exited the meeting. This was due to an outage brought on by the billing mainframe going down, a system integral to the customer web portal and numerous other systems. Those who stayed were primarily members of the development team engaged in an incremental multi-year program to replace the legacy customer web portal.
Despite the outage impacting the system of record, the new cloud-based streaming architecture they had developed maintained service continuity for users. This architecture dynamically routed users to newly developed pages, ensuring that the customer self-service portal—enabling functions like bill viewing, payments, utility service management, and energy monitoring—remained operational even as the mainframe suffered downtime.
The story of National Grid’s transformation centres on a comprehensive decoupling strategy spanning technical systems, organisational structure, and semantic understanding. This approach has allowed the utility to transition away from a rigid legacy architecture without complete rewrites, using four interconnected paradigms:
-
Domain-Driven Design (DDD): This methodology helped reshape legacy mainframe data into business-friendly concepts and established a common language between technical staff and business stakeholders. It facilitated semantic decoupling by modelling software around specific business domains.
-
Team Topologies: A reorganisational framework that aligns teams with business value streams rather than technical layers. National Grid adopted stream-aligned teams responsible for specific business capabilities, complicated subsystem teams managing mainframe integrations, and enablement teams providing technical foundations. This reduced coordination overhead and improved delivery speed.
-
Event-Driven Architecture: Employing asynchronous event communication rather than synchronous APIs, this pattern shifted the system toward eventual consistency, facilitating technical decoupling between front-end and back-end systems.
-
Change Data Capture (CDC): This technology enables near real-time streaming of data changes from the mainframe, creating a “system-of-reference” that mirrors the system-of-record without requiring direct synchronous mainframe access, thus enhancing resilience and reducing dependencies.
Initially, National Grid’s Unified Web Portal 1.0 (UWP 1.0) attempted to consolidate disparate legacy systems and mainframes via batch ETL processes into a unified view hosted on a SaaS platform. However, batch data loading caused stale information, and synchronous mainframe calls were required for critical real-time operations, leading to major reliability, scalability, and organisational coordination challenges.
The legacy architecture suffered from data freshness issues, a proliferation of Backend-for-Frontend (BFF) APIs tailored to specific client needs, and synchronous dependencies that could trigger cascading system failures during traffic spikes. Organisationally, siloed teams handled discrete technology layers, complicating delivery with high coordination demands.
To overcome these challenges, National Grid established clear objectives: decouple frontend and web services from mainframe dependencies, reduce brittle point-to-point integrations, empower stream-aligned teams with end-to-end feature ownership, and ultimately improve customer satisfaction while reducing operational costs.
The modern architecture centres on retaining the mainframe as the system-of-record with CDC capturing changes, which are then streamed into event hubs (Kafka). Background services process these streams, transforming mainframe data into domain-centric models stored in document databases such as Cosmos DB. APIs, primarily using GraphQL with schema stitching, expose these domain models to web, mobile, and other applications. This separation enables incremental adoption and scaling, with elastic cloud-based components handling millions of daily events and user traffic scaling from dozens to millions over an 18-month period without scaling issues.
The implementation of DDD began with event-storming workshops to define bounded contexts, entities, commands, and events based on business needs rather than legacy constraints. These definitions informed the design of GraphQL APIs that avoided the previous BFF proliferation by allowing flexible, optimised queries across interconnected domains.
From an organisational perspective, Team Topologies structured teams into stream-aligned teams owning specific business capabilities, enablement teams providing infrastructural support, and complicated subsystem teams responsible for complex mainframe integration. This alignment ensured that team boundaries mirrored system boundaries, promoting ownership and reducing the cognitive load of inter-team dependencies.
Within the system-of-reference, event-driven patterns extended to internal communications via the Outbox Pattern, enabling domain events to maintain consistency and drive derived value computations. For operations affecting the mainframe, such as payment transactions, a Parallel Saga pattern was implemented to manage eventual consistency and orchestration between user requests and mainframe state changes. Unique challenges of race conditions between user-initiated responses and CDC updates were addressed by maintaining dual data collections and API aggregates to reconcile state.
Workflow state machines were designed to be composable, supporting complex multi-step user processes such as combined payment and bank account updates, each as discrete but interlinked workflows.
Several challenges arose during this transformation. Understanding and adopting event-driven architecture required substantial cultural and mindset shifts. Observability had to be enhanced with correlated logs, metrics, and traces to debug asynchronous workflows effectively. Batch processing on the mainframe occasionally flooded the CDC pipeline, causing data latency that was mitigated by handling different event topics separately. The GraphQL schema stitching approach proved difficult to maintain and tightly coupled domains; the team acknowledged that federation might have been a better initial strategy. Additionally, cross-cutting concerns like authentication and observability needed shared libraries managed through automation to maintain consistency across multiple repositories.
National Grid adopted an Agile Release Train process complemented by feature flags and automated pipelines managing deployment artifacts across over 30 repositories, enabling coordinated yet autonomous team releases that increased velocity, stability, and predictability.
To further reduce risk, the transition from UWP 1.0 to UWP 2.0 deployed a hybrid architecture using edge routing to direct user traffic gradually between the old and new systems, maintaining a seamless user experience through context awareness.
Reflecting on the transformation, National Grid achieved significant technical and business outcomes: decoupling the frontend from mainframe constraints, eliminating brittle integrations, empowering team ownership, reducing call centre volumes, lowering licensing expenses, and improving customer satisfaction through enhanced portal stability. The architecture and organisational model foster an evolving platform not reliant on centralised legacy systems, positioning National Grid for continued modernisation, including potential future mainframe replacement via the Strangler Fig pattern.
This comprehensive journey demonstrates a strategic embrace of decoupling—technical, organisational, and semantic—as a foundation for scalable, maintainable, and resilient utility customer services. The InfoQ report highlights how combining Change Data Capture, Domain-Driven Design, Event-Driven Architecture, Team Topologies, and incremental rollout strategies can effectively modernise legacy systems in critical industry sectors.
Source: Noah Wire Services