Architecting Mission Critical Applications Don’t forget the Instrumentation!

Architecting Mission Critical ApplicationsDon’t forget the Instrumentation! Mike Jolliffe Chief Technology Officer

Overview of Equiniti • Market Leader in UK Share Registration Services • Partnering around 57% of the FTSE 100 & 40% of FTSE 250 • Manage over 24 million shareholder accounts • Offices in London, Worthing, Birmingham, Bristol, Edinburgh and Jersey • Separated from Lloyds TSB Group on October 1st 2007after 50 years • Emphasis on growth

What makes an application ‘Mission Critical’? A Business dependency on that application for core business activity Tests such as these may help you determine the importance of the application:- • Would the business survive a major outage of this application? • A need for “high-9’s” availability over a normal processing period • Regulatory drivers for the applications availability – such as Crest settlement capability

The Goals of this Presentation • To highlight that whilst there is focus on infrastructure availability, there is not always the same degree of attention to application stability • To make a case for investing in the design effort for planning for application failures. • To show that understanding the User’s perspective on the application’s behaviour helps both the successful development of the application, and all of the ongoing support effort

Our Mission Critical Application………. For Equiniti, a system we call Sirius is at the core of what we do for our Share Registration and Employee Share Scheme clients. This is our Mission Critical Application • Started in 2003, live from Q1 2006, as a £40m project to re-engineer our processes and replace our aging OpenVMS systems. This was always going to be more than a straight application rewrite. • The development was led by Accenture, with HP and Microsoft providing Hardware and Software resources respectively • To date this is over 140,000 man days of effort, producing over 2m lines of code, and 2500 classes. We continue to extend the system for new capabilities as we as a business expand – this is not a static application! • Integrated custom workflow, Imaging, real time work prioritisation. • Technology stack is Windows 2003 R2, .Net Framework 3.0, SQLServer 2005. We started out on Framework 1.1 and SQLServer 2000.

What makes a ‘Mission Critical Application’ successful Application Design characteristics such as:- • Componentisation / Abstraction delivering • Clearly defined interfaces and Service boundaries • Interoperability through those services to other internal or external systems • e.g. Integration to Call Centre or Website technologies to re-use functionality already developed for one channel through other delivery mechanisms • Flexibility and adaptability of the actual application to changing business needs Infrastructure Design & Non-functional characteristics such as:- • Design for Availability • Design for Scalability • Processing Performance • Data Integrity – i.e. No committed transactions could be lost as part of a system failure

So you have a ‘Standard’ application…..e.g. Sirius Classic 3-Tier C# / ASP.Net UI for Internal business users Internal UI is just one Channel - any Channel can use the same web services Takes 1.4 million hits per day on average Web (any channel) WCF based web service call C# classes for business logic and data access via Genome (ORM+) The application exposes web services to be consumed by different channels Over 2500 classes each providing methods to achieve specific business functions App SQLServer 2005 Database of 2TB Some tables partitioned for size Some tables partitioned to achieve data deletions Database +ORM – Object Relational Mapping

Sirius is physically deployed like this Users NLB Web Web Web Web Web Web NLB App App App App App App App App Cluster Passive Active Passive Passive X SAN storage +

Sirius is physically deployed like this

Sirius is physically deployed like this Data Centre 1 Data Centre 2 NLB NLB Synchronously mirrored SAN storage SAN storage SAN storage Resilient High speed fibre network Triangulation for DR Triangulation for DR Mission Accomplished! 3rd location

Is the Application as resilient as the Infrastructure? Applications must be architected to be as resilient as the Infrastructure – to highlight when it fails and what caused it. • Do you architect into the application, from the outset, the basic needs of fault diagnosis? • You measure infrastructure resilience on the time to recover from an outage, if it’s even detectable by the end user. Do you do that for your Application? • It is not about writing endless logs (but the quality of log entries on a failure is important). • It is about instrumentation in your application that tells you in near-real-time what’s happening. • For a successful Application you need to be able to:- • Detect there is a problem (before the users flood the service desk with calls) • Restart failed services &/or Recover the damage that might have been done by a failed process.

The End User Perspective • Users see ‘System Availability’ as their “ability to use the application” – which is not the same as the infrastructure being up and running • It means that the application must be up, running, and performing fast enough for them to get their work done. • Before you start architecting the solution ensure you understand what your users expect you to achieve in terms of availability and performance – as to them they tend to be one and the same thing. If it goes slowly it can be almost as bad as not being available at all. • Determine ahead of the development what the impacts of failure will be – this helps drive the right architectural and non-functional requirements for a Mission Critical App. • The cost of downtime – lost revenue (£’s) • Reputational damage – financial impact (£’s) • Regulatory breaches & potentially financial penalties (£’s)

Steps to take in the application • As Architects you must consider from the start how errors will be handled within the application and ensure that development standards reflect your decisions • Developers must implement proper error trapping, and make informed decisions with how they raise that error, and the degree of criticality. • Should ‘retries’ be coded in the app (such as a timeout during a cluster failover) • Should the error be raised to the calling process / written to the Event Log to allow a graceful failure? • What goes into the Event Log must be meaningful and complete • Unique error description and number – this allows tools such as System Centre to pick up the error. • Have pre-defined actions configured for System Centre wherever the corrective action is clear from the error code. Treat changes/updates in these actions as part of future code releases so they get deployed with application patches that might change the recovery action. • Ensure precise details of failing component are recorded, including call stack

Steps to take in the application • In the case of Windows Services especially, build in support for using WMI to monitor the service. • Over and above any monitoring tool output, consider what reports you can provide Service Management with that will give early warning of problems such as performance degradation. • In addition to any application specific tables you can analyse, some other great sources of information come for ‘free’ • IIS logs – Load to a database every 60 seconds via a SQLAgent job and the Logparser tool to get a picture on interactive page performance • SQLServer 2005 Management Views for query performance and resource utilisation • Infrastructure performance data from Perfmon or WMI calls to show hotspots as they occur • These types of reports tell you about the Application, but they also tell you about how your users make use of your Application. This feeds into planning for infrastructure, support and enhancements

Sirius status reports – End user performance An analysis of the IIS weblogs from each webserver, imported into a database and displayed via Reporting Services

Sirius status reports – End user performance Combining the interactive response with a graph that shows background processes allows correlation of performance dips to tasks that may be causing them and hence allowing better scheduling

Sirius status reports – Performance by Transaction Analysis of the performance by page name is used to highlight those pages that fall outside performance expectations and allows prioritisation of development resource to tune that process. This report supports drilling down through multiple levels to see specific details

Final Thoughts • Plan for the Application failing in the same way that we already plan for hardware / networks failing. Get a framework for error management in place and document the ‘big’ scenarios, you won’t catch all of the smaller ones in design. Then tune the error management process during testing • Architect-in the needs of the support teams who will have to diagnose and fix application failures. If they don’t have the information they need recorded by the failure event, then the time to rectify is greatly extended. • Before you start architecting the solution ensure you understand what your users expect you to achieve in terms of availability and performance an all the consequences for not achieving these requirements • Share your findings about the application usage with the business end users – this can help them change their work patterns, process flows etc to maximise the systems potential

References • Systems Centre • Home page Http://www.microsoft.com/systemcenter/operationsmanager/en/us/default.aspx • SQLServer Reporting • IIS Reports starter pack http://www.microsoft.com/downloads/details.aspx?FamilyID=2805D337-14C7-40E3-820B-E7EE653C68C0&displaylang=en • Contact details Mike.Jolliffe@Equiniti.com • Shareview – the Shareholder & Investor portal • http://www.Shareview.co.uk

Architecting Mission Critical Applications Don’t forget the Instrumentation!