Metrics in Risk Determination for Large-Scale Distributed Systems Maintenance

Metrics in Risk Determination for Large-Scale Distributed Systems Maintenance Maureen Ann Raley Advisor: Letha Hughes Etzkorn University of Alabama in Huntsville Computer Science Department Telephone: 703-325-3510 University of Alabama in Huntsville Maraley@comcast.net Telephone: 256-842-6291 letzkorn@cs.uah.edu

Nationwide Computer System Upgrade • Upgrade networked computers, distributed throughout the continental US • Systems consisted of a mix of mainframe computers and client/server systems used in a financial environment • Lessons-learned survey concentrated on the client computers or workstations (no mainframes or servers) • Data entry centers • Computer centers (in-house software developed and maintained) • Field offices (operations)

Nationwide Computer System Upgrade Condition: Firm deadline Non-compliant systems at deadline would be either: • Removed from operational use and transferred/disposed • Disconnected from the network and accorded strictly stand-alone status • No transfer of data other than hardcopy printout • Prohibited “Sneaker-net” transfer by floppy drive, modem, or other electronic means

Nationwide Computer System Upgrade Purpose of the upgrade • Standardize the software and hardware throughout the agency • Modernize the computers and software components • Dispose of obsolete components • Introduce a layered operating system with administrative and user privileges

Nationwide Computer System Upgrade Expected benefits • More conducive to configuration management controls • Easier to maintain and upgrade • Barriers to prevent unauthorized software (games, favorite programs, instant messaging, peer-peer networking) • More secure and less susceptible to external or internal hacking

Nationwide Computer System Upgrade Software • Predominately COTS • Business automation application suites • Operating systems • Network • Individual workstations • But also . . . programs developed in-house • data entry and data analysis • And . . . in-house customized COTS-based systems

Nationwide Computer System Upgrade Hardware • Commercial grade desktop computers • No consistency as to • Age • Manufacturer • RAM, HD, I/O, capabilities

Nationwide Computer System Upgrade System Priorities • Availability • Data integrity • Performance • Security

Nationwide Computer System Upgrade Lessons Learned 94 personnel interviewed at 12 different locations • Data entry clerks • Inventory control specialists • Information technology personnel • Middle and upper management • Secretaries and administrative assistants.

Nationwide Computer System Upgrade Lessons Learned Questionnaire • What were the biggest obstacles you found in achieving nationwide compliance? • Based on what you have learned during the nationwide compliance effort, what would you want to see done differently if we were starting the effort now? • How is training being handled in your organization? Are there adequate resources? Is time allocated for training?

Nationwide Computer System Upgrade The responses note: • Management change in direction • Lack of understanding in the side effects of management directives • Need for bottoms-up input into management decisions • Status assessment based on an inaccurate inventory database • Understaffing • Mismatched arrival of hardware and software • Unrealistic compatibility testing • Loss of functionality in the replacement software • Inadequate training

Nationwide Computer System Upgrade 1. & 2. Change in management direction and inadequate side effect analysis caused rework • Upper management would direct paths to the goals that were always not the most reasonable or efficient. • Management failed to identify all the side effects of their direction and the implications to the end users. • Management would also direct action, then recall that directive, and direct a different approach. This caused frustration and rework.

Nationwide Computer System Upgrade 3. Lack of bottoms-up input from users • Project office held weekly conference calls on nationwide status and twice weekly conference calls on issues, as well as proactive, engaged working groups consisting of the national office and field coordinators, but . . . • Some of these issues were due to management issuing directives without field input or vetting. • Sometimes field-level management made the decisions for their end-users without soliciting input. • Ensuring a “bottoms up” comment process would have helped to mitigate some of the rework.

Nationwide Computer System Upgrade 1., 2. & 3. A process to vet upper management decisions could help to reduce the amount of rework by the end users. • Choose a set of end users at diverse locales to “beta test” the management directives. • Distribute proposed directives for comment and ensure both headquarters personnel and field users, reviewed these proposals. • Direct field management to allot time to end users to ensure a meaningful review.

Nationwide Computer System Upgrade 4. Inaccurate inventory database • The IS inventory database was queried weekly to provide reports on the HW and SW compliance status. • Inaccuracies largely due to the small, mobile nature of the computers and the large numbers of them to track. • Effort to correct IS database was being worked by another office, but did not coincide with the project office milestones. • As the inventory database became more accurate, a realistic estimation of the work done and yet to be done became clearer as the project deadline approached. • Modern inventory control methods, such as radio frequency ID (RFID) tagging could be used for more accurate tracking of small, easily movable components.

Nationwide Computer System Upgrade 5. Excessive workload, staffing shortages • Field coordinators worked the upgrade effort in addition to their normal workload. • Because of this, some field coordinators were not as dedicated to the upgrade effort as others • Additional staff (skilled support) was used at HQ • Management should also augment staffing levels in the field.

Nationwide Computer System Upgrade 6. Mismatched arrival of replacement HW/SW • Delays were as long as several months, due to vendor supply shortages • Inadequate storage space for new HW platforms arriving first • Old HW platforms could not be removed without complete replacements • Incompatible or upgrade violation: new SW and old HW; new HW and old SW • Installation of new SW on old incompatible platforms caused systems failures and rework for reformatting and reinstallation of old SW.

Nationwide Computer System Upgrade 7. Lack of realistic compatibility testing • Compatibility testing was not independent and had no oversight. The division overseeing the upgrade did the compatibility testing. • Testing was done on “idealized” machines with the new COTS software suite at HQ, not in the field, not by field end-users, not with in-house field application, and not under field workload conditions. • Most of the software was compatible, BUT some critical incompatibilities existed with the field applications not used at HQ. • Testing should have followed software engineering “best practices” independent tests, oversight, and under operational conditions.

Nationwide Computer System Upgrade 8. Loss of functionality in the new software • Transition from one office automation software suite (i.e. Word Perfect, Lotus, Oracle, email) to another (MS Word, Excel, Access, Outlook). • Data-entry transition: line-entry to graphical, mouse-driven system. • In-house developed replacement programs incurred the "loss of functionality" complaint less often than COTS software. • In-house programming team came closer to maintaining the basic functions of the old in-house software. • "Not all the right features" and "too many unneeded features" issues are common to COTS based systems (not tailored to a specific task, but instead are developed for mass market sales). • A more rigorous COTS selection process could mitigate the inadequacies.

Nationwide Computer System Upgrade 9. Inadequate training • Usually did not address the new skill sets needed. • Not always tailored for the different skill levels. • Frequently expected to be self-taught. • Expected to occur in addition to the normal workload.

Nationwide Computer System Upgrade 9. Inadequate training • The most successful training: combination of introductory overview, followed by hands-on classroom training with a knowledgeable and motivated instructor. • Unsuccessful training: • Sometimes none at all • Unmotivated, poorly trained instructors • Self-study -- a CD ROM at one's desktop, (ok if computer literate, overwhelming if not) • Information technology and computer specialists adapted with ease. • Non-computer oriented people, such as secretaries, administrative assistants, and data entry clerks, had much more trouble.

Nationwide Computer System Upgrade Benefits • Uniform nationwide software and hardware • More effective configuration management • Easier maintenance and upgrades installation • Cyclical replacement or upgrades hardware and software easier to plan and effect • Layered operating systems (administrative user privileges) inhibit installation of ad hoc end user software • End Game Assessment has applications for recovery from hostile information system attacks

Observations • Not all risks are known and can be planned for in advance • This project mitigated many of the problems encountered during the transition by: • Consistent monitoring of the hardware and software components’ compliance status using inventory data • An “issues” database, addressed weekly • A “risks” database for issues with a probability of occurrence with negative consequences, addressed every two weeks

Possibilities for Future Research • Quantification of the loss due to rework and the effect of double or treble responsibilities on the lower level staff (data not available) • Further investigation on the consequences of effective and ineffective management decisions • Development of metrics for risk analysis

Nationwide Computer System Upgrade Actual Compliance Growth from Inventory Data

Metrics Considerations • The determination of an appropriate set of metrics to analyze riskduring the maintenance phase of a distributed system upgrade • Standard (actual data), as well as normalized metrics • Normalizingwould deal with the varying number of devices (with this sample data, the total number of units in the system changes as new units are added, old ones are disposed of, and as the inventory accuracy grows). By normalizing the metric suite, we can compare distributions of different size. • An adaptive sizing model that deals not with the total number of system devices (units), but only with the units modified during a certain period • A period of time, i.e., a week • A time-independent period, or threshold, determined by number of devices, i.e. 100. Using this model, instead of dividing by the actual number of devices in the system, as in the standardized model, would be divided by the number of recently modified devices. • A sliding window might used as well. • A history complexity metric for each location (component) to assess the effect of the complexity of the period. • This could help determine if risk increases during bursty or chaotic periods and during periods of high activity. • Validation: Statistical analysis of the results of the actual compliance growth (from the inventory data)vs. the results of the alternative metrics.

One Possibility - Entropy Metrics • Essentials of mathematical theory of information - “A Mathematical Theory of Communication,” Claude Shannon (1949) • Established fundamental bounds on the performance of communication systems in the presence of noise • Exerted an enormous influence on many disciplines -- communications, biology, mathematics, physics. • Entropy has been defined in terms of the information content of software and used to measure code complexity • Also has been used effectively as an indicator for reusability

Entropy Metrics Have been used successfully to: • Measure SW quality during SW development • Used directed graphs to model a SW system and measure coupling, cohesion, size, length, complexity at module level (Allen, Khoshgoftaar, et al, 1996 - present) • Measured complexity in object-oriented design (Davis, Etzkorn, Bansiya, Gholston, et al, 1999 - present) • Measure SW complexity during SW maintenance • Based on message flow between modules, extended to COTS (Chapin, 1988) • Measure Cost Growth during large-scale system development • Queried experts to develop a cost model validated to 3% accuracy vice 300% predicted (Martin, Lenz, Glover, et al, 1981) • Measure SW complexity using process entropy (not code) • Theorized the number of times a module was modified adversely affected code complexity, validated 13%-45% improvement (Hassan & Holt, 2003)

SC1 SC1 SC1 SC1 SC2 SC2 SC2 SC2 SC3 SC3 SC3 SC3 SC4 SC4 SC4 SC4 SC5 SC5 SC5 SC5 SC6 SC6 SC6 SC6 SC7 SC7 SC7 SC7 Compliance Growth DevicesNeedingModification DevicesIsolateto Standalone Inventory Stability SC8 SC8 SC8 SC8 SC9 SC9 SC9 SC9 SC10 SC10 SC10 SC10 R1 R1 R1 R1 R2 R2 R2 R2 R3 R3 R3 R3 R4 R4 R4 R4 Possible Entropy Metrics - Division 1

Backup Slides

Possible Entropy Metrics - Division 2 CC1 CC1 CC1 CC1 CC2 CC2 CC2 CC2 CC3 CC3 CC3 CC3 HQ1 HQ1 HQ1 HQ1 HQ2 HQ2 HQ2 HQ2 HQ3 HQ3 HQ3 HQ3 HQ4 HQ4 HQ4 HQ4 DevicesIsolatetoStandalone DevicesNeedingModification Inventory Stability Compliance Growth HQ5 HQ5 HQ5 HQ5 HQ6 HQ6 HQ6 HQ6 HQ7 HQ7 HQ7 HQ7 NC1 NC1 NC1 NC1 NC2 NC2 NC2 NC2 NC3 NC3 NC3 NC3 Aux Aux Aux Aux

Shannon’s Equation C.E. Shannon, in “A Mathematical Theory of Communication,” (1948) proposed to measure the amount of uncertainty, or entropy, in a distribution by the following equation: n Hn(P) = - (pk * log2 pk) k=1 n where pk 0,  k  1, 2, . . . n and  pk =1 k=1

Metrics in Risk Determination for Large-Scale Distributed Systems Maintenance