1 / 94

How to U ncover P erformance R elated B ottlenecks , Optimize and Tune the Full Remedy Stack

How to U ncover P erformance R elated B ottlenecks , Optimize and Tune the Full Remedy Stack. Derek Roberts/ Armen Avedisijan Director / Consultant Scapa Technologies. Agenda. Introduction My background Questions to the audience regarding their performance issues

isra
Télécharger la présentation

How to U ncover P erformance R elated B ottlenecks , Optimize and Tune the Full Remedy Stack

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Uncover Performance Related Bottlenecks, Optimize and Tune the Full Remedy Stack Derek Roberts/ ArmenAvedisijan Director / Consultant Scapa Technologies

  2. Agenda • Introduction • My background • Questions to the audience regarding their performance issues • Objectives of this talk • Questions to the audience – anyone with performance problems they wish to discuss • Live demonstration of the Scapa 3 Step process; • Capture, Process, Go

  3. Agenda • Note: After the demo: • We will move on to the analysis portion of this presentation following the demonstration. • In order to uncover performance related bottlenecks, optimize and tune the full Remedy stack, in addition to using the Scapa 3 step process, you need to know • what can go wrong with systems • know what to test (test cases) • how to interpret results. • To make full use of the available time and to fast track you in becoming a testing expert, I will attempt to teach you as much as possible in the allotted time. • A full transcript of speaker notes can be downloaded from which covers all subjects listed in this presentation. http://www.scapatech.com/about/events/wwrug-2013/

  4. Agenda (Continued) • A few real-life performance horror stories • Learn from others mistakes; what the horror stories have in common • The complexity of IT Systems • The tuning and optimization challenge • Tuning and optimization: A misconnection • Root cause analysis in a multi user/computational system.

  5. Agenda (Continued) • Multi user/computational system scenarios • Root cause analysis vs interference analysis • Scalability – the characteristics • Edge of capacity and root-cause • Let’s talk about Interference • Queuing • Caching

  6. Agenda (Continued) • Lower Resource Efficiency • Queuing Types • Convoy effect in a computer system (transaction Processing) • Locking and Blocking • Caching

  7. Agenda (Continued) • Symptoms versus Causes when diagnosing load performance problems. • The importance of Hunches and developing testable hypotheses • Wrap up with a scenario: • How to test the load balancing in a Remedy Mid-Tier deployment.

  8. Objectives and Results • Main objective • To demonstration the simplicity of Scapa’s “Capture, Process, Go” process • As this is so simple, we should get this over with quickly, but I want to keep you for the full 50 minutes, so rather than this being an infomercial, I’ll try to convey some fundamentals with regards to successful performance testing and optimization of Remedy systems. • Comprehensive Guides Exist (they are good – use them) • I will be recommending settings. Guides (good ones) already exist • Tuning the Web-tier • https://docs.bmc.com/docs/display/public/ars8000/Tuning+the+mid+tier • Tuning the entire stack https://docs.bmc.com/docs/display/public/ars8000/Performance+tuning+for+BSM

  9. Objectives and Results • Objectives • To observer the simplicity and speed in the creation of test cases in Scapa TPP, via the Scapa3 step process of 'capture, process and go'. • To understand the risks of skipping performance testing and why this is often skipped. • To have an appreciation of types of bottlenecks, scalability, capacity and their impact on performance characteristics • To understand how to test the effectiveness of Remedy load balancing • Note: There was a change in v7.6.04 in the way load balancing is configured on the Mid-Tier.

  10. Objectives and Results • Results • Demonstration of Scapa’s “Capture, Process, Go” • Performance testing knowledge transfer. • Skills developed • An understanding that quantitative measurements are critical in the act of performance bottleneck identification and resolution and in tuning/optimization efforts AND that the best way to achieve results quickly and effectively is to EXCLUDE real users from tests; Automation of virtual users is key. • An understanding on how to run a Scapa “Capture, Process, Go” test • A basic understanding in the interpretation of results from Scapa TPP • Via real life examples • A new appreciation of the performance testing process. • A basic understanding on how you can implement performance testing to resolve existing issues and prevent performance issues.

  11. Introduction • Full title: • How to uncover performance related bottlenecks, optimize and tune the full Remedy stack (including the web stack), for lightning performance and maximal end user experience, by using Scapa Technologies simple 3 step process of 'capture, process and go'.

  12. Introduction • Full description: • The complexity of tuning and optimizing disparate Remedy Mid-Tier systems via capacity, load, stress and performance testing has been annihilated with Scapa's Load, Stress and Performance Testing Platform (Scapa TPP). Scapa’s'capture, process and go' opens the doors for anyone, even non ‘techies' to build and run, load, capacity, and stress tests in minutes rather than hours or days (as is the norm with other tools)and with no prior Remedy experience or skills required! A Scapa consultant will demonstrate the simplicity of these 3 steps, 'capture, process and go', and introduce you to the characteristics of performance related issues and bottlenecks that will adversely impact end users’ experience.

  13. Introduction • Scapa Technologies – what we do • We are software and consulting house, specialising in • testing and monitoring the performance, scalability, reliability and capacity of IT systems.

  14. Introduction (Cont.) • Derek Roberts – his consultancy role in Scapa • Senior consultant and director at Scapa Technologies • Over ten years’ experience in the testing field • I mostly find myself testing systems and applications deployed via Microsoft Terminal Server, Microsoft Remote Desktop Server, Citrix XenApp/XenDesktopand VMware View. • I do this by automating user interactions with the real application GUI (for any application, including Remedy mid-tier). I will be giving a talk on how to automate the Remedy mid-tier via the GUI in another talk. • I also undertake Remedy testing via the user tool and Remedy mid-tier through their appropriate protocols.

  15. Introduction • Today, we will demonstrate how to use Scapa’s 3 steps, 'capture, process and go' to record and play back user interactions at the Remedy Mid-tier using the HTTP protocol for performance, scalability, reliability and capacity testing. • Before I do that…… • Questions to the audience – anyone in the room who has performance issues with there Remedy-mid-tier system at the moment? • Describe the symptom you are seeing on your system.

  16. Live Demo • Live Demo Time • Disclaimer: it’s a live demo and Murphy may show up. As such, a ‘canned’ demonstration is available from: • http://www.scapatech.com/about/events/wwrug-2013/ • Capture • Armen will demonstrate the capture We now need to make this captured user interaction re-playable. We do this by the process procedure. • Process • Do we have a volunteer to assist? • Don’t worry, Armen will talk you through the process. • Lets discuss how this differs from other testing tools. 16

  17. Live Demo • GO 17

  18. A few real-life performance horror stories from me • A few real-life performance horror stories from me • The 24-hour system outage affecting over 8,000 employees across two countries. Not once but twice within a 1 month period. • Estimated cost of failure: In the low millions of dollars • Highly customized Remedy implementation (with a customized web frontend) that stopped working when 15 additional users were added to the system. • Estimated cost of failure: In the low hundreds of thousands dollars • Customer that leased their customized, customized call centre software at an astonishing cost of $15Million per year, yet the system would fail 3 out of 7 days. • Estimated cost of failure: over $1million

  19. Learn from Others Mistakes • Learn from others mistakes; what the horror stories have in common • Only performed manual functional testing with a handful of users. • Sole reliance on rule of thumb and vendor supplied optimization recommendations and tuning guides as a replacement for workload related performance testing. • No stress testing was performed; only used low, fixed load tests, up to a pre-set estimate provided by the customer or a business analyst. • This is a common mistake • Relied on software vendor supplied statements regarding the capacity and performance of their solutions. • The vendor will be unaware of your use cases, customization or environment when they test the out of the box solution.

  20. Learn from Others Mistakes (continued) • Learn from others mistakes; what the horror stories have in common (continued). • Relied on whistle/stop watch tests • Perceived delay to project timelines • Perceived high cost of undertaking performance • Project manager’s milestones bonus pay-outs paid on delivered functionality rather than system scalability and capacity criteria. • Simply put, the customers were not aware of potential scalability issues and performance problems until it all went wrong

  21. The Complexityof IT Systems • The complexity of IT Systems, not just Remedy Systems, means that the propensity for capacity and performance related issues are high. • For most systems, the application mix will include databases, third party plugins, different desktop delivery methods, server consolidation via virtualization etc.. • Add to this the human factor (“to err is human”) and Murphy’s law (“Anything that can go wrong, will”) and the result is usually a potential performance, reliability, scalability and capacity issues.

  22. The Tuning and Optimization Challenge • How do you ensure that the end user experience will be adequate before rolling out a Remedy ITSM system to production? • How do you resolve existing performance issues within a production system? • Do you know for sure and can you prove that the application of recommended tuning and best practice configurations will work or resolve a performance related issue on my system?

  23. Tuning and Optimization: A Misunderstanding • There is sometimes a misunderstanding that if you have a performance issue or want to avoid performance issues on a system then you simply applyconfiguration settings set out in rule of thumb, best practice/tuning/optimization guides. • Although you might get lucky with that approach (if you have a performance issue), it is a blind approach based on work carried out on a reference system which will most probably have significant differences between your systems. • https://docs.bmc.com/docs/display/public/ars8000/Tuning+the+mid+tier • NOTE: Configuration settings WILL have a dramatic effect on the performance of the mid-tier. With Scapa you can quantify the improvement of any changes that you make and target changes depending on what the system is ‘telling you’ via the test results.

  24. Tuning and Optimization: A Misunderstanding (cont.) • Another misunderstanding is that the application performance tuning is conductedat the singular level • where you optimized each system transition in isolation of each other • That the analysis of such tuning efforts (singular), is all that is required to fully optimize the system for maximum performance.

  25. Tuning and Optimization: A Misunderstanding (cont.) • This is not the case. Most systems, including Remedy Mid-tier implementations are multi-user workload systems. • Multi-user in this context refers to any computational environment where there are multiple independent computations executing concurrently and competing for shared resources, not just the more general reference to more than one person using the system at the same time. • Unless you undertake multi user/computational scenarios you can miss performance related issue in the system which will eventually show in the production system.

  26. Root Cause Analysis in a Multi-User/Computational System • Root-cause analysis is often mentioned in a performance issue setting, but it can difficult to perform • sometimes impossible and • is not always necessary to resolve performance related issues. • It can also take considerable time; you can end up in an infinite loop, not knowing when to stop looking for the root-cause.

  27. Multi-User/Computational System Scenarios • Let’s consider the scenario where you started a test, simulating different concurrent users, performing similar, but not identical, business operations. • The system performs ok for a limited number of users, but failed the operational acceptance capacity check. • In other words, the system bombed with a number of concurrent users that is far less than the expected number of users expected in the production system, despite the hardware being “adequate”.

  28. Multi-User/Computational System Scenarios (cont.) • When running each test case in isolation we can see that the business operations that make up the load are not being excessively greedy for system resources. • In fact, we see that the single test case scales and perform adequately, so what gives with the poor performance when run the business transactions concurrently?

  29. Multi-User/Computational System Scenarios (cont.) • Why didn’t the application performance scale satisfactorily when the number of users increased?

  30. Multi-User/Computational System Scenarios (cont.) • In this scenario the performance problem is NOT caused by the consumption of resources by the business operation. Rather, the performance problem is cause by the pattern with which concurrently-executing business operations interfere with each other.

  31. Root Cause Analysis Versus Interference Analysis • Scaling up the hardware by increasing CPU, memory capacity etc., or scaling out by throwing more servers in to the mix will NOT increase capacity in this instance (interference). • Remember those 3 examples I gave earlier? • The 24 hour system outage affecting over 8,000 employees across two countries. Not once but twice within a 1 month period. • The highly customized Remedy implementation (with a customized web frontend) that had 150 engineers complaining that their daily work orders. • This is the example associated to the software rental that failed to scaleat an astonishing cost of $15Million per year.

  32. Root Cause Analysis Versus Interference Analysis • In the first example, this was traced to a particular nasty bug via root-cause analysis. This bug ultimately caused the catastrophic failure of the system and was not related to interference and as such, root-cause analysis was not that difficult. • The second and third example; scaling up or out would NOT have resolved the performance issue on these system. The second example required root cause analysis in order to resolve the issue. • The third example is interesting as not only was root cause not possible, but it was not required.

  33. Root Cause Analysis Versus Interference Analysis • The third example is interesting as not only was root cause not possible, but it was not required (by the tester - me). • I had approximately 10 scripted test scenarios. • If I ran the test scenarios independently of each other, the systems scalability and performance was fantastic. • The vendor ran similar test before delivering their solution to customers, so it was not surprise that when it came to the blame game, they were confident that the issues experienced by their customer had nothing to do with their software. • How wrong the were!

  34. Root Cause Analysis Versus Interference Analysis • When I ran a multi user/computational system scenario the system would quickly fail. • What we have here is some sort of interferencebetween the business operations and with this type of problem it is usually impossible to measure directly and find the root-cause. • This is why analysing performance problems with multi-user/computational workloads can be trickier than analysing transactions that performing similar, but not identical, business operations.

  35. Root Cause Analysis Versus Interference Analysis • In this 3rd example, I traced the cause of the interference to two particular user transactions, which corresponded to significant and frequent deadlocks on the SQL database. > 300 per second. • The deadlock is not the root cause but fix the deadlocks and the problem would be resolved. • As this customer did not own the software (rented it for $15mil per year), I was unable to look at the source SQL code.

  36. Root Cause Analysis Versus Interference Analysis • However, the symptom of deadlocks and the fact that I could run a scenario to create this issue on demand was sufficient for a patch to be supplied by the vendor. • Ultimately, in this instance, the vendor would have had to do the root-cause analysis in order to fix the issue, but based on the two simple transactions that interfered with each other, it would have dramatically simplified and pinpointed the precise area that would require investigation. A patch was delivered within a week. • They could target their developer resources at the issue and fix it quickly.

  37. Root Cause Analysis Versus Interference Analysis • A lot of time you develop a hypothisis, based on what the system is ‘telling you’ via the results – in other words you treat the symptom often more than the underlying cause as finding the underlying cause can be time consuming or impossible. • You then re-run test after you make a change to test your hypothesis.

  38. Scalability – the Characteristics • We know from little's law (queuing theory) that: • The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the (Palm-)average time a customer spends in the system, W; or expressed algebraically: L = λW. • This proves that the you cannot extrapolate using linear scalability

  39. Scalability – the Characteristics

  40. Scalability – the Characteristics I’ve over simplified this diagram as it doesn’t look at the relative performance of the difference types of business operations – it only shows the response time for a single business transaction.

  41. Scalability – the Characteristics • There comes a point where the decline becomes sharp. At this point, on successive load, small increments in the number of concurrent users cause an increasingly large reduction in the proportion of operations completing within an acceptable response time. • If the actual performance declines before the required performance, then you have a scalability, performance and capacity issue.

  42. Edge of Capacity and Root-Cause

  43. Edge of Capacity and Root-Cause(cont.) • Experience shows that at the edge of capacity (the knee of the performance curve – sometimes referred to as the hockey stick or pinch point of the system) one of the resources will have an overly-high utilization rate and

  44. Edge of Capacity and Root-Cause (cont.) • Our next task is to determine which one! • Superficially, this sounds easy: You just re-run the load performance test with performance monitors running on the various servers in the system under est. Then at the load level that cases performance to deteriorate, you spot which resource goes above its critical utilization level.

  45. Edge of Capacity and Root-Cause (cont.) • Superficially, this sounds easy: • You just re-run the load performance test with performance monitors running on the various servers in the system under test. • Then at the load level that cases performance to deteriorate, you spot which resource goes above its critical utilization level.

  46. Edge of Capacity and Root-Cause (cont.) • If you are lucky, this approach might get you the results in the end, but it has at least three significant drawbacks: • When the performance monitor shows a resource ‘going critical’ under load, you may be observing a symptom, rather than a cause • It is highly likely that you will have to repeat the test several times, as one performance issue is resolved and the next one is encountered; you will find repeatability tedious and tricky. • The ad hoc nature of the approach means that it will not be so easy to produce compelling, documentary evidence of your performance conclusions. • You know the system has adequate CPU and disk resources, so why did it fail the load performance test? The chances are that the unexpectedly poor performance is a by-product of concurrently-executing business operations.

  47. Edge of Capacity and Root-Cause (cont.) • If you are lucky, this approach might get you the results in the end, but it has at least three significant drawbacks (cont.): • The ad hoc nature of the approach means that it will not be so easy to produce compelling, documentary evidence of your performance conclusions. • You know the system has adequate CPU and disk resources, so why did it fail the load performance test? • The chances are that the unexpectedly poor performance is a by-product of concurrently-executing business operations (the interference problem).

  48. Let’s Talk AboutInterference • Let’s talk about Interference • Queuing • Even when the utilisation of a resource is well below its critical threshold, queuing can occur. • As more users perform business transactions on the system, queues will grow and will extend the elapsed time of the business operation. • In other words, the response time will increase and eventually the amount of work the system can perform in a given timeframe will plateau or even reduce.

  49. Let’s Talk AboutInterference (cont.) • Let’s talk about Interference (cont.) • Caching • Some cached resources are repeatedly accessed by a business operation. • As the caching resource becomes more heavily utilised, the chances of the business operation’s data still being in the cache the next time its swapped is reduced.

  50. Let’s Talk AboutInterference (cont.) • Let’s talk about Interference (cont.) • Lower Resource Efficiency • For some resources, the efficiency of the servers provided deteriorates as utilisation increases. • To make matters worse, for some resources this deterioration can take the form of a step change in performance.

More Related