1 / 53

MSG347 Monitoring and Analyzing System Performance for Exchange 

MSG347 Monitoring and Analyzing System Performance for Exchange . Pierre Bijaoui (Hewlett-Packard) . Slide Guidelines Subtitle Color. Slides should emphasize key points Limit to 6 lines per slides Limit to 6 words per line

keegan
Télécharger la présentation

MSG347 Monitoring and Analyzing System Performance for Exchange 

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MSG347Monitoring and Analyzing System Performance for Exchange  Pierre Bijaoui (Hewlett-Packard) 

  2. Slide GuidelinesSubtitle Color • Slides should emphasize key points • Limit to 6 lines per slides • Limit to 6 words per line • Font, size, and color for text have been formatted for you in the Slide Master

  3. Goal: How To Pinpoint Causes Of Poor Exchange Performance? • Tools • Windows Performance Monitor (Perfmon) • Microsoft Operations Manager (MOM) + Exchange Management Pack • This talk is very detailed! • Slides are available • Don’t try to take detailed notes now • Getting good at this analysis will take practice • Here’s a kick-start!

  4. Format Note • Performance Monitor counters will be in the following format Object(instance)\counter name Object\counter name

  5. Pinpointing Performance Problems What to do when clients say their mail is slow… • Basic process is deductive • Start at top and eliminate possibilities

  6. Question 1: Is The Problem Exchange Or “Before” Exchange?Are Requests Even Getting To Exchange? • Use 2 counters • MSExchangeIS\RPC Requests:MAPI RPC requests currently being processed • MSExchangeIS\RPC Operations/sec:rate at which requests are being processed • Problem is before Exchange if • Operations/sec is low and • Outstanding requests is zero • All other combinations problem is Exchange or something after Exchange

  7. Example Exchange Problem No operations are executing but the store has outstanding requests for 3 minute period in the middle Store has outstanding requests No operations are executing for 3 minutes

  8. Example Exchange Problem Four periods of increasing outstanding requests while throughput drops

  9. Example Client Problem • Somebody running a utility or a test script? • Use NetMon to find from which machine the requests are coming

  10. Example A Network Problem • Use NetMon to determine whether requests are arriving at server

  11. Getting The Right Info UpfrontQuestions about the problem • Are clients experiencing sluggishness or are clients hanging? • Is it happening with a particular operation? • Does everyone experience the problem at the same time? • At what frequency does this occur?

  12. Getting The Right Info UpfrontQuestions about the hardware • How many CPU’s on the server? • How much memory on the server? • For each physical disk volume • how many disks • how are they configured (RAID-0, 1 or 5)?

  13. If The Problem Is On The Server… First step: Is there a physical resource bottleneck? Questions • Is there a CPU bottleneck? • Is there a Disk bottleneck? • Is there a memory bottleneck?

  14. Easy to detect Processor(_Total)\% Processor Timeapproaches 100% System\Processor Queue Lengthabove # of processors too often Caveat Full Text Indexing…(pause crawl) If CPU is high Is MSExchangeIS\RPC Requests increasing? Getting close or above 30 is BAD and can cause client timeouts Is There A CPU Bottleneck?

  15. CPU Bottleneck • Message Delivery spike leads to CPU bottleneck CPU ~ 100%

  16. Who Is Consuming The CPU? • The likely suspects (in order) Process(store)\% Processor Time Process(inetinfo)\% Processor Time Process(emsmta)\% Processor Time Process(mssearch)\% Processor Time Process(mssdmn)\% Processor Time Process(system)\% Processor TimeTotal of these  90% of the CPU used

  17. Who Is Consuming the CPU? “Histogram view”

  18. Who Is Consuming The CPU? • Likely sources of problems • Backup utilities; AV/AS • Monitoring utilities (WinMgmt, MAD) • Remote access tools (WinVNC, TermSrv) • NoteProcess counters  100% = one full processorE.g., 8-proc server 0 < Process(process)\% Processor Time< 800%

  19. Disk Bottleneck Detection • Much fuzzier than CPU bottlenecks  present 3 approaches • Always remember: A disk bottleneck may actually be the symptom of a memory problem • Best Practice • Size for disk i/o capacity first, instead of disk space • Run diskperf –yenables on logical and physical disk counters

  20. Disk Bottleneck Approach 1 PhysicalDisk(drive:)\Disk Writes/sec PhysicalDisk(drive:)\Disk Reads/sec • Look at all drives – compare to total  Isolate where the I/O is going • Rule of thumb estimate for disk random i/o Raid-0: Reads/s + Writes/s < # Spindles X 100 Raid-1: Reads/s + 2 * Writes/s < # Spindles X 100 Raid-5: Reads/s + 4 * Writes/s < # Spindles X 100 Assumes disk throughput = 100 random i/o per spindle

  21. Disk Bottleneck Approach 2 • I/O requests waiting to be completed PhysicalDisk(drive:)\Avg. Disk Queue average over the sampling interval PhysicalDisk(drive:)\Current Disk Queue instantaneous value • Disk bottleneck if • Average queue >> number of spindles on the array • Current Disk Queue never hits zero • Correlate spikes with MSExchangeIS\RPC Requests to confirm effect on clients

  22. Disk Bottleneck Approach 3 • I/O latency  sensitive to disk health PhysicalDisk(drive:)\Avg. Disk sec/Read PhysicalDisk(drive:)\Avg. Disk sec/Write Typical range: 0.005 to 0.020 seconds for random I/O Write caching in array controller  sec/write < 0.001 • Likely bottleneck: 0.020 - 0.050 seconds • Definite bottleneck: > 0.050

  23. What Is Causing The I/O? • Identify drives with high I/O… • May identify if it is likely to be the paging file, .edb, .stm, .log, or routing queue files • With Windows 2000 Server, you can use Process(process name)\IO Read Operations/sec Process(process name)\IO Write Operations/sec  qualitative feel for which process is doing I/O

  24. Where Is The I/O Going?Filemon • Choose the logical disks which needs investigation • Shows all disk reads and writes (size, which file, etc.) • Useful for multi-use disk (e.g. C:) • See http://www.sysinternals.com

  25. Filemon Example

  26. Physical Memory • Start with Memory\Available MBytes • Available MBytes < 4MB  Windows aggressive cuts working sets • Server clearly healthy if Available MBytes >> 4MB • Check for paging problems with • Memory\pages/sec(total pages to/from disk) • Memory\page reads/sec(total paging reads) • Memory\page writes/sec(total paging writes) • Paging I/O is normal Exchange 2000 uses Windows NT system cache for the .stm file • Check that paging I/O is from the page file with physical disk counters!

  27. Monitoring Physical Memory The Less-Useful Counters • Memory\Page Faults/sec is often not an indication of a problem as it includes • Memory\Cache Faults/secnormal part of Exchange 2000 operation because of .stm file • Both “Page Faults” and Cache Faults” include • Memory\Transition Faults/sec: Faults that don’t go to disk (memory manager has the pages on the standby list) • Process(process)\Page Faults/sec: Guide to find rogue processes (use histogram trick)

  28. Likely suspects Process(store)\Working Set most of committed bytes(due to Database\Cache Bytes) Process(inetinfo)\Working Set Process(emsmta)\Working Set Memory\Cache Bytes  Histogram to find processes with large working sets… Monitoring MemoryWhere Did It Go?

  29. Virtual MemoryA.k.a., Address Space • Best PracticeSet the /3GB switch in Boot.ini for dedicated Exchange 2000 servers with > 1 Gb memory • Requires Windows 2000 Adv. Server or Datacenter • Set /USERVA=3030 on Windows Server 2003 • Enterprise Edition and above • Process(store)\virtual bytes: Want >200MB free • Note: 3 GBytes = 3.22x109 bytes • Why is this important?

  30. Virtual Memory Fragmentation Very high fragmentation • Cluster failover may not work if receiving node is highly fragmented! • Need to monitor VM carefully…

  31. Monitoring Virtual MemoryExchange 2000 SP1 additions • Perfmon Counters to monitor VM fragmentation (cluster failover) • MSExchangeIS: VM Largest Block Size • MSExchangeIS: VM Total Free Blocks • MSExchangeIS: VM Total Large Free Block Bytes • MSExchangeIS: VM Total 16MB Free Blocks • MSExchangeIS events • Event 9852 (warning and error severity) warns of few large contiguous blocks of VM

  32. Kernel Memory • 32-bit OS limits kernel memory space • Limits are computed at server startup • Based on amount of physical memory and number of processors • /3gb switch limits kernel memory space dramatically

  33. Memory\Paged Pool Bytes • Kernel memory space that can be paged out to disk • Max of 196mb for a server with >1024Mb of physical memory and /3gb switch • 270mb without /3gb switch set • When max is hit, server  unresponsive • Increasing paged pool bytes…indicative of • Handle leaks  Check process handles counters • Growing SMTP queue

  34. Memory\Pool Non-paged Bytes • Kernel memory space that cannot be paged out to disk • Max of 96mb on servers with more than 512mb with /3gb switch • 250mb without /3gb • Increases are is often indicative of • Driver leak (SCSI etc) • Excessive number of TCP/IP connections • System will become unresponsive when it reaches max

  35. Memory: Free System Page Table Entries (PTEs) • Kernel memory space used to back I/O and network buffers • Generally 61k available PTEs on /3gb server with 1GB physical RAM • 450k without /3gb switch • Healthy server if >5000 • Unhealthy server if <3000 • May drop network packets and/or disk I/O's • Especially problematic on large, 8 processor servers with thousands of users • See Q313707 Exchange 2000 w. /3GB Switch Loses Network Connectivity

  36. Everything Checks Out But Server Still ‘Slow’ • Exchange depends on the Active Directory  Check out bottlenecks on your AD servers • CPU bottleneck? • Disk bottleneck? • Insufficient Memory? Most techniques discussed to identify problems with Exchange 200x are equally applicable to Windows 200x Active Directory

  37. DSAccess CountersMaking Sure Caching Is Happening • DSAccess reduces load on DS by caching requests • Important counters to check operation • MSExchangeDSaccess Caches\Cache Hits/Sec • MSExchangeDSaccess Caches\LDAP Searches/Sec • Compare to baseline rates when server is performing well

  38. Problem Is “Before” Exchange • Check network counters Network Interface(netcard)\bytes received/sec Network Interface(netcard)\bytes sent/sec The network is rarely a bottleneck. However, incorrect backup schedules, can cause problems • Next stop, client side sniffs – are the packets really getting to the server?

  39. Measuring Non-MAPI Requests • Analog of “RPC requests”  Epoxy queue object counters Epoxy(protocol)\Client Out Que Len Epoxy(protocol)\Store Out Que Len protocol = POP3, IMAP4, SMTP, DAV, and NNTP • Client Out Que Len: Number of requests waiting to be picked up by the store • Store Out Que Len: Number of requests waiting to be picked up by the Internet Information Server protocol handlers

  40. Message Delivery Counters • Server responds to user requests preferentially • Delivery queues  first sign of an overload • SMTP Server\Local Queue Length • Should not grow continuously • Peak periods: Growing and shrinking in the range of 0-1000 is reasonable • SMTP Server\Messages Delivered/sec • Should be continuous • Gaps of zero delivery followed by spikes are indicative of other bottlenecks

  41. Keeping Servers Healthy

  42. Keeping Servers Healthy • Monitor servers continuously! • If you can identify bottlenecks, you can tell • when you don’t have them and • when you are getting close • But only if you are monitoring! • Need a baseline! • E.g., is today’s problem is due to • Increased load • Mail storm • Virus • Hardware problem

  43. Monitoring Strategies With Perfmon • Keep live views w/different sample times, e.g., • 900 seconds for a 24 hour view • 1 second to catch short lived spikes • Add minimal set of important counters • Study your busiest server – why it is different? • Save reference logs (baseline data)

  44. Processor(_Total)\% Processor Time System\Processor Queue Length Process(store )\% Processor Time PhysicalDisk(xxx)\Disk Transfers/sec PhysicalDisk(xxx)\Avg. Disk sec/Transfer MSExchangeIS\RPC Requests MSExchangeIS\RPC Operations/sec SMTP Server\Local Queue Length SMTP Server\Messages Delivered/sec MSExchangeIS Mailbox\Local Delivery Rate MSExchangeIS Mailbox\Folder Opens/sec MSExchangeIS Mailbox\Message Opens/sec A Minimal Set Of Counters

  45. Do You Know? • Number of messages received/user per day? • How many do they download? • How often do they open folders? • What is the • Peak delivery rate? • Peak period during the day? • Peak day of the week? • Are there monthly/quarterly peaks? • How many more users can your servers support? Maybe there’s an easier way…

  46. Making This Easier… • Microsoft Operations Manager and • Exchange Management Pack • Watch all of the bottleneck analysis perf counters and much more

  47. Goals Of The Exchange Management Packs • Facilitate high availability Exchange operations • Monitor broadly  maximum pre-emptive alerting • Facilitate lower time-to-resolution: Management Pack knowledge base • Rapid diagnosis • Quick resolution

  48. Questions

  49. Exchange Survey • Help us understand your requirements • Available via CommsNet • Daily Drawings for Windows Mobile Smartphones! • http://www.researchhq.com/messagingsurvey

  50. Microsoft Learning • Microsoft® Exchange Server 2003 Administrator's Companion ISBN:0-7356-1979-4

More Related