MSG347 Monitoring and Analyzing System Performance for Exchange

MSG347Monitoring and Analyzing System Performance for Exchange Pierre Bijaoui (Hewlett-Packard)

Slide GuidelinesSubtitle Color • Slides should emphasize key points • Limit to 6 lines per slides • Limit to 6 words per line • Font, size, and color for text have been formatted for you in the Slide Master

Goal: How To Pinpoint Causes Of Poor Exchange Performance? • Tools • Windows Performance Monitor (Perfmon) • Microsoft Operations Manager (MOM) + Exchange Management Pack • This talk is very detailed! • Slides are available • Don’t try to take detailed notes now • Getting good at this analysis will take practice • Here’s a kick-start!

Format Note • Performance Monitor counters will be in the following format Object(instance)\counter name Object\counter name

Pinpointing Performance Problems What to do when clients say their mail is slow… • Basic process is deductive • Start at top and eliminate possibilities

Question 1: Is The Problem Exchange Or “Before” Exchange?Are Requests Even Getting To Exchange? • Use 2 counters • MSExchangeIS\RPC Requests:MAPI RPC requests currently being processed • MSExchangeIS\RPC Operations/sec:rate at which requests are being processed • Problem is before Exchange if • Operations/sec is low and • Outstanding requests is zero • All other combinations problem is Exchange or something after Exchange

Example Exchange Problem No operations are executing but the store has outstanding requests for 3 minute period in the middle Store has outstanding requests No operations are executing for 3 minutes

Example Exchange Problem Four periods of increasing outstanding requests while throughput drops

Example Client Problem • Somebody running a utility or a test script? • Use NetMon to find from which machine the requests are coming

Example A Network Problem • Use NetMon to determine whether requests are arriving at server

Getting The Right Info UpfrontQuestions about the problem • Are clients experiencing sluggishness or are clients hanging? • Is it happening with a particular operation? • Does everyone experience the problem at the same time? • At what frequency does this occur?

Getting The Right Info UpfrontQuestions about the hardware • How many CPU’s on the server? • How much memory on the server? • For each physical disk volume • how many disks • how are they configured (RAID-0, 1 or 5)?

If The Problem Is On The Server… First step: Is there a physical resource bottleneck? Questions • Is there a CPU bottleneck? • Is there a Disk bottleneck? • Is there a memory bottleneck?

Easy to detect Processor(_Total)\% Processor Timeapproaches 100% System\Processor Queue Lengthabove # of processors too often Caveat Full Text Indexing…(pause crawl) If CPU is high Is MSExchangeIS\RPC Requests increasing? Getting close or above 30 is BAD and can cause client timeouts Is There A CPU Bottleneck?

CPU Bottleneck • Message Delivery spike leads to CPU bottleneck CPU ~ 100%

Who Is Consuming The CPU? • The likely suspects (in order) Process(store)\% Processor Time Process(inetinfo)\% Processor Time Process(emsmta)\% Processor Time Process(mssearch)\% Processor Time Process(mssdmn)\% Processor Time Process(system)\% Processor TimeTotal of these  90% of the CPU used

Who Is Consuming the CPU? “Histogram view”

Who Is Consuming The CPU? • Likely sources of problems • Backup utilities; AV/AS • Monitoring utilities (WinMgmt, MAD) • Remote access tools (WinVNC, TermSrv) • NoteProcess counters  100% = one full processorE.g., 8-proc server 0 < Process(process)\% Processor Time< 800%

Disk Bottleneck Detection • Much fuzzier than CPU bottlenecks  present 3 approaches • Always remember: A disk bottleneck may actually be the symptom of a memory problem • Best Practice • Size for disk i/o capacity first, instead of disk space • Run diskperf –yenables on logical and physical disk counters

Disk Bottleneck Approach 1 PhysicalDisk(drive:)\Disk Writes/sec PhysicalDisk(drive:)\Disk Reads/sec • Look at all drives – compare to total  Isolate where the I/O is going • Rule of thumb estimate for disk random i/o Raid-0: Reads/s + Writes/s < # Spindles X 100 Raid-1: Reads/s + 2 * Writes/s < # Spindles X 100 Raid-5: Reads/s + 4 * Writes/s < # Spindles X 100 Assumes disk throughput = 100 random i/o per spindle

Disk Bottleneck Approach 2 • I/O requests waiting to be completed PhysicalDisk(drive:)\Avg. Disk Queue average over the sampling interval PhysicalDisk(drive:)\Current Disk Queue instantaneous value • Disk bottleneck if • Average queue >> number of spindles on the array • Current Disk Queue never hits zero • Correlate spikes with MSExchangeIS\RPC Requests to confirm effect on clients

Disk Bottleneck Approach 3 • I/O latency  sensitive to disk health PhysicalDisk(drive:)\Avg. Disk sec/Read PhysicalDisk(drive:)\Avg. Disk sec/Write Typical range: 0.005 to 0.020 seconds for random I/O Write caching in array controller  sec/write < 0.001 • Likely bottleneck: 0.020 - 0.050 seconds • Definite bottleneck: > 0.050

What Is Causing The I/O? • Identify drives with high I/O… • May identify if it is likely to be the paging file, .edb, .stm, .log, or routing queue files • With Windows 2000 Server, you can use Process(process name)\IO Read Operations/sec Process(process name)\IO Write Operations/sec  qualitative feel for which process is doing I/O

Where Is The I/O Going?Filemon • Choose the logical disks which needs investigation • Shows all disk reads and writes (size, which file, etc.) • Useful for multi-use disk (e.g. C:) • See http://www.sysinternals.com

Filemon Example

Physical Memory • Start with Memory\Available MBytes • Available MBytes < 4MB  Windows aggressive cuts working sets • Server clearly healthy if Available MBytes >> 4MB • Check for paging problems with • Memory\pages/sec(total pages to/from disk) • Memory\page reads/sec(total paging reads) • Memory\page writes/sec(total paging writes) • Paging I/O is normal Exchange 2000 uses Windows NT system cache for the .stm file • Check that paging I/O is from the page file with physical disk counters!

Monitoring Physical Memory The Less-Useful Counters • Memory\Page Faults/sec is often not an indication of a problem as it includes • Memory\Cache Faults/secnormal part of Exchange 2000 operation because of .stm file • Both “Page Faults” and Cache Faults” include • Memory\Transition Faults/sec: Faults that don’t go to disk (memory manager has the pages on the standby list) • Process(process)\Page Faults/sec: Guide to find rogue processes (use histogram trick)

Likely suspects Process(store)\Working Set most of committed bytes(due to Database\Cache Bytes) Process(inetinfo)\Working Set Process(emsmta)\Working Set Memory\Cache Bytes  Histogram to find processes with large working sets… Monitoring MemoryWhere Did It Go?

Virtual MemoryA.k.a., Address Space • Best PracticeSet the /3GB switch in Boot.ini for dedicated Exchange 2000 servers with > 1 Gb memory • Requires Windows 2000 Adv. Server or Datacenter • Set /USERVA=3030 on Windows Server 2003 • Enterprise Edition and above • Process(store)\virtual bytes: Want >200MB free • Note: 3 GBytes = 3.22x109 bytes • Why is this important?

Virtual Memory Fragmentation Very high fragmentation • Cluster failover may not work if receiving node is highly fragmented! • Need to monitor VM carefully…

Monitoring Virtual MemoryExchange 2000 SP1 additions • Perfmon Counters to monitor VM fragmentation (cluster failover) • MSExchangeIS: VM Largest Block Size • MSExchangeIS: VM Total Free Blocks • MSExchangeIS: VM Total Large Free Block Bytes • MSExchangeIS: VM Total 16MB Free Blocks • MSExchangeIS events • Event 9852 (warning and error severity) warns of few large contiguous blocks of VM

Kernel Memory • 32-bit OS limits kernel memory space • Limits are computed at server startup • Based on amount of physical memory and number of processors • /3gb switch limits kernel memory space dramatically

Memory\Paged Pool Bytes • Kernel memory space that can be paged out to disk • Max of 196mb for a server with >1024Mb of physical memory and /3gb switch • 270mb without /3gb switch set • When max is hit, server  unresponsive • Increasing paged pool bytes…indicative of • Handle leaks  Check process handles counters • Growing SMTP queue

Memory\Pool Non-paged Bytes • Kernel memory space that cannot be paged out to disk • Max of 96mb on servers with more than 512mb with /3gb switch • 250mb without /3gb • Increases are is often indicative of • Driver leak (SCSI etc) • Excessive number of TCP/IP connections • System will become unresponsive when it reaches max

Memory: Free System Page Table Entries (PTEs) • Kernel memory space used to back I/O and network buffers • Generally 61k available PTEs on /3gb server with 1GB physical RAM • 450k without /3gb switch • Healthy server if >5000 • Unhealthy server if <3000 • May drop network packets and/or disk I/O's • Especially problematic on large, 8 processor servers with thousands of users • See Q313707 Exchange 2000 w. /3GB Switch Loses Network Connectivity

Everything Checks Out But Server Still ‘Slow’ • Exchange depends on the Active Directory  Check out bottlenecks on your AD servers • CPU bottleneck? • Disk bottleneck? • Insufficient Memory? Most techniques discussed to identify problems with Exchange 200x are equally applicable to Windows 200x Active Directory

DSAccess CountersMaking Sure Caching Is Happening • DSAccess reduces load on DS by caching requests • Important counters to check operation • MSExchangeDSaccess Caches\Cache Hits/Sec • MSExchangeDSaccess Caches\LDAP Searches/Sec • Compare to baseline rates when server is performing well

Problem Is “Before” Exchange • Check network counters Network Interface(netcard)\bytes received/sec Network Interface(netcard)\bytes sent/sec The network is rarely a bottleneck. However, incorrect backup schedules, can cause problems • Next stop, client side sniffs – are the packets really getting to the server?

Measuring Non-MAPI Requests • Analog of “RPC requests”  Epoxy queue object counters Epoxy(protocol)\Client Out Que Len Epoxy(protocol)\Store Out Que Len protocol = POP3, IMAP4, SMTP, DAV, and NNTP • Client Out Que Len: Number of requests waiting to be picked up by the store • Store Out Que Len: Number of requests waiting to be picked up by the Internet Information Server protocol handlers

Message Delivery Counters • Server responds to user requests preferentially • Delivery queues  first sign of an overload • SMTP Server\Local Queue Length • Should not grow continuously • Peak periods: Growing and shrinking in the range of 0-1000 is reasonable • SMTP Server\Messages Delivered/sec • Should be continuous • Gaps of zero delivery followed by spikes are indicative of other bottlenecks

Keeping Servers Healthy

Keeping Servers Healthy • Monitor servers continuously! • If you can identify bottlenecks, you can tell • when you don’t have them and • when you are getting close • But only if you are monitoring! • Need a baseline! • E.g., is today’s problem is due to • Increased load • Mail storm • Virus • Hardware problem

Monitoring Strategies With Perfmon • Keep live views w/different sample times, e.g., • 900 seconds for a 24 hour view • 1 second to catch short lived spikes • Add minimal set of important counters • Study your busiest server – why it is different? • Save reference logs (baseline data)

Processor(_Total)\% Processor Time System\Processor Queue Length Process(store )\% Processor Time PhysicalDisk(xxx)\Disk Transfers/sec PhysicalDisk(xxx)\Avg. Disk sec/Transfer MSExchangeIS\RPC Requests MSExchangeIS\RPC Operations/sec SMTP Server\Local Queue Length SMTP Server\Messages Delivered/sec MSExchangeIS Mailbox\Local Delivery Rate MSExchangeIS Mailbox\Folder Opens/sec MSExchangeIS Mailbox\Message Opens/sec A Minimal Set Of Counters

Do You Know? • Number of messages received/user per day? • How many do they download? • How often do they open folders? • What is the • Peak delivery rate? • Peak period during the day? • Peak day of the week? • Are there monthly/quarterly peaks? • How many more users can your servers support? Maybe there’s an easier way…

Making This Easier… • Microsoft Operations Manager and • Exchange Management Pack • Watch all of the bottleneck analysis perf counters and much more

Goals Of The Exchange Management Packs • Facilitate high availability Exchange operations • Monitor broadly  maximum pre-emptive alerting • Facilitate lower time-to-resolution: Management Pack knowledge base • Rapid diagnosis • Quick resolution

Questions

Exchange Survey • Help us understand your requirements • Available via CommsNet • Daily Drawings for Windows Mobile Smartphones! • http://www.researchhq.com/messagingsurvey

Microsoft Learning • Microsoft® Exchange Server 2003 Administrator's Companion ISBN:0-7356-1979-4

MSG347 Monitoring and Analyzing System Performance for Exchange