550 likes | 808 Vues
ADM390 Microsoft ® Windows ® Crash Dump Analysis. Mark Russinovich Winternals Software David Solomon David Solomon Expert Seminars. About The Speakers. Authors of: Inside Windows 2000 , 3rd Edition (Microsoft Press) Inside Windows 2000/XP/2003 Interactive Internals Video Tutorial
E N D
ADM390Microsoft® Windows® Crash Dump Analysis Mark RussinovichWinternals Software David SolomonDavid Solomon Expert Seminars
About The Speakers • Authors of: • Inside Windows 2000, 3rd Edition(Microsoft Press) • Inside Windows 2000/XP/2003 Interactive Internals Video Tutorial • Used by Microsoft for worldwide internal training • David Solomon: • Teaches Windows internals classes (www.solsem.com) • Writes books and articles on Windows internals • Mark Russinovich: • Author of tools on www.sysinternals.com • Co-founder and Chief Software Architect for Winternals Software (www.winternals.com) • Teaches Windows internals classes • Writes books and articles on Windows internals
Outline • What causes crashes? • Crash dump options • Analysis with WinDbg/Kd • Debugging hung systems • Microsoft On-line Crash Analysis • Using Driver Verifier • Live kernel debugging • Getting past a crash
Introduction • Many systems administrators ignore Windows NT/Windows 2000’s crash dump options • “I don’t know what to do with one” • “Its too hard” • “It won’t tell me anything anyway” • Basic crash dump analysis is actually pretty straightforward • Even if only 1 out of 5 or 10 dumps tells you what’s wrong, isn’t it worth spending a few minutes?
Why Analyze Dumps? • The debuggers and Microsoft Online Crash Analysis (OCA) often solve crashes • Sometimes, however, they do not, so your analysis might tell you: • What driver to disable, update, or replace with different hardware • What OEM to send the dump to
What Causes Crashes? • System crashes when a fatal error prevents further execution • Any kernel-mode component can crash the system • Drivers and the OS share the same memory space • Therefore, any driver or OS component can, due to a bug, corrupt system memory • Note: This is for performance reasons and is the same on Linux, most Unix’s, VMS, etc…
What Are The Root Causes? • Anecdotal evidence suggests: • Buggy drivers • Bugs in the OS • Hardware failure/error • Cosmic rays
At The Crash • A component calls KeBugCheckEx, which takes five arguments: • Stop code • 4 stop-code defined parameters • KeBugCheckEx: • Turns off interrupts • Tells other CPUs to stop • Paints the blue screen • Notifies registered drivers of the crash • If a dump is configured: • Verifies checksums • Calls dump I/O functions
Common Stop Codes • There are about 150 defined stop codes • Shared by many components and drivers • Common ones include: • IRQL_NOT_LESS_OR_EQUAL (0x0A) • Usually an invalid memory access • INVALID_KERNEL_MODE_TRAP (0x7F) andKMODE_EXCEPTION_NOT_HANDLED (0x1E) • Generated by executing garbage instructions • Usually caused when a stack is trashed • Documented in Debugger Tools help file • Often, multiple articles in Knowledge Base
Dump Options • Complete memory dump (Windows NT 4, Windows 2000, Windows XP) • Full contents of memory written to <systemroot>\memory.dmp • Kernel memory dump (Windows 2000, Windows XP, Server 2003) • System memory written to <systemroot>\memory.dmp • Small memory dump (Windows 2000, Windows XP, Server 2003) • Also called a minidump or triage dump • 64KB of summary written to <systemroot>\minidump\MiniMMDDYY-NN.dmp
Enabling Dumps • In Windows 2000/XP/2003:
What Happens When Crash Dumps Are Enabled • When the system boots it checks HKEY_LOCAL_MACHINE\System\ CurrentControlSet\Control\CrashControl • The boot disk paging file’s on-disk mapping is obtained • Relevant components are checksummed: • Boot disk miniport driver • Crash I/O functions • Page file map
At The Reboot WinLogon Memory.dmp Session Manager 2 3 SaveDump 1 4 User mode Kernel mode Paging File NtCreatePagingFile
At The Reboot • Session Manager process (\windows\system32\smss.exe) initializes paging file • NtCreatePagingFile • NtCreatePagingFile determines if the dump has a crash header • Protects the dump from use • WinLogon calls NtQuerySystemInformation to tell if there’s a dump 1 2
At The Reboot • If there’s a dump, Winlogon executes SaveDump (\windows\system32\savedump.exe) • Writes an event to the System event log • SaveDump writes contents to appropriate file • Crash dump portion of paging file is in use during copy, so virtual memory can run low 3 4
Why Crash Dumps Fail • Most common reasons: • Paging file on boot volume is too small • Not enough free space for extracted dump • Less common: • The crash corrupted components involved in the dump process • Miniport driver doesn’t implement dump I/O functions • Windows storage drivers must implement dump I/O to get a Microsoft® digital signature
Microsoft On-line Crash Analysis (OCA) • By Default, after a reboot XP/Server 2003 prompts you to send information to http://oca.microsoft.com • Can be configured with Computer Properties->Advanced->Error Reporting • Can be customized with Group Policies
What Does OCA Do? • Server farm uses !analyze, but uses Microsoft’s Triage.ini file and database that includes information about known problems • Several ways to get OCA results: • Via e-mail • At the OCA site • Sometimes OCA will point you at KB articles that describe the problem • KB articles may tell you to use Windows Update to get newer drivers, a hotfix, or install a Service Pack
Analyzing a Crash Dump • If OCA doesn’t help you, or you have an NT4 or Windows 2000 dump, then you need to open it with one of the kernel debuggers: • WinDbg –Windows program • Kd – command-line program • Both provide same kernel debugger analysis commands • Part of the Debugging Tools for Windows • Free download from http://www.microsoft.com/whdc/ddk/debugging/default.mspx • Supports Windows NT 4, Windows 2000, Windows XP, Server 2003 • Check for updates frequently • Don’t use older version on install media
Symbol Files • Before you can use any crash analysis tool you need symbol files • Symbol files contain global function and variable names • Symbols are service pack-specific and have an installer (default directory is \windows\symbols) • Windows NT 4: *.dbg • Windows 2000: *.dbg, *.pdb • Windows XP/2003: *.pdb • Note: Service Pack symbols only include updates
Microsoft Symbol Server • WinDbg and Kd can download symbols automatically from Microsoft • Pick a directory to install symbols and add the following to the debugger’s symbol path:SRV*directory*http://msdl.microsoft. com/download/symbols • The debugger automatically detects the OS version of a dump and downloads the symbols on-demand
Automated Analysis • When you open a crash dump with Windbg or Kd you get a basic crash analysis: • Stop code and parameters • A guess at offending driver • The analysis is the result of the automated execution of the !analyze debugger command
Automated Analysis • Always execute !analyze with the –v option to get more information • Text description of stop code • Meaning (if any) of parameters • Stack dump • !Analyze uses heuristics to walk up the stack and determine what driver is the likely cause of the crash • “Followup” is taken from optional triage.ini file
Manual Analysis • Sometimes automated analysis isn’t enough • !analyze doesn’t tell you anything useful • You want to know what else was happening at the time of the crash • Useful commands: • Examine current thread: !thread tid • May or may not be related to the crash • List all processes: !process 0 0 • Make sure you understand what was running on the system • Examine a specific process: !process <pid> 7 • List loaded drivers: lm kv • Make sure drivers are all recognized and up to date • Look at memory usage: !vm • Create a smaller dump file: .dump • Additional commands: !help
Driver Verifier • If you find a driver in a crash dump that looks like it might be the cause of the crash, turn on verification for it • If the Verifier detects a violation it crashes the system and identifies the driver • Use “Last Known Good” if the verifier detects a bug during the boot • If a bug is detected in a third-party product check for updates and/or contact the vendor’s support
NotMyFault.exe • In order to demonstrate common crash scenarios, use NotMyFault.Exe • Download from http://www.sysinternals.com /files/notmyfault.zip • It loads MyFault.sys • MyFault.Sys has an IOCTL interface that implements different bugs User Mode Kernel Mode MyFault.sys IOCTL Interface
IRQL_NOT_LESS_OR_EQUAL • Run NotMyFault and select “High IRQL fault (kernel mode)” • Allocates paged pool buffer • Frees the buffer • Raises IRQL ≥ DISPATCH_LEVEL • Touches the buffer • Paged buffers that are marked “not present” but are touched when IRQL ≥ DISPATCH_LEVEL result in the IRQL_NOT_LESS_OR_EQUAL bug check • Memory Manager calls KeBugCheckEx from page fault handler • The IRQL is not less than or equal to the maximum IRQL at which the operation is legal (which is < DISPATCH_LEVEL)
Using the Stack in Analysis • !analyze easily identifies MyFault.sys by looking at the KeBugCheckEx parameters • The Memory Manager looked at the stack and determined the address that caused the page fault • !analyze often looks at the stack to determine the cause of a crash
Stacks • Each thread has a user-mode and kernel-mode stack • The user-mode stack is usually 1 MB on x86 • The kernel-mode stack is typically 12 KB on x86 systems • Stacks allow for nested function invocation • Parameters can be passed on the stack • Stores return address • Serves as storage for local variables
Stack Frames Parameter 1 Return Address Frame Pointer Local Variable 1 Function 1 Local Variable 2 Parameter 3 Higher Addresses Parameter 2 Parameter 1 Function 2 Return Address Frame Pointer Local Variable 1 Local Variable 2 Function 3 Parameter 2 Parameter 1 Return Address Frame Pointer Local Variable 1
Stacks • Other calling conventions make the stack hard to figure out • No frame pointer • Register arguments (fast calls) • Debugger requires symbol information to parse • The stack is the #1 analysis resource • It requires that a driver get “caught in the act” • Sometimes that’s not possible without the Driver Verifier’s help
Stack Trashing • Stack trashes have several possible causes: • A driver pushing things on the stack causes the stack to overflow • A driver overruns a stack-allocated buffer • Usually results in garbage code being executed (KMODE_EXCEPTION_NOT_HANDLED) • Driver Verifier can’t determine cause • Since the stack is corrupted, analysis is especially hard
Debugging Stack Trashes • Run NotMyFault and select “Stack Trash” • Allocates a buffer on the stack • Overruns the buffer • Returns to the caller • Crash doesn’t show much off hand • !analyze actually blames Win32K.sys, the Win32 kernel-mode subsystem • Stack doesn’t show anything except an exception handler • Look deeper • !thread shows an outstanding IRP • !irp <irp> shows that myfault.sys was the target of the IRP
Buffer Overruns • Result when a driver goes past the end (overrun) or the beginning (underrun) of a buffer • Usually detected whenoverwritten data is referenced • Another driver or the kernel makes the reference • There can be a long delaybetween corruption and detection Another Driver’s Buffer Higher Addresses Pool Structures Driver Buffer
Causing a Buffer Overrun • Run NotMyFault and select “Buffer Overrun” • Allocates a nonpaged pool buffer • Writes a string past the end • Note that you might have to run several times since a crash will occur only if: • The kernel references the corrupted pool structures • A driver references the corrupted buffer • The crash tells you what happened, but not why
A Buffer Overrun Bluescreen • In this example, where the crash was the result of the kernel tripping on corrupt pool tracking structures, the Bluescreen tells you what to do:
What is Special Pool? • Special pool is a kernel buffer area where buffers are sandwiched with invalid pages • Conditions for a driver allocating from special pool: • Driver Verifier is verifying driver • Special pool is enabled • Allocation is slightly less than one page (4 KB on x86) Invalid Page n+2 Higher Addresses Buffer Page n+1 Signature Invalid Page n
Turning on Special Pool • Enable Special Pool verification on the suspect driver
The Verifier Catching Buffer Overrun • The Driver Verifier catches the overrun when it occurs • The Bluescreen tells you who’s fault it is • !analyze explains the crash and also tells you the buggy driver name • The stack shows where the driver bug is
Code Overwrites • Caused when a bug results in a wild pointer • A wild pointer that points at invalid memory is easily detected • A wild pointer that points at data is similar to buffer overrun • Might not cause a problem for a long time • Crash makes it look like its something else’s fault • Driver Verifier doesn’t catch code overwrite • System code write protection catches code overwrite, but it’s not on if: • It’s a Windows 2000 system with > 127 MB memory • It’s a Windows XP or .NET Server system with > 255 MB • Something has disabled it
Causing a Code Overwrite • Run NotMyFault and select “Code Overwrite” • Overwrites first bytes of nt!ntreadfile • Function is most common entry to I/O system so a random thread will cause the crash • The crash hints that the fault occurred in NtReadFile • The last user-mode address is ZwReadFile • The ebx register in the exception frame points at NtReadFile • NtReadFile’s start location looks scrambled (u ntreadfile)
System Code Write Protection • Make sure system code write protection is on • Set HKLM\System\CurrentControlSet\Control \Session Manager\Memory Management LargePageMinimum REG_DWORD 0xFFFFFFFF EnforceWriteProtection REG_DWORD 1 • Reboot to take effect • Rerun NotMyFault • Crash occurs immediately and even the blue screen points at MyFault.sys: • !analyze shows the address of the write and the target (NtReadFile)
Hung Systems • You can tackle a hung system, but only if you’ve prepared: • Boot in debug mode, or • Set the keystroke-crash Registry value • For debug mode you need a second system (the debugger host) connected to the target via serial cable • Run Windbg/Kd on the host • Edit the target’s boot.ini file: • /debugport=comX /baudrate=XXX • When the system hangs, connect with the debugger and hit Ctrl-C
Hung Systems • To configure keystroke-crash: • Set HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\i8042prt\Parameters\CrashOnCtrlScrl to 1 • Enter right-ctrl+[scroll-lock, scroll-lock] to crash the system • Use !thread to see what’s running • Examine loaded drivers, IRQL, …
Getting Past a Crash • Last-Known Good • Boots with driver/kernel configuration last used during a successful boot • Safe Mode • Boots the system with core set of drivers and services • Network and non-network • Recovery Console • Manually disable offending service, replace corrupt images, update files • ERD Commander 2003 • Registry Editor, Explorer, Driver/Service Manager, password changer, Event Log viewer, Notepad
The Bluescreen Screen Saver • Scare your enemies and fool your friends with the Sysinternals Bluescreen Screen Saver • Be careful, your job may be on the line!
More Information • Inside Windows 2000, 3rd edition • Section on System Crashes in chapter 4 • Debugging Tools help file • Knowledge Base Articles • http://www.microsoft.com/whdc/ddk/debugging/DBG-KB.mspx • Usenet newsgroup microsoft.public.windbg for discussion of debugger issues • The debugger team wants your feedback and bug reports - mail suggestions or bug reports to windbgfb@microsoft.com
Community Resources • Community Resources http://www.microsoft.com/communities/default.mspx • Most Valuable Professional (MVP) http://www.mvp.support.microsoft.com/ • Newsgroups Converse online with Microsoft Newsgroups, including Worldwide http://www.microsoft.com/communities/newsgroups/default.mspx • User Groups Meet and learn with your peers http://www.microsoft.com/communities/usergroups/default.mspx