1.02k likes | 1.27k Vues
Customer Stories From the Front Line. Bob Ward Principal Architect Escalation Engineer Microsoft. Adam Saxton Sr. Escalation Engineer Microsoft. The day…. Why are you here today?. Plenty of demos. We may go deep in some places. What kind of stories will I tell ?. VLF Fragmentation.
 
                
                E N D
Customer Stories From the Front Line Bob Ward Principal Architect Escalation Engineer Microsoft Adam Saxton Sr. Escalation Engineer Microsoft
Why are you here today? Plenty of demos We may go deep in some places
What kind of stories will I tell? VLF Fragmentation Memory Leak Always On Data Surgery Memory Scribbler Connectivity Scheduler Non-Yield
Tim Chapman PFE – East Region VLF Fragmentation Strikes Again Database Mirroring principal behind syncing to secondary on SQL 2008 SP2 2 problems that appear to be unrelated Non-yielding error in the ERRORLOG with stack dump 2012-08-01 17:37:00.500 Server Using 'dbghelp.dll' version '4.0.5' 2012-08-01 17:37:00.530 Server ***Unable to get thread context for spid 0 2012-08-01 17:37:00.530 Server * ******************************************************************************* 2012-08-01 17:37:00.530 Server * 2012-08-01 17:37:00.530 Server * BEGIN STACK DUMP: 2012-08-01 17:37:00.530 Server * 08/01/12 17:37:00 spid 10728 2012-08-01 17:37:00.530 Server * 2012-08-01 17:37:00.530 Server * Non-yielding Scheduler 2012-08-01 17:37:00.530 Server * 2012-08-01 17:37:00.530 Server * ******************************************************************************* 2012-08-01 17:37:00.530 Server Stack Signature for the dump is 0x000000000000028D 2012-08-01 17:37:05.750 Server External dump process return code 0x20000001. External dump process returned no errors. Means “no session” but is “background”
VLF Fragmentation Inside VLF Fragmentation This looks fairly harmless Symptoms are typically really long recovery or restore even if there is nothing to recover What if I have to scan this list? VLF<n> VLF 900,000 VLF1 VLF2
VLF Fragmentation Back to our Story
VLF Fragmentation The Solution BACKUP_IO wait_type
VLF Fragmentation Prevention
Bob Ward Texas It looks like tempdb to me Customer benchmark cannot scale on SQL Server 2008R2 RTM Looks like a tempdb latch allocation contention problem Larger number of users run into massive blocking
Yielding problem Bob asked to investigate the case • Every scheduler has someone RUNNING a SELECT • “Ready to run” but someone else running on scheduler
Yielding problem So if it is not tempdb, what is it?
Yielding problem Now we find the problem
Yielding problem The problem seen visually Plan to find A bad hash chain Bucket 1 Entry 100000 Entry 10000 Entry1 Entry2 …………….. …………….. Bucket 2 Entry 100000 Entry 10000 Entry1 Entry2 …………….. …………….. Bucket 3 Entry 100000 Entry 10000 Entry1 Entry2 …………….. …………….. A better hash chain Plan found here Bucket 1 Entry1 Entry2 …..….. Bucket 100000 Entry1 Entry2
Yielding problem The Solution
Adam Saxton Texas Cannot Generate SSPI Context Users cannot connect to SQL Server with the error below Doesn’t matter who tries to connect or which application they use Started happening after the weekend
The Misplaced SPN What is an “SSPI Context?” sys.dm_exec_connections “double-hop” scenarios require delegration
The Misplaced SPN Service Principal Name (SPN) “Built-in” account = computer account
The Misplaced SPN SQL Service SPN Svc account Must have AD perms SQL Server startup = register SPN SQL Server shutdown = delete SPN Default Instance TCP MSSQLSvc/passsql.pass.local:1433 MSSQLSvc/passsql.pass.local:56772 Named Instance / : class host port Default Instance Named Pipes MSSQLSvc/passsql.pass.local MSSQLSvc/passsql.pass.local:myinstance Named Instance Blog: What SPN do I use and how does it get there?
The Misplaced SPN It should always work but….. Not to the current “Log on As” account
The Misplaced SPN How can this happen? “Fallback” to NTLM
The Misplaced SPN Back to our Case….
The Misplaced SPN System Center Advisor http://www.systemcenteradvisor.com You may experience connectivity issues to SQL Server if SPNs are misconfigured http://support.microsoft.com/kb/2443457
Adam Saxton Texas Connection Timeout Login Timeout Expired error message at client. May or may not be specific to Windows Authentication Can occur intermittently and/or for periods of time Major Cause is Network issues
Connection Timeout Cause: AD Communication Issues • Slow response from Active Directory • Could be network, could be AD issues, could be a DC problem • SQL Authentication logins work • dm_os_wait_stats / PREEMPTIVE_OS_LOOKUPACCOUNTSID TDS Login Login Timeout
Connection Timeout Cause: Worker Pool Starvation DO NOT assume you need more worker threads • No available workers to service connection task • dm_os_wait_stats / THREADPOOL • Dedicated Admin Connection to see this live • Look for other waits and often a long blocking chain Applies to any task TDS Login Login Timeout
Demo Connection Timeout
Tejas Shah India Error 18056 Periodically not able to connect Lasts for 5-10 minutes and “resolves itself” without having to restart SQL ERRORLOG has multiple of these below spid52 Error: 18056, Severity: 20, State: 29. spid52 The client was unable to reuse a session with SPID 52, which had been reset for connection pooling. The failure ID is 29. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message.
Error 18056 What is a “redo” login? No Connection Pooling CONNECT SQL Server DISCONNECT SPID <n> Connection Pooling CONNECT SQL Server DISCONNECT CONNECT (nothing to server) Connection Pool SPID <n> BATCH/RPC TDS header query reset bit redo
Error 18056 18056 vs 18456 Msg 18456 database in connection string cannot be accessed 2013-04-26 16:32:39.63 spid56 Error: 18056, Severity: 20, State: 46. 2013-04-26 16:32:39.63 spid56 The client was unable to reuse a session with SPID 56, which had been reset for connection pooling. The failure ID is 46. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message. login 2013-04-26 16:32:39.66 Logon Error: 18456, Severity: 14, State: 38. 2013-04-26 16:32:39.66 Logon Login failed for user <account>. Reason: Failed to open the explicitly specified database 'mydb'. [CLIENT: <local machine>] Redo login
Error 18056 This blog post has the states for both 18056 and 18456 18056 States LOGON TRIGGER State = 1 (Msg 17892 for login)
Error 18056 What about state 29? dm_os_ring_buffers of type RING_BUFFER_EXCEPTION has details
Error 18056 The problem seen visually Connection Pooling CONNECT SQL Server DISCONNECT CONNECT (nothing to server) SPID <n> Connection Pool BATCH/RPC TDS header query reset bit redo Attention If Server receives ATTN during “redo”, 18056 state 29 in ERRORLOG Query timeout or explicit cancel
Suresh Kandoth Texas Memory Scribbler Different Customers calling over a long period of time with different errors All systems running same hardware [brand and model] Page Header Corruption with same signature for all cases • Walk through the timeline • Understanding some errors (832, 5242/5243, Bugcheck, etc) • How do we recover? • Understanding the resolution
Memory Scribbler November & December 2011 Perhaps not so random? • Error 832 reported by constant page sniffer • First 4 bytes of page header damaged with random hex 2011-12-19 09:46:09.60 spid5s Error: 832, Severity: 24, State: 1. 2011-12-19 09:46:09.60 spid5s A page that should have been constant has changed (expected checksum: 2247c294, actual checksum: ba52fd57, database 5, file 'Data.mdf', page (1:248037)). This usually indicates a memory failure or other hardware or OS corruption. 00000003`437a0000 2b 31 87 7f 00 82 01 00-51 8e 12 00 01 00 1c 00 00000003`437a0010 53 8e 12 00 01 00 17 00-05 a0 d2 0a f6 00 dc 1e 00000003`437a0020 52 8e 12 00 01 00 00 00-fe 71 02 00 6e 33 00 00 00000003`437a0030 4d 00 00 00 00 00 00 00-00 00 00 00 78 65 23
Memory Scribbler Checksum Expected Checksum 0x200 = HAS_CHECKSUM Actual Checksum Page Header Page 1:248037 m_tornBits=0x2247c294 m_flagBits=0x200 0xba52fd57 1 100 2 200 111 96 2011-12-19 09:46:09.60 spid5s Error: 832, Severity: 24, State: 1. 2011-12-19 09:46:09.60 spid5s A page that should have been constant has changed (expected checksum: 2247c294, actual checksum: ba52fd57, database 5, file 'Data.mdf', page (1:248037)). This usually indicates a memory failure or other hardware or OS corruption.
Memory Scribbler The Page 2b 31 7f 6e 00 13 4f 00 82 01 00 87 51 8e 12 00 01 00 1c 00 53 8e 12 00 01 01 17 00 05 a0 d2 0a f6 00 dc 1e 52 8e 12 00 01 00 00 00 Thread, another process, driver, hw