490 likes | 597 Vues
ESC 210. Solving Real Problems that Required a Consultant. Dave Stewart, PhD Director of Research and Systems Integration InHand Electronics dstewart@inhand.com www.inhand.com. Objective of this Class. Share some lessons learned
 
                
                E N D
ESC 210 Solving Real Problems thatRequired a Consultant Dave Stewart, PhD Director of Research and Systems Integration InHand Electronics dstewart@inhand.comwww.inhand.com
Objective of this Class • Share some lessons learned • If you encounter a similar issue, the flags raised here may give you additional ideas of what to look for • View hard problems differently • If you use same steps as a consultant, you can be your own consultant and save time and money
Overview • When is a Consultant Consulted? • Satellite Modem • USB Key Transfers Hang • Locomotive Braking System Lock-ups • Flash Corruption on Battlefield • Cryogenic Temperature Cooling • Degrading Legacy Software
When is a Consultant Consulted? • A problem has been around for weeks or months • Engineers familiar with system spent months to resolve unsuccessfully • Issues are glitches • Problem shows up randomly, and is not easily repeatable • Traditional debug unsatisfactory • Problem goes away or functionality breaks when you add debug • Can’t identify who is responsible • Is this a hardware or a software issue? • Is this “our” problem or a “vendor” problem?
Challenges Faced by the Consultant • Expected to find root cause in days • Even though engineers familiar with system could not do it in weeks or longer • Traditional methods won’t find issue • If they did, problem would already have been found • Root cause is not known • Even if customer says it is software or confined to a particular module, those might just be observable effects, not root cause
Who is a Consultant? • An expert within the organization • Usually an “expensive” resource • Need to pull the person off a different project • An FAE or vendor expert • When using COTS hardware or software, it could be the organization who sold the product • An independent contractor or consultant • Can leverage skills and experience applied to many other jobs
Reality of Hard Problems • Most hard problems are fundamentally simple • A common or known issue • A “silly” bug in software • One hardware signal is faulty • Difficulty solving problem is for three reasons: • Trying to fix problem before understanding root cause • Failure to use theory to analyze the system • Using the wrong tools to collect clues • Anyone can become a consultant • If they use a systematic approach and the right tools
Consultant TriageA Systematic Approach to Troubleshooting • Observe • Review information already available • Identify what information is missing • Hypothesize • Review the design for known common flaws • Verify any applicable errata • Consider what fundamental theories are likely in play • What tools can be used to prove or disprove a hypothesis? • Investigate • Use new additional techniques to increase quality and quantity of clues to identify root cause • Solutions • Usually, once root cause is known, several viable solutions follow rather quickly.
Satellite TV ModemKey Observables • Picture would occasionally glitch when button on remote pressed • Engineers identified it only happened when guide was being downloaded at same time • Implemented workarounduntil problem could be solved: • Ignore remote buttons when guide being downloaded • Leverages user’s default action when no response, which is to just press the same button again
Satellite ModemHypothesize • Multi-threaded system • Possible priority inversion or race condition • Real-time analysis never performed • Measurements of execution time not known • Possible transient overload not handled correctly • Need list of threads and execution time measurements • Customer was able to provide list of threads, but not execution time
Satellite ModemInvestigate • Need execution time for each thread • Instrumented code to allow measurements • Needed to restructure some tasks to follow a proper model • Can only measure execution time of threads that follow a definitive model (shown next slide) • Used logic analyzer to measure execution time • See this month’s issue of Embedded Systems Design • March 2012 issue available on show floor
Model of Real-Time Task Each thread has a main loop that does the following: Thread A For periodic threads, event is time-based. For other threads, event could be an interrupt, message arrival, semaphore wakeup,or any other signal. Wait for Event Read ITC Inputs/Events Do Processing andRead/Write Devices Write ITC Outputs
Model of Real-Time Task Measuring execution time for purpose of real-time analysisalways done at same place Thread A End Thread Cycle Wait for Event Start Thread Cycle Read ITC Inputs/Events Frequency of thread represented by how often this point is reached. Do Processing andRead/Write Devices Write ITC Outputs
Satellite ModemSolutions • Used Rate Monotonic Analysis • Identified system overloaded when guide being downloaded • Also identified on average 20% idle time when guide not being downloaded • Culprit was one interrupt handler executing for 6 msec, and temporarily using 80% CPU power. Engineer thought it was only a few hundred microseconds • Solution that Worked • Ensure all threads followed model or real-time thread • Split interrupt handler into ISR+IST • Defined IST as an Aperiodic Server • Scheduled system using Rate Monotonic
Satellite Modem • Why was Consultant Needed? • Customer was not applying fundamental real-time systems theory • A simple Rate Monotonic Analysis of the problem showed an obvious root cause • Customer did not have the tools needed to measure execution time. Execution time is key input into the analysis.
USB Key Transfer Hang Can you spot the difference?
USB Key Transfer HangKey Observables • Two apparently identical USB keys connected to embedded system • one worked consistently 100% of the time • other one locked up 50% of the time during long transfers. • Customer kept running tests from user-space. • After the first few tests, each other test only provided duplicate information; no new clues. • The key had a custom mechanical construction • Using a different key was not an option. • On desktop PC, both worked 100% of the time • Ran controlled tests with different file sizes 1KB to 1GB. Problems only started above 1MB on embedded system
USB Key Transfer HangHypothesize • Hangs == deadlock • Anytime there is a hang, look for a deadlock • Looks are deceiving • Although the two keys “looked” the same, they might be different versions • Compare working (PC) to non-working (Embedded System) at the USB interface • Focus on large file sizes since small files did not fail
USB Key Transfer HangInvestigate • Driver • Instrument USB driver at lowest level • Every time it sent a message, log event • Log every time a lock was obtained • Protocol • Use a USB analyzer to capture the transfer • Version • Analyzer allowed checking firmware version dates … the good one was Rev 8.02, the failing one 8.01 • This confirmed the keys were in fact not identical
USB Key Transfer HangSolutions • It works fine on PC • Since even “bad” key worked fine on PC, reverse engineer what PC was doing compared to embedded system • USB Analyzer provided key clue • The PC broke large blocks into many smaller blocks • Problem was the USB key • Changing hardware is always much more expensive than changing software • Can a software workaround be used to avoid the issue with the key? • Making change to embedded driver to break larger transfer into smaller ones allowed the bad key to work consistently. • Problem was not a deadlock.
USB Key Transfer Hang • Why was Consultant Needed? • Customer focus on creating more and more tests was not producing more clues • Customer was not using the right tool to debug • They didn’t have a USB analyzer because it was “expensive” • $2000 for a tool was much cheaper than losing month+ labor! • USB Analyzer was key tool that provided the clues to quickly zoom into root cause
Locomotive Anti-Lock Braking HangKey Observables • Randomly entire system would lock up • Manual override needed to be engaged • Debug showed threads were blocked • Post-mortem dump showed multiple threads all waiting for message • Design was a message-passing system • It followed guidelines given in RTOS documentation
Locomotive Braking System HangHypothesis • Most likely causes • Deadlock • Lost message • Blocking form of message passing was being used • This is known to be problematic in real-time systems • Potentially prone to deadlock
Locomotive Braking System HangInvestigation • Does system have all four necessary conditions required for a deadlock to occur? • Mutual Exclusion • Lock one resource while waiting for another • Cannot preempt resource usage • Circular wait • Answer was yes! • No reason to try to pinpoint the sequence of events that leads to deadlock • If a deadlock is possible, change the design
Locomotive Braking System HangSolution • Avoid deadlock by eliminating one of the necessary conditions from being possible: • Prevented “waiting” for another resource by changing the system to use non-blocking communication • Implemented this modified design • Never encountered subsequent deadlocks
Locomotive Braking System Hang • Why was a Consultant Needed? • Missing theoretical foundation to recognize that recommended design by RTOS vendor was flawed • This was a design flaw, customer tried to fix implementation by changing priorities and synchronization, but to no avail
Ruggedized PDA Flash CorruptionKey Observables • Some units that were fielded for a year or more started crashing on boot-up • Reformatting flash seemed to fix the problem, but only temporarily • No other indication of what problem was • “Damaged” units were sent back for analysis • Confirmed flash was corrupted, but no evidence of why
Ruggedized PDA Flash CorruptionHypothesis • Unit encountering hard shut-offs • Verified file system was transactional • Possible failure of flash chip • Run extensive tests • Compare image of corrupted flash with good unit • Filesystem area expected to be different • Focus on read-only parts of memory, ensure no corruption there
Ruggedized PDA Flash CorruptionInvestigation • Flash tests proved there were occasional bit errors • That is enough to point to the chip as culprit • But Why? • Review of theory indicated flash rated for 100,000 erase/write cycles per block • It seems like a lot, but that means 100,000/365=273 cycles a day on the same block could damage the flash
Ruggedized PDA Flash CorruptionSolutions • Enabled logging on a test unit to determine how flash was being used • Found that the registry was being written once per minute … or 1440 times per day. • Although the file system had wear leveling, when it was mostly full, the number of blocks available for wear leveling was only a handful • This meant blocks were being erased/written about a couple of hundred times per day each • Wearing out of the flash is to be expected.
Ruggedized PDA Flash CorruptionSolutions (cont’d) • Only fix identified was to replace flash • Workarounds to avoid bad blocks did not work, because blocks were scattered, and that only meant even less blocks available for wear leveling • For units that did not fail yet … • All units could eventually fail • Modified design to write logs by keeping file open then doing a flush, instead of open/write/close each time
Ruggedized PDA Flash Corruption • Why was a consultant needed? • Customer did not pay attention to theoretical limits of flash; it was a design oversight • Engineers working on project did not have a good set of flash tests that could catch the issue
Cryogenic Temperature ControlKey Observables • Temperature needed to be maintained at 4°K • Tolerance of +/- 10% • But was fluctuating +/- 2°K (200%) • Engineers using room temperature to troubleshoot • Using heater and ice bucket to verify control algorithm
Cryogenic Temperature ControlHypothesis • Temperature behavior at room temperature not same as near absolute zero • Control algorithm needed to be based on theory of temperature as it approaches zero • Heating/cooling cycles took tens of minutes • Very difficult to monitor behavior of temperature over extended period of time
Cryogenic Temperature ControlInvestigation • Created an emulator • A PC external to controller took outputs of embedded system, and returned an analog signal to represent temperature • New understanding of temperature • To create emulator, needed to think like the canister, and understand laws of heating and cooling • Emulator allowed execution thousands of times faster than real-time • In 1 minute, had 60,000 data points. • Previously, took 10 minutes to get 1 data point
Cryogenic Temperature ControlSolutions • Emulator created by different engineer than controller • Both emulator and controller need to agree in order to get correct results • When they don’t agree, both systems get debugged • Identified reduced number of significant digits near zero resulted in inaccurate calculations • Needed to perform computations as micro-degrees instead of milli-degrees
Cryogenic Temperature Control • Why was a consultant needed? • Customer not using correct tool, namely an emulator • Engineer failed to apply correct theory (heating cooling laws) near absolute zero
Degrading Legacy SoftwareKey Observables • Customer reported, “Software is degrading” • Control software for locomotives that had been working for two decades started crashing • watchdog timer causing board resets • Replacing main boards did not fix problem • Customer suspected possible issue with sensor data • Adding debug caused system to fail worse
Degrading Legacy SoftwareHypothesis • Adding debug made system worse • System is overloaded • Debug breaks real-time performance • System only starting to fail now • Perhaps a different path of execution was increasing utilization • Verify error handling.
Degrading Legacy SoftwareInvestigation • Used logic analyzer method to troubleshoot • Allowed collecting data in real-time where traditional debug methods caused system to fail totally • Monitored inputs and outputs of functions. • Found data output from sensors had significant noise and had occasional bad samples • Root cause – sensors are degrading • It was determined that it was the sensors that started to degrade • The software did not have appropriate error handling code, and faulty calculations were causing processor exceptions
Degrading Legacy SoftwareSolutions • Replacing sensors is a very costly solution • Changing sensor fixed issue. However, replacing all sensors on all locomotives too costly • Software workaround: Add filters and error handling • Desire to add filtering to clean up the now-noisier data • Processor was already fully loaded, adding software filters data caused overloads and system failure
Degrading Legacy SoftwareSolutions (cont’d) • Reduce processor load • Performed real-time analysis: near 0% idle time • Found a 5-msec control loop using 60% of processor bandwidth • Why 5-msec? Customer: “Because it works”
Degrading Legacy SoftwareSolutions (cont’d) • Review design decisions • How slow can control loop be run? • Edge of customer comfort level was at about 50 msec • Decided to run at 25-msec instead • Final solution • Revised real-time analysis, idle time up to 48% • Now using 3msec every 25msec, instead of 15msec every 25msec • Added filtering for sensors starting to fail • Added error notifications for sensors that needed replacing • Created enough ‘spare’ processing time to add traditional debug
Degrading Legacy Software • Why was Consultant Needed? • Wrong tool: customer did not have a good way to obtain debug in real-time. Using a logic analyzer resolved this • No real-time analysis: once a real-time analysis was done, it was obvious which thread needed to be optimized. • Real-time systems theory: to reduce utilization, only two things can be done: 1) Reduce execution time 2) Increase period of execution • #2 is usually easier, to verify possibility of that first
Summary • Most hard problems are fundamentally simple • A common or known issue • A “silly” bug in software • One hardware signal is faulty • Difficulty solving problem is for two reasons: • Trying to fix problem before understanding root cause • Using the wrong tools to collect clues • Anyone can become a consultant • If they use a systematic approach and the right tools