1 / 1

Failure Rate in the Field

Stronger clustering, higher yield (   0). Weaker clustering, lower yield (    ). 4/6 good chips. 2/6 good chips. A Cost Analysis Framework for Multi-core Systems with Spares. Saeed Shamshiri, Peter Lisherness, Sung-Jui Pan, and Kwang-Ting (Tim) Cheng

joella
Télécharger la présentation

Failure Rate in the Field

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stronger clustering, higher yield (0) Weaker clustering, lower yield ( ) 4/6 good chips 2/6 good chips A Cost Analysis Framework for Multi-core Systems with Spares Saeed Shamshiri, Peter Lisherness, Sung-Jui Pan, and Kwang-Ting (Tim) Cheng University of California, Santa Barbara Resilient System Design, Task 1.2.3.4. Objectives Challenges Core Yield Model 3 1 2 Defect density Area “Could 10-20% yields for Cell processors lead to problems for Sony PS3?” * • “With standard SiGe single-core processors, IBM can achieve yields of up to 95%. But with a chip like the Cell processor, you’re lucky to get 10 or 20 percent.” • “If you really want to be focused on reliability and up-time availability, you can design one of these chips to self-detect. You can ship it with eight cores working, blow one of them, and from a user perspective you would have self-healed it in the field.” • “With such systems in place, yields could conceivably increase in a best-case scenario to 40% -- still significantly lower than the 95% yields that IBM and others enjoyed during the single-core, ‘one-by-one’ era.” • True yield of a core, yc, is a function of area, defect density, and clustering factor (): •  is the degree to which defects are clustered • Observed yield of a core, y’c, also depends on the manufacturing testing: • Understanding impact of core yield, manufacturing and in-field test quality and spare scheme on the system yield and cost. • How many spare cores should be included? • How many working spares in a shipped chip would be sufficient? • What is the requirement for manufacturing and in-field test quality to achieve required system reliability? • Can we skip burn-in and repair infant mortality in the field? Clustering factor defect coverage of the manufacturing testing IBM CELL Processor (8 SPE) (ISSCC05) Intel 80-tile network on chip (ISSCC07) Sun Niagara (8 Sparc cores) (IEEE Micro 2005) * Electronic News 7/7/06 and TGDaily 7/14/06, Interview of Tom Reeves, VP of semiconductor and technology services at IBM * de Sousa and Agrawal, DATE 2000 * Kuo and Kim, Proc. of IEEE 1999 Failure Rate in the Field System Yield Chip Cost Model 4 5 6 • Total cost • Manufacturing cost • Service cost • Weibull distribution model for a core’s lifecycle* Observed system yield: The probability that a core is defect-free given that it passes the test: The probability that all m active cores are indeed defect-free after they all passed the test: True system yield: Shape parameter Warranty period Manufacturing cost of a core Observed system yield Scale Parameter: the time at which 63.2% of units will fail The probability of a core not failing within the infant mortality period The probability of a shipped chip not failing within the warranty period Service cost per failed chip The probability that a core that passed the test is indeed defect free and does not fail within the infant mortality period * Carulli and Anderson, IEEE Design & Test Computers, March/April 2006 Summary & Future Works Burn-in Elimination Manufacturing vs. In-field Testing 7 8 9 Summary: • An analytical model for the cost of a spare-enhanced multi-core system, with or without burn-in. • Reducing the overall cost by adding a few spare cores. • Investigating the removal of the burn-in process. • Illustrating the importance of in-field test quality vs. the manufacturing test quality. Future Work: • Considering the communication infrastructure of an NoC-based SoC in the analysis: • Routers • Network Interfaces • Links • Generalizing from homogeneity to heterogeneity: • Memory blocks, FPGA blocks, IP cores, DSPs, and et al. • Applying the analysis to some available NoC-based SoCs: • Intel’s 80 core processor • MIT’s Raw processor • University of Texas at Austin’s TRIPS processor • In a spare-enhanced chip, the in-field test quality is much more important than the manufacturing test quality • For example, with four spares, the chip cost with θ=60% is 64.36, which is 23% greater than that of the same chip with θ=70%, and is 4.7 times that of the case in which θ=100% (θ is the in-field defect coverage). With s>=3, and Yc>90% it is better not to do burn-in With s>=2, and Yc>98% it is better not to do burn-in

More Related