Optimization Strategies for Time, Area, and Instruction Sets in System Design

UNIT - 2 TIME, AREA AND INSTRUCTION SETS

Time, Area and Instruction Sets The tradeoff of Cost and Performance is fundamental to any System Design Long Term Cost Performance Ratio is the basis for most design decisions. The three main conceptual areas which affect Architectural Choices and Design Decisions are Time: Cycle Time is one important parameter for optimizing program execution. The Instructions are partitioned into cycles and basic pipelining mechanism is used to speed up instruction execution. Area: The cost of Area occupied by a particular feature is important aspect of architectural trade off. Instruction Set: Simpler Instruction sets occupying less implementation area provide better cost performance ratio unless a complex feature included in the instruction set can provide better performance commensurate with increased cost of implementation.

Time Instruction execution is partitioned into manageable and fixed time cycles Designers aim at reducing both cycle count and cycle time to speed up performance. The partitioning process has its own overheads and very short cycle times tend to be dominated by this overhead. The longer cycle times make pipelining less productive. The challenge is to find optimized minimum achievable time. The goal is to minimize the overall instruction execution time.

The nature of a Cycle A cycle defined as time between state transitions is the basic quantum of time for processing or communicating information with in a machine. Cycle time is determined as worst case ( Maximum) time to accomplish a meaningful event across most frequently used data paths. Less frequently used data paths needing more time than this will need multiple cycles. A New Cycle begins with Instruction being decoded by the decoder. The decoder produces a number of control points which are applied to various data path output register gates and combinational logic throughout the system. This allows data from source registers to propagate through designated combinational logic in to the in gating of destination registers. After a suitable setup time clocking system provides a clock pulse which samples the data into registers. Time between corresponding clock pulses is the cycle time. The clock may get delayed from its expected arrival time due to propagation or loading effects, which is termed as (Uncontrolled) Clock Skew.

The Registers and Clocking Out Put Out put Data in Data In Registers are of two basic forms. Single Rank Registers and Dual Rank Registers. Registers use two distinct clock triggering mechanisms. Level Triggered and Edge Triggered. Clock Clock Singe Rank Register Dual Rank Register • A single rank registers transmits the input data to out put data soon after it has been triggered. • A Dual rank register has two latches – an input latch and an out put latch.

The Registers and Clocking • An alternative clocking technique is to use multi phased clock. Separate clock lines enable successive logic stages. • Most modern machines both mainframes and microprocessors use multi phase level sensitive clocking. When the clock is initially enabled the data in is stored in input latch and when the clock is disabled the output of first latch is transferred to output latch. Single rank register has lower cost whereas dual rank has clocking advantage. It avoids logical hazard when P min < tw ( Clock Pulse Width)

The Clocking Overhead Maximum delay in logic without clock overhead Minimum delay in the shortest logic path Clock Pulse width Register data set up time Register out put delay The Clocking Overhead: It is the additional delay in the worst case delay path due to register delay and clock. C = Clock Overhead T = Cycle Time For a conventionally clocked system T = For a dual rank edge triggered clock

The Pipelined Processors Optimizing the partitioning of instructions into cycles is one way to speed up program execution. Another approach is to use concurrency of instruction execution In Pipelined machines as soon as one instruction that was fetched enters decode cycle, the next instruction is fetched ready to be decoded as soon as previous action completes decode and enters next cycle. Thus in each cycle we fetch, decode, data fetch and execute one instruction. Suppose we have a simple instruction execution process of four cycles. IF – Instruction fetched from cache into IR. D - Decode the instruction DF – Data Fetch from either Memory or Register set. Ex – Execute the instruction. Then in a pipelined machine we have D IF DF EX IF D DF EX IF D DF EX IF D DF EX A Four Staged Pipelined Processor

The Pipelined Processor Pipelined instruction execution can significantly speed up program execution. If the Pipeline has n Stages or segments the maximum possible speed up is n times the instruction execution time in well mapped machines. An idealized pipelined processor can process instructions at the rate of one instruction per cycle, but interruptions due to incorrectly guessed or unexpected branches affect the actual performance ( Due to invalidation of instructions preparing to enter or already in pipeline) A basic optimization issue is the partitioning of the pipeline into concurrently operating segments. More the number of segments, higher is the max speed up. However each new segment carries clocking overheads with it which adversely affects performance. Pipelined processors are complex and costly but still such processors are cost effective overall. Almost all recently introduced microprocessors are Pipelined.

Optimum Pipelining T T/S T/S T/S T/S Clock Over head + Skew T/S KT/S C Skew Clock Overhead • Assuming • Total Time to execute an Instruction (Without Pipeline segments) = T ns. • If this is segmented into S segments ( to allow clocking and Pipelining) • The ideal delay ( Time to complete one segment) through a segment = T/S = Tseg • Associated with each segment are two types of partition overheads • A Fixed Clock Overhead ‘C’ ( Due to data set up and hold times etc) • A Clock skew which stretches Segment delay by a factor of (1+K) • Therefore actual Cycle Time = T/S+KT/S+C = (1+K)TSeg + C If quantization effects are ignored, It is possible to determine an optimal cycle time and hence the number of segments for a simple pipelined processor.

Optimum Pipelining If there are no interruptions to the pipeline, The processor gives a performance of one instruction per cycle. But interruptions occur due to unexpected branches. Suppose interruptions occur with a frequency ‘b’ and have effect of invalidating the ( S-1 ) instructions in the pipeline. Now the performance of the processor will be given by Performance = 1 / 1+(S-1)b instructions per cycle The Throughput ( no of instruction in one sec) G = Performance / Cycle Time = 1/1 + (S-1)b *1/(1+K) TSeg + C If we find dG/dS =0 We can find the optimum number of pipeline segments Sopt = Sopt is only a design starting point and usually a upper limit of possible no of segments Once an initial S has been determined, we can calculate the total instruction latency ( Time to execute one instruction). T instr = T + S (Clocking Overhead) = T + S (KT/S +C) = S X Cycle Time

The S opt as determined is simplistic – as we can not divide functional units in arbitrary manner. • Each time a new pipeline segment is introduced additional cost is added as each segment requires additional registers and clocking hardware. • Sopt thus serves as a possible upper limit of the number of useful pipeline segments and designer has to experiment with lower number of segments and compare throughput performance ‘G’ in ‘MIPS’ to minimize quantization effect and arrive at best possible value of S. • Cycle Quantization • Time lost due to fitting functional unit delays into fixed cycle times is called cycle quantization which adversely affects performance. • Quantization overhead is the difference between Total Instruction execution time and the sum of T ( total instruction stages delay) and the clocking overhead (SC) • To reduce quantization overhead we can increase S but that increases Clocking Overhead. A delicate balance needs to be maintained. Optimum Pipelining

An Example: Suppose an instruction execution requires a sequence of five events with respective delays as given below A 12 ns B 24 ns C 15 ns D 24 ns E 19 ns If we assume a fixed clocking overhead ‘C’ of 2 ns and ignore clock skew (K=0). Also assuming branch frequency b = 0.2 we can tabulate performance ‘G’ for different possible cycle times If we calculate Sopt we get a value of 13.7. but if we use any cycle time less than 14ns ( Min delay + clocking overhead) we will have excessive quantization overhead. If we start with (12+2) 14ns as cycle time, Event A will take one cycle and all others would require two cycles making it (1+2+2+2+2) total of 9 cycles. Cycle Partitioning

Similarly we can calculate performance for other cycle times like 17ns (15+2), 21ns (19+2), and 26ns (24+2) and tabulate the result as given below. Cycle Partitioning From this chart we observe that 14 ns of cycle time which gives 9 cycles ( or Pipeline Segments) is most preferred as it gives best performance ‘G’ as 27.5 MIPS.

A good design aims at achieving an optimum cost – performance tradeoff at a particular target performance. • Other issues that directly influence the program cost and hence the product cost include • Compatibility: older instruction and data types are retained to preserve the portability of programs to the newer processor • Applicability to market place: The design must provide for what the users are looking forward to. • Upgradeability: A product may be extended later as the needs arise. • Design Time: More complex design will elongate the design cycle increasing fixed design cost and delaying access to a particular market • Reliability: will affect both market acceptability and reduction in service costs • There are two types of cost. A fixed cost that includes engineering cost of both H/W and S/W, charges for CAD equipment, Charges for VLSI mask design, Manufacturing tooling cost etc plus general management and overhead cost. Cost - Area

Another type of cost is marginal cost of manufacture which is the cost of manufacturing each additional unit after the product has been developed. • AREA • All modern designs are based upon silicon technology in VLSI format . There are two primary determinants of cost: • Pins ( and package) • Silicon Area Occupied • Processor Area: • Most processors are implemented on one ( or a few) chips, each housing a die of about 10-15 mm on a side. • These die are produced from larger wafer perhaps 15-21 cm in diameter. • Defects randomly occurs over the wafer surface. • As processor design becomes complex and new features are added, larger die size are required to implement the additional features. Cost - Area

Large chip areas require absence of defects over that area. • This reduces yield ( no of good dies obtained from a silicon wafer) • A good design is not necessarily one that maximizes yield. • Reducing the area below a certain critical amount has only marginal effect on yield ( as it depends on percentage area affected by defects) • Small designs waste area because there is a required separation between adjacent die on a wafer • Designs are limited by the resolution of the photolithography or the ability to define distinct geometries on wafer surface. • Due to this there is a concept of minimum feature size. We may regard this as the minimum transistor length. • Advances in VLSI design process and due to improved photolithography the minimum feature size is becoming smaller every day making it possible to implement more transistor ( hence more features) in a given die area. • Improvements in manufacturing technology allows larger die to be realized with higher yields. Processor Area ( Contd..)

Both photolithographic technology and process technology are constantly improving. • By anticipating these moves in technology a designer can design more complex and useful features to get an edge over its competitors. • The initial yields may be low but as the usefulness of design lasts longer than other simpler designs and the improved technology is likely to improve the yields over a period of time the complex design may prove more economic overall. • Suppose a die with square aspect ratio has area A. If we have a wafer of diameter d then • No of die N (from this wafer) = Area of wafer / Area of die • = Processor Area ( Contd..) Suppose there are NGgood chips andND point defects on the wafer. Even if ND>N, we can expect several good chips since defects are randomly distributed and several defects would cluster on defective chips. We can determine Yield = NG / N = e-ND/N (Poisson Distribution of Defects)

Now ND = Number of wafer Defects So if ρD is the defect density per unit area Then ND = ρD * Wafer area = ρD (πd2 / 4) For large wafers (d-√A)2 ≈ d2 ND / N = ρD (πd2 /4N) = ρD A So Yield = e-ρDA Die Cost: Die cost is affected by area in two ways. One, as the area increases the number of die produced from a fixed size wafer decreases. Two, as the area increases the yield ( % of good chips) decreases. Roughly doubling the area of the die may increase the die cost by about 10 times. Processor Area ( Contd..)

Example: Assume a wafer has diameter of 21 cm and costs $5000.If defect density ρD is 1 defect / cm2. . Now compute the cost per die if die area is 1 cm2 The total no of die realized from 21 cm wafer N = π (d- √A)2 /4A Now for (d=21 and A= 1) N=314 Now the Yield = NG /N = e-ρD A = 0.37 ( For ρD = 1 and A =1) So no of good dies realized NG = 0.37 * 314 = 116 Cost per die = 5000/116 = $43 It can be seen that if we double the area to say 2.3.cm2 the No of chips realized will decrease to 130 and yield will come down to 0.1 giving only 13 good dies, resulting in cost escalation per die to $385. ( Nearly 10 times) Processor Area ( Contd..)

Because of constantly shifting geometries and resulting minimum feature size we use a dimensionless unit for area called rbe. One rbe equals the area of a bit storage cell. Since there is no universal bit storage cell, the rbe refers to a six transistor static cell with high bandwidth. Other storage cells are expressed in terms of rbe. For Example a static RAM bit would equal 0.6 rbe and a DRAM bit equals 0.1 rbe. Relationship of rbe and Feature Size and Resolution. If λ is minimum resolution the it has been empirically determined that 1 rbe equals 2700 λ2 . f=2 λ there fore λ2 =f2 /4. 1 rbe = 2700 λ2 =675 f2 If minimum feature size is one micron then f = 1μ. We define an unit of area A= f2*106 So an area 1A (1 mm2 ) = 1/675 *10 6 = 1481 rbe Processor Area ( Contd..)

The Economics of a Processor Project Any Project Begins with a Project Plan The Technical Specifications: Complete set of functional specifications that are to be met by the proposed system. Market Analysis: The analysis of user’s present and future needs, Solutions available or expected shortly, performance and prices of other suppliers products

The Economics of a Processor Project Assumptions: Average Selling Price: $10,000 per unit Sales volume: 10,000 units (6-7 yr Period) Cost of manufacture: $1000 (Material +Labor) Profit Ratio 10:1 Even this generous profit margin ratio may not be sufficient to make any profits, considering the no of units sold and size of fixed cost investment in product development.

The Economics of a Processor Project Prototype parts The project can be broken up in time phases. Phase 1: Development Development Effort: 12 – 30 months Initial Development Cost: $ 18 million Fixed Cost ( estimate): $33 Million Mech Design Mktg. Cost Manuf. Start Up & Inventory S/W Gen & Administrative Tech Support & Equp. H/W Design Development Cost CAD Out side support (VLSI mode)

The Economics of a Processor Project Phase-2: Early Manufacturing Early manufacturing costs are high compared to ultimate manufacturing cost due to High cost for assembly and testing Problems with vendors and building a parts inventory Manufacturing / Purchasing Issues. Manufacturing learning Curve Phase-3: Production Manufacturing hits its stride and unit cost approaches ultimate manufacturing cost. Add to this cost the cost of other assemblies and cost to deliver the product plus its warranty support costs etc.

The Economics of a Processor Project Phase-3: Production ( Contd.) Software and Hardware development efforts continue. Design flaws reported from field have to be attended to To extend the life of the product additional functionality ,features and performance enhancements must be made available Phase-4: End of Product Cycle Product cycle ends when products superceding the capacity / functionality of this product are introduced in the market. The project will make profit only if product life cycle is enhanced and higher no of units could be sold during its life cycle.

Optimization Strategies for Time, Area, and Instruction Sets in System Design

Optimization Strategies for Time, Area, and Instruction Sets in System Design

Presentation Transcript

Unit 2

Unit 2

Unit 2

Unit 2

Unit - 2

Unit 2

Unit 2

Unit 2

Unit 2

UNIT 2

UNIT 2

Unit 2

Unit 2

Unit 2:

Unit 2

Unit 2

Unit 2

Unit - 2

Unit 2

Unit 2

Unit 2