Hyper-Threading Technology Architecture and Microarchitecture

Hyper-Threading Technology Architecture andMicroarchitecture Deborah T. Marr, Desktop Products Group, Intel Corp. Frank Binns, Desktop ProductsGroup, Intel Corp. David L. Hill, Desktop Products Group, Intel Corp. Glenn Hinton, Desktop Products Group, Intel Corp. David A. Koufaty, Desktop Products Group, Intel Corp. J. Alan Miller, Desktop Products Group, Intel Corp. Michael Upton, CPU Architecture, Desktop Products Group, Intel Corp.

Introduction • The amazing growth of the Internet made people demanding higher processor performance. • To keep up with this demand we cannotrely entirely on traditional approaches to processordesign. • Superpipelining • branch prediction • super-scalar execution • out-of-order execution • Caches • Processor architects arelooking for ways to improve performance at agreater rate than transistor counts and power consumption. • Hyper-Threading Technology is one solution. microprocessors are more complex, have moretransistors, and consume more power.

Thread-Level Parallelism • Serverapplications consist of multiple threads or processes thatcan be executed in parallel. • On-line transactionprocessing and Web services have an abundance ofsoftware threads that can be executed simultaneouslyfor faster performance. • Even desktop applications arebecoming increasingly parallel. • We need to apply thread-levelparallelism (TLP) to gain a better performance vs.transistor count and power ratio.

Chip Multiprocessing • Two processors on a single die. • Each has a full setof execution and architectural resources. • They may or may not share a large on-chip cache. • However, a CMP chip is significantly largerthan the size of a single-core chip and therefore moreexpensive to manufacture

Time-slice multithreading • Single processor toexecute multiple threads by switching between them after a fixed timeperiod. • This can result in wastedexecution slots but can minimize the effectsof long latencies to memory. • Switch-on-event multithreadingwould switch threads on long latency eventssuch as cache misses. • This can work well forserver applications with large numbers of cachemisses and where the two threads are executing

Simultaneous Multithreading • A single physicalprocessor appear as multiple logical processors • There is one copy of the architecture state foreach logical processor, and the logical processors sharea single set of physical execution resources. • Software perspective: Operating systems and user programs can scheduleprocesses or threads to logical processors as they wouldon conventional physical processors in a multiprocessorsystem. • Microarchitectureperspective : instructions from logicalprocessors will persist and execute simultaneously onshared execution resources.

Simultaneous Multithreading • added less than 5% to the relative chip sizebut can provideperformance benefits much greater than that. • Architecture state: general-purpose registers, thecontrol registers, the advanced programmable interruptcontroller (APIC)registers, and some machine stateregisters. • The number of transistors to storethe architecture state is small. • Logical processors share nearly all otherresources such as caches, execution units, branch predictors, control logic, andbuses. • Each logical processor has its own interrupt controlleror APIC. Interrupts sent to a specific logical processorare handled only by that logical processor.

Trace Cache • Figure 5a, instructions generally come fromthe Execution Trace Cache (TC) • Figure 5b only when there is a TC miss does the machine fetchand decode instructions from the (L2)cache. Near the TC is the Microcode ROM • Execution Trace Cache (TC) • Two sets ofnext-instruction-pointers independently track theprogress of the two software threads executing. Thetwo logical processors arbitrate access to the TC everyclock cycle. • Ifone logical processor is stalled or is unable to use theTC, the other logical processor can use the fullbandwidth • The TC entries are tagged with thread information • The shared nature of the TC allows one logicalprocessor to have more entries than the other if needed.

L1 Data Cache, L2 Cache, L3 Cache • The L1 data cache is a write-through cache, meaning that writesare always copied to the L2 cache. • Because logical processors can share data in the cache,there is the potential for cache conflicts, which canresult in lower observed performance. • However, thereis also the possibility that one logical processor may prefetchinstructions or data, needed by the other, into the cache;this is common in server application code. • In aproducer-consumer usage model, one logical processormay produce data that the other logical processor wantsto use.

Branch Prediction • The branch prediction structures are either duplicated orshared. • Thebranch history buffer used to look up the global historyarray is also tracked independently for each logicalprocessor. • However, the large global history array is ashared structure with entries that are tagged with alogical processor ID

SINGLE-TASK AND MULTI-TASK MODES • To optimize performance when there is one softwarethread to execute, there are two modes of operationreferred to as single-task (ST) or multi-task (MT). • The IA-32 Intel Architecture has aninstruction called HALT that stops processor executionand normally allows the processor to go into a lowerpowermode. • HALT is a privileged instruction, meaningthat only the operating system or other ring-0 processesmay execute this instruction. User-level applicationscannot execute HALT.

Performance • Online transaction processingperformance • 21% performanceincrease in the cases of the single and dualprocessorsystems • 65% performanceincrease on 4-way server platforms.

Performance • Performance when executing server-centricbenchmarks. • In thesecases the performance benefit ranged from 16 to 28%.

CONCLUSION • Newtechnique for obtaining additional performance forlower transistor and power costs. • The logicalprocessors have their own independent architecturestate, but they share nearly all the physical executionand hardware resources of the processor. • Had to ensure forward progress on logical processors, even ifthe other is stalled, and to deliver full performance evenwhen there is only one active logical processor. • Thesegoals were achieved through efficient logical processorselection algorithms and the creative partitioning andrecombining algorithms of many key resources. • Performancegains of up to 30% on common server applicationbenchmarks. • The potential for Hyper-Threading Technology istremendous

The End

OUT-OF-ORDER EXECUTION ENGINE • The out-of-order execution engine consists of theallocation, register renaming, scheduling, and executionfunctions, as shown in Figure 6. • This part of themachine re-orders instructions and executes them asSpecifically, each logical processor can use up to amaximum of 63 re-order buffer entries, 24 load buffers,and 12 store buffer entries. • If there are uops for both logical processors in the uopqueue, the allocator will alternate selecting uops fromthe logical processors every clock cycle to assignresources. • If a logical processor has used its limit of aneeded resource, such as store buffer entries, theallocator will signal “stall” for that logical processor andcontinue to assign resources for the other logicalprocessor. • In addition, if the uop queue only containsuops for one logical processor, the allocator will try toassign resources for that logical processor every cycle tooptimize allocation bandwidth, though the resourcelimits would still be enforced. • By limiting the maximum resource usage of key buffers,the machine helps enforce fairness and preventsdeadlocks.

Instruction Scheduling • The schedulers are at the heart of the out-of-orderexecution engine. • Five uop schedulers are used toschedule different types of uops for the variousexecution units. • Collectively, they can dispatch up tosix uops each clock cycle. • The schedulers determinewhen uops are ready to execute based on the readinessof their dependent input register operands and theavailability of the execution unit resources. • The memory instruction queue and general instructionqueues send uops to the five scheduler queues as fast asthey can, alternating between uops for the two logicalprocessors every clock cycle, as needed. • Each scheduler has its own scheduler queue of eight totwelve entries from which it selects uops to send to theexecution units. • The schedulers choose uops regardlessof whether they belong to one logical processor or theother. • The schedulers are effectively oblivious tological processor distinctions. • The uops are simplyevaluated based on dependent inputs and availability ofexecution resources. • For example, the schedulers coulddispatch two uops from one logical processor and twouops from the other logical processor in the same clockcycle. • To avoid deadlock and ensure fairness, there is alimit on the number of active entries that a logicalprocessor can have in each scheduler’s queue. • Thislimit is dependent on the size of the scheduler queue.

Hyper-Threading Technology Architecture and Microarchitecture