Chapter4 Multiprocessors

Chapter4 Multiprocessors Dr. Bernard Chen Ph.D. University of Central Arkansas

Outline • Trend for Multiprocessors • Multiprocessors Model • Cache Coherence

Uniprocessor Performance • Electronic circuits are ultimately limited un their speed of operation by the speed of light… and many of the circuits were already operating in the nanosecond range W.Jack Bouknight, 1972

Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Multiprocessor • “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) • All microprocessor companies switch to MP (2X CPUs / 2 yrs)

Multiprocessor • The importance of multiprocessors was growing throughout the 1990s as designers sought a way to build a server • The slowdown in uniprocessor performance arising from diminishing returns in exploiting ILP • Leading to an era where multiprocessors plays a major role

Multiprocessor • Major reasons are: • Growth in data-intensive applications • Data bases, file servers, … • Growing interest in servers, server performance • Increasing desktop performance less important • Improved understanding in how to use multiprocessors effectively • Especially server where significant natural TLP • Advantage of leveraging design investment by replication • Rather than unique design

Multiprocessor • However, we are left with two problems: • Multiprocessor architecture is a large and diverse field, and much of the field is in its youth • Broad coverage would necessarily entail discussing approaches that may not stand the test of time • Therefore, we focus on the mainstream of multiprocessor design: multiprocessors with small to medium number of processors (4~32 processors)

Flynn’s Taxonomy M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

SIMD • SIMD  Data Level Parallelism

SIMD • Output

SIMD MPI_Send • Some other MPI basic functions • int MPI_Send(void *buf,intcount,MPI_Datatypedatatype,intdest,inttag,MPI_Commcomm); • buf • [in] initial address of send buffer (choice) • count • [in] number of elements in send buffer (nonnegative integer) • datatype • [in] datatype of each send buffer element (handle) • dest • [in] rank of destination (integer) • tag • [in] message tag (integer) • comm • [in] communicator (handle) Code example: http://mpi.deino.net/mpi_functions/MPI_Send.html

SIMD MPI_Recv • int MPI_Recv(void *buf,intcount,MPI_Datatypedatatype,intsource,inttag,MPI_Commcomm,MPI_Status *status); • buf • [out] initial address of receive buffer (choice) • count • [in] maximum number of elements in receive buffer (integer) • datatype • [in] datatype of each receive buffer element (handle) • source • [in] rank of source (integer) • tag • [in] message tag (integer) • comm • [in] communicator (handle) • status • [out] status object (Status)

SIMD MPI_Reduce • int MPI_Reduce(void *sendbuf,void *recvbuf,intcount,MPI_Datatypedatatype,MPI_Opop,introot,MPI_Commcomm);Parameters • sendbuf • [in] address of send buffer (choice) • recvbuf • [out] address of receive buffer (choice, significant only at root) • count • [in] number of elements in send buffer (integer) • datatype • [in] data type of elements of send buffer (handle) • op • [in] reduce operation (handle) • root • [in] rank of root process (integer) • comm • [in] communicator (handle)

Flynn’s Taxonomy • MIMD  Each processor fetches its own instructions and operates on its own data. • MIMD operates thread-level parallelism • MIMD offers flexibility • MIMD can build on the cost-performance advantages

Back to Basics • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” • Parallel Architecture = Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRT memory: • Centralized Memory Multiprocessor • < few dozen processor chips Small enough to share single, centralized memory • Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands  Memory distributed among processors

Centralized Memory Multiprocessor • Large caches  single memory can satisfy memory demands of small number of processors • Can scale to a few dozen processors by using a switch and by using many memory banks • Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

Eniac (350 op/s) 1946 - (U.S. Army photo)

Arkansas Star

ASCI White (12.3 teraops/sec)IBM Jun28 2000 Mega flops = 10^6 flops = 2^20 Giga = 10^9 = billion = 2^30 Tera = 10^12 = trillion = 2^40 Peta = 10^15 = quadrillion = 2^50 Exa = 10^18 = quintillion = 2^60

Physically Distributed-Memory multiprocessor • Pro: Reduces latency of local memory accesses • Con: Communicating data between processors more complex

2 Models for Communication and Memory Architecture • The first kind, communication occurs through a shared address space. • Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors

2 Models for Communication and Memory Architecture • Even the physically separate memories can be addressed as one logically shared space • Meaning that the memory reference can be made by any processor to any memory location, (assume it has the access right) • These multiprocessors are called distributed shared memory (DSM)

2 Models for Communication and Memory Architecture • Communication occurs through a shared address space (via loads and stores): shared memory multiprocessorseither • symmetric shared memory (centralized memory MP) • distributed shared memory (distributed memory MP) • Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP

What is Multiprocessor Cache Coherence

What is Multiprocessor Cache Coherence • Informally, we could say that a memory system is coherent if any read of a data item returns the most recently written value of the data item • However, two formal definitions are required • Coherence: defines what values can be returned by a read • Consistency: determines when a written value will be returned by a read

Defining Coherent Memory System • Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P • Coherent view of memory: Read by a processor to location X that follows a write by anotherprocessor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses • Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors • If not, a processor could keep value 1 since saw as last write • For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1

Defining Coherent Memory System (Translated) • A read follows a write in the same processor • A read follows a write by another CPU • Two writes

2 Classes of Cache Coherence Protocols • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

Snooping • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • All caches are accessible via some broadcast medium (a bus or switch) • All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access

Snooping (write back)

Snooping • Write through: the information is written to both the block in the cache and to the block in the lower-level memory • Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed

Snooping (write through)

Snooping • The key in Snooping is the “bus”, to perform invalidations • To perform invalidate, the processor simply acquires bus access and broadcasts the address to be invalidated on the bus • If two processors attempt to write on the same lactation, who get the bus first wins

Snoopy Cache-Coherence Protocols • Cache Controller “snoops” all transactions on the shared medium (bus or switch)

u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 u = 7 Example: Write-thru Invalidate • The write step by p3 Must invalidate before step 3 • Write update uses more broadcast medium BW all recent MPUs use write invalidate P P P 2 1 3 $ $ $ I/O devices Memory

Locate up-to-date copy of data • Write-through: get up-to-date copy from memory • Write through simpler if enough memory BW • Write-back harder • Most recent copy can be in a cache

Locate up-to-date copy of data • Can use same snooping mechanism • Snoop every address placed on the bus • If a processor has dirty copy of requested cache block, it provides it in response to a read request and aborts the memory access • Complexity from retrieving cache block from a processor cache, which can take longer than retrieving it from memory • Write-back needs lower memory bandwidth  Support larger numbers of faster processors  Most multiprocessors use write-back

A commercial workload • Server: AlphaServer 4100 • Each processor issues up to four instructions per clock cycle and runs at 300MHz • Each processor has a three-level cache hierarchy: • L1 consists of a pair of 8KB direct-mapped on-chip cache • L2 is a 96KB on-chip three-way set associative cache • L3 is a off-chip direct-mapped 2MB cache • The latency for an access to L2 is 7 cycles, L3 is 21 cycles, and main memory access is 80 clock cycles

A commercial workload

2 Classes of Cache Coherence Protocols • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

Directory-Based Cache Coherence Protocols • To implement the operations, a directory must track the state of each cache block: • Shared (S): one or more processors have the block cached, and the value is up-to-date • Uncached (U): no processor has a copy of the cache block • Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block

Directory-based Protocol Interconnection Network Directory Directory Directory Local Memory Local Memory Local Memory Cache Cache Cache CPU 0 CPU 1 CPU 2

Directory-based Protocol CPU 0 CPU 1 CPU 2 Interconnection Network Bit Vector X U 0 0 0 Directories X 7 Memories Caches

CPU 0 Reads X Read Miss CPU 0 CPU 1 CPU 2 Interconnection Network X U 0 0 0 Directories X 7 Memories Caches

CPU 0 Reads X CPU 0 CPU 1 CPU 2 Interconnection Network X S 1 0 0 Directories X 7 Memories Caches

Chapter4 Multiprocessors