Cell/B.E.

Cell/B.E. Jiří Dokulil

Introduction • Cell Broadband Engine • developed Sony, Toshiba and IBM • 64bit PowerPC • PowerPC Processor Element (PPE) • runs OS • SIMD • Synergistic Processor Element (SPE) • 8x • computations, no OS • big endian

Architecture

Memory access • PPE • load & store • cache • SPE • DMA • up to 16 concurrent per SPE • no direct access to memory • no need for out-of-order processing, no speculation • local storage • no cache

PPE • PowerPC Processor Element • PPU (PowerPC Processor Unit) • PPSS (PowerPC Processor Storage Subsystem) • 64-bit, dual-thread PowerPC Architecture RISC core • 2x32KB L1 (instructions and data) • 512LB L2 (unified) • PowerPC instruction set • vector/SIMD extensions – different from SPE • 32x 128bit vector registers

SPE • Synergistic Processor Element • SPU (Synergistic Processor Unit) • MFC (Memory Flow Controller) • RISC, SIMD • Synergistic Processor Unit Instruction Set Architecture • support for DMA and interprocessor messaging • 256KB LS • 128x128bit register file • DMA access to main memory • segment and page tables of PPE • channels • in MFC • unidirectional message-passing interfaces • memory-mapped I/O (MMIO) registers and queues

EIB • Element Interconnect Bus • four 16-byte-wide data rings • transfer 128byte at a time (one PPE cache line) • internal bandwidth 96bytes per clock cycle • latency depends on the number of hops • bus is a ring • half frequency of SPU

DMA • MFCs support naturally aligned DMA transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16 bytes • maximum transfer size of 16 KB per transfer • DMA list commands can initiate up to 2048 transfers • peak transfer performance • if both the effective addresses and the LS addresses are 128-byte aligned • and the size of the transfer is an even multiple of 128 bytes • SMM (Synergistic Memory Management) unit • processes address translation • access-permission information • data supplied by the PPE operating system

SIMD example // 16 iterations of a loop int rolled_sum(unsigned char bytes[16]) { int i; int sum = 0; for (i = 0; i < 16; ++i) { sum += bytes[i]; } return sum; }

SIMD example cont. // Vectorized for vector/SIMD multimedia extension int vectorized_sum(unsigned char bytes[16]) { vector unsigned char vbytes; union { int i[4]; vector signed int v; } sum; vector unsigned int zero = (vector unsigned int){0}; // Perform a misaligned vector load of the 16 bytes. vbytes = vec_perm(vec_ld(0, bytes), vec_ld(16, bytes), vec_lvsl(0, bytes)); // Sum the 16 bytes of the vector sum.v = vec_sums((vector signed int)vec_sum4s(vbytes, zero), (vector signed int)zero); // Extract the sum and return the result. return (sum.i[3]); }

Communication • DMA • 2 command queues per SPE • one for commands by SPE • one for commands by PPE and other SPEs • commands have tags (32 different) – status query • one transfer or a list • mailboxes • for each SPE • communication with PPE • 2 outgoing (1 message) • 1 incoming (4 messages) • signals • 2 inbound channels

DMA • put, get • SPE or PPE initiated • tag • 5bit • ordering • out of order • barrier – maintains order (within tag group) • fence – after all previous (within tag group) • simple or lists • lists stored in LS (8bytes per item) -> SPE only • up to 2048 transfers, 16KB each -> 32MB • compare to 256KB LS size

DMA – PPE raw access • MFC registers mapped to virtual address space void *ps = get_ps(); //get the problem state – must be mapped by privileged software unsigned int ls = 0x500; unsigned int long long ea = 0x10000000; unsigned int size = 0x4000; unsigned int tag = 5; unsigned int classid = 0; unsigned int cmd = MFC_GET_CMD; unsigned int cmd_status; do { *((volatile unsigned int *)(ps + MFC_LSA)) = ls; *((volatile unsigned long long *)(ps + MFC_EAH)) = ea; *((volatile unsigned int *)(ps + MFC_Size)) = (size << 16) | tag; *((volatile unsigned int *)(ps + MFC_ClassID)) = (classid << 16) | cmd; /* Read MFC_CMDStatus to enqueue command and check enqueue success.*/ cmd_status = *((volatile unsigned int *)(ps + MFC_CMDStatus)) & 0x3; } while (cmd_status); /* Attempt to enqueue until success */ • only enqueues the command

DMA – PPE raw access cont. • test for completion (poll tag group status) void *ps = get_ps(); unsigned int tag_mask = 1 << 5; unsigned int tag_status; *((volatile unsigned int *)(ps + Prxy_QueryMask)) = tag_mask; __asm__(“eieio”); /* force write to Prxy_QueryMask to complete */ do { tag_status = *((volatile unsigned int *)(ps + Prxy_TagStatus)); } while (!tag_status); • more tag groups unsigned int tag_mask = (1<<5)|(1<<14)|(1<<31);

DMA – SPE • no direct access to the virtual address space • only by DMA • direct access to own command channels • wrch assembly instruction extern void dma_transfer(volatile void *lsa, // local storage address unsigned int eah, // high 32-bit effective address unsigned int eal, // low 32-bit effective address unsigned int size, // transfer size in bytes unsigned int tag_id, // tag identifier (0-31) unsigned int cmd); // DMA command in assembler: wrch $MFC_LSA, $3 wrch $MFC_EAH, $4 wrch $MFC_EAL, $5 wrch $MFC_Size, $6 wrch $MFC_TagID, $7 wrch $MFC_Cmd, $8 in C intrinsic: spu_mfcdma64(lsa, eah, eal, size, tag_id, cmd);

DMA – SPE cont. • poll for completion # Set tag group mask wrch $MFC_WrTagMask, $0 # Set up for immediate tag status update. il $1, 0 repeat: wrch $MFC_WrTagUpdate, $1 rdch $1, $MFC_RdTagStat brz $1, repeat OR #include <spu_intrinsics.h> #include <spu_mfcio.h> unsigned int tag_id = 0; unsigned int tag_mask = 1 << tag_id; spu_writech(MFC_WrTagMask, tag_mask); do { }while(!spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE)); /* poll for update */

DMA – SPE cont. • wait for completion (stall SPE) # Set tag group mask wrch $MFC_WrTagMask, $0 # 0x1 for any tag, 0x2 for all tags. il $1, 0x1 # Wait for conditional tag status update (stall the SPU). wrch $MFC_WrTagUpdate, $1 rdch $1, $MFC_RdTagStat OR #include <spu_intrinsics.h> #include <spu_mfcio.h> unsigned int tag_id = 0; unsigned int tag_mask = 1 << tag_id; spu_writech(MFC_WrTagMask, tag_mask); /* Wait for all ids in tag group to complete (stall the SPU) */ spu_mfcstat(MFC_TAG_UPDATE_ALL);

DMA – SPE cont. • completion of DMA • source buffer can be reused • data may not have yet been written to the main storage • mailbox-ed notification can reach PPE before the data • SPE can do mfcsync • PPE can do lwsync • more efficient • SPE can notify via DMA • mfceieio must be used between DMAs for ordering

Mailboxes • 32bit messages • blocking for SPE (stalls SPE) • reading of empty inbound • writing of full outbound • SPE can poll the number of messages • non-blocking for PPE (and other devices) • reading returns zeros • writing overwrites last message

Mailboxes – SPE • send (stalling) wrch $SPU_WrOutMbox, $1 or spu_writech(SPU_WrOutMbox, mb_value); • send (active waiting) repeat: rchcnt $2, $SPU_WrOutMbox brz $2, repeat wrch $SPU_WrOutMbox, $1 or do { /* Do other useful work while waiting. */ } while (!spu_readchcnt(SPU_WrOutMbox)); spu_writech(SPU_WrOutMbox, mb_value);

Mailboxes – SPE cont. • read (stalling) rdch $1, $SPU_RdInMbox or mb_value = spu_readch(SPU_RdInMbox); • read (active waiting) repeat: rchcnt $1, $SPU_RdInMbox brz $1, repeat rdch $2, $SPU_RDInMbox or do { /* Do other useful work while waiting.*/ } while (!spu_readchcnt(SPU_RdInMbox)); mb_value = spu_readch(SPU_RdInMbox);

Mailboxes – PPE • read SPE’s outbound mailboxsend void *ps = get_ps(); unsigned int mb_status; unsigned int new; unsigned int mb_value; do { mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat)); new = mb_status & 0x000000FF; } while ( new == 0 ); mb_value = *((volatile unsigned int *)(ps + SPU_Out_Mbox));

Mailboxes – PPE cont. • writing to SPE’s inbound mailbox • problem of overrunning full mailbox //send four messages without overrunning the mailbox void *ps = get_ps(); unsigned int j,k = 0; unsigned int mb_status; unsigned int slots; unsigned int mb_value[4] = {0x1, 0x2, 0x3, 0x4}; do { /*Poll the Mailbox Status Register until the SPU_In_Mbox_Countfield indicates there is at least one slot available in the SPU Read Inbound Mailbox.*/ do { mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat)); slots = (mb_status & 0x0000FF00) >> 8; } while ( slots == 0 ); for (j=0; j<slots && k < 4; j++) { *((volatile unsigned int *)(ps + SPU_In_Mbox)) = mb_value[k++]; } } while ( k < 4 );

CELL SDK 3.1 • http://www.ibm.com/developerworks/power/cell/ • Cell BE Programming Handbook Including PowerXCell 8i • http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1741C509C5F64B3300257460006FD68D • SPE Runtime Management Library • http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1DFEF31B3211112587257242007883F3 • PPU & SPU C/C++ Language Extension Specification • http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E

libspe & libspe2 • low level APIs to access Cell from C/C++ • new threading model in libspe2 • use threading library of your choice and use libspe2 from there – no “SPE threads” • create e.g. pthread thread and launch SPE code from that – call returns after SPE finishes

Compilation • PPE object • g++ [-m64] -c -Ox • SPE object • spu-gcc -Ox • no –m64 • LS adresses are always 32bit • ppu-embedspu [-m64] <symbol> <object> <output> • link • g++ [-m64]<spe object> <ppe object> -lspe -lspe2

Referencing SPE code from PPE code • extern spe_program_handle_t <symbol>; • spe_program_load(spe_context,&<symbol>);

Launching SPE code (libspe2) struct thread_data { spe_context_ptr_t context; program_data* pd; }; void *ppu_pthread_function(void *arg) { thread_data td = *(thread_data *) arg; spe_context_ptr_t context = td.context; unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run(context,&entry,0,td.pd,NULL,NULL); pthread_exit(NULL); } spe_context_ptr_t context; pthread_t pthread; thread_data td; context = spe_context_create(0,NULL); spe_program_load(context,&spe_prg); pthread_create(&pthread,NULL,&ppu_pthread_function,&td[spe]); pthread_join(pthread,NULL); spe_context_destroy(context);

SPE code #include <spu_mfcio.h> int main( unsigned long long spe_id, unsigned long long program_data_ea, unsigned long long env) { program_data pd __attribute__((aligned(16))); int tag_id = 1; mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0); mfc_write_tag_mask(1<<tag_id); mfc_read_tag_status_any(); … }

Program data • structure shared by SPE and PPE code • unsigned long long for 64bit pointers • void* is 32bit on SPE and 32/64bit on PPE • be careful with the alignment • DMA cannot handle misaligned transfers • size padded to 16byte

DMA – SPE side • (void) mfc_put(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag,uint32_t tid, uint32_t rid) • initiate transfer from LS • tag is number (e.g. 5) • mfc_putb, mfc_putf

DMA – SPE side cont. • (void) mfc_get(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag,uint32_t tid, uint32_t rid) • mfc_getb, mfc_getf

DMA status – SPE side • (void) mfc_write_tag_mask (uint32_t mask) • tag mask (e.g. 1<<5) • (uint32_t) mfc_read_tag_status_any(void) • blocks untill any of the specified tag groups has no outstanding operations • (uint32_t) mfc_read_tag_status_all(void) • blocks untill all of the specified tag groups have no outstanding operations

Mailboxes – SPE side • (uint32_t) spu_read_in_mbox(void) • (uint32_t) spu_stat_in_mbox(void) • (void) spu_write_out_mbox(uint32_t data) • (uint32_t) spu_stat_out_mbox(void)

Mailboxes – PPE side • int spe_out_mbox_read (spe_context_ptr_t spe, unsigned int *mbox_data, int count) • int spe_out_mbox_status (spe_context_ptr_t spe) • int spe_in_mbox_write (spe_context_ptr_t spe, unsigned int *mbox_data, int count, unsigned int behavior) • SPE_MBOX_ALL_BLOCKING • blocks until all are sent • SPE_MBOX_ANY_BLOCKING • blocks until at least one message is sent • SPE_MBOX_ANY_NONBLOCKING • sends as many as possible without blocking • int spe_in_mbox_status (spe_context_ptr_t spe)

PPE direct access to SPE • void* spe_ls_area_get (spe_context_ptr_t spe) • less efficient than DMA • int spe_ls_size_get (spe_context_ptr_t spe) • void* spe_ps_area_get (spe_context_ptr_t spe, enum ps_area area) • enum ps_area • SPE_MFC_COMMAND_AREA • MFC registers • SPE_CONTROL_AREA • mailboxes • the get_ps function used in examples from the first part

Cell/B.E.

Cell/B.E.

Presentation Transcript

THE CELL PHONE AND THE CELL

Cell Death and Cell Renewal

Cell parts

Chapter 18

Cell-Cell Communication

Cell Organelles

Cell Biology

THE CELL

7-3 Cell Boundaries

MATRIX/CELL-CELL/CYTOSKELETAL TENSION

Cell to Cell Communication

Cell Organelles

Cell and a truck

Cell Membrane

11-14-2011 Cell Specialization

The Cell Rap!

The Cell Membrane

THE CELL CYCLE and CELL DIVISION