Consistency Oblivious Programming

Consistency Oblivious Programming Hillel Avni Tel Aviv University

Agenda • Transactional Memory and Locking • Consistency Oblivious Programming (COP) • COP with STM • COP With HTM • Future Work 2

Global Lock Easy to use Composable - Concatenate critical sections Not scalable 3

Fine Grain Locking Hard to use Not Composable Scalable Lazy linked list is a good example… 4

Lazy Traversal a b e d Aha! add(c) 5

Lock and Validate a b e d Yes, b still points to d add(c) 6

Perform Updates and Release Locks a b e d c add(c) 7

Transactional Memory Easy to use Composable Scalable How is it done? 8

Java (Duece) bool CAS(int location, int expected, int new val) { atomic { if (location != expected) return false; location = new val; } return true; } 9

C/C++ (GCC-4.7) bool CAS(int location, int expected, int new val) { __transaction_atomic { if (location != expected) return false; location = new val; } return true; } 10

Software Transactional Memory Different algorithms are used. consistency checking rollback Compiler recognizes shared accesses. 11 11

STM Problem - Overhead template <typename V> static V load(const V* addr, ls_modifier mod) { if (unlikely(mod == RfW)) { pre_write(addr, sizeof(V)); return *addr; } if (unlikely(mod == RaW)) return *addr; gtm_thread *tx = gtm_thr(); gtm_rwlog_entry* log = pre_load(tx, addr, sizeof(V)); V v = *addr; atomic_thread_fence(memory_order_acquire); post_load(tx, log); return v; } load function from GCC 4.8.1 12

STM Problem - Overhead static gtm_rwlog_entry* pre_load(gtm_thread *tx, const void* addr, size_t len) { size_t log_start = tx->readlog.size(); gtm_word snapshot = tx->shared_state.load(memory_order_relaxed); gtm_word locked_by_tx = ml_mg::set_locked(tx); size_t orec = ml_mg::get_orec(addr); size_t orec_end = ml_mg::get_orec_end(addr, len); do { gtm_word o = o_ml_mg.orecs[orec].load(memory_order_acquire); if (likely (!ml_mg::is_more_recent_or_locked(o, snapshot))) { success: gtm_rwlog_entry *e = tx->readlog.push(); e->orec = o_ml_mg.orecs + orec; e->value = o; } else if (!ml_mg::is_locked(o)) {snapshot = extend(tx); goto success; } else { if (o != locked_by_tx) tx->restart(RESTART_LOCKED_READ);} orec = o_ml_mg.get_next_orec(orec); } while (orec != orec_end); return &tx->readlog[log_start]; } load always call pre_load 13

STM Problem - Overhead static void post_load(gtm_thread *tx, gtm_rwlog_entry* log) { for (gtm_rwlog_entry *end = tx->readlog.end(); log != end; log++) { gtm_word o = log->orec->load(memory_order_relaxed); if (log->value != o) tx->restart(RESTART_VALIDATE_READ); } } and post_load Compare to moveax, [ebx] on x86 14

Hardware Transactional Memory Exploit native cache coherence consistency checking rollback 15 15

HTM Problem – Resources limits cache size limits data footprint quantum size limits duration A transaction cannot commit if it is too big too slow 16 16

All TM Problem – False Conflicts Any address that was encountered during the transaction is monitored until the end of that transaction. An address may abort a transaction long After it is not relevant… 17 17

COP Operation • In non transactional mode: • Execute the read-only prefix of the operation and record its output. • In transactional mode: • Verify output is correct. • Perform updates. 19

COP Example – RB Tree 20 10 30 40 27 28 25 20

Add 26 – Tree Unbalanced 20 10 30 40 27 28 25 26 TM Search 26 21

Tree Balanced 27 20 30 10 40 25 28 26 Conflict and Abort TM Search continues from 27 22

Add 26 – Tree Unbalanced 20 10 30 40 27 28 25 26 COP Search 26 23

Tree Balanced 27 20 30 10 40 25 28 26 Found TM Search continues from 27 24

COP RB-Tree Verify To facilitate verification: • all nodes in the RB-Tree are connected in a successor-predecessor doubly linked list, and each node has a live mark. • Search returns a node n with k or a leaf with k’s successor or predecessor. 25

COP RB-Tree Suffix • Resume a transaction • Verify: • k found and n is live – done. • K not found, check: • (n.k>k>n.pred.k && !n.right) or (n.k<k<n.succ.k && !n.left) • If verification failed – abort the transaction. • Complete updates, add / remove / rebalance, using n. 26

COP Template for op start-transaction any-code suspend-transaction output = op-rop(); resume-transaction If(not(op-verify(output))) abort-transaction op-complete(output) any-code end-transaction 27

COP Correctness The underlying TM: • Transactional Regular Registers The COP algorithm: • Obliviousness • Verifiability • Separation We prove that if the TM yields transactional regular registers, and the COP algorithm demonstrates obliviousness, verifiability, and separation, than the COP operation is linearizeable. 28

STM Algorithm • GCC default STM algorithm is the one that proved to be the most efficient and scalable in most scenarios: • Write Through (WT) • Encounter Time Locking (ETL) • Multi Lock (ML) 30

87 0 121 0 50 0 50 0 87 0 87 0 34 0 44 0 V# 0 34 0 99 0 99 0 50 0 121 0 88 0 87 0 34 0 99 0 50 0 34 1 99 1 V# 0 87 0 44 0 V# 0 88 0 121 0 121 0 50 0 100 RV STM: WT – ETL - ML 100 120 121 100 Shared Version Clock Mem Locks • RV Shared Version Clock • On Read: check unlocked and v# <= RV then add to read-Set • On write: check v# <= RV, lock, and add to undo-Set • WV = F&I(VClock) • Validate that in the read-set each v# <= RV • Release locks with v#  WV X X Y Y Commit 31

GCC Constructs __transaction_atomic{}: Mark the transaction. __transaction_cancel: Explicit abort. __attribute__((transaction_safe)): Instrument the code. __attribute__((transaction_pure)): Do not instrument the code. We will show this attribute can be used efficiently as __transaction_suspendwith WT – ETL – ML default STM algorithm in GCC. 32

pure = suspend • Transactional Regular Registers – All values upto one architecture-word size are written and read atomically. The rollback may use memcpy, but the memcpy is optimized to write maximal alignment. • Now we will compare the future Power architecture HTM suspended mode, to transaction_pure with WT-ETL-ML STM algorithm. 33

Power tsuspend - tresume • Until failure occurs, load instructions that access memory locations that were transactionally written by the same thread will return the transactionally written data. • In the event of transaction failure, failure recording is performed, but failure handling is deferred until transactional execution is resumed. • The initiation of a new transaction is prevented. • Store instructions that access memory locations that have been accessed transactionally (due to load or store) by the same thread will cause the transaction to fail. 34

RB – 1M sz – 20%U - 10 op/tx 35

RB – 1K sz – 8 Threads – 20% U 36

Haswell HTM with COP There is no suspend mode, so to compose COP operations, we execute all ROP before the transaction. This limits the composition to one writing COP operation in a transaction at most. 38

Capacity and Cache Associativity Packed Memory Array (PMA) search is done by divide and conquer. Assume a PMA size is 0x800000, and it starts at address 0. A searches for an item that is found in address 0x0…0x7FFF, must go through the addresses: 0x400000 0x20000 0x100000 0x80000 0x40000 0x20000 0x10000 0x8000 As cache size in Haswell is 0x8000, all these addresses have the same cache index (0), and will always abort. 39

PMA 40

RB-Tree Capacity Aborts 41

RB-Tree Conflict Aborts 42

Data Structures We already have COP versions of: • RB-Tree • Linked list • PMA • Cache Oblivious B-Tree • Leaplist (k-ary skip list, tailored for range queries) Can we design more COP data structures? 44

Applications Use COP in applications. Many applications use shared data structures, so it is interesting to see the impact of COP on their performance. 45

Infrastructure Add statistics (transactional accesses, conflicts) to GCC. Add real suspend-mode to GCC, hardware. 46

Theory How to make transformation to COP automatic? Is COP applicable outside the data-structures area? Bounds on the amount of transactional accesses? Bounds on the amount of false conflicts? 47

Thank You

Consistency Oblivious Programming

Consistency Oblivious Programming

Presentation Transcript

oblivious

Consistency

Consistency

Rational Oblivious Transfer

Consistency

Consistency

The Oblivious Empire

Oblivious Transfer (OT)

Consistency

Consistency

Cache-oblivious Programming

Task1: Data Production – Consistency And Programming

Proximity Oblivious Testing

Consistency

Oblivious Search Trees

Consistency

Consistency

Cache-Oblivious Algorithms

Consistency

OBLIVIOUS

Verifiable Oblivious Storage

Cache-oblivious Programming