1 / 73

Tornado

Tornado. Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa , Orran Krieger, Jonathan Appavoo , Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah , 4 th Dec 2013.

butch
Télécharger la présentation

Tornado

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: AnushaMuthiah, 4th Dec 2013

  2. Why Tornado? Previous shared memory multiprocessors evolved from designs for architecture that existed back then

  3. Locality – Traditional System • Memory latency was low: fetch – decode – execute without stalling • Memory faster than CPU = okay to leave data in memory CPU CPU CPU Loads & Stores Shared Bus Shared memory

  4. Locality – Traditional System • Memory got faster • CPU got faster than memory • Memory latency was high. Fast CPU = useless! CPU CPU CPU Shared Bus Shared memory

  5. Locality – Traditional System • Reading memory was very important • Fetch – decode – execute cycle CPU CPU CPU Shared Bus Shared memory

  6. Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Extremely close fast memory • Hitting a lot of addresses within a small region • E.g. Program – seq of instructions • Adjacent words on same page of memory • More if in a loop

  7. Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Repeated acess to the same page of memory was good!

  8. Memory latency • CPU executes at speed of cache • Making up for memory latency CPU Really small How can it help us more than 2% of the time? cache Bus Memory

  9. Locality • Locality: same value or storage location being frequently accessed • Temporal Locality: reusing some data/resource within a small duration • Spatial Locality: use of data elements within relatively close storage locations

  10. Cache • All CPUs accessing same page in memory CPU CPU CPU cache cache cache Shared Bus Shared memory

  11. Example: Shared Counter Copy of counter exists in both processor’s cache CPU-1 CPU-2 Cache Cache Shared by both CPU-1 & CPU-2 Memory Counter

  12. Example: Shared Counter CPU-1 CPU-2 Memory 0

  13. Example: Shared Counter CPU-1 CPU-2 0 Memory 0

  14. Example: Shared Counter Write in exclusive mode CPU-1 CPU-2 1 Memory 1

  15. Example: Shared Counter Shared mode Read : OK CPU-1 CPU-2 1 1 Memory 1

  16. Example: Shared Counter 1 2 Invalidate From shared to exclusive CPU-1 CPU-2 2 Memory 2

  17. Example: Shared Counter Terrible Performance!

  18. Problem: Shared Counter • Counter bounces between CPU caches leading to high cache miss rate • Try an alternate approach • Counter converted to an array • Each CPU has its own counter • Updates can be local • Number of increments = commutativity of addition • To read, you need all counters (add up)

  19. Example: Array-based Counter CPU-1 CPU-2 Array of counters, one for each CPU CPU-1 Memory 0 0 CPU-2

  20. Example: Array-based Counter CPU-1 CPU-2 1 Memory 1 0

  21. Example: Array-based Counter CPU-1 CPU-2 1 1 Memory 1 1

  22. Example: Array-based Counter Read Counter CPU 2 CPU-1 CPU-2 1 1 Add All Counters (1 + 1) Memory 1 1

  23. Performance: Array-based Counter Performs no better than ‘shared counter’!

  24. Why Array-based counter doesn’t work? • Data is tranferred between main memory and caches through fixed sized blocks of data called cache lines • If two counters are in the same cache line, they can only be used by one processor at a time for writing • Ultimately, still has to bounce between CPU caches Memory 1 1 Share cache line

  25. False sharing CPU-1 CPU-2 Memory 0,0

  26. False sharing CPU-1 CPU-2 0,0 Memory 0,0

  27. False sharing Sharing CPU-1 CPU-2 0,0 0,0 Memory 0,0

  28. False sharing Invalidate CPU-1 CPU-2 1,0 Memory 1,0

  29. False sharing Sharing CPU-1 CPU-2 1,0 1,0 Memory 1,0

  30. False sharing Invalidate CPU-1 CPU-2 1,1 Memory 1,1

  31. Ultimate Solution • Pad the array (independent cache line) • Spread counter components out in memory

  32. Example: Padded Array CPU-1 CPU-2 Individual cache lines for each counter Memory 0 0

  33. Example: Padded Array Updates independent of each other CPU-1 CPU-2 1 1 Memory 1 1

  34. Performance: Padded Array Works better

  35. Cache • All CPUs accessing same page in memory • Write sharing is very destructive CPU CPU CPU cache cache cache Shared Bus Shared memory

  36. Shared Counter Example • Minimize read/write & write sharing = minimize cache coherence • Minimize false sharing

  37. Now and then • Then (Uni): Locality is good • Now (Multi): Locality is bad. Don’t share! • Traditional OS: implemented to have good locality • Running same code on new architecture = cache interference • More CPU is detrimental • CPU wants to run from its own cache without interference from others

  38. Modularity • Minimize write sharing & false sharing • How do you structure a system so that CPUs don’t share the same memory? • Then: No objects • Now: Split everything and keep it modular Paper: • Object Oriented Approach • Clustered Objects • Existense Guarantee

  39. Goal • Ideally: • CPU uses data from its own cache • Cache hit rate good • Cache contents stay in cache and not invalidated by other CPUs • CPU has good locality of reference (it frequently accesses its cache data) • Sharing is minimized

  40. Object Oriented Approach • Goal: • Minimize access of shared data structures • Minimize using shared locks • Operating Systems are driven by requests of applications on virtual resources • Good performance – requests to different virtual resources must be handled independently

  41. How do we handle requests independently? • Represent each resource by a different object • Try to reduce sharing • If an entire resource is locked, requests get queued up • Avoid making resource a source of contention • Use fine-grain locking instead

  42. Coarse/Fine grain locking example • Process object: maintains the list of mapped memory region in the process’s address space • On a page fault, it searches process table to find the responsible region to forward page fault to Thread 1 Region 1 Region 2 Region 3 Whole table gets locked even though it only needs one region. Other threads get queued up … Region n Process Object

  43. Coarse/Fine grain locking example Thread 1 Region 1 Region 2 Use individual locks Region 3 … Thread 2 Region n

  44. Key memory management object relationships in Tornado If file data not cached in memory, request new phy page frame Forwards request to responsible region Page fault delivered Translates fault addr into a file offset. Forwards req to FCM And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object

  45. Advantage of using Object Oriented Approach • Performance critical case of in-core page fault (TLB miss fault for a page table resident in memory) • Objects invoked specific to faulting object • Locks acquired & data structures accessed are internal to the object • In contrast, many OSes use global page cache or single HAT layer = source of contention

  46. Clustered Object Approach • Obj Oriented is good, but some resources are just too widely shared • Enter Clustering • Eg: Thread dispatch queue • If single list used : high contention (bottleneck + more cache coherency) • Solution: Partition queue and give each processor a private list (no contention!)

  47. Clustered Object Approach All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object

  48. Clustered object • Shared counter Example But actually made up of representative counters that each CPU can access independently It looks like a shared counter

  49. Degree of Clustering One rep for the entire system Cached Obj Rep – Read mostly. All processors share a single rep. One rep for a cluster of Neighbouring processors Region – Read mostly. On critical path for all page faults. One rep per processor FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps

  50. Clustered Object Approach

More Related