1 / 32

Capriccio: Scalable Threads for Internet Services (von Behren)

Capriccio: Scalable Threads for Internet Services (von Behren). Kenneth Chiu. Background. Non-blocking I/O, async I/O NB Usually doesn’t work well for disks. Async I/O Issue a request, get completion. epoll()/poll() convoy: tendency for threads to “bunch up” priority inversion

holiday
Télécharger la présentation

Capriccio: Scalable Threads for Internet Services (von Behren)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capriccio: Scalable Threads for Internet Services (von Behren) Kenneth Chiu

  2. Background • Non-blocking I/O, async I/O • NB • Usually doesn’t work well for disks. • Async I/O • Issue a request, get completion. • epoll()/poll() • convoy: tendency for threads to “bunch up” • priority inversion • call graph • average, weighted moving average • capriccio: improvisatory style, free form

  3. The Problem • Web “transactions” involve a number of steps which must be performed in sequence. • For high-throughput, we want to service many of these requests concurrently. • When does concurrency help? When does it not? • If we use a single thread per request, we will have too many threads. • If we multiplex requests on a small set of threads, it’s more difficult.

  4. Read two numbers and add while (true) { fd = get_read_ready(); state = lookup(fd); if (state.step == READING_FIRST) { c = read(fd, …, bytes_left); if (have enough) { state.step == READING_SECOND; } } else if (state.step == READING_SECOND) { … } while (true) { int n1, n2; readexact(fd, &n1, 4); readexact(fd, &n2, 4); printf(“%d\n”, n1 + n2); }

  5. Thread Design and Scalability

  6. The Case for User-Level Threads • Flexibility • Level of indirection between applications and the kernel, which helps decouple the two. • Kernel-level thread scheduling must handle all applications. User-level can be tailored. • Lightweight which means can use zillions of them. • Performance • Cooperative scheduling is nearly free. • Do not require kernel crossing for uncontended locks. (Why do contended locks require kernel crossings?) • Disadvantages • Non-blocking I/O requires an additional system call. (Why?) • SMPs

  7. Implementation • Context switches • Built on coroutine library. • I/O • Intercept blocking system calls, use epoll() and AIO for disk. • Can be less efficient • Scheduling • Main scheduling loop looks very much like an event-driven application. (What is an EDA?) • Makes it relatively easy to switch schedulers. • Synchronization • Cooperative threading on UP. • Efficiency • All O(1), except sleep queue.

  8. Benchmarks • 2 X 2.4 GHz Xeon, 1 GB memory, 2 X 10K RPM SCSI, GigE. • 2 X 1.2 GHz US III • Linux 2.5.70, epoll(), AIO. • Solaris 8 • Capriccio, LinuxThreads, NPTL

  9. Thread Primitives

  10. Thread Scalability • Producer-consumer

  11. Thread Scalability • Drop between 100 and 1000 to cache footprint.

  12. I/O Performance • pipetest • Pass a number of tokens among a set of pipes. • Disk scheduling • A number of threads perform random 4 KB reads from a 1 GB file. • Disk I/O through buffer cache • 200 threads reading with a fixed miss rate.

  13. When concurrency is low, performance is poorer.

  14. Benefits of disk head scheduling.

  15. I/O out of buffer. • Performance is lower due to AIO.

  16. Linked Stack Management

  17. Thread Stacks • If a lot of threads, the cumulative stack space can be quite large. • Solution: Use a dynamic allocation policy and allocate on demand. Link stack chunks together. • Problem: How do you link stack chunks together? How do you know when to link a new one?

  18. Weighed Call Graph • Use static analysis to create a weighted call graph. • Each node is weighed by the maximum stack space that that function might consume. (Why is it maximum, and not exact?) • Now what?

  19. Bounds • Most real-world programs use recursion. • Even without, static bound wastes too much. • Instead insert checkpoints at key places to link in new stack chunks. • Chunks switched right before arguments are pushed.

  20. Placing Checkpoints • Make sure one checkpoint in every cycle by inserting in back edges. (How?) (Is this efficient?) • Then make sure each path (sum) is not too long.

  21. Function B is executing. • Function D, both ways. • Recursion.

  22. Special Cases • Function pointers • Difficult, but they try to analyze. • External functions • Allow annotations. • Alternatively, link in a large chunk. • Variable length arrays • C99

  23. Question • What kind of a problem is this? • Is it being solved at the right level?

  24. Resource-Aware Scheduling

  25. Admission Control • We’ve seen many graphs where performance degrades as some variable increases. • Scheduling in Capriccio is to keep performance in the “good” part of the curve.

  26. Blocking Graph • Each node is a location where the program blocked. • Location is call chain. • Generated at run time. • Annotate with resource usage: • Average running time (with exponentially-weighted “moving” average), memory, stack, sockets, etc. • Maintain a run queue for each node. Admit threads till resources reach maximum capacity.

  27. Pitfalls • Too many non-linear effects to predict. • One solution is to use some kind of instrumentation, plus feedback control. • But even detecting that is hard.

  28. Web Server Test

  29. Summary • Control flow maintains state. Control flow can be swapped for explicit maintenance. • Threads perform two functions: • Maintain state (logical threads of programming model) • Allow concurrency (kernel) • Should separate the two, since the overhead of concurrency is not necessary when just want to maintain state. • Cooperative multitasking has been denigrated before, but can be good.

More Related