Process Management & IPC In Multiprocessor Operating Systems

Process Management & IPCIn Multiprocessor Operating Systems Presented by Group A1 Garrick Williamson Brad Crabtree Alex MacFarlane

Process Management & IPC Intro(Focus on Solaris) Garrick Williamson

Introduction • SunOS is the operating system component of the Solaris environment. • It supports Symmetric Multiprocessing (SMP). See diagram on next page for an example of an SMP system. • The kernel runs equally on all processors within a tightly coupled shared memory multiprocessor system. • Control flows are entirely threads, including interrupts.

SMP System Example

SunOS 5.0 Architecture • In addition to Kernel level threads, SunOS also supports multiple threads of control, called lightweight processes (LWPs). • There is one Kernel thread for each LWP. The Kernel threads are used when the LWPs perform system functions/calls.

SunOS Architecture Diagram

Synchronization • Threads/Processes synchronize through a variety of ways: • Mutual Exclusion locks • Condition Variables • Counting Semaphores • Multiple Readers and single writer locks • The Mutual Exclusion and writer locks use a priority inheritance protocol in order to prevent priority inversion.

Solaris IPC • Solaris provides the following mechanisms for IPC: • Simple, but limited mechanisms include • Signals • Pipes and named pipes (FIFO) • Sockets • More versatile mechanisms include • Message Queues • Shared memory (With Memory Mapped files and IPC shared Memory options) • Semaphores

Simple IPC • Pipes do not allow unrelated processes to communicate. • Named pipes allow unrelated processes to communicate, but are not private channels. • Using the kill function, processes may communicate with signals, but only through signal numbers.

Complex IPC • Messaging allows formatted data streams to be sent to arbitrary processes. • Semaphores allow processes synchronization. • And shared memory allows processes to share part of their virtual address space.

IRIX Process Management And IPC Brad Crabtree

Outline • Hardware Background • Process Management Facilities • Interprocess Communication Facilities

The Avalon A12. The Cambridge Parallel Processing Gamma II Plus. The Compaq AlphaServer SC. The Fujitsu AP3000. The Fujitsu VPP5000 series. The Hitachi SR8000 system. The HP Exemplar V2600. The IBM RS/6000 SP. The NEC Cenju-4. The NEC SX-5. The Quadrics Apemille. The SGI Origin 2000 series. The Sun E1000 Starfire. The Tera/Cray SV1. The Tera/Cray T3E. The Tera MTA June 2001 Raytheon installs 1152 processor Origin 3000 series at NOAA $67M 900 BFLOPS/sec 2 PB Tape Library Large Scale Computing Machines a Reality

SGI Origin Architecture • ccNUMA (NUMALink) • non-blocking crossbar switches as an interconnect fabric • 1.6GB-per-second crossbar switch

Switch verses Bus

“Cellular IRIX” Scheduler • Facilities for Improving Scalability and Locality • Job Priorities • Real-Time Jobs • Batch Critical • Time Share • Batch • Weightless • User-level Scheduler Concept

Real Time Jobs • Global Run Queue replaced with Implicit Binding Scheme • improve cache affinity and scalability • binds top N jobs, by priority, to N CPUs • CPU is always available when real-time job comes in because currently running job is of lower priority • Real-Time jobs always go to same CPU

Hard Real-Time in IRIX • REACT/PRO Extentions • Lock processes, memory to CPUs • Disable IRIX scheduler and replace with Frame Scheduler, Deadline Scheduler or None (yours) • Direct interrupts away from CPUs • Deterministic interrupt latency

Time Sharing Scheduler • Degrading Priority replaced with Earnings Model • Distribution controlled by Virtual Multiprocessors (VMPs) • at 1 HZ, VMPs balance run queues with nearest neighbors and push out extra work

Parallel Job Scheduling • Gang Scheduling replaced with Nanothreads • Space sharing over Time Sharing • Job requests CPUs, gets # avail and then algorithm is re-blocked • When thread preempted, context is saved to shared memory and User Level Scheduler re-blocks again

Replicated Kernel Text • Wired in 16MB TLB pair into kernel virtual memory space • One read-only, one read-write • TLB miss exception overhead is avoided

Memory Migration • Trying to avoid memory hot spots • Reference counters in hub (local/remote) • Fast Block Transfer Engine • Marks Source Page as Poisoned • Lazy TLB Shootdown • Hysterisis for frequent migration managed

Types of IPC & Compatibility

POSIX vs. IRIX Shared Memory POSIX Function Name Purpose and Operation mmap(2) Map a file or shared memory object into the address space shm_open(2) Create, or gain access to, a shared memory object. shm_unlink(2) Destroy a shared memory object when no references to it remain open. IRIX Function Name Purpose and Operation usconfig(3) Establish the default size of an arena, the number of concurrent processes that can use it, and the features of IPC objects in it. usinit(3) Create an arena or join an existing arena. usadd(3) Join an existing arena.

usconfig options usconfig() Flag Name Meaning CONF_INITSIZE The initial size of the arena segment. The default is 64 KB. Often you know that more is needed. CONF_AUTOGROW Whether or not the arena can grow automatically as more IPC objects or data objects are allocated (default: yes). CONF_INITUSERS The largest number of concurrent processes that can use the arena. The default is 8; if more processes than this will use IPC, the limit must be set higher. CONF_CHMOD The effective file permissions on arena access. The default is 600, allowing only processes with the effective UID of the creating process to attach the arena. CONF_ARENATYPE Establish whether the arena can be attached by general processes or only by members of one program (a share group). CONF_LOCKTYPE Whether or not lock objects allocated in the arena collect metering statistics as they are used. CONF_ATTACHADDR An explicit memory base address for the next arena to be created CONF_HISTON/OFF Start and stop collecting usage history (more bulky than metering information) for semaphores in a specified arena. CONF_HISTSIZE Set the maximum size of semaphore history records.

IRIX IPC • Tuned for Multiprocessor Environment • Utilizes “shared arena” memory • memory that can be mapped into the address spaces of multiple processes • A shared arena is identified with a file that acts as the backing store for the arena memory • shared memory is pinned into physical memory, accessible by programs and kernel

First Touch Rule • Pages in an arena are allocated via first touch • places virtual page in the node that first accesses it • To ensure spread processes have local access to most used pages, touch whole pages in arena from processes which use them most • dynamic realloc. will handle; but slower

Linux Process Management Alex MacFarlane

Threads • Number of threads limited only to size of physical memory. By default, set to half: max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2; • Modifiable at runtime using sysctl() or the proc filesystem interface. • Was limited to 4k in Linux 2.2

Thread Types • Idle Thread(s) • One per CPU in SMP system • Created at boot time • Kernel Threads • User-space Threads • Threads created by clone(), an extension to fork()

clone() flags • CLONE_VM • Share data and stack • CLONE_FS • Share filesystem info • CLONE_FILES • Share open files • CLONE_SIGHAND • Share signal handlers • CLONE_PID • Share PID with parent

Linux Scheduling Policies • SCHED_OTHER • Traditional UNIX scheduling • SCHED_FIFO • Runs until blocking on I/O, explicitly yielding CPU or being pre-empted by higher priority realtime task. • SCHED_RR • Same as SCHED_FIFO but limited to a timeslice • All user-space tasks must use SCHED_OTHER • Static priorities may be assigned using nice()

Process Representation • A collection of struct task_struct structures • Linked in two ways: • A hashtable hashed on pid • A circular doubly-linked list • Find specific task using find_task_by_pid() • Walk tasks using for_each_task() • Modifications protected by a read-write spinlock.

Process States • TASK_RUNNING: means the task is in the run queue. • TASK_INTERRUPTIBLE: means the task is sleeping but can be woken up by a signal or by expiry of a timer. • TASK_UNINTERRUPTIBLE: same as previous, except it cannot be woken up. • TASK_ZOMBIE: task has terminated but has not had its status collected (wait()-ed for) by the parent (natural or by adoption). • TASK_STOPPED: task was stopped, either due to job control signals or due to ptrace(). • TASK_EXCLUSIVE: this is not a separate state but can be OR-ed to either one of TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. Prevents “thundering herd”. • A process’ state may be modified asynchronously.

Atomic Operations • Two types • Bitmap • atomic_t • Wrapped by bus locking on SMP • Bitmap operations – for free/allocated bitmaps • set_bit(), clear_bit(), change_bit(), test_and_set_bit() etc. • atomic_t operations – for numeric counts • atomic_read(), atomic_set(), atomic_add(), atomic_inc() etc.

References • The SGI Origin software environment and application performance , Whitney, S.; McCalpin, J.; Bitar, N.; Richardson, J.L.; Stevens, L., Compcon '97. Proceedings, IEEE , 1997, Page(s): 165 -170 • An Integrated Kernel- and User-Level Paradigm for Efficient Multiprogramming, Master’s Thesis, D. Craig, CSRD Technical Report No. 1533, University of Illinois at Urbana-Champaign, 1999. • Integrated scheduling of multimedia and hard real-time tasks, Kaneko, H.; Stankovic, J.A.; Sen, S.; Ramamritham, K., Real-Time Systems Symposium, 1996., 17th IEEE , 1996, Page(s): 206 -217 • An Efficient Kernel-level Scheduling Methodology for Multiprogrammed Shared Memory Multiprocessors, Proc. of the First Merged IPPS/SPDP Conference, pp. 392--397, Orlando, FL, 1998. 18 • Topics in IRIX Programming, Chapter 2, Interprocess Communication, Silicon Graphics, Inc., 2001 • Topics in IRIX Programming, Chapter 3, Sharing Memory Between Processes, Silicon Graphics, Inc., 2001

References • Phyllis E. Crandall, Eranti V. Sumithasri, and Mark A. Clement. Performance comparison of desktop multiprocessing and workstation cluster computing. In Proceedings of the Fifth International Symposium on High Performance Distributed Computing, August 1996. • www.sun.com • Kotz, David and Nils Nieuwajaar, Flexibility and Performance of Parallel File Systems, ACM Operating Systems Review 30(2), ACM Press, April 1996, pp. 63-73.

Process Management & IPC In Multiprocessor Operating Systems