670 likes | 799 Vues
This paper discusses the integration of Quality of Service (QoS) support in general-purpose operating systems, specifically within Eclipse/BSD. It outlines the motivation behind QoS for server applications, highlights design goals such as isolation and fairness, and describes advanced scheduling techniques, including hierarchical proportional sharing. The paper also introduces key elements like the reservation file system and tagging mechanisms, outlining how these contribute to flexible resource management in distributed data centers serving globally connected clients.
E N D
QoS Support in Operating Systems Banu Özden Bell Laboratories ozden@research.bell-labs.com
Vision • Service providers will offer storage and computing services • through their distributed data centers • connected with high bandwidth networks • to globally distributed clients. • Clients will access these services via diverse devices and networks, e.g.: • mobile devices and wireless networks, • high-end computer systems and high bandwidth networks. • These services will become utilities (e.g., storage utility, computing utility). • Eventually resources will be exchanged and traded between geographically dispersed data centers to address fluctuating demand.
Eclipse/BSD:an Operating System with Quality of Service Support Banu Özden ozden@research.bell-labs.com
Motivation • QoS support for (server) applications: • web servers • video servers • Isolation and differentiation of different • entities serviced on the same platform • applications running on the same platform • QoS requirements: • client-based • service-based • content-based
Design Goals • QoS support in a general purpose operating system • Remain compatible with the underlying operating system • QoS parameters: • Isolation • Differentiation • Fairness • (Cumulative) throughput • Flexible resource management • capable of implementing a large set of provisioning needs • supports a large set of server applications without imposing significant changes to their design
Talk Outline • Schedulers • Reservation File System (reservfs) • Tagging • Web Server Experiments • Access Control and Profiles • Eclipse/BSD Status • Related Work • Future Work
Proportional sharing • Generalized processor sharing (GPS) weight of flow i service received by flow i in set of flows • For any flow i continuously backlogged in • Thus, rate of flow i in is:
QoS Guarantees • Fairness • Throughput • Packet delay
Schedulers in Eclipse • Resource characteristics differ • Different hierarchical proportional-share schedulers for resources • Link scheduler: WF2Q • Disk scheduler: YFQ • CPU scheduler: MTR-LS • Network input: SRP
server server 0.8 0.2 0.4 0.2 0.4 company A company B company A page 1 company A page 2 company B 0.5 0.5 page 1 page 2 Hierarchical GPS Example hierarchical proportional sharing proportional sharing
Schedulers • Hierarchical proportional-sharing (GPS) descendant queue nodes of node n serviced received by scheduler node n in set of immediate descendant nodes of the parent of node n • For any node n continuously backlogged in
link scheduler link scheduler Link Aggregation • Need to incrementally scale bandwidth • Resource aggregation is emerging as a solution: • Grouping multiple resources into a single logical unit • QoS over such aggregated links?
GPS MSFQ Nr Nr … r r r Multi-Server Model • Multi Server Fair Queuing (MSFQ) • A packetized algorithm for a system with N links, each with a bandwidth of r, that approximates a GPS system with a single link with Nr bandwidth Reference model Packetized scheduler
Multi-Server Model (Contd.) • Goals: • Guarantee bandwidth and packet delay bounds that are independent of the number of flows • Allow flows arrive and depart dynamically • Be work-conserving • Algorithm: • When a server is idle, schedule the packet that would complete transmission earliest under a single server GPS system with a bandwidth of Nr Sigcomm 2001
a1 a2 a1 a2 GPS GPS 1 2 1 2 MSFQ serv1 WFQ 1 serv 1 2 serv2 2 time = 0 1 2 3 4 time = 0 1 2 3 4 a1 a2 a3 a4 a5 a6 a7 GPS 1 2 3 4 5 6 7 … serv1 6 1 4 … 7 2 5 serv2 MSFQ 3 serv3 time = 0 1 2 3 4 5 6 7 8 9 10 MSFQ Preliminary Properties Multi-Server specific properties • Ordering: a pair of packets scheduled in the order of their GPS finishing times may complete in reverse order • GPS busy MSFQ busy, but converse is not true • Non-coinciding busy periods • Work backlog?
GPS service MSFQ Packet delay time GPSi service MSFQi Service discrepancy time MSFQ Properties • Maximum service discrepancy (buffer requirement) • Maximum packet delay • Maximum per-flow service discrepancy
Schedulers (contd.) • Disk scheduling with QoS • tradeoffs between QoS and total disk performance • driver queue management • queue depth • queue ordering • fragmentation • Hierarchical YFQ • CPU scheduling with QoS • length of cpu phases are not known a priori • cumulative throughput • Hierarchical MTR-LS
Eclipse’s Key Elements • Hierarchical, proportional share resource schedulers • Reservation, reservation file system (reservfs) • Tagging mechanism • Access and admission control, reservation domain
Reservations and Schedulers • (Resource)reservations • unit for QoS assignment • similar to the concept of a flow in packet scheduling • Hierarchical schedulers • a tree with two kinds of nodes: • scheduler nodes • queue nodes • each node corresponds to a reservation • Schedulers are dynamically reconfigurable
disk bandwidth cpu cycles 0.8 0.8 0.8 0.2 0.2 0.2 company A company B company A company B 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 page 1 page 1 page 1 page 1 page 2 page 2 page 2 page 2 Web Server Example • Hosting two companies’ web sites, each with two web pages network bandwidth company A company B
Web Server Video Server Application Interface Reservation file system Scheduler Interface CPU scheduler Link scheduler Disk scheduler 1 Disk scheduler 2 Net 1 Net 2 CPU 1 CPU 1 Disk1 Disk2 Disk3 Reservfs • We built the reservation file system • to create and manipulate reservations • to access and configure resource schedulers
/reserv cpu fxp0 fxp1 da0 Reservfs • Hierarchical • Each reservation directory corresponds to a node at a scheduler • Each resource is represented by a reservation directory under /reserv
Reservfs • Two types of reservation directories: • scheduler directories • queue directories • Scheduler directories are hierarchically expandable • Queue directories are not expandable
/reserv cpu fxp0 fxp1 ca0 q0 q0 r1 q0 q0 q1 q0 share newqueue newreserv share backlog Reservfs • Scheduler directory: • share • newqueue • newreserv • special queue: q0 • Queue directory: • share • backlog
CPU scheduler Link scheduler Disk scheduler Net 1 Net 2 CPU 1 Disk1 Disk2 Reservfs Web Server Video Server Application Interface: Reservation file system Scheduler Interface:
Reservfs API • Creation of a new queue/scheduler reservation • fd=open(newqueue/newreserve,O_CREAT) • fd of newly created share file
da0 q1 q0 q1 share newqueue newreserv share backlog Creating Queue Reservation /reserv cpu fxp0 fxp1 da0 q0 q0 r1 q0 q0 q0 q1 fd= open(“newqueue”,O_CREAT)
da0 da0 q0 q1 r0 r0 q0 q1 share newqueue newreserv q0 share newreserv newqueue fd= open(“newreserv”,O_CREAT) Creating Scheduler Reservation /reserv cpu fxp0 fxp1 q0 q0 r1 q0 q0 q1
Reservfs API • Changing QoS parameters • writing a weight and min value to the share file • Getting QoS parameters • reading the share file • Getting/setting queue parameters • reading/writing the backlog file
Reservfs API Command line output: killerbee$ cd /reserv killerbee$ ls -al total 5 dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 . drwxr-xr-x 20 root wheel 512 Sep 12 21:54 .. dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 cpu dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 fxp0 dr-xr-xr-x 0 root wheel 512 Sep 15 11:37 fxp1 killerbee$ cd fxp0 killerbee$ ls -alR total 6 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 newqueue -rw------- 1 root wheel 1 Sep 15 11:39 newreserv dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q0 -r-------- 1 root wheel 1 Sep 15 11:39 share ./q0: total 4 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 backlog -rw------- 1 root wheel 1 Sep 15 11:39 share
Reservfs API killerbee$ cd r0 killerbee$ ls -al total 6 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 newqueue -rw------- 1 root wheel 1 Sep 15 11:39 newreserv dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q0 -r-------- 1 root wheel 1 Sep 15 11:39 share killerbee$ echo “50 1000000” > newqueue killerbee$ ls -al total 6 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 newqueue -rw------- 1 root wheel 1 Sep 15 11:39 newreserv dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q0 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 q1 -r-------- 1 root wheel 1 Sep 15 11:39 share killerbee$ cd q1 killerbee$ ls -al total 4 dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 . dr-xr-xr-x 0 root wheel 512 Sep 15 11:39 .. -rw------- 1 root wheel 1 Sep 15 11:39 share -rw------- 1 root wheel 1 Sep 15 11:39 backlog killerbee$ cat share 50 1000000 killerbee$
CPU scheduler Link scheduler Disk scheduler Net 1 Net 2 CPU 1 Disk1 Disk2 Reservfs Web Server Video Server Application Interface: Reservation file system Scheduler Interface:
Reservfs Scheduler Interface • Schedulers registers by providing the following interface routines via reservfs_register(): • init(priv) • create(priv, parent, type) • start(priv, parent, type) • delete(priv, node) • get/set(priv, node, values, type)
Reservfs Implementation • Built via vnode/vfs interface • A reserv{} structure represents each reservfs file • reserv{} representing a directory contains a pointer to the corresponding node at scheduler • Scheduler independent • Implements garbage collection mechanism
Talk Outline • Introduction • Schedulers • Reservation File System (reservfs) • Tagging • Web Server Experiments • Access Control and Profiles • Eclipse/BSD Status • Related Work • Future Work
Tagging • A request arriving at a scheduler must be associated with the appropriate reservation • Each request is tagged with a pointer to a queue node • mbuf{}, buf{} and proc{} are augmented • How is a request tagged?
Tagging (contd.) • For a file, its file descriptor is tagged with a disk reservation • For a connected socket, its file descriptor is tagged with a network reservation • For unconnected sockets, we provide a late tagging mechanism • Each process is tagged with a cpu reservation • We associate reservations with references to objects
Default List of a Process • Default reservations of a process, one for each resource • A list of tags (pointers to queue directories) • Used when a tag is otherwise not specified • Two new files are added for each process pid in /proc/pid • /proc/pid/default to represent the default list • /proc/pid/cdefault to represent the child default list
Default List of a Process (contd.) • Reading these file returns the name of default queue directories, e.g., /reserv/cpu/q1 /reserv/fxp0/r2/q1 /reserv/da0/r1/q3 • A process, with the appropriate access rights, can change the entries of default files
Implicit Tagging • The file descriptor returned by open(), accept() or connect() is automatically tagged with default • The tag of the file descriptor of an unconnected socket is set to default at sendto() and sendmesg() • When a process forks, the child process is tagged with the default cpu reservation
Explicit Tagging • The tag of a file descriptor can be set/read with new commands to fcntl(): • F_SET_RES • F_GET_RES • A new system call chcpures() to change the cpu reservation of a process
Reservation Domains • Permissions of a process to use, create and manipulate reservations • The reservation domain of a process is independent of its protection domain
disk bandwidth network bandwidth cpu cycles 0.8 0.8 0.8 0.2 0.2 0.2 reserv A reserv B reserv A reserv B reserv A reserv B 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 reserv 1 reserv 2 reserv 1 reserv 2 reserv 1 reserv2 reserv 1 reserv2 Reservations and Reservation Domains Reservationdomain 1 Reservation domain 2
Reservfs Garbage Collection • Based on reference counts • every application that is using a specific node adds a reference on it (to the vnode) • Triggered by the vnode layer • when the last application finishes using the node this is garbage collected • fcntl() available to maintain the node even if no references to it exist
SRP Input Processing • Demultiples incoming packets • before network and higher-level protocol processing • Unprocesed input queue per socket • Processes input protocols in context of receiving process • Drops packets when per-socket queue is full • Avoids receive livelock
Talk Outline • Introduction • Schedulers • Reservation File System (reservfs) • Tagging • Web Server Experiments • Access Control and Profiles • Eclipse/BSD Status • Related Work • Future Work
QoS Support for Web Server • Virtual hosting with Apache server: • separate Apache server for each virtual host • single Apache server for all virtual hosts • Eclipse/BSD isolates and differentiates performance of virtual hosts • multiple Apache servers----implicit tagging • single Apache server----explicit tagging • We implemented an Apache module for explicit tagging
Experimental Setup • Apache Web Server: • A multi-process server • (Pre)spawns helper processes • A process handles one request at a time • Each process calls accept() to service the next connection request • HTTP clients run on five different machines • Servers are running FreeBSD 2.2.8 or Eclipse/BSD 2.2.8 on a PC (266 MHz Pentium Pro, 64 MB RAM, 9 GB Seagate ST39173W fast wide SCSI disk) • Machines are connected with a 10/100 Mbps Ethernet switch
/reserv cpu fxp0 da0 q0 q0 q0 q1 q1 q1 q2 q2 q2 Experiments • Hosting two sites with two servers Reservation domain of server 1 Reservation domain of server 2