Linux Operating System 許富皓

Linux Operating System 許富皓

NameSpace

Namespace [Michael Kerrisk] Currently, Linux implements six different types of namespaces. The CLONE_NEW* identifiers listed in parentheses are the names of the constants used to identify namespace types when employing the namespace-related APIs (clone(), unshare(), and setns() )

Six Linux Namespaces Mount namespaces (CLONE_NEWNS, Linux 2.4.19) UTS namespaces (CLONE_NEWUTS, Linux 2.6.19) IPC namespaces (CLONE_NEWIPC, Linux 2.6.19) PID namespaces (CLONE_NEWPID, Linux 2.6.24) Network namespaces (CLONE_NEWNET, Linux 2.6.29) User namespaces (CLONE_NEWUSER, Linux 3.8)

Goals of Namespace (1) [Michael Kerrisk] The purpose of each namespace is to wrap a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource.

Goals of Namespace (2) [Michael Kerrisk] One of the overall goals of namespaces is to support the implementation of containers, a tool for lightweight virtualization (as well as other purposes) that provides a group of processes with the illusion that they are the only processes on the system

PID Namespace [Michael Kerrisk] The global resource isolated by PID namespaces is the process ID number space. This means that processes in different PID namespaces can have the same process ID. PID namespaces are used to implement containers that can be migrated between host systems while keeping the same process IDs for the processes inside the container.

Process PID[Michael Kerrisk] As with processes on a traditional Linux (or UNIX) system, the process IDs within a PID namespace are unique, and are assigned sequentially starting with PID 1. Likewise, as on a traditional Linux system, PID 1—the init process—is special: it is the first process created within the namespace, and it performs certain management tasks within the namespace.

Creation of a New PID Namespace [Michael Kerrisk] • A new PID namespace is created by calling clone() with the CLONE_NEWPID flag. • child_pid = clone(childFunc, child_stack, CLONE_NEWPID | SIGCHLD, argv[1]);

PID Namespace Hierarchy[Michael Kerrisk] • PID namespaces form a hierarchy: • A process can "see" only those processes contained in its own PID namespace and in the child namespaces nested below that PID namespace. • If the parent of the child created by clone() is in a different namespace, the child cannot "see" the parent; therefore, getppid() reports the parent PID as being zero.

PID Namespace Hierarchy [text book]

/proc/PID Directory[Michael Kerrisk] • Within a PID namespace, the /proc/PIDdirectories show information only about • processes within that PIDnamespace or • processes within one of its descendant namespaces.

Mount a proc filesystem[Michael Kerrisk] However, in order to make the /proc/PID directories that correspond to a PID namespace visible, the proc filesystem ("procfs" for short) needs to be mounted from within that PID namespace. From a shell running inside the PID namespace (perhaps invoked via the system()library function), we can do this using a mount command of the following form: #mount -t proc proc /mount_point

Nested PID Namespaces[Michael Kerrisk] • PID namespaces are hierarchically nested in parent-child relationships. • Within a PID namespace, it is possible to see • all other processes in the same namespace, as well as • all processes that are members of descendant namespaces.

“See” a Process [Michael Kerrisk] • Here, "see" means being able to make system calls that operate on specific PIDs. • e.g., using kill() to send a signal to process. • Processes in a child PID namespace cannot see processes that exist (only) in the parent PID namespace (or further removed ancestor namespaces).

PID returned by getpid() [Michael Kerrisk] A process will have one PID in each of the layers of the PID namespace hierarchy starting from the PID namespace in which it resides through to the root PID namespace. Calls to getpid() always report the PID associated with the namespace in which the process resides

Traditional init Process and Signals • The traditional Linux init process is treated specially with respect to signals. • The only signals that can be delivered to init are those for which the process has established a signal handler. • All other signals are ignored. • This prevents the init process—whose presence is essential for the stable operation of the system —from being accidentally killed, even by the super user.

init Processes of Namespaces and Signals • PID namespaces implement some analogous behavior for the namespace-specific init process. • Other processes in the namespace (even privileged processes) can send only those signals for which the init process has established a handler. • Note that (as for the traditional init process) the kernel can still generate signals for the PID namespace init process in all of the usual circumstances • e.g., • hardware exceptions, • terminal-generated signals such as SIGTTOU, • and expiration of a timer.

Signals from Ancestor Namespaces • Signals can be sent to the PID namespace init process by processes in ancestor PID namespaces. • Again, only the signals for which the init process has established a handler can be sent, with two exceptions: • SIGKILL and • SIGSTOP.

init Process and SIGKILLand SIGSTOP When a process in an ancestor PID namespace sends SIGKILL and SIGSTOP to the init process, they are forcibly delivered (and can't be caught). The SIGSTOP signal stops the init process; SIGKILL terminates it.

Termination of init Processes Since the init process is essential to the functioning of the PID namespace, if the init process is terminated by SIGKILL (or it terminates for any other reason), the kernel terminates all other processes in the namespace by sending them a SIGKILL signal.

Connection between Processes and Namespaces struct nsproxy *nsproxy;

Definition of struct nsproxy struct nsproxy { atomic_t count; struct uts_namespace *uts_ns; struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; struct net *net_ns; }; A nsproxy is shared by processes which share all namespaces. As soon as a single namespace is cloned or unshared, the nsproxy is copied.

struct nsproxy A structure to contain pointers to all per-process namespaces - fs (mount), uts, network, ipc, etc. 'count' is the number of processes holding a reference. The count for each namespace, then, will be the number of nsproxies pointing to it, not the number of processes.

Initial Global Namespace struct nsproxy init_nsproxy = { .count = ATOMIC_INIT(1), .uts_ns = &init_uts_ns, #if defined(CONFIG_POSIX_MQUEUE)|| defined(CONFIG_SYSVIPC) .ipc_ns = &init_ipc_ns, #endif .mnt_ns = NULL, .pid_ns = &init_pid_ns, #ifdef CONFIG_NET .net_ns = &init_net, #endif };

Process Identification Number Unix processes are always assigned a number to uniquely identify them in their namespace. This number is called the process identification number or PID for short. Each process generated with fork or clone is automatically assigned a new unique PID value by the kernel.

Process ID • PIDs are numbered sequentially in each PID namespace: the PID of a newly created process is normally the PID of the previously created process increased by one. • Of course, there is an upper limit on the PID values; when the kernel reaches such limit, it must start recycling the lower, unused PIDs. • By default, the maximum PID number is PID_MAX_LIMIT-1 (32,767 or 262143).

Maximum PID Number #define PAGE_SHIFT 12 #define PAGE_SIZE 1UL << PAGE_SHIFT) #define PID_MAX_DEFAULT (CONFIG_BASE_SMALL ? 0x1000 : 0x8000) #define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \ (sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT)) P.S.: PID_MAX_LIMIT is equal to 215 (32768) or 224. #define PIDMAP_ENTRIES ((PID_MAX_LIMIT + 8*PAGE_SIZE - 1)/PAGE_SIZE/8) P.S.: PIDMAP_ENTRIES is equal to 1 or 215.

PIDs in PID Namespaces Namespaces add some additional complexity to how PIDs are managed. PID namespaces are organized in a hierarchy.

A Process May Have Multiple PIDs When a new namespace is created, all PIDs that are used in this namespace are visible to the parent namespace, but the child namespace does not see PIDs of the parent namespace. However this implies that some processes are equipped with more than one PID, namely, one per namespace they are visible in. This must be reflected in the data structures.

Global IDs Global IDs are identification numbers that are valid within the kernel itself and in the initial global namespace. For each ID type, a given global identifier is guaranteed to be unique in the whole system.

Local IDs Local IDs belong to a specific namespace and are not globally valid. For each ID type, they are valid within the namespace to which they belong, but identifiers of identical type may appear with the same ID number in a different namespace.

Global PID and TGID • The global PID and TGID are directly stored in the task struct, namely, in the elements pid and tgid: typedef int __kernel_pid_t; typedef __kernel_pid_t pid_t; struct task_struct { ... pid_t pid; pid_t tgid; ... }

Representation of a PID Namespace struct pid_namespace { struct kref kref; struct pidmap pidmap[PIDMAP_ENTRIES]; int last_pid; unsigned int nr_hashed; struct task_struct *child_reaper; struct kmem_cache *pid_cachep; unsigned int level; struct pid_namespace *parent; : struct user_namespace *user_ns; struct work_struct proc_work; kgid_t pid_gid; int hide_pid; int reboot; /* group exit code if this pidns was rebooted */ unsigned int proc_inum; };

child_reaper Field Every PID namespace is equipped with a process that assumes the role taken by init in the global picture. One of the purposes of init is to call wait4 for orphaned processes, and this must likewise be done by the init process of the namespace. A pointer to the task structure of this process is stored in child_reaper.

parent Field parent is a pointer to the parent namespace, and level denotes the depth in the namespace hierarchy. The initial namespace has level 0, any children of this namespace are in level 1, children of children are in level 2, and so on. Counting the levels is important because IDs in higher levels must be visible in lower levels.

pidmap Field struct pidmap { atomic_t nr_free; void *page; }; #define PIDMAP_ENTRIES ((PID_MAX_LIMIT + 8*PAGE_SIZE - 1)/PAGE_SIZE/8 struct pid_namespace { : struct pidmap pidmap[PIDMAP_ENTRIES]; : }

PID bitmap [1][2][3] To keep track of which PIDs have been allocated and which are still free, the kernel uses a large bitmap in which each PID is identified by a bit. The value of the PID is obtained from the position of the bit in the bitmap.

Allocate a Free PID Allocating a free PID is then restricted essentially to looking for the first bit in the bitmap whose value is 0; this bit is then set to 1. static int alloc_pidmap(struct pid_namespace *pid_ns)

Free a PID Freeing a PID can be implemented by ‘‘toggling‘‘ the corresponding bit from 1 to 0. static void free_pidmap(struct upid *upid)

struct upid • struct upid represents the information that is visible in a specific namespace. struct upid { /* Try to keep pid_chain in the same cacheline as nr for find_vpid */ int nr; struct pid_namespace *ns; struct hlist_node pid_chain; };

Fields of struct upid nr represents the numerical value of an ID, and ns is a pointer to the namespace to which the value belongs. All upidinstancesare kept on a hash table to which we will come in a moment, and pid_chain allows for implementing hash overflow lists with standard methods of the kernel.

The Kernel-internal Representation of A PID • struct pidis the kernel-internal representation of a PID. struct pid { atomic_t count; unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; struct rcu_head rcu; struct upid numbers[1]; };

Type enumpid_type enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX }; • Notice that thread group IDs are not contained in this collection. • This is because the thread group IDis simply given by the PID of the thread group leader, so a separate entry is not necessary.

PIDs and Processes • Linux associates a different PID with each process or lightweight process in the system. • As we shall see later in this chapter, there is a tiny exception on multiprocessor systems. • This approach allows the maximum flexibility, because every execution context in the system can be uniquely identified.

Threads in the Same Group Must Have a Common PID • On the other hand, Unix programmers expect threads in the same group to have a common PID. • For instance, it should be possible to send a signal specifying a PID that affects all threads in the group. • In fact, the POSIX 1003.1c standard states that all threads of a multithreaded application must have the same PID.

Thread Group • To comply with POSIX 1003.1c standard, Linux makes use of thread groups. • The identifier shared by the threads is thePID of the thread group leader , that is, the PID of the first lightweight process in the group. • The thread group ID of a thread group is called TGID.

Process Groups • Modern Unix operating systems introduce the notion of process groups to represent a job abstraction. • For example, • in order to execute the command line: $ ls | sort | morea shell that supports process groups, such as bash, creates a new group for the three processes corresponding to ls, sort, and more. • In this way, the shell acts on the three processes as if they were a single entity (thejob, to be precise).

Process Groups [waikato] • One important feature is that it is possible to send a signal to every process in the group. • Process groups are used • for distribution of signals, and • by terminals to arbitrate requests for their input and output.

Process Groups [waikato] • Foreground Process Groups • A foreground process has read and write access to the terminal. • Every process in the foreground receives SIGINT (^C)SIGQUIT (^\) and SIGTSTP signals. • Background Process Groups • A background process does not have read access to the terminal. • If a background process attempts to read from its controlling terminal its process group will be sent a SIGTTIN.

Linux Operating System 許富皓