JOS Virtual Memory (part 2)

Introduction

This post will briefly explain what copy-on-write fork() is, then we will discuss JOS’s very particular copy-on-write fork() implementation which is mostly done in user-space.

Hopefully this post will give you a good understanding of the system calls presented in the first part of this series.

To be honest, I am not sure whether copy-on-write fork() is the best example for beginners, but it is very helpful to demonstrate how exokernel implementations move functionality normally implemented in kernel-space to user-space.

I recommend you to read the first post of this series if you have not done so yet.

Copy-on-write fork()

The fork() system call creates a new process by duplicating the calling process. That means that the new process created by fork(), known as the child, is a complete copy of the process which called fork(), known as the parent (actually, not everything is copied but we can ignore the details for now).

One of the parent’s resources which has to be duplicated is its memory address space, which include all the pages allocated for the parent. There are two ways to do this.

The simpler way is easy to understand and implement, it is something like the following:

for each mapped page in the parent do
	1. allocate a new page
	2. insert it into the address space of the child, at the same virtual address
           of the parent's page
	3. copy the parent's page contents into the child's allocated page

Again, that is not the full story, I have omitted some details but the algorithm does show the problems of this solution which is basically an unnecessary overhead.

For example, if the child process uses only a small subset of its address space, all the work done on the unused pages would have been completely wasted.

Think about the shell now, its main duty in life is to fork() and exec() new programs. Exec() destroys the address space of the process calling it and builds a new one. So, if you are using the dumb algorithm showed above, for each command you type into the shell prompt, fork() will duplicate the address of the shell just to allow exec() to destroy it.

Of course this algorithm is awful and today’s operating systems use something better called copy-on-write.

Copy-on-write fork() works like this: instead of duplicating the address space of the parent into the child, the kernel marks all the parent’s pages as read-only and inserts them into child address space. By doing that both processes will share the parent’s pages. When any of the two writes into one of the shared pages, the MMU hardware issues a page fault. The page fault handler executes and learns that a process is trying to write into a copy-on-write page. At this moment the page fault handler does the following:

1. Allocates a new page
2. Makes the new page read-write
3. Copies the contents of the copy-on-write page into the allocated one
4. Inserts the new page into the address space of the process doing the write, at the
   same virtual address of the copy-on-write page

After the fourth step, the page fault handler exits and the processor continues to execute in the instruction which caused the page fault. In this way the change is transparent for the running process.

Copy-on-write fork() is a little bit more complex to implement and it depends on the hardware to work, but it does make fork() low overhead as it does not have to duplicate the entire address space of the calling process.

Let’s consider the shell case again. In a system with copy-on-write support, the fork()/exec() sequence is much faster because no page is duplicated between the fork() and exec() calls.

Note that copy-on-write is potentially useful to any process which forks new processes, because pages are duplicated only when needed.

JOS copy-on-write fork()

In JOS fork() is not a system call. It is a library function which uses various system calls to perform its task.

The code of fork() can be found here, I recommend you to keep it open while reading this section. Also note that for now on, when I say fork(), I will be referring to JOS’s fork().

Besides the memory management system calls presented in the first part of this series, fork() also uses the following system calls which deals with environment handling:

envid_t sys_exofork(void);
int sys_env_set_status(envid_t envid, int status);
int sys_env_set_pgfault_upcall(envid_t envid, void *upcall);

I will describe now how fork() works and will introduce the functions above when necessary. It is important to note that my goal here is just to give you a general overview.

In JOS, page faults caused by environments are handled in user-space. When a page fault happens, the kernel checks whether the environment which issued the page fault has registered a page fault handler. If it did, the kernel returns to user-space and executes the registered handler. If the environment did not register a page fault handler the kernel will kill it when a page fault happens.

The first thing fork() does on line 122 is to setup the calling process page fault handler. Both the parent and the child have to setup a page fault handler.

The next step on line 124 is to create a new environment by calling sys_exofork(). That function allocates a new environment from the poll maintained by the kernel, initializes it and copies the current context of the parent (that is, the parent’s registers’ contents) into it. It is important to realize that sys_exofork() does only the very first part of the forking process.

After the new environment has been created, fork() executes its main code. It goes all over the parent’s page tables and each writable page is marked COW and read-only before being inserted into the address space of the child. Note that we have to remap the page into the parent as well, since the page permissions have been changed.

When the page setup is done, the parent does the final setup. Firstly it configures the child’s page fault handler and then the parent marks the child runnable with the call to sys_env_set_status(). That is, in exokernels it is up to the application to say when a newly created environment is runnable, not up to the kernel.

As long as the child is marked runnable, it can execute at anytime. If there are just a few environments running, it will probably run within the next few clock interruptions.

When the child is put to run by the kernel, it starts executing as if it was returning from sys_exofork(). The code in fork() is prepared to deal with this situation and on line 127 it checks whether it is the child which is executing and if it is, fork() finishes the setup and returns 0.

The process is now forked, and when any of the two tries to write to a page a page fault is issued by the hardware.

When the page fault happens the kernel will return to user-space and branch out to the page fault handler registered by fork(), which is pgfault() function on line 15. This function duplicates the page exactly as was explained in the first section of this post.

3 Responses to “JOS Virtual Memory (part 2)”

  1. vivek Says:

    thanks luiz,

    it was pretty informative. i was struggling to find this bit of information on internet.
    there are lots of references of this on the web, but none of them mention about the mmu and pagefaults. and the generally mess up the COW and read only markers!

    i have one more doubt.

    does the fork mark (cow and read only) pages which are non read-only (like data segment,stack) at the begining?

    because if the child is going to be the exact copy of the parent for its lifetime, there is no point in copying the parent’s text segment.

    thank you
    please mail me

  2. Luiz Capitulino Says:

    Yes, you are right. The text pages are not copied, they will be shared among descendants.

    This is, of course, a general rule. I mean, today’s operating systems have so many features that sometimes they may not follow such a rule strictly.

    I think what confused you was this sentence: “the kernel marks all the parent’s pages as read-only and inserts them into child address space”. Speaking about JOS, the pages which are marked read-only are the the ones which have write permission enabled. The others (eg. text) are already read-only, so they are directly inserted into the child’s address space.

    I’m replying you here (instead of sending an email) because your question may be relevant to others.

    I’m glad to know that this post has been useful to someone. :)

  3. Nencia Says:

    Nice work! I’ll have to do a cross post on this one ;)