How to kill those Zombies

Posted 06-13-2013 at 03:22 PM by rtmistler

When a process has ended, signals are emitted to it's parent to indicate that the process is done. This does not delete the process however, information about the process remains with the system until such time as the parent checks that signal. The process becomes a zombie until the exit signal is checked.

The reason for this is closure.

Rather than have that process go away and never give an understanding as to why it went away, it stays as a zombie so that the parent can check that exit status. And if the parent process terminates, then any zombies attached to it will go away; even if the parent becomes a zombie. This is because the purpose of the zombie is to provide knowledge to the operating parent process; if that parent process is no longer alive whatever zombies were attached to it, do get cleaned up.

In my world of run-forever applications, which typically run on embedded platforms; zombies are a real threat. Parent processes are designed to never go away, but they do create child processes of temporary natures to accomplish their functional feats.

Simple Case
The simple case is when you can afford to wait, or must wait anyways for the outcome of your child process.

Code:

    int w_status;
    pid_t pid;

    /* Fork a child process to resolve the file handle and wait for it to complete */
    pid = fork();
    if(pid < 0) {
        my_log("fork error - configuration file not updated %d:%s\n", errno, strerror(errno));
        return;
    }
    /* This is the child - note args and newenv set up prior; just not shown here */
    else if(pid == 0) {
        if(execve("/bin/cp", args, newenv) == -1) {
            my_log("Error invoking execv %d:%s\n", errno, strerror(errno));
        }
    }

    /* 
     * Parent waiting for child to complete.
     * Once done, all file handles used by the child will be closed, this is
     * done to expedite the complete writing of all cached data to the file.
     */
    waitpid(pid, &w_status, 0);

The waitpid() specifically waiting and blocking until the process ID of the child emits a signal is what manages the closure for the zombie process.

More Complex Case

But what happens if your application cannot spare the time to wait some indeterminate amount of time? How I handle that is to create a global process ID storage and manage that as part of my main loop. There are a few subtleties to deal with.

This helps with not having a duplicate child because if the global pid is non-zero, then you already have a pending process
The persistent loop which checks for the signal waits with a "no hang" attribute so it is non-blocking
If you enter a point where you try to re-do the creation of a child, you have the option to kill a former process

Here's a full example.

Definition of the global pid

Code:

pid_t G_Child_Pid = 0;

In the place where I'd normally create a child process, a test before I do the fork()

Code:

    if(G_Child_Pid != 0) {
        my_log("existing process already running\n");
        /* Here I use the option to kill that existing process, my choice, you don't have to do this */
        kill(G_Child_Pid, SIGKILL); // shoot that zombie in the head!
        return 0;
    }

Once I determine that it is OK to fork() a child, I do that action, and set up G_Child_Pid with the value of the pid

Code:

    pid = fork(); // pid is a pid_t local declaration
    if(pid < 0) {
        my_log("fork error %d:%s\n", errno, strerror(errno));
        return 0;
    }
    else if(pid == 0) {
        /* Run my executable - arguments and environment set up before */
        if(execve("/usr/bin/cp", args, newenv) == -1) {
            my_log("Error invoking execv %d:%s\n", errno, strerror(errno));
        }
        else {
            my_debug("cp action ran\n");
        }
    }

    G_Child_Pid = pid;

And finally in my persistent loop where I check one or more global process ID's and wait for termination or a signal; this function is called out of my run-forever loop

Code:

static void clean_up_zombies(void)
{
    int w_status;

    if(G_Child_Pid != 0) {
        if(waitpid(G_Child_Pid, &w_status, WNOHANG) == G_Child_Pid) {
            G_Child_Pid = 0;
        }
    }

    if(G_Other_Pid != 0) {
        if(waitpid(G_Other_Pid, &w_status, WNOHANG) == G_Other_Pid) {
            G_Other_Pid = 0;
        }
    }
}

It is also good to note that you can also parse the w_status variable using supplied MACROS, here are some examples and you can check the man page for wait(2) to see more of these MACROS

Code:

            if(WIFSIGNALED(w_status)) {
                my_log("Process %d was terminated by signal %d.  Restarting\n", pid, WTERMSIG(w_status));
                if(WCOREDUMP(w_status)) {
                    my_log("Process %d generated a core dump.\n", pid);
                    save_core_data(); // custom function to save and zip data I care about
                }
            }
            else if(WIFEXITED(w_status)) {
                my_log("Process %d has self terminated with exit code %d.  Restarting\n", pid, WEXITSTATUS(w_status));
            }
            else {
                my_log("Process %d has failed for unknown reasons.  Restarting\n", pid);
            }

How to kill those Zombies

Comments