Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.

This database has been brought to you by the number -17

Posted 02-12-2010 at 11:01 AM by rocket357
Updated 02-12-2010 at 11:55 AM by rocket357

In this blogpost, I shared two scripts: one was to dynamically configure PostgreSQL to the specs of the machine it's running on (based on three different "usage patterns": web-based backend, online transaction processing, and data warehousing/business intelligence), and the other script was a very basic startup/init script for PostgreSQL. In the init script I put a few new lines (new to me, at least), so I figured I'd explain those.

First off, the lines in question:

Code:

if [ -f $PGDATA/postmaster.pid -a -w /proc/`cat $PGDATA/postmaster.pid`/oom_adj ]; then
    echo -17 >>/proc/`cat $PGDATA/postmaster.pid`/oom_adj
fi
sysctl -w vm.overcommit_memory=2

Ok, first line checks to make sure PostgreSQL has a pidfile (i.e. the server process is running) and that the oom_adj file exists in the proc filesystem for the server process, the next line puts the magic number -17 into the oom_adj file of the PostgreSQL server process.

So what's oom_adj, and what is the significance of -17?

If you've administered Linux for any amount of time recently, you'd know that the 2.6 kernels have a nefarious little block of kernel logic called the oom-killer that only acts when the system is under extremely heavy memory pressure. The concept is simple: when the system is using all of its physical memory and most of the swap, the oom-killer hunts down a process that's sucking up memory (using an arbitrary "points system"...kinda like "Whose Line is it Anyways"...haha). The oom-killer rates the different processes, picks one, and kills it to free up the memory that process was taking up so the system might live on without a crash.

This logic is gorgeous for a desktop...but not so much so for a server...particularly a dedicated database server! (Gee, I wonder what process is sucking up the most RAM? PostgreSQL! Kill it!). Blarg! What good is it that the OS didn't crash if I lose the database server? I'd almost rather a complete system crash!

So, rather than letting that stupid snmp monitor kill off PostgreSQL in the middle of an hour-long business intelligence query, I think it would be better to allow PostgreSQL to continue doing it's thing and worst case fail the snmp allocations instead. Sound reasonable? I think so...but now, how do we go about doing that?

Well, that's what oom_adj is. See, in the /proc filesystem, there are a ton of directories that have numbers for names.

Example:

Code:

user@host:~$ ls /proc
1    123    14     2023  2036   22129  22141  22153  27     32     38   436  447  52    5541  64    acpi         fb           locks         slabinfo           vmcore
10   124    15     2024  2038   22130  22142  22154  2770   3207   39   437  448  53    56    65    buddyinfo    filesystems  meminfo       stat               vmstat
11   125    1531   2025  2039   22131  22143  22155  2788   32685  391  438  449  5377  57    66    bus          fs           misc          swaps              zoneinfo
115  126    16     2026  2041   22132  22144  22156  28     32687  392  439  45   5378  58    67    cgroups      interrupts   modules       sys
116  127    17     2027  2042   22133  22145  22157  2880   32688  4    44   450  5380  59    7     cmdline      iomem        mounts        sysrq-trigger
117  128    18     2028  20541  22134  22146  22185  2881   32689  40   440  46   5381  5947  8     cpuinfo      ioports      mtrr          sysvipc
118  129    19     2029  21     22135  22147  22186  29     33     41   441  47   5384  6     8187  crypto       irq          net           timer_list
119  13     1928   2030  22     22136  22148  23     3      34     42   442  48   54    60    8188  devices      kallsyms     pagetypeinfo  timer_stats
12   130    1930   2031  22124  22137  22149  24     30     35     424  443  49   5468  6028  8205  diskstats    kcore        partitions    tty
120  133    19606  2032  22125  22138  22150  25     3051   36     426  444  5    5487  61    8372  dma          key-users    sched_debug   uptime
121  134    2      2034  22126  22139  22151  255    31     37     43   445  50   55    62    8373  driver       kmsg         scsi          version
122  13925  20     2035  22127  22140  22152  26     31954  3741   435  446  51   5520  63    9     execdomains  loadavg      self          version_signature

Each of the number directories corresponds to a process on the system. In this case:

Code:

user@host:~$ sudo su
[sudo] password for user:
root@host:/home/user# cat /var/lib/postgresql/data/postmaster.pid
32685

So the /proc folder 32685 contains all of the information for the PostgreSQL process running on this machine (it'll be different each time PostgreSQL is restarted!). Let's see what this directory contains:

Code:

root@host:/home/user# ls /proc/32685/
attr  cgroup      cmdline          cpuset  environ  fd      io      loginuid  mem     mountstats  oom_adj    root   smaps  statm   task
auxv  clear_refs  coredump_filter  cwd     exe      fdinfo  limits  maps      mounts  numa_maps   oom_score  sched  stat   status  wchan

That's a lot of info, and I'm sure if you took the time to figure out what each item did, you could accomplish quite a bit...but I'm only concerned about one for now: oom_adj. From the Linux kernel source file mm.h (memory manager header file):

Code:

/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */
#define OOM_DISABLE -17

There is it. If you write -17 to /proc/<pid>/oom_adj, you disable the oom-killer for a given process.

That should be it, right? Well, almost. See, the only time the system gets into this mess is when the kernel is running loose with memory allocations. It gives out memory knowing that a process might request 2 GB but only use 1.2 GB. For a system that has many processes running simultaneously, a conservative approach leads to available memory being claimed by processes that aren't using it, meaning additional processes can't claim any memory and hence fail. What Linus, et al., did, then was to give the kernel the power to "overcommit" memory, knowing that the processes aren't using the requested memory to it's full potential. This way, more processes can run, but if each process starts actually using what it requested, it's sort of like a bank that "overcommits" its money to loans...again, this is beautiful for a desktop system...but disaster for a server!

So we need to tell the kernel how we want it to handle memory overcommit...and we can, using the sysctl program:

Code:

sysctl -w vm.overcommit_memory=2

Here's what the "2" means:

Code:

root@host:/usr/src/linux-source-2.6.18# grep -R "#define OVERCOMMIT" *
include/linux/mman.h:#define OVERCOMMIT_GUESS           0
include/linux/mman.h:#define OVERCOMMIT_ALWAYS          1
include/linux/mman.h:#define OVERCOMMIT_NEVER           2

As you can see, there are three options: 0, 1, and 2. I'm only concerned with turning overcommit off, so I chose 2 (PostgreSQL handles memory relatively efficiently, and these are dedicated servers so I'm not concerned about allocation failures. Again, I'd prefer not to overcommit and get caught with my pants down when users all run queries that return a 15 GB data set...I'd rather reject additional connections until system resources free up as queries complete).

Also, this would allow for the PostgreSQL worker processes to utilize the complete work_mem they're allotted, meaning less swap/tmp_file time and more physical memory time. Woot?

Well, that's it. Enjoy.

Posted in PostgreSQL Stuff

Views 3070 Comments 5

« Prev Main Next »

Total Comments 5

Comments

Great, that's very interesting, many thanks for your work! +++++++

Posted 02-12-2010 at 12:11 PM by Web31337

	Yeah, interesting stuff. I haven't got round to delving into databases yet, but it's this kind of thing: poking about in the innards, tinkering, editing this and that, etc, that attracted me to Linux in the first place.
	Posted 02-12-2010 at 01:19 PM by brianL

	This stuff is critical to DBA happiness. Recently we went on a buying spree and started loading up the PostgreSQL machines with RAM, but before that spree, I'd see oom-killer failures relatively regularly...and it's frustrating (to the point I started looking at OpenSolaris/FreeBSD/etc...as a PostgreSQL platform). The oom_adj file is a sanity-saver, lemme tell ya. haha.
	Posted 02-12-2010 at 01:37 PM by rocket357

Are you sure that a desktop machine needs this oom-killer and this is why it has been added? I can hardly imagine how it can be useful even if I am doing scientific calculations that request but do not use tons of memory. A server, however, needs it badly so that some wild process, other than the beloved PostgreSQL, does not crash the system.

Are you sure that vm.overcommit_memory=2 literally turns overcommit off? I always thought it just makes it conservative so that if a process successfully requested memory, then it will not fail when it actually tries to use it.

Posted 02-14-2010 at 10:08 AM by AGer AGer is offline

My reference to a desktop was more in line of the thinking that "firefox has just gone apes*it and is sucking up hundreds of megs of RAM...kill it off". I don't know why Linus et al decided to implement the oom-killer. That is something only they know. Having the oom-killer on by default with overcommit on aggressive is a bit extreme, however.

Isn't the definition of turning "overcommit" off that the system now gives out requested memory in a conservative fashion such that a process won't fail when it attempts to use memory it has requested?

Posted 02-14-2010 at 03:39 PM by rocket357 rocket357 is offline