Hi. I'm jon.404, a Unix/Linux/Database/Openstack/Kubernetes Administrator, AWS/GCP/Azure Engineer, mathematics enthusiast, and amateur philosopher. This is where I rant about that which upsets me, laugh about that which amuses me, and jabber about that which holds my interest most: *nix.
This database has been brought to you by the number -17
In this blogpost, I shared two scripts: one was to dynamically configure PostgreSQL to the specs of the machine it's running on (based on three different "usage patterns": web-based backend, online transaction processing, and data warehousing/business intelligence), and the other script was a very basic startup/init script for PostgreSQL. In the init script I put a few new lines (new to me, at least), so I figured I'd explain those.
First off, the lines in question:
Ok, first line checks to make sure PostgreSQL has a pidfile (i.e. the server process is running) and that the oom_adj file exists in the proc filesystem for the server process, the next line puts the magic number -17 into the oom_adj file of the PostgreSQL server process.
So what's oom_adj, and what is the significance of -17?
If you've administered Linux for any amount of time recently, you'd know that the 2.6 kernels have a nefarious little block of kernel logic called the oom-killer that only acts when the system is under extremely heavy memory pressure. The concept is simple: when the system is using all of its physical memory and most of the swap, the oom-killer hunts down a process that's sucking up memory (using an arbitrary "points system"...kinda like "Whose Line is it Anyways"...haha). The oom-killer rates the different processes, picks one, and kills it to free up the memory that process was taking up so the system might live on without a crash.
This logic is gorgeous for a desktop...but not so much so for a server...particularly a dedicated database server! (Gee, I wonder what process is sucking up the most RAM? PostgreSQL! Kill it!). Blarg! What good is it that the OS didn't crash if I lose the database server? I'd almost rather a complete system crash!
So, rather than letting that stupid snmp monitor kill off PostgreSQL in the middle of an hour-long business intelligence query, I think it would be better to allow PostgreSQL to continue doing it's thing and worst case fail the snmp allocations instead. Sound reasonable? I think so...but now, how do we go about doing that?
Well, that's what oom_adj is. See, in the /proc filesystem, there are a ton of directories that have numbers for names.
Example:
Each of the number directories corresponds to a process on the system. In this case:
So the /proc folder 32685 contains all of the information for the PostgreSQL process running on this machine (it'll be different each time PostgreSQL is restarted!). Let's see what this directory contains:
That's a lot of info, and I'm sure if you took the time to figure out what each item did, you could accomplish quite a bit...but I'm only concerned about one for now: oom_adj. From the Linux kernel source file mm.h (memory manager header file):
There is it. If you write -17 to /proc/<pid>/oom_adj, you disable the oom-killer for a given process.
That should be it, right? Well, almost. See, the only time the system gets into this mess is when the kernel is running loose with memory allocations. It gives out memory knowing that a process might request 2 GB but only use 1.2 GB. For a system that has many processes running simultaneously, a conservative approach leads to available memory being claimed by processes that aren't using it, meaning additional processes can't claim any memory and hence fail. What Linus, et al., did, then was to give the kernel the power to "overcommit" memory, knowing that the processes aren't using the requested memory to it's full potential. This way, more processes can run, but if each process starts actually using what it requested, it's sort of like a bank that "overcommits" its money to loans...again, this is beautiful for a desktop system...but disaster for a server!
So we need to tell the kernel how we want it to handle memory overcommit...and we can, using the sysctl program:
Here's what the "2" means:
As you can see, there are three options: 0, 1, and 2. I'm only concerned with turning overcommit off, so I chose 2 (PostgreSQL handles memory relatively efficiently, and these are dedicated servers so I'm not concerned about allocation failures. Again, I'd prefer not to overcommit and get caught with my pants down when users all run queries that return a 15 GB data set...I'd rather reject additional connections until system resources free up as queries complete).
Also, this would allow for the PostgreSQL worker processes to utilize the complete work_mem they're allotted, meaning less swap/tmp_file time and more physical memory time. Woot?
Well, that's it. Enjoy.
First off, the lines in question:
Code:
if [ -f $PGDATA/postmaster.pid -a -w /proc/`cat $PGDATA/postmaster.pid`/oom_adj ]; then echo -17 >>/proc/`cat $PGDATA/postmaster.pid`/oom_adj fi sysctl -w vm.overcommit_memory=2
So what's oom_adj, and what is the significance of -17?
If you've administered Linux for any amount of time recently, you'd know that the 2.6 kernels have a nefarious little block of kernel logic called the oom-killer that only acts when the system is under extremely heavy memory pressure. The concept is simple: when the system is using all of its physical memory and most of the swap, the oom-killer hunts down a process that's sucking up memory (using an arbitrary "points system"...kinda like "Whose Line is it Anyways"...haha). The oom-killer rates the different processes, picks one, and kills it to free up the memory that process was taking up so the system might live on without a crash.
This logic is gorgeous for a desktop...but not so much so for a server...particularly a dedicated database server! (Gee, I wonder what process is sucking up the most RAM? PostgreSQL! Kill it!). Blarg! What good is it that the OS didn't crash if I lose the database server? I'd almost rather a complete system crash!
So, rather than letting that stupid snmp monitor kill off PostgreSQL in the middle of an hour-long business intelligence query, I think it would be better to allow PostgreSQL to continue doing it's thing and worst case fail the snmp allocations instead. Sound reasonable? I think so...but now, how do we go about doing that?
Well, that's what oom_adj is. See, in the /proc filesystem, there are a ton of directories that have numbers for names.
Example:
Code:
user@host:~$ ls /proc 1 123 14 2023 2036 22129 22141 22153 27 32 38 436 447 52 5541 64 acpi fb locks slabinfo vmcore 10 124 15 2024 2038 22130 22142 22154 2770 3207 39 437 448 53 56 65 buddyinfo filesystems meminfo stat vmstat 11 125 1531 2025 2039 22131 22143 22155 2788 32685 391 438 449 5377 57 66 bus fs misc swaps zoneinfo 115 126 16 2026 2041 22132 22144 22156 28 32687 392 439 45 5378 58 67 cgroups interrupts modules sys 116 127 17 2027 2042 22133 22145 22157 2880 32688 4 44 450 5380 59 7 cmdline iomem mounts sysrq-trigger 117 128 18 2028 20541 22134 22146 22185 2881 32689 40 440 46 5381 5947 8 cpuinfo ioports mtrr sysvipc 118 129 19 2029 21 22135 22147 22186 29 33 41 441 47 5384 6 8187 crypto irq net timer_list 119 13 1928 2030 22 22136 22148 23 3 34 42 442 48 54 60 8188 devices kallsyms pagetypeinfo timer_stats 12 130 1930 2031 22124 22137 22149 24 30 35 424 443 49 5468 6028 8205 diskstats kcore partitions tty 120 133 19606 2032 22125 22138 22150 25 3051 36 426 444 5 5487 61 8372 dma key-users sched_debug uptime 121 134 2 2034 22126 22139 22151 255 31 37 43 445 50 55 62 8373 driver kmsg scsi version 122 13925 20 2035 22127 22140 22152 26 31954 3741 435 446 51 5520 63 9 execdomains loadavg self version_signature
Code:
user@host:~$ sudo su [sudo] password for user: root@host:/home/user# cat /var/lib/postgresql/data/postmaster.pid 32685
Code:
root@host:/home/user# ls /proc/32685/ attr cgroup cmdline cpuset environ fd io loginuid mem mountstats oom_adj root smaps statm task auxv clear_refs coredump_filter cwd exe fdinfo limits maps mounts numa_maps oom_score sched stat status wchan
Code:
/* /proc/<pid>/oom_adj set to -17 protects from the oom-killer */ #define OOM_DISABLE -17
That should be it, right? Well, almost. See, the only time the system gets into this mess is when the kernel is running loose with memory allocations. It gives out memory knowing that a process might request 2 GB but only use 1.2 GB. For a system that has many processes running simultaneously, a conservative approach leads to available memory being claimed by processes that aren't using it, meaning additional processes can't claim any memory and hence fail. What Linus, et al., did, then was to give the kernel the power to "overcommit" memory, knowing that the processes aren't using the requested memory to it's full potential. This way, more processes can run, but if each process starts actually using what it requested, it's sort of like a bank that "overcommits" its money to loans...again, this is beautiful for a desktop system...but disaster for a server!
So we need to tell the kernel how we want it to handle memory overcommit...and we can, using the sysctl program:
Code:
sysctl -w vm.overcommit_memory=2
Code:
root@host:/usr/src/linux-source-2.6.18# grep -R "#define OVERCOMMIT" * include/linux/mman.h:#define OVERCOMMIT_GUESS 0 include/linux/mman.h:#define OVERCOMMIT_ALWAYS 1 include/linux/mman.h:#define OVERCOMMIT_NEVER 2
Also, this would allow for the PostgreSQL worker processes to utilize the complete work_mem they're allotted, meaning less swap/tmp_file time and more physical memory time. Woot?
Well, that's it. Enjoy.
Total Comments 5
Comments
-
Great, that's very interesting, many thanks for your work! +++++++
Posted 02-12-2010 at 12:11 PM by Web31337 -
Yeah, interesting stuff. I haven't got round to delving into databases yet, but it's this kind of thing: poking about in the innards, tinkering, editing this and that, etc, that attracted me to Linux in the first place.
Posted 02-12-2010 at 01:19 PM by brianL -
This stuff is critical to DBA happiness. Recently we went on a buying spree and started loading up the PostgreSQL machines with RAM, but before that spree, I'd see oom-killer failures relatively regularly...and it's frustrating (to the point I started looking at OpenSolaris/FreeBSD/etc...as a PostgreSQL platform). The oom_adj file is a sanity-saver, lemme tell ya. haha.
Posted 02-12-2010 at 01:37 PM by rocket357 -
Are you sure that a desktop machine needs this oom-killer and this is why it has been added? I can hardly imagine how it can be useful even if I am doing scientific calculations that request but do not use tons of memory. A server, however, needs it badly so that some wild process, other than the beloved PostgreSQL, does not crash the system.
Are you sure that vm.overcommit_memory=2 literally turns overcommit off? I always thought it just makes it conservative so that if a process successfully requested memory, then it will not fail when it actually tries to use it.Posted 02-14-2010 at 10:08 AM by AGer -
My reference to a desktop was more in line of the thinking that "firefox has just gone apes*it and is sucking up hundreds of megs of RAM...kill it off". I don't know why Linus et al decided to implement the oom-killer. That is something only they know. Having the oom-killer on by default with overcommit on aggressive is a bit extreme, however.
Isn't the definition of turning "overcommit" off that the system now gives out requested memory in a conservative fashion such that a process won't fail when it attempts to use memory it has requested?Posted 02-14-2010 at 03:39 PM by rocket357