Translate

Image of RHCE Red Hat Certified Engineer Linux Study Guide (Exam RH302) (Certification Press)
Image of Advanced Programming in the UNIX Environment, Second Edition (Addison-Wesley Professional Computing Series)
Image of XSLT 2.0 and XPath 2.0 Programmer's Reference (Programmer to Programmer)
Image of Modern Operating Systems (3rd Edition)

Out of Memory Killer

I am logged in on pts/1 and using the Bash shell. As shown below, associated with my Bash shell process are three pseudo-files in procfs whose names start with oom. This post discusses the purpose of these files.

# ps
  PID TTY          TIME CMD
 1688 pts/1    00:00:00 ps
10290 pts/1    00:00:00 sudo
10291 pts/1    00:00:00 su
10294 pts/1    00:00:00 bash
# ls -l /proc/10294/oom*
-rw-r--r--. 1 root root 0 Dec 26 17:13 /proc/10294/oom_adj
-r--r--r--. 1 root root 0 Dec 26 17:13 /proc/10294/oom_score
-rw-r--r--. 1 root root 0 Dec 26 17:13 /proc/10294/oom_score_adj
# cat /proc/10294/oom_score
0


It turns out that these three files have to do with Linux out of memory (OOM) management. Linux can be configured to overcommit memory by changing the value of the overcommit_memory kernel variable.

overcommit_memory:

This value contains a flag that enables memory overcommitment.

When this flag is 0, the kernel attempts to estimate the amount
of free memory left when userspace requests more memory.

When this flag is 1, the kernel pretends there is always enough
memory until it actually runs out.

When this flag is 2, the kernel uses a "never overcommit"
policy that attempts to prevent any overcommit of memory.
Note that user_reserve_kbytes affects this policy.

This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't use much of it.

The default value is 0.

This allows memory allocation functions such as malloc() to allocate virtual memory with no guarantee that physical storage for it exists.

Memory overcommitment is useful. Without it, a system may fail to fully utilize its memory. Overcommitting memory allows a system to use virtual memory in a more efficient way but with the risk of running out of physical memory. This is fine until the kernel cannot find sufficient physical memory to back a virtual memory page when needed.

The purpose of the kernel OOM killer routine is free up memory for the system when all other memory management freeing techniques fail. It does this by killing selected processes until sufficient memory is freed to stabilize the system. OOM killer has several configuration options that enable some choice in the behaviour of the system when it is faced with an out-of-memory condition.

OOM Killer attempts to select the “best” processes to kill to achieve system stability, i.e. the least number of processes which will free up the maximum amount memory upon termination and which are also the least important processes as far as the system is concerned. Obviously, it will also kill any process sharing the same mm_struct as the selected process.

To facilitate process selection, the kernel maintains an oom_score for each process. The higher the value, the more likelihood of a process and its children getting killed by OOM Killer in an out-of-memory situation.

The oom_score_adj kernel variable exists to enable a user to have some control of the OOM Killer process selection. The deprecated kernel variable oom_adj provides similar functionality.

3.1 /proc//oom_adj & /proc//oom_score_adj- Adjust the oom-killer score
--------------------------------------------------------------------------------

These file can be used to adjust the badness heuristic used to select which
process gets killed in out of memory conditions.

The badness heuristic assigns a value to each candidate task ranging from 0
(never kill) to 1000 (always kill) to determine which process is targeted.  The
units are roughly a proportion along that range of allowed memory the process
may allocate from based on an estimation of its current memory and swap use.
For example, if a task is using all allowed memory, its badness score will be
1000.  If it is using half of its allowed memory, its score will be 500.

There is an additional factor included in the badness score: the current memory
and swap usage is discounted by 3% for root processes.

The amount of "allowed" memory depends on the context in which the oom killer
was called.  If it is due to the memory assigned to the allocating task's cpuset
being exhausted, the allowed memory represents the set of mems assigned to that
cpuset.  If it is due to a mempolicy's node(s) being exhausted, the allowed
memory represents the set of mempolicy nodes.  If it is due to a memory
limit (or swap limit) being reached, the allowed memory is that configured
limit.  Finally, if it is due to the entire system being out of memory, the
allowed memory represents all allocatable resources.

The value of /proc//oom_score_adj is added to the badness score before it
is used to determine which task to kill.  Acceptable values range from -1000
(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX).  This allows userspace to
polarize the preference for oom killing either by always preferring a certain
task or completely disabling it.  The lowest possible value, -1000, is
equivalent to disabling oom killing entirely for that task since it will always
report a badness score of 0.

Consequently, it is very simple for userspace to define the amount of memory to
consider for each task.  Setting a /proc//oom_score_adj value of +500, for
example, is roughly equivalent to allowing the remainder of tasks sharing the
same system, cpuset, mempolicy, or memory controller resources to use at least
50% more memory.  A value of -500, on the other hand, would be roughly
equivalent to discounting 50% of the task's allowed memory from being considered
as scoring against the task.

For backwards compatibility with previous kernels, /proc//oom_adj may also
be used to tune the badness score.  Its acceptable values range from -16
(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17
(OOM_DISABLE) to disable oom killing entirely for that task.  Its value is
scaled linearly with /proc//oom_score_adj.

The value of /proc//oom_score_adj may be reduced no lower than the last
value set by a CAP_SYS_RESOURCE process. To reduce the value any lower
requires CAP_SYS_RESOURCE.

Caveat: when a parent task is selected, the oom killer will sacrifice any first
generation children with separate address spaces instead, if possible.  This
avoids servers and important system daemons from being killed and loses the
minimal amount of work.

As stated earlier, processes to be killed are selected based on their badness score which is visible to a user as /proc/<PID>/oom_score. See this article in LWN (Linux Weekly News) for more information about how badness is calculated. The process, and any children, with the highest badness score is killed first.

There is lots more to OOM Killer than I have time to cover in this post. Just do an Internet search and you will find plenty of additional information.

Comments are closed.