Hey dude, where’s my memory? part 2

In my previous article I started exploring the memory usage of a process on a recent linux kernel (2.6.39-400.243.1 (UEK2)), recent means “recent for the Enterprise Linux distributions” in this context, linux kernel developers would point out that the kernel itself is at version 3.19 (“stable version” at the time of writing of this blogpost).

The previous article showed that every process has its own address space, and that different allocations exists for execution. These allocations can be seen in the proc pseudo filesystem, in a process specific file called ‘maps’. The process itself needs some administration area’s which are anonymous allocations which are marked with [heap] and [stack], and some a few others called [vdso] and [vsyscall]. Every process executes something (not all people realise that: a process can wait for something, but essentially is always executing). So there always will be an executable in a (regular) process’ address space. In a lot of cases, the executable uses shared libraries. In that case, the libraries are loaded in the address space of the process too, alongside the executable.

The executable contains (at least) two sections; the code segment, which is readonly and potentially shared, and the data segment, which is read write and gets truly allocated instead of used shared with the parent if the process needs to write. In a lot of cases, the executable uses shared libraries, which means it uses functions which are stored in that library. A library also needs to be loaded, and also contains multiple sections, which can be read only or read write, and are shared unless the process needs to write to that segment.

For completeness, here’s the complete maps output of a process executing the ‘cat’ executable again:

$ cat /proc/self/maps
00400000-0040b000 r-xp 00000000 fc:00 2605084                            /bin/cat
0060a000-0060b000 rw-p 0000a000 fc:00 2605084                            /bin/cat
0060b000-0060c000 rw-p 00000000 00:00 0
0139d000-013be000 rw-p 00000000 00:00 0                                  [heap]
7f444468d000-7f444a51e000 r--p 00000000 fc:00 821535                     /usr/lib/locale/locale-archive
7f444a51e000-7f444a6a8000 r-xp 00000000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a6a8000-7f444a8a8000 ---p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8a8000-7f444a8ac000 r--p 0018a000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ac000-7f444a8ad000 rw-p 0018e000 fc:00 3801096                    /lib64/libc-2.12.so
7f444a8ad000-7f444a8b2000 rw-p 00000000 00:00 0
7f444a8b2000-7f444a8d2000 r-xp 00000000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aacd000-7f444aad1000 rw-p 00000000 00:00 0
7f444aad1000-7f444aad2000 r--p 0001f000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad2000-7f444aad3000 rw-p 00020000 fc:00 3801089                    /lib64/ld-2.12.so
7f444aad3000-7f444aad4000 rw-p 00000000 00:00 0
7fff51980000-7fff519a1000 rw-p 00000000 00:00 0                          [stack]
7fff519ff000-7fff51a00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

If you look closely, you will see that I didn’t explain one type of allocation yet: the anonymous allocations. The anonymous allocations are visible in lines 4, 11, 13 and 16. The anonymous mappings directly following the data segment of either an executable or a library mapped into a process address space is called the BSS. The data segment and the BSS store static variables, however the data segment stores initialised variables, the BSS stores uninitialised variables. The anonymous mapping for the BSS section might exist for a library, as seen above, or might not exist; not all Oracle database executable libraries use an anonymous memory mapping for example. Actually, there is one other memory allocation visible, /usr/lib/locale/locale-archive, which is a file for locale (multi-language support) functions in the C library, which is out of scope for this article.

When a process requests memory to store something, the system call malloc() (memory allocation) can be called. This system call inspects the size of the allocation, and will allocate memory from either the process’ heap (the memory mapping with [heap], using the system call brk()) or it will allocate space using a new anonymous memory segment, using the system call mmap(). If you follow the link with malloc(), you can read the source code of the malloc() call. There are different malloc()’s, which fulfil different purposes (embedded devices have different requirements than huge servers), the implementation that Enterprise Linuxes use is one called ptmalloc2, which is based on a version written bij Doug Lea. If you read the comments in the source code, specifically at ‘Why use this malloc?’, you will see that it tries to be smart with requests up to 128KB (for memory re-usability, to avoid fragmentation and memory wastage), which are allocated from the heap. If an allocation is larger than 128KB, it will use the system memory mapping facilities.

Okay, this brings us back at the original question: how much memory does this process take? I hope you recall from the first blogpost that Linux tries to share as much memory as possible, and when a new process is created, the allocations in the address space of this new process are pointers to the memory areas of the parent process. Let’s first use a utility a lot of people are using: top.

top - 10:53:40 up 9 days, 14:11,  2 users,  load average: 1.34, 1.36, 1.37
Tasks: 1124 total,   1 running, 1123 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9%us,  0.8%sy,  0.1%ni, 98.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  98807316k total, 97411444k used,  1395872k free,   400288k buffers
Swap: 25165820k total,  3560852k used, 21604968k free, 27573200k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 16391 oracle    -2   0 2369m 4960 4812 S  0.7  0.0  91:41.37 asm_vktm_+asm1
 16407 oracle    20   0 2388m  20m 8580 S  0.7  0.0  46:16.40 asm_dia0_+asm1

I edited the output a bit to show only two process from an ASM instance.
The columns that show information on memory are VIRT, RES, SHR, %MEM.

VIRT is described in the man-page of top as ‘The total amount of virtual memory used by the task’. This means it’s ALL the memory visible in the addressing space of the process. A useful utility to get the contents of the virtual memory allocations for a process is pmap, let’s use it for process 16391, which is asm_vktm_+asm1:

$ pmap -x 16391
16391:   asm_vktm_+ASM1
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000  246644    2520       0 r-x--  oracle
000000000f6dd000    1952      44      20 rw---  oracle
000000000f8c5000     140       0       0 rw---    [ anon ]
000000001042c000     356       0       0 rw---    [ anon ]
0000000060000000    4096       0       0 rw-s-  SYSV00000000 (deleted)
0000000061000000 2080768       0       0 rw-s-  SYSV00000000 (deleted)
00000000e0000000      12       4       4 rw-s-    [ shmid=0x4e78003 ]
00007fa275589000      72      12       0 r-x--  libnfsodm12.so
00007fa27559b000    2044       0       0 -----  libnfsodm12.so
00007fa27579a000       8       0       0 rw---  libnfsodm12.so
00007fa27579c000    1604       0       0 r-x--  libshpksse4212.so
00007fa27592d000    2044       0       0 -----  libshpksse4212.so
00007fa275b2c000      72       0       0 rw---  libshpksse4212.so
00007fa275b3e000      20       4       0 r-x--  libcxgb3-rdmav2.so
00007fa275b43000    2044       0       0 -----  libcxgb3-rdmav2.so
00007fa275d42000       4       0       0 rw---  libcxgb3-rdmav2.so
...snip...
00007fa27abd5000       8       0       0 r-x--  libodmd12.so
00007fa27abd7000    2044       0       0 -----  libodmd12.so
00007fa27add6000       4       0       0 rw---  libodmd12.so
00007fa27add7000     128     116       0 r-x--  ld-2.12.so
00007fa27adf8000     512      12      12 rw---    [ anon ]
00007fa27ae78000     212       4       4 r--s-  passwd
00007fa27aead000    1260      68      68 rw---    [ anon ]
00007fa27aff3000       4       4       0 rw-s-  hc_+ASM1.dat
00007fa27aff4000       8       0       0 rw---    [ anon ]
00007fa27aff6000       4       0       0 r----  ld-2.12.so
00007fa27aff7000       4       0       0 rw---  ld-2.12.so
00007fa27aff8000       4       0       0 rw---    [ anon ]
00007fff7b116000     132       8       8 rw---    [ stack ]
00007fff7b1ff000       4       4       0 r-x--    [ anon ]
ffffffffff600000       4       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB         2426668    5056     156

The column ‘Kbytes’ represents the full size of the executable, libraries, shared memory, anonymous mappings and other mappings of this process. For completeness sake: 2426668/1024=2369.79, which matches the 2369 in the top output. This is all the memory this process can see, and could use. Does this tell us anything on what memory the process 16391 actually takes? No. (parts of) the Oracle executable’s allocations are potentially shared, the shared memory (SYSV00000000 (deleted) and [ shmid=0x4e78003 ], which represent the Oracle SGA) is shared, the memory allocations for the libraries are potentially shared. The anonymous memory mappings have been defined, but the actual allocation is not visible in the Kbytes column. What this column in top does for me, is tell the approximate SGA size, especially if the SGA is larger (meaning multiple gigabytes).

The second memory column in top is RES. RES is described as: ‘The non-swapped physical memory a task is using’. RES is sometimes referred to as RSS, and called ‘resident set size’. As we can see from the total, the RSS is way lesser than the virtual memory size. One important thing in the RES description of top is that it described that swapped memory pages are not counted for the RES/RSS value. RES/RSS corresponds to the actual used (“touched”) memory by a process, and is directly usable. If you look back to the RSS column of pmap above, you see the oracle executable’s two mappings, one has a RSS size of 2520, and one has a RSS size of 44. But…if you remember that the code/readonly segment is potentially shared with other process, and then look at the 2520 value (which is of the oracle memory segment with the rights r-x–, which means the code segment), I hope you understand this just means this process (vktm) read a subset of the entire executable, and more importantly: the RSS size does not reflect physical memory uniquely allocated by this process.

If we look at the shared memory segments, it’s interesting to see what happens during normal life of a database session. I think it is needless to say that you should calculate shared memory outside of process memory usage, since it’s a distinct memory set that is truly shared by all the processes that are created for the instance.

This is a session which has just started:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
...
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     348     348 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848    2316    2316 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592       0       0 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]
...

This instance has a SGA set to 4G. Because the session just started, it only touched 2316(KB) of the SGA. Next, I do a big (buffered!) full table scan, requiring the session to put a lot of blocks into the buffercache. After the scan, look at the shared memory segment using pmap again:

$ pmap -x 43853
Address           Kbytes     RSS   Dirty Mode   Mapping
...
0000000060000000       4       0       0 r--s-    [ shmid=0xb87801d ]
0000000060001000    2860     384     384 rw-s-    [ shmid=0xb87801d ]
0000000061000000 4046848 2279040 2279040 rw-s-    [ shmid=0xb88001e ]
0000000158000000  144592   66564   66564 rw-s-    [ shmid=0xb88801f ]
0000000161000000      20       4       4 rw-s-    [ shmid=0xb890020 ]
...

The session has touched half of the SGA shared memory segment (visible in the RSS column of the 6th line). This is logical if you understand what is going on: the process does a buffered table scan, which means the blocks read from disk need to be stored in the buffer cache, which is one of the memory structures in the Linux shared memory segments. However, if you look strictly at the top utility output of a database that has just started up, you see the RSS size of the all the processes growing. This phenomenon has lead to a repetitive question on the Oracle Linux forums if Oracle database processes are leaking memory. Of course the correct answer is that the RSS size just grows because the process just touches more of the shared memory (=SGA) that has been mapped into its address space. It will stop increasing once it touched all the memory it could touch.

%MEM is the RES/RSS size expressed as a percentage of the total physical memory.

SHR is the amount of shared memory. The manpage of top says ‘It simply reflects memory that could be potentially shared with other processes’. Do not confuse this with shared memory segments mapped into the process’ address space. Empirical tests show the SHR value always seems to be lower than the RSS size, which means it seems to track the RSS value of memory, and shows RSS (touched) memory that could be shared (which seems to contain both touched memory from the shared memory segments, as well as privately mapped executable and libraries). At least from the perspective of an Oracle performance and troubleshooting perspective I can’t see any benefit from using this value.

The conclusion of this part on memory usage is that both the virtual set size (VIRT/VSZ) and resident set size (RES/RSS) are no figures you can add up to indicate physical memory usage of a single process or a group of processes.

The virtual set size gives you the total amount of virtual memory available to a single process, which includes the executable, and potentially shared libraries, anonymous memory mappings and files mapped into the address space and shared memory. In a lot of cases, you get an indication of the total SGA size of the instance, because the shared memory segments which contain the SGA are entirely mapped into the address space of the process.

The resident set size shows you how much of the memory and files mapped into the address space are actually “touched”, and directly usable. Some of the memory usage result in memory pages private to the process, because of writing to the memory and the Copy On Write mechanism of the Linux memory manager. A lot of other memory mappings can be used, increasing the resident set size, while these are shared with other processes. A third potential component is the actual usage of memory as a result of anonymous memory mappings (versus the total allocation, which can be much more), which are private to the process.