
Written before: I am not a system administrator. The purpose of this article is for troubleshooting my own problems and I have read too many misleading articles about this topic on the web, except the https://www.tldp.org.
The word memory will refer to RAM (Random Access Memory)
The word virtual memory will refer to application’s all available logical addresses which have or have not got corresponding physical address due to Linux OS overcommitting
The word swap will refer to the hard disk memory
First let’s have a glimpse of the linux server memory output, which is on kernel version 3.10.0–123
[eclipse@hkclapps15 sa]$ free -h total used free shared buffers cachedMem: 5.7G 3.7G 1.9G 23M 0B 2.1G-/+ buffers/cache: 1.6G 4.1GSwap: 8.0G 150M 7.9G
More columns using sar
[eclipse@hkclapps15 sa]$ sar -r -f sa14 -s 12:00:00 | head -5
Linux 3.10.0-123.20.1.el7.x86_64 (hkclapps15.hk.eclipseoptions.com) 03/14/19 _x86_64_ (4 CPU)
12:00:01 kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
12:10:01 151292 5796656 97.46 0 2602792 5313264 37.06 3442716 2128300 168
12:20:01 150568 5797380 97.47 0 2601060 5315152 37.07 3415056 2156808 268
Linux system divides physical memory into page blocks (typically 4k). Then give to processes requiring virtual memory like code, cache, metadata, java heap, garbage collector, code cache, compiler, class loading, symbol table, threads, etc. (Refer https://stackoverflow.com/questions/53451103/java-using-much-more-memory-than-heap-size-or-size-correctly-docker-memory-limi/53624438#53624438 for jvm memory allocation)
There are two types of page blocks: file based and non-file based. The non-file based pages are called anonymous pages (rather than annonymous mapping) — the memory allocation made inside code which not based on file like stack and heap. (Refer https://landley.net/writing/memory-faq.txt) The file based pages can be both how application use RAM, and mmap (memory mapped) files including buffer cache and binary image storing on disk. The performance differences between the two can be explained during swapping-in. For file based pages, swap-in is done by kernel looking up the mapping from process opened files. But for anonymous pages, swap-in is done by kernel remembering the region for the non-file based pages on physical memory. Remember the anonymous page swap-in has better performance than file based swap-in when we have fast disk like SSD compared to spinning disk with enough disk space. (reference: https://www.quora.com/How-do-anonymous-VMAs-work-in-Linux)
Why do we care is because swappiness is essentially the anonymous-page-reclaim-priority under system memory stress (also it’s the disk IO cost for file pages). So definitely not the percentage of free memory/total memory! The word “reclaim” here means after pages being purged from physical memory, we can get the data back somewhere.
This is where the swappiness is set.
[eclipse@hkclapps15 sa]$ cat /proc/sys/vm/swappiness30
According to the linux vmscan.c source code (refer https://github.com/torvalds/linux/blob/master/mm/vmscan.c). We are subtracting max 100 swappiness from 200. The highest swappiness means we tell Linux kernel to treat clean pages, dirty pages, anonymous pages equally when making room in the RAM for whatever scenarios.
- clean cache — can drop pages without losing data
- dirty cache — need to write to disk before dropping pages to not lose data
- anonymous cache — cannot be dropped at all unless we have swap
Atop output:MEM | tot 25.5G | free 4.6G | cache 18.3G | dirty 254.3M | buff 0.0M | slab 319.6M | slrec 198.6M | shmem 48.5M | shrss 0.6M | shswp 0.0M
How swap matters in different scenarios:
- Memory under stress — clean cache memory starts to drop; file based memory starts to increase; swappiness decides the page reclaim priority for non file based memory. Then everything will try to use swap and slow down the application if swap is on spinning disk. (more likely your swap should be on spinning disk rather than SSD, because SSD costs more than just adding memory)
- Memory not under stress — kernel scans RAM and decides to swap anonymous pages out to improve performance
- Memory starvation — swap acts as emergency memory and prolongs the thrashing (unrecoverable page faults)
/* * With swappiness at 100, anonymous and file have the same priority. * This scanning priority is essentially the inverse of IO cost. */ anon_prio = swappiness; file_prio = 200 - anon_prio;
However a big big warning here that swappiness is only the tip of the iceberg for how linux MMU (memory management unit) decides to use swap among CPU scheduling, physical memory temperature, etc. And currently in kernel version 3 there is no good way to tell whether a linux system is under memory pressure or not. (refer https://chrisdown.name/2018/01/02/in-defence-of-swap.html)
In the normal setting, a value of swappiness 0 will avoid ever swapping out just for caching space. Using 100 will always favor making disk cache bigger. In general, high swappiness value maximize throughput: how much work system gets down during a unit of time. Low swappiness value favor latency: getting a quick response time from other applications.
Some Frequently Asked Questions:
Why Swap is used when plenty of free memory is left?
- https://serverfault.com/questions/420778/why-swap-is-used-when-plenty-of-free-memory-is-left
- system tends to put unused memory and very infrequently accessed memory in swap rather than cache
- By default, Linux aggressively swaps processes out of physical memory onto disk in order to keep the disk cache as large as possible
Why is swap still allocated when there is free RAM?
- When program closes a file, it is a good idea too keep the file in cache in case it will be used again. Rather than assigning pages to it next time opening a file. When swappiness is 0, system only puts inactive file based memory in to swap. When swappiness is 100, system puts both inactive file memory and anonymous memory into swap. Therefore if the free+cached memory is ok a box. but swap overused. Increasing swappiness will put more inactive heap and stack memory in swap, giving more cached memory for application to use or quickly drop. Decreasing swappiness can decrease swap usage, but will increase the physical used memory thus having less cached memory for application to use or drop.
- When swap is allocated, doesn’t necessarily mean it’s being used
- Can be application keeping reading the opened file on disk
- The system may swap-in the inactive pages whenever it thinks reasonable. Can do “vmstat 1” to see whether system is swapping in or out now (si/so)
- Alternatively can use swapoff -a && swapon -a to clear swap used. But may cause problems, especially memory under stress
- https://askubuntu.com/questions/1357/how-to-empty-swap-if-there-is-free-ram
How to define memory stress?
- slow increase of actually used physical memory
- system flushing cached memory and buffer memory (decrease) to make space
- system page-in lots of cache as if it’s hungrey
- high active (recently accessed) vs inactive memory ratio
- Increasing faults per second and major faults per second
- If virtual memory is higher than physical memory due to linux overcommitting, it will demand virtual pages according to https://www.tldp.org/LDP/tlk/mm/memory.html
- https://www.linuxjournal.com/article/8178 — how to monitor linux memory
What is the result of memory stress?
- large paging out to disk
- increase dirty cache due to write-back caching for disk writes
- CPU spike to keep up paging
- thrashing (system doing more paging than processing application)
What exactly is page in and page out stats from sar -B?
- Since linux 2.6 buffer cache was included in the page cache from /proc/vmstats. Thus page activity can be reading and writing executable part of binary code so accessing memory mapped files on hard disk.
- Currently Linux doesn’t have a clear counter for page cache. So only the si/so from vmstats can tell about swapping
- https://serverfault.com/questions/270283/what-does-the-fields-in-sar-b-output-mean
- https://lists.gt.net/linux/kernel/1131720?do=post_view_threaded
References:
- How redhat explains swappiness and server tuning https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-memory-tunables
- Kafka virtual memory handling https://www.cloudera.com/documentation/enterprise/6/6.0/topics/kafka_system_level_broker_tuning.html#virtual_memory_handling
- http://www.westnet.com/~gsmith/content/linux-pdflush.htm