dirty pages ， swapiness 查看SWAP占用进程

文章说了这么多的意思就是不要过度分配不用的内存。虽然脏块不会写入swap，但是占了物理内存，浪费空间，可能导致进行了很多不必要的交换（虽然判断很少要进swap，判断要不要也要时间。。。）。

To verify which PIDs are using swap area - bellow command can be used:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r

和top f s 效果一样

有第三方网页也评论说，将 vm.swappiness 设置为 10 将仅在分配了 90% 的可用内存时才使用交换空间 - 这是不正确的，因为 vm.swappiness 不是这样工作的（它的缓存页面窃取率与由于内存不足而导致的交换率）类似的设置 vm.swappines 为 100 并不意味着将在启动后立即使用交换空间。

Applies to:

Linux OS - Version Oracle Linux 5.1 to Oracle Linux 9.0 [Release OL5U1 to OL9]
Oracle Cloud Infrastructure - Version N/A and later
Linux x86
Linux x86-64

Goal

This document help us to explain the dirty pages in Oracle Linux

Solution

Whenever application/database process（进程） needs to add virtual page（VM，现在内存管理机制） into physical memory but no free physical pages are left ，OS must clear-out remaining old pages.
Now if old page had not been written （未改动）at all then this one does not need to be saved it can be simply recovered from the the data file.
But if old page has been modified already then it must be preserved somewhere so application/database can re-used later on - this is called dirty page.（数据库中的脏块）
OS stores such dirty pages in swap files ( so it can be removed from physical memory so another 'new' page can be stored in physical memory ) 为什么不直接写入file？OS也有commit一说
If lots of data will be removed from page cache to dirty page area（swap的过程） - this might cause significant IO bottleneck if actual swap device is located on local disk ( sda ) and more-over cause further issues if local disk is used as well by local root ( OS ) disk. 硬盘同时给OS和swap用，大部分的情况。

Page cache in Linux is just a disk cache （page cache等于disk cache） which brings additional performance to OS which helps with intensive high read/writes on files.
Further details can be found in km note:

How to Check Whether a System is Under Memory Pressure (Doc ID 1502301.1)

As 'sub' （子，附属）product of page cache is dirty page - which was explained in above example case.
Dirty pages can be also observed whenever application will write to file or create file - first write will happen in page cache area - hence creating a file which 10MB file can be really fast:

内存中申请页面

# dd if=/dev/zero of=testfile.txt bs=1M count=100
10+0 records in
10+0 records out
10485760 bytes (100 MB) copied, 0,1121043 s, 866 MB/s

Its because that file is created in memory region not actual disk - hence response time is really fast.
Under the OS such thing will be noted in /proc/meminfo and more over in 'Dirty:

Before above command will get executed - note-down the /proc/meminfo and 'Dirty' row:

# more /proc/meminfo | grep -i dirty
Dirty: 96 kB

After command is executed:

# more /proc/meminfo | grep -i dirty
Dirty: 102516 kB

Periodically OS or application/database will initiate sync which will write actual testfile.txt to disk:

操作系统或应用程序/数据库将定期启动同步，application 也可以发起sync操作

# more /proc/meminfo | grep -i dirty
Dirty: 76 kB

Now Oracle Database for example does not allow to do such writes into memory region as if OS will crash or if SAN LUn will fail - data will be compromised.
That's why Oracle Database requires data to be 'in-sync' hence all writes needs to be confirmed by backend like disk/lun before database will throw more write requests.

Normally Databases/Application periodically drop cache hence dirty pages are written to disk in small chunks.（drop cache 导致自动同步）
In some cases dirty pages can grow in size as maybe application/database did not configured page cache mechanism properly. OS 和application共同管理page cache

So dirty pages can write to swap files ( Swap area ) but also to special region in disk ( LUN/file-system ) 可以写swap 也可以直接写file
If for example we create more than 100MB swap file which will be re-used later from swap file we might cause uncecessary IO issues on swap device.
Enterprise systems store swap files and swap area on OS under solid state drives ( SSD ) or dedicated LUN hence local disk performance won't be impacted ( as normally swap region is created on Local disk ) swap盘也要专用
In some cases application/database might have issues internally and dirty pages will be written as swap files but will be never re-used this will cause swap area to grow and cause uncessary IOs on local disk and lead to large swap usage under OS.

To find out at what stage OS will try to dump dirty pages back to disk layer please check official kernel documentation around Virtual Memory here and look for settings like:

vm.dirty_background_ratio
vm.dirty_ratio
vm.swappiness

and

dirty_background_ratio
dirty_ratio
dirty_background_bytes
dirty_expire_centisecs

Above settings needs to be tuned per Database/Application requirement as OS does not have any 'best practice' setting for them - they are tuned per DB/APP load/configuration 它们是根据 DB/APP 负载/配置进行调整的

Whenever application/database will demand memory pages to be free on physical memory - OS tends to keep everything in page cache - hence OS will need to re-allocate some of the pages and mark them as dirty. 新分配出的memory pages都是dirty块，如果块未改动就不要留（同上 Now if old page had not been written （未改动）at all then this one does not need to be saved it can be simply recovered from the the data file.）
This process is works fine if application/database end are properly tuned and scaled 调整和扩展- otherwise it will cause really aggressive swappiness to occur - as OS will need to write all dirty pages back to swap disk - this can be controlled via vm.swappiness setting.
If application/database will do agreessive负面 swappiness it might cause serious IO writes on swap device and lead to serious system stalls系统停顿 - always make sure that application/databases are properly configured in terms of memory management.

As explained not all pages will be marked as dirty - mostly unused pages will get discarded 丢弃rather than marked as dirty ( it all depends if pages which already are allocated were modified or not )

上面说了这么多的意思就是不要过度分配不用的内存。虽然脏块不会写入swap，但是占了浪费空间，进行了不必要的交换

To verify which PIDs are using swap area - bellow command can be used:

for file in /proc/*/status ; do awk '/VmSwap|Name/{printf $2 " " $3}END{ print ""}' $file; done | sort -k 2 -n -r

和top f s 效果一样

Releasing 'consumed' swap space is really limited, normally if PID exits properly or simply gets shutdown swap space will be re-claimed but killing PID or if it ends-up abnormally like segfault might still leave swap space consumed. Another option is to reboot as doing swapoff and swapon command can cause serious issues or even lead to system panic state.甚至导致系统崩溃状态。正常退出的可以释放swap 非正常的可以kill了也不会释放

To understand why swap is still consumed even if swappines is set to 0 - please refer to the KM here

References

NOTE:2328563.1 - Oracle Linux: Setting vm.swappiness=0 Does Not Completely Disable Swap Usage

---------------------------Setting vm.swappiness=0 Does Not Completely Disable Swap Usage

Why adding vm.swappiness=0 to /etc/sysctl.conf does not completely disable swap usage ?

Solution

Explanation on vm.swappiness setting from kernel documentation:

Swappiness

This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase agressiveness, lower values
decrease the amount of swap. A value of 0 instructs the kernel not to
initiate swap until the amount of free and file-backed pages is less
than the high water mark in a zone.

So if there is swap present, it'll be used if needs be. vm.swappiness=0 discourages the kernel from using it, but doesn't prevent it.

Below example from TOP command under OL7 might be confusing:

top - 12:21:27 up 2 days, 16:57, 4 users, load average: 1.62, 1.95, 2.13
Tasks: 539 total, 1 running, 538 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.8 us, 3.7 sy, 0.0 ni, 79.2 id, 3.0 wa, 0.0 hi, 0.3 si, 0.0 st
KiB Mem : 36383529+total, 1715556 free, 32909801+used, 33021720 buff/cache
KiB Swap: 5388604 total, 103444 free, 4185160 used. 33684252 avail Mem

Available mem is shown as 30GB free but best is to verify /proc/meminfo where MemAvailable is specified:

cat /proc/meminfo | grep MemAvailable

MemAvailable: 2440244 kB

What is MemAvailable:

An estimate of how much memory is available for starting new applications, without
swapping.
Calculated from MemFree, Reclaimable, the size of the file LRU lists, and the
low watermarks in each zone.

The estimate takes into account that the system needs some page cache to function well,
and that not all reclaimable slab will be reclaimable, due to items being in use.
The impact of those factors will vary from system to system.

Hence even if system shows 30GB as free actual MemAvailable is quiet low and it might cause higher swap usage even if setting swappiness is set to 0

Another explanation of swap usage might be related to dirty pages km note: 2304722.1 or missing Database tuning which is explained in 1295478.1

There are 3rd party web pages which also comment that setting vm.swappiness to 10 will make swap space only utilized if 90% of free memory is allocated - this is not true as vm.swappiness does not work like this ( its page steal ratio from cache vs swapping due to insufficient memory ) similar setting vm.swappines to 100 does not mean swap will be used immediately after boot.

正常设置60%，所以到40%内存使用时就考虑使用swap，这个是对的啊！！居然说人家错

Official statement on vm.swappiness can be found in official kernel memory documentation here

As side note 顺便说一句- swap usage can be lowered by enabling hugepages - but this will only apply mostly to Oracle Database cases as enabling hugepages won't stop system or application layer from swapping. hugepages只能用于SGA，pga不能避免
Reference km ntoe:

HugePages on Oracle Linux 64-bit (Doc ID 361468.1)

And statement:

"The HugePages configuration described in this document does not cause the O/S components to use HugePages. HugePages will be used by applications which explicitly make use of HugePages in their code (like majority of Oracle RDBMS SGA given proper configuration). Therefore, you will still see swap usage on the system as the regular O/S components, or non-HugePages-aware applications use swappable pages."

Also executing well known command on 3rd party websites, eg:

# swapoff -a && swapon -a

Is not supported or either recommended - if system already struggle with memory and swap - customers should validate their configuration settings or properly tune APP/DB end:

How to Calculate Memory Usage on Linux (Doc ID 1630754.1)

Rather than disabling swap device and dumping off allocated pages in swap device 这个是dump回内存还是 file？？

Hence unexpected results from above commands won't be debug'd by Oracle Linux support in case of issues ( as during dumping-out of swap device, lots of services/pids might still relay on things put in swap device leading to uncontrolled results )
（例如，在转储交换设备期间，许多服务/PID 可能仍会中继放入交换设备的内容，从而导致不受控制的结果 )

------

Purpose

This document talks about Linux swapping and it's nature briefly with references to database workloads.

Scope

This document is useful for Linux and database administrators for configuring, evaluating and monitoring systems.

Details

Linux OS is a virtual memory system like any other modern operating system. The Virtual Memory Management system of Linux includes:

Paging
Swapping
HugePages
Slab allocator
Shared memory

When almost all of the available physical memory (RAM) is started to be used in Linux, the kernel will start to swap out pages to the swap (disk space), or worse it may start swapping out entire processes. One another scenario is that it starts killing processes using the Out-of-Memory (OOM) Killer (See Document 452000.1)

当几乎所有可用的物理内存（RAM）都开始在 Linux 中使用时，内核将开始将页面换出到交换空间（磁盘空间），或者更糟糕的是，它可能会开始换出整个进程。另一种情况是它开始使用内存不足（OOM） Killer 杀死进程。-------- 进程进到swap中，而不是cache page

Swap Usage on Linux

To check swap usage on Linux use one of below:

free: Seek for low (or zero) values for Swap / used:

# free -m
             total       used       free     shared    buffers     cached
Mem:          4018       3144        873          0         66       2335
-/+ buffers/cache:        742       3276
Swap:         4690          0       4690

meminfo: Seek for SwapTotal = SwapFree

# grep Swap /proc/meminfo
SwapCached:            0 kB
SwapTotal:       4803392 kB
SwapFree:        4803380 kB

top: Look for low (preferably zero) values of Swap / used:

# top

...
Mem: 4115320k total, 3219408k used, 895912k free, 68260k buffers
Swap: 4803392k total, 12k used, 4803380k free, 2390804k cached
...

vmstat: Look for si / so values to be zero:

# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 12 871592 69308 2405188 0 0 103 36 275 500 14 13 71 1

Why is Swapping Bad

Especially on Linux try to avoid swapping because:

The swapping potentially makes every memory page access a thousand (or more) times slower (and Linux swapping mechanism is not specifically fast).
As more memory swapped, more operations take longer time
As operations take longer time, more requests come in to be served
The demand for resources exponentially increase

Due to scenario above, if any memory bound application is running (like a database), if swapping is started, most of the time there is no recovering back.
由于上述情况，如果任何内存受限的应用程序正在运行（如数据库），如果启动了交换，则大多数情况下不会恢复。得是SGA才行

The Oracle Database SGA pages are pageable on Linux by default, and potentially those pages can be swapped out if system runs out of memory. Using HugePages is one of the methods to make the Oracle SGA not to be swapped out at all, still one needs to be careful about the configuration. To learn all about HugePages please read Document 361323.1 and references.

Conclusions

Make sure total SGA, PGA fit in the RAM also leaving some decent memory for process spaces and system services. See the database installation guides for more information
Consider using HugePages on Linux
Be very careful with memory configuration (HugePages, Automatic Memory Management, Swap, VLM)
Monitor OS continuously for memory usage and swapping

结论
确保总 SGA、PGA 适合 RAM，并为进程空间和系统服务留出一些不错的内存。有关更多信息，请参阅数据库安装指南
考虑在 Linux 上使用 HugePages
非常小心内存配置（HugePages， Automatic Memory Management， Swap， VLM）
持续监控操作系统的内存使用情况和交换情况