一、前言
在介绍完struct zone和struct page后,终于开始讲node对应的结构体struct pglist_data。其是linux kernel 物理内存管理三大结构体的最后一个,也是金字塔顶端的那个结构体。长路漫漫,道阻且长。
在Linux 物理内存管理涉及的三大结构体之struct page中,我们知道了CPU跟内存的两种结构:UMA和NUMA,在NUMA模型中,每个CPU都有自己的本地内存节点(memory node),然后通过QPI总线访问其他CPU的本地内存节点,只是访问自己的本地内存要比访问其他CPU的内存速度要快很多。通常经过一次QPI要增加30%的访问时延。在UMA中,每个CPU都通过一根总线(北桥)访问内存,因此访问内存的方式和时延是一样的。然后,根据上面的硬件角度,Linux从软件角度抽象出了三级物理内存管理结构node,zone和page frame。在UMA中,只有一个node,而NUMA有多个node。实际上UMA就是只有一个node的NUMA的特殊情况。
如下就是内核定义的保存NUMA node的数组node_data[MAX_NUMNODES],带有__read_mostly标志,意味着内核被加载时,该数据将自动存放到CPU Cache中,跟讲struct zone里面涉及到的totalreserve_pages一样。MAX_NUMNODES是2^(NODES_SHIFT),如果使能了CONFIG_NODES_SHIFT,那么NODES_SHIFT就是CONFIG_NODES_SHIFT,否则,NODES_SHIFT直接为0。CONFIG_NODES_SHIFT的取值范围是[1,10],默认是4,具体取值多少取决于NEED_MULTIPLE_NODES参数。
//mm/Kconfig
#
# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
# to represent different areas of memory. This variable allows
# those dependencies to exist individually.
#
config NEED_MULTIPLE_NODES
def_bool y
depends on DISCONTIGMEM || NUMA
//arch/arm64/Kconfig
config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
range 1 10
default "4"
depends on NEED_MULTIPLE_NODES
help
Specify the maximum number of NUMA Nodes available on the target
system. Increases memory reserved to accommodate various tables.
//include/linux/numa.h
#ifdef CONFIG_NODES_SHIFT
#define NODES_SHIFT CONFIG_NODES_SHIFT
#else
#define NODES_SHIFT 0
#endif
#define MAX_NUMNODES (1 << NODES_SHIFT)
//arch/arm64/mm/numa.c
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
对于现在常见的移动终端——手机,采用的就是UMA架构,因为手机最多是8个CPU CORE,对内存的竞争不那么严重,特别是在当前大内存(16GB+1T,同时手机都有内存扩展,将一部分存储内存当做运行内存使用)趋势下,因而都在一个node上,普通PC同理也是如此。对于一些服务器平台,有更多的CPU CORE,比如1024个,这时如果还采用share memory方式,那么CPU的性能瓶颈就在访问memory上了,此时采用NUMA架构是更合适的。这里需要说明一下:当NUMA架构中,CPU数量比较多时,可能就不是每个CPU对应一个本地内存节点了,而是以CPU模块的形式,每个CPU模块对应一个本地内存节点,一个CPU模块由多个CPU组成。
二、struct pglist_data
下面开始对node的结构体struct pglist_data里面的成员进行介绍,kernel-5.4版本。
/*
* On NUMA machines, each NUMA node would have a pg_data_t to describe
* it's memory layout. On UMA machines there is a single pglist_data which
* describes the whole memory.
*
* Memory statistics and page replacement data structures are maintained on a
* per-zone basis.
*/
typedef struct pglist_data {
/*
* node_zones contains just the zones for THIS node. Not all of the
* zones may be populated, but it is the full list. It is referenced by
* this node's node_zonelists as well as other node's node_zonelists.
*/
struct zone node_zones[MAX_NR_ZONES];
/*
* node_zonelists contains references to all zones in all nodes.
* Generally the first zones will be references to this node's
* node_zones.
*/
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones; /* number of populated zones in this node */
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
#endif
#endif
#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
/*
* Must be held any time you expect node_start_pfn,
* node_present_pages, node_spanned_pages or nr_zones to stay constant.
* Also synchronizes pgdat->first_deferred_pfn during deferred page
* init.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
* or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
*
* Nests above zone->lock and zone->span_seqlock
*/
spinlock_t node_size_lock;
#endif
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
mem_hotplug_begin/end() */
struct task_struct *mkswapd[MAX_KSWAPD_THREADS];
int kswapd_order;
enum zone_type kswapd_highest_zoneidx;
int kswapd_failures; /* Number of 'reclaimed == 0' runs */
ANDROID_OEM_DATA(1);
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_highest_zoneidx;
wait_queue_head_t kcompactd_wait;
struct task_struct *kcompactd;
bool proactive_compact_trigger;
#endif
/*
* This is a per-node reserve of pages that are not available
* to userspace allocations.
*/
unsigned long totalreserve_pages;
#ifdef CONFIG_NUMA
/*
* node reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif /* CONFIG_NUMA */
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* If memory initialisation on large machines is deferred then this
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct deferred_split deferred_split_queue;
#endif
/* Fields commonly accessed by the page reclaim scanner */
/*
* NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
*
* Use mem_cgroup_lruvec() to look up lruvecs.
*/
struct lruvec __lruvec;
unsigned long flags;
ZONE_PADDING(_pad2_)
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
} pg_data_t;
2.1 struct zone node_zones[MAX_NR_ZONES]
包含当前node所有zone结构体的数组。通过这个,我们知道当前node有哪几个zone。
/*
* node_zones contains just the zones for THIS node. Not all of the
* zones may be populated, but it is the full list. It is referenced by
* this node's node_zonelists as well as other node's node_zonelists.
*/
struct zone node_zones[MAX_NR_ZONES];
2.2 struct zonelist node_zonelists[MAX_ZONELISTS]
数组包括所有node的所有zone。MAX_ZONELISTS最小为1,最大为2。从下面代码可知,默认是有ZONELIST_FALLBACK。在NUMA中,还多了一个ZONELIST_NOFALLBACK,如果分配内存时使用__GFP_THISNODE标志则尝试从pg_data_t->node_zonelists[ZONELIST_NOFALLBACK]中去分配内存,相当于禁止从其他node执行fallback寻找内存,只能从本node寻找内存。这个就对应Linux 物理内存管理涉及的三大结构体之struct zone里面描述的NUMA_HIT和NUMA_MISS,当然对于手机和普通PC,不用考虑NUMA和多个node。
为了将所有node的所有zone包括进来,kernel定义了struct zonelist,利用zonelist囊括了所有node的所有zone,并按照优先级顺序,第一个是目标zone,其他就按照fallback机制执行时,zone的选择顺序排列,优先级从高到低,_zonerefs长度是MAX_ZONES_PER_ZONELIST+1,"距离"越近的邻居node所属的struct zone,在_zonerefs数组中的位置越靠前;同一邻居node中zones按照从高到低的顺序排列(即ZONE_HIGH、ZONE_NORMAL...ZONE_DMA)。
至于ZONELIST_NOFALLBACK的zonelist,只有本node的所有zone,同样也是第一个是目标zone,其他就按照fallback机制执行时,zone的选择顺序排列。
/* Maximum number of zones on a zonelist */
#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
enum {
ZONELIST_FALLBACK, /* zonelist with fallback */
#ifdef CONFIG_NUMA
/*
* The NUMA zonelists are doubled because we need zonelists that
* restrict the allocations to a single node for __GFP_THISNODE.
*/
ZONELIST_NOFALLBACK, /* zonelist without fallback (__GFP_THISNODE) */
#endif
MAX_ZONELISTS
};
/*
* This struct contains information about a zone in a zonelist. It is stored
* here to avoid dereferences into large structures and lookups of tables
*/
struct zoneref {
struct zone *zone; /* Pointer to actual zone */ //指针直接指向对应的zone
int zone_idx; /* zone_idx(zoneref->zone) *///返回的是对应zone在_zonerefs数组中的索引位置
};
/*
* One allocation request operates on a zonelist. A zonelist
* is a list of zones, the first one is the 'goal' of the
* allocation, the other zones are fallback zones, in decreasing
* priority.
*
* To speed the reading of the zonelist, the zonerefs contain the zone index
* of the entry being read. Helper functions to access information given
* a struct zoneref are
*
* zonelist_zone() - Return the struct zone * for an entry in _zonerefs
* zonelist_zone_idx() - Return the index of the zone for an entry
* zonelist_node_idx() - Return the index of the node for an entry
*/
struct zonelist {
//MAX_ZONES_PER_ZONELIST上面有定义是MAX_NUMNODES * MAX_NR_ZONES,表示所有node可能的最多zone数目,这个zonelist,首先第一个是目前的zone,其他的就是fallback zones(备用的zone),按照优先级递减排列
struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};
/*
* node_zonelists contains references to all zones in all nodes.
* Generally the first zones will be references to this node's
* node_zones.
*/
struct zonelist node_zonelists[MAX_ZONELISTS];
这是struct pglist_data相对重要的概念。
2.3 int nr_zones
表示当前node中zone的数目,ZONE_DMA,ZONE_DMA32,ZONE_NODRMAL,ZONE_HIGHMEM,ZONE_MOVABLE和ZONE_DEVICE这6个zone不一定都有。这个nr_zones就表示当前node实际的zone数目。跟当前node中struct zone node_zones[MAX_NR_ZONES]里面的实际zone数目一致,不过为了访问方便,更快速得到当前node的zone数量,提供了这个参数。这里需要注意的是:struct zone node_zones[MAX_NR_ZONES]参数,数组长度虽然是MAX_NR_ZONES,但是不代表当前node的实际zone数目就是MAX_NR_ZONES。特别是多个node的NUMA结构中,每个node的zone分布是有可能不同的。
int nr_zones; /* number of populated zones in this node */
2.4 struct page *node_mem_map
表示指向当前node中所有struct page构成的mem_map数组。如下所示,默认是使能CONFIG_FLAT_NODE_MEM_MAP。内核中使用 struct pglist_data 表示用于管理连续物理内存的 node 节点(kernel假设 node 中的物理内存是连续的),既然每个 node 节点中的物理内存是连续的,于是在每个 node 节点中还是采用 FLATMEM 平坦内存模型,而不是SPARSEMEM的方式来组织管理物理内存页,这也就是为什么默认使能CONFIG_FLAT_NODE_MEM_MAP。不过实际上物理内存不一定就是连续的,会有内存空洞,这就意味着会给内存空洞(hole)分配struct page结构体。此时,虽然内存空洞有struct page结构体,但是没有实际的物理内存对应,这就是为什么在Linux 物理内存管理涉及的三大结构体之struct zone中统计present_pages时,需要剔除掉内存空洞的原因,下面的node_present_pages也是剔除掉内存空洞的。
对于手机和PC来说,物理内存基本就是FLATMEM 平坦内存模型。但对于服务器或者一些大内存计算平台且物理内存是可热插拔的,此时平坦内存模型就不适用了,从Linux 物理内存管理涉及的三大结构体之struct page可知,每个struct page占用64字节,给太多内存空洞分配struct page会造成大量的内存浪费。此时就可以考虑DISCONTIGMEM 非连续内存模型和SPARSEMEM 稀疏内存模型了,这里不细讲,大家可以参阅这篇文章《一步一图带你深入理解 Linux 物理内存管理》。
//mm/Kconfig
config FLAT_NODE_MEM_MAP
def_bool y
depends on !SPARSEMEM
//mm/Kconfig.debug
config PAGE_EXTENSION
bool "Extend memmap on extra space for more information on page"
help
Extend memmap on extra space for more information on page. This
could be used for debugging features that need to insert extra
field for every page. This extension enables us to save memory
by not allocating this extra memory according to boottime
configuration.
//include/linux/page_ext.h
/*
* Page Extension can be considered as an extended mem_map.
* A page_ext page is associated with every page descriptor. The
* page_ext helps us add more information about the page.
* All page_ext are allocated at boot or memory hotplug event,
* then the page_ext for pfn always exists.
*/
struct page_ext {
unsigned long flags;
};
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
#endif
#endif
2.5 struct page_ext *node_page_ext
其是充当struct page的补充的,在不更改struct page结构体的情况,增加跟物理页帧相关的信息。这是kernel增加的page extension功能,在没有该功能前,如果想要记录页框更多的信息,则需要修改 struct page 结构体,并重新编译内核代码。当第三方内核模块有相关依赖时,重新编译会变的很困难;而且修改 struct page 结构体会引起无法预知的系统行为。因而page extension(mm/page_ext.c)添加一个结构(struct page_ext)作为 struct page 的补充。函数 alloc_node_page_ext(int nid) 会申请个数与 page 相等的 page_ext 结构,并保存在node的成员node_page_ext中。在mem_section中,该成员的名字为page_ext。
2.6 spinlock_t node_size_lock
保护node_start_pfn, node_present_pages, node_spanned_pages 、nr_zones和first_deferred_pfn这五个参数的,确保同一时刻只能被一个线程持有,一般使能了CONFIG_MEMORY_HOTPLUG或者CONFIG_DEFERRED_STRUCT_PAGE_INIT后,会有这把锁。因为当支持物理内存热插拔和使能了推迟struct page 初始化时,这个node size是会变化,所以需要额外保护这些跟node大小相关的字段,避免发生异步修改和读的数据不一致问题。正常所有的struct page会在启动初期就完成全部初始化,但对于超大内存的设备,这将会导致设备启动时间过长,因而提供DEFERRED_STRUCT_PAGE_INIT功能。在启动初期,先初始化一部分struct page,供早期的bootmem分配器使用,剩余的物理内存,在后面开机过程中,并行的初始化struct page。
//mm/Kconfig
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"
select MEMORY_ISOLATION
depends on SPARSEMEM || X86_64_ACPI_NUMA
depends on ARCH_ENABLE_MEMORY_HOTPLUG
depends on 64BIT || BROKEN
select NUMA_KEEP_MEMINFO if NUMA
config DEFERRED_STRUCT_PAGE_INIT
bool "Defer initialisation of struct pages to kthreads"
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
select PADATA
help
Ordinarily all struct pages are initialised during early boot in a
single thread. On very large machines this can take a considerable
amount of time. If this option is set, large machines will bring up
a subset of memmap at boot and then initialise the rest in parallel.
This has a potential performance impact on tasks running early in the
lifetime of the system until these kthreads finish the
initialisation.
#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
/*
* Must be held any time you expect node_start_pfn,
* node_present_pages, node_spanned_pages or nr_zones to stay constant.
* Also synchronizes pgdat->first_deferred_pfn during deferred page
* init.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
* or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
*
* Nests above zone->lock and zone->span_seqlock
*/
spinlock_t node_size_lock;
#endif
2.7 unsigned long node_start_pfn
表示当前node起始的物理页号,即页帧号。跟struct zone中的zone_start_pfn作用类似。node_start_pfn== node_start_paddr >> PAGE_SHIFT。在kernel-2.6之前,使用的就是node_start_paddr(表示的就是该node的起始物理地址),之后就替换成node_start_pfn。
unsigned long node_start_pfn;
2.8 unsigned long node_present_pages
表示该node中实际物理页数量,剔除掉了内存空洞。当然在一些体系结构中,如果没有内存空洞,那么node_present_pages跟node_spanned_pages是一样的。计算方式:node_present_pages = node_spanned_pages- absent_pages(pages in holes)。
unsigned long node_present_pages; /* total number of physical pages */
2.9 unsigned long node_spanned_pages
参考struct zone的spanned_pages。表示当前node物理地址范围内的所有page,包括内存空洞。有些体系结构中可能存在没有物理页面的hole,这个hole就是内存空洞。计算方式:node_spanned_pages= node_end_pfn - node_start_pfn。因此通过node_start_pfn+node_spanned_pages就是知道该node的结束物理页号。
unsigned long node_spanned_pages; /* total size of physical page range, including holes*/
2.10 int node_id
表示当前的node的编号,从0开始。如果整个NUMA系统只有一个node,那直接就是0.
int node_id;
2.11 wait_queue_head_t kswapd_wait;
内核会为每个 NUMA 节点(UMA是只有一个node的特殊NUMA)分配一个kswapd线程用于回收不经常使用的页面或者内存不足时回收内存,还会为每个 NUMA 节点分配一个kcompactd线程用于内存规整避免内存碎片。下面开始会讲一些跟kswapd和kcompactd相关的一些数据结构成员。
kswapd_wait表示是一个kswapd等待队列,里面存放的就是kswapd线程,在free_area_init_core 函数中被初始化。负责kswapd线程睡眠和唤醒功能。当kswapd需要sleep时,会调⽤kswapd_try_to_sleep()尝试进⼊sleep状态,同时把⾃⼰挂⼊kswapd_wait等待队列中,直到kswapd下次被唤醒。
因而从上面也可知道,kswapd_wait是一个等待队列,有三个功能,一是,kswapd线程需要睡眠时方便入队sleep;二是,⽅便kswapd线程被唤醒,三是,在尝试唤醒kswapd线程时,判断kswapd是否在运行,如果kswapd正在运行,在判断kswapd_wait时候,会发现该队列已空,无需重复执⾏唤醒操作,也可以有效防止出现kswapd被漏唤醒的问题。
//include/linux/wait.h
struct wait_queue_head {
spinlock_t lock;
struct list_head head;
};
typedef struct wait_queue_head wait_queue_head_t;
wait_queue_head_t kswapd_wait;
2.12 wait_queue_head_t pfmemalloc_wait;
表示等待直接内存回收(direct reclaim)结束的线程等待队列。里面存放的都是等待由kswapd帮忙做完直接内存回收的线程。当kswapd直接内存回收后,整个node的free pages满足要求时,在kswapd睡眠前,kswapd会唤醒pfmemalloc_wait里面的线程,线程直接进行内存分配,这个等待队列的线程跳过了自己direct reclaim的操作。
在kswapd中会对node中每一个不平衡的zone进行内存回收,根据struct zone里面的水线图可知,直到所有zone都满足:该zone分配页框后剩余的页框数量 > 该zone的_watermark[WMARK_HIGH]+该zone 预留它用的页框数量,则kswapd就会停止内存回收,进入睡眠,然后唤醒pfmemalloc_wait等待队列的线程。
如何判断请求内存分配的线程是进入pfmemalloc_wait等待对列,还是直接自己进行direct reclaim呢?实际上,就是判断一个node是否平衡。平衡,那就由线程自己进行direct reclaim,否则,线程加入等待队列,由kswapd来进行direct reclaim。具体的判断调用链:try_to_free_pages()->throttle_direct_reclaim()->allow_direct_reclaim()。allow_direct_reclaim返回true,说明node是平衡的,不会唤醒kswapd去做direct reclaim,具体的函数流程这里暂时不展开讲。
wait_queue_head_t pfmemalloc_wait;
2.13 struct task_struct *kswapd
kswapd线程,而这个就是。指向kswapd线程结构体task_struct的指针,指向的是kswapd的主线程。
struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */
2.14 struct task_struct *mkswapd[MAX_KSWAPD_THREADS]
表示一个包含指向kswapd主线程,子线程的指针数组,最大允许kswapd线程数是16个。正常一般不会一下子创建16个,根据内存回收效率,进行子线程的创建。kernel源代码中未见这个,这是手机芯片厂商/谷歌主动添加的一个feature。
//include/linux/mmzone.h
#define MAX_KSWAPD_THREADS 16
struct task_struct *mkswapd[MAX_KSWAPD_THREADS];
2.15 int kswapd_order
表示kswapd线程内存回收的单位(2^kswapd_order),要求大于线程内存分配所需求的order,否则会更新为线程内存分配对应的order。
int kswapd_order;
2.16 enum zone_type kswapd_highest_zoneidx
表示kswapd线程在node节点执行内存回收时,所允许的最大zone index。我们都知道,内存分配肯定是从高位zone往低位zone方向进行,要不然也没有必要有struct zone里面的lowmem_reserver数组和fallback机制。因此,kswapd线程被唤醒进行内存回收时,首先就要更新确认kswapd_highest_zoneidx,不然kswapd回收内存在高于kswapd_highest_zoneidx的zone里面开始进行,那这个内存对于有内存分配需求的线程是用不了的,就会做无用功。
enum zone_type kswapd_highest_zoneidx;
2.17 int kswapd_failures
表示kswapd线程内存回收失败的次数,失败的定义:kswapd线程没有回收到页帧。失败一次,kswapd_failures+1,最大失败次数是16次,如果大于等于MAX_RECLAIM_RETRIES,则不会再唤醒kswapd线程来回收内存了,而是叫有内存分配需求的线程自己direct_reclaim或者唤醒node节点的kcompactd线程进行内存规整,整理内存碎片,得到空闲连续的物理内存。一般每次kswapd回收成功一次就会将kswapd_failures置为0,见shrink_node函数。
在函数balance_pgdat->kswapd_shrink_node中,就是开始内存回收,在这里面会更新sc.nr_reclaimed,如果更新后,实际回收的是0,则进行pgdat->kswapd_failures++。在prepare_kswapd_sleep和wakeup_kswapd等函数中均会判断是否超过MAX_RECLAIM_RETRIES。
static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
{
int i;
unsigned long nr_soft_reclaimed;
unsigned long nr_soft_scanned;
unsigned long pflags;
unsigned long nr_boost_reclaim;
unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
bool boosted;
struct zone *zone;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.order = order,
.may_unmap = 1,
};
set_task_reclaim_state(current, &sc.reclaim_state);
psi_memstall_enter(&pflags);
__fs_reclaim_acquire();
restart:
sc.priority = DEF_PRIORITY;
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
bool raise_priority = true;
bool balanced;
bool ret;
......
/* Call soft limit reclaim before calling shrink_node. */
sc.nr_scanned = 0;
nr_soft_scanned = 0;
nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
sc.gfp_mask, &nr_soft_scanned);
sc.nr_reclaimed += nr_soft_reclaimed;
......
/*
* There should be no need to raise the scanning priority if
* enough pages are already being scanned that that high
* watermark would be met at 100% efficiency.
*/
if (kswapd_shrink_node(pgdat, &sc)) //这里得到kswapd线程回收的页帧
raise_priority = false;
......
/*
* Raise priority if scanning rate is too low or there was no
* progress in reclaiming pages
*/
nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;//减去原有,得到实际回收的
nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
if (nr_boost_reclaim && !nr_reclaimed)
break;
if (raise_priority || !nr_reclaimed)
sc.priority--;
} while (sc.priority >= 1);
if (!sc.nr_reclaimed)//这里进行kswapd_failures+1
pgdat->kswapd_failures++;
....
}
//mm/internal.h
/*
* Maximum number of reclaim retries without progress before the OOM
* killer is consider the only way forward.
*/
#define MAX_RECLAIM_RETRIES 16
int kswapd_failures; /* Number of 'reclaimed == 0' runs */
2.18 ANDROID_OEM_DATA(1)
表示在struct pglist_data中,添加一个跟OEM相关的字段,预留一定内存位置,针对潜在性的使用。谷歌在Android kernel系统中添加的一个字段,跟GKI有关。在linux kernel中是没有的。 而且在Android kernel系统中,也需使能CONFIG_ANDROID_VENDOR_OEM_DATA才有作用。
//drivers/android/Kconfig
config ANDROID_VENDOR_OEM_DATA
bool "Android vendor and OEM data padding"
default y
help
This option enables the padding that the Android GKI kernel adds
to many different kernel structures to support an in-kernel stable ABI
over the lifespan of support for the kernel as well as OEM additional
fields that are needed by some of the Android kernel tracepoints. The
macros enabled by this option are used to enable padding in vendor modules
used for the above specified purposes.
Only disable this option if you have a system that needs the Android
kernel drivers, but is NOT an Android GKI kernel image and you do NOT
use the Android kernel tracepoints. If disabled it has the possibility
to make the kernel static and runtime image slightly smaller but will
NOT be supported by the Google Android kernel team.
If even slightly unsure, say Y.
//include/linux/android_vendor.h
/*
* ANDROID_VENDOR_DATA
* Reserve some "padding" in a structure for potential future use.
* This normally placed at the end of a structure.
* number: the "number" of the padding variable in the structure. Start with
* 1 and go up.
*
* ANDROID_VENDOR_DATA_ARRAY
* Same as ANDROID_VENDOR_DATA but allocates an array of u64 with
* the specified size
*/
#ifdef CONFIG_ANDROID_VENDOR_OEM_DATA
#define ANDROID_OEM_DATA(n) u64 android_oem_data##n
#else
#define ANDROID_OEM_DATA(n)
#endif
ANDROID_OEM_DATA(1);
2.19 内存压缩/内存规整参数
kcompactd_max_order表示kcompactd线程内存规整时,规整所允许的最大order。在内存规整的主函数static int kcompactd(void *p)函数中,首先会将其初始化为0,后面会根据order来更新这个值。根据代码流程来看,kcompactd_max_order最大值也只能MAX_ORDER-1。这里同步介绍一下PAGE_ALLOC_COSTLY_ORDER参数,默认值为3。其含义是:当一次内存申请小于或等于 2^3个page时,通常容易得到满足的,而大于8个就是比较"costly"的操作。相当于在提醒开发者,最好不要一次申请超过8个连续的page frames。
kcompactd_highest_zoneidx表示在该node中,内存规整遍历zone时的上限。在内存规整的主函数kcompactd(void *p)函数中,首先会将其初始化为pglist_data->nr_zones-1。
kcompactd_wait表示kcompactd线程规整内存时使用到的等待队列,一般哪些等待内存规整结束的线程会进入该队列。
kcompactd指向kernel为每个NUMA 节点分配的kcompactd主线程的struct task_struct结构体。该结构体包含了内存规整守护进程的所有信息,如进程 ID、进程状态、进程优先级等。这里貌似不像前面的kswapd可以有多个子线程,就只有一个主线程。
proactive_compact_trigger用于确定是否用户启用主动压缩,还是出现内存碎片后,无法满足连续物理内存分配需求时,内核告知到它,被动的内存规整。针对大内存需求的应用,应用可以主动使能,提前准备好内存,使其内存需求得到及时满足,应用及时加载起来。这笔patch是在2021年提出的。 由系统调用sysctl_compaction_proactiveness控制,如下所示,默认值为20,代表内存规整比例。用户可以通过系统调用设置该值,动态的调整触发主动压缩的时机,在kcompactd->should_proactive_compact_node中,结合这个内存规整比例,计算当前node的碎片化程度,达到要求,就会主动压缩,同步将proactive_compact_trigger置为true。
//mm/compaction.c
/*
* Tunable for proactive compaction. It determines how
* aggressively the kernel should compact memory in the
* background. It takes values in the range [0, 100].
*/
unsigned int __read_mostly sysctl_compaction_proactiveness = 20;
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_highest_zoneidx;
wait_queue_head_t kcompactd_wait;
struct task_struct *kcompactd;
bool proactive_compact_trigger;
#endif
2.20 unsigned long totalreserve_pages
表示每个node需要预留的不能被用户空间直接分配的物理内存。关于totalreserve_pages的计算函数calculate_totalreserve_pages,在Linux 物理内存管理涉及的三大结构体之struct zone中的2.4.2节有讲(里面的pgdat->totalreserve_pages参数),实际上就是该node下所有zones的high watermark加上lowmem_reserve的值。
/*
* This is a per-node reserve of pages that are not available
* to userspace allocations.
*/
unsigned long totalreserve_pages;
2.21 NUMA下Node Reclaim相关参数
min_unmapped_pages表示无法回收的最小数量的unmapped file backed pages。仅在启用CONFIG_NUMA时定义。从node_reclaim和__node_reclaim函数中可知,node_pagecache_reclaimable(pgdat)计算得到的node中unmapped file backed pages不大于min_unmapped_pages,且可回收slab不大于min_slab_pages时,就不会回收unmapped file backed pages,反之,则会调用内存回收核心函数shrink_node函数进行内存回收。下面是初始化setup_min_unmapped_ratio的大小,在init_per_zone_wmark_min中会调用它。这个min_unmapped_pages大小受sysctl_min_unmapped_ratio的控制,用户如有root 权限,可通过echo XX > /proc/sys/vm/min_unmapped_ratio,修改该系统调用设置。
min_slab_pages表示无法回收的最小SLAB page数。仅在启用 CONFIG_NUMA 时定义。功能跟min_unmapped_pages一样。从node_reclaim函数中可知,若node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <= pgdat->min_slab_pages,则系统就不会回收,返回NODE_RECLAIM_FULL,这主要是出于保证I/O性能,低于指定值,就不会reclaim。下面是初始化setup_min_slab_ratio的大小,在init_per_zone_wmark_min中会调用它。这个min_slab_pages大小受sysctl_min_slab_ratio的控制,用户如有root 权限,可通过echo XX > /proc/sys/vm/min_slab_ratio,修改该系统调用设置。
//mm/page_alloc.c
#ifdef CONFIG_NUMA
//如下函数设置min_unmapped_pages值大小
static void setup_min_unmapped_ratio(void)
{
pg_data_t *pgdat;
struct zone *zone;
for_each_online_pgdat(pgdat)
pgdat->min_unmapped_pages = 0;
for_each_zone(zone)
//这里遍历该node的所有zone,进行计算得到min_unmapped_pages的大小
zone->zone_pgdat->min_unmapped_pages += (zone_managed_pages(zone) *
sysctl_min_unmapped_ratio) / 100;
}
//如下函数设置min_slab_pages值大小
static void setup_min_slab_ratio(void)
{
pg_data_t *pgdat;
struct zone *zone;
for_each_online_pgdat(pgdat)
pgdat->min_slab_pages = 0;
for_each_zone(zone)
zone->zone_pgdat->min_slab_pages += (zone_managed_pages(zone) *
sysctl_min_slab_ratio) / 100;
}
//在init_per_zone_wmark_min函数会调用setup_min_unmapped_ratio和setup_min_slab_ratio
int __meminit init_per_zone_wmark_min(void)
{
unsigned long lowmem_kbytes;
int new_min_free_kbytes;
......
#ifdef CONFIG_NUMA
//这里
setup_min_unmapped_ratio();
setup_min_slab_ratio();
#endif
khugepaged_min_free_kbytes_update();
return 0;
}
#ifdef CONFIG_NUMA
/*
* node reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif /* CONFIG_NUMA */
//在linux 服务器上,分别被设置为1和5
User@Ubuntu-149-19:/proc/sys/vm$ cat min_unmapped_ratio
1
User@Ubuntu-149-19:/proc/sys/vm$ cat min_slab_ratio
5
2.22 ZONE_PADDING(_pad1_)
在Linux 物理内存管理涉及的三大结构体之struct zone中,已做过介绍,这里不过多展开。主要作用是让前后的成员分布在不同的CPU cache line中,使得它们各自独占cache line,提高访问性能,以空间换取时间。
2.23 spinlock_t lru_lock
在kernel-4.8之前,所有的LRU是按照zone的粒度管理的,即每个zone都有5个LRU(LRU_INACTIVE_ANON,LRU_ACTIVE_ANON,LRU_INACTIVE_FILE,LRU_ACTIVE_FILE和LUR_UNEVITABLE),通过zone->lru_lock来保证同步。但从kernel-4.8开始,所有的LRU都是统计在node上⾯的,通过pgdat->lru_lock来保证同步。这样从node维度,可以保证每个zone上面的page老化程度一样(进程可能从不同zone中申请并分配到了内存,当进程内存回收时,从所有zone上分配到的内存可同一时间段回收,保证各个LRU的老化程度趋于一致)。而且之前之所以按照zone来管理,是受限于32位地址支持访问的内存空间有限,但现在主流基本是64位系统,可以直接访问所有物理内存。
使⽤pgdat->lru_lock⽅式管理LRU后,可以解决不同zone中page老化程度不同步的问题,但是pgdat->lru_lock锁的竞争还是十分激烈,lru_lock是用于对node中LRU链表并行访问时进行保护的自旋锁,并发时的同步控制。
spinlock_t lru_lock;
2.24 unsigned long first_deferred_pfn
first_deferred_pfn参数表示推迟初始化的物理内存范围的第一个PFN号。启动后期,并行初始化之前推迟的内存,就是从first_deferred_pfn编号开始的。这个成员也是受node_size_lock锁保护的。使能了CONFIG_DEFERRED_STRUCT_PAGE_INIT才启用。
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* If memory initialisation on large machines is deferred then this
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
2.25 struct deferred_split deferred_split_queue
表示每个node中被推迟初始化的透明大页保存的队列。仅使能了CONFIG_TRANSPARENT_HUGEPAGE才启用。
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct deferred_split deferred_split_queue;
#endif
2.26 struct lruvec __lruvec
保存和维护每个node的LRU链表和相关参数。无法直接访问__lruvec,一般都是通过mem_cgroup_lruvec查找lruvecs。struct lruvec结构体里面记录了五个LRU链表的相关信息。
//include/linux/mmzone.h
struct lruvec {
struct list_head lists[NR_LRU_LISTS];
/*
* These track the cost of reclaiming one LRU - file or anon -
* over the other. As the observed cost of reclaiming one LRU
* increases, the reclaim scan balance tips toward the other.
*/
unsigned long anon_cost;
unsigned long file_cost;
/* Non-resident age, driven by LRU movement */
atomic_long_t nonresident_age;
/* Refaults at the time of last reclaim cycle */
unsigned long refaults[ANON_AND_FILE];
/* Various lruvec state flags (enum lruvec_flags) */
unsigned long flags;
#ifdef CONFIG_MEMCG
struct pglist_data *pgdat;
#endif
};
/* Fields commonly accessed by the page reclaim scanner */
/*
* NOTE: THIS IS UNUSED IF MEMCG IS ENABLED.
*
* Use mem_cgroup_lruvec() to look up lruvecs.
*/
struct lruvec __lruvec;
2.27 unsigned long flags
每个node节点跟控制回收行为相关的标志。PGDAT_DIRTY表示回收扫描已经发现当前node有很多的dirty file page在LRU链表尾部,后续会使用page out动作将脏文件页回写,注意点:脏页都是file backed page,不是anon page,在linux kernel内存管理之/proc/meminfo下参数介绍有讲。PGDAT_WRITEBACK表示回收扫描已经发现当前node很多page正在回写(内容写回disk)。PGDAT_RECLAIM_LOCKED表示回收扫描发现,当前node拒绝此次的回收。
enum pgdat_flags {
PGDAT_DIRTY, /* reclaim scanning has recently found
* many dirty file pages at the tail
* of the LRU.
*/
PGDAT_WRITEBACK, /* reclaim scanning has recently found
* many pages under writeback
*/
PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
};
unsigned long flags;
2.28 ZONE_PADDING(_pad2_)
在Linux 物理内存管理涉及的三大结构体之struct zone中,已做过介绍,这里不过多展开。主要作用是让前后的成员分布在不同的CPU cache line中,使得它们各自独占cache line,提高访问性能,以空间换取时间。整个typedef struct pglist_data共有ZONE_PADDING两个,将struct pglist_data结构体分成三部分,让这三部分分布在不同的CPU cache line。
ZONE_PADDING(_pad2_)
2.29 struct per_cpu_nodestat __percpu *per_cpu_nodestats
跟struct zone里面的struct per_cpu_pageset __percpu *pageset功能有些差异,这里没有涉及到PCP技术。从struct per_cpu_nodestat结构体中,也可以看出per_cpu_nodestats保存的是per cpu在该node的的vm stat信息,涉及到per cpu vm stat信息更新的阈值判断参数stat_threshold和保存per CPU在该node的vm stat统计信息vm_node_stat_diff。同样也是在init_per_zone_wmark_min函数里面会调用refresh_zone_stat_thresholds对pglist_data->stat_threshold进行更新。
struct per_cpu_nodestat {
s8 stat_threshold;
s8 vm_node_stat_diff[NR_VM_NODE_STAT_ITEMS];
};
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
2.30 atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS]
保存当前node使用内存情况的统计信息,起到维护作用。cat /proc/zoneinfo或者cat /proc/vmstat里面的部分信息就来自这里。
enum node_stat_item {
NR_LRU_BASE,
NR_INACTIVE_ANON = NR_LRU_BASE, /* must match order of LRU_[IN]ACTIVE */
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
NR_SLAB_RECLAIMABLE_B,
NR_SLAB_UNRECLAIMABLE_B,
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
WORKINGSET_NODES,
WORKINGSET_REFAULT_BASE,
WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE,
WORKINGSET_REFAULT_FILE,
WORKINGSET_ACTIVATE_BASE,
WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE,
WORKINGSET_ACTIVATE_FILE,
WORKINGSET_RESTORE_BASE,
WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE,
WORKINGSET_RESTORE_FILE,
WORKINGSET_NODERECLAIM,
NR_ANON_MAPPED, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
only modified from process context */
NR_FILE_PAGES,
NR_FILE_DIRTY,
NR_WRITEBACK,
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_SHMEM_THPS,
NR_SHMEM_PMDMAPPED,
NR_FILE_THPS,
NR_FILE_PMDMAPPED,
NR_ANON_THPS,
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
NR_KERNEL_STACK_KB, /* measured in KiB */
#if IS_ENABLED(CONFIG_SHADOW_CALL_STACK)
NR_KERNEL_SCS_KB, /* measured in KiB */
#endif
NR_VM_NODE_STAT_ITEMS
};
/* Per-node vmstats */
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
下面是从linux服务器和手机端,cat /proc/zoneinfo得到的相应node内存统计信息。linux服务器由于是kernel-4.10版本,有些参数跟当前所讲的kernel-5.4版本有差异。手机是kernel-5.4版本跟上面内容基本一致。
NUMA,linux服务器
Node 0, zone DMA
per-node stats
nr_inactive_anon 87787
nr_active_anon 131343
nr_inactive_file 4940567
nr_active_file 3924819
nr_unevictable 151
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 2876638041
workingset_activate 1141762621
workingset_nodereclaim 16712104
nr_anon_pages 104356
nr_mapped 22473
nr_file_pages 8981490
nr_dirty 12
nr_writeback 0
nr_writeback_temp 0
nr_shmem 110525
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_anon_transparent_hugepages 104
nr_unstable 0
nr_vmscan_write 2265499755
nr_vmscan_immediate_reclaim 33161710
nr_dirtied 34372672210
nr_written 31767870833
......
Node 1, zone Normal
per-node stats
nr_inactive_anon 36150
nr_active_anon 42170
nr_inactive_file 5507744
nr_active_file 4180727
nr_unevictable 764
nr_isolated_anon 0
nr_isolated_file 0
nr_pages_scanned 0
workingset_refault 2449337831
workingset_activate 1030814580
workingset_nodereclaim 15312425
nr_anon_pages 4966
nr_mapped 11985
nr_file_pages 9762999
nr_dirty 8
nr_writeback 0
nr_writeback_temp 0
nr_shmem 71300
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_anon_transparent_hugepages 2
nr_unstable 0
nr_vmscan_write 3761793628
nr_vmscan_immediate_reclaim 31291473
nr_dirtied 36341879167
nr_written 33134228904
......
UMA 手机
Node 0, zone Normal
per-node stats
nr_inactive_anon 139310
nr_active_anon 281741
nr_inactive_file 1572082
nr_active_file 338922
nr_unevictable 44830
nr_slab_reclaimable 43829
nr_slab_unreclaimable 79209
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 28110
workingset_refault 120434
workingset_activate 36754
workingset_restore 109
workingset_nodereclaim 0
nr_anon_pages 418889
nr_mapped 273103
nr_file_pages 1958272
nr_dirty 72
nr_writeback 0
nr_writeback_temp 0
nr_shmem 2472
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 7422
nr_vmscan_immediate_reclaim 45
nr_dirtied 318705
nr_written 323852
nr_kernel_misc_reclaimable 64483
nr_unreclaimable_pages 58528
......
三、结语
总结一下,本文主要介绍kernel物理内存管理三大结构体里面顶端的结构体typedef struct pglist_data,对其里面的参数做了一一介绍。 其中跟zone相关的参数(node_zones,node_zonelists,nr_zones),node物理内存PFN相关的参数(node_start_pfn,node_present_pages,node_present_pages),kswapd相关的参数,内存规整相关参数,跟LRU相关的参数(lru_lock,__lruvec)和统计node内存信息参数vm_stat,在内核中使用频繁,相对重要。
参考资料
一步一图带你深入理解 Linux 物理内存管理
Linux内核那些事之kswapd
【Linux内核】什么是kswapd?
[内核内存] [arm64] 内存回收2---快速内存回收和直接内存回收
Physical Memory — The Linux Kernel documentation
[内核内存] [arm64] 内存规整1---memory-compaction详解
pg_data_t、struct zonelist、struct zoneref和struct zone浅析
Linux Kernel Documentation