kernel-hacking-2024-linux-s.../mm
Mel Gorman 8479eba778 mm: numa: quickly fail allocations for NUMA balancing on full nodes
Commit 4167e9b2cf ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
flag combination due to confusing semantics.  It noted that
alloc_misplaced_dst_page() was one such user after changes made by
commit e97ca8e5b8 ("mm: fix GFP_THISNODE callers and clarify").

Unfortunately when GFP_THISNODE was removed, users of
alloc_misplaced_dst_page() started waking kswapd and entering direct
reclaim because the wrong GFP flags are cleared.  The consequence is
that workloads that used to fit into memory now get reclaimed which is
addressed by this patch.

The problem can be demonstrated with "mutilate" that exercises memcached
which is software dedicated to memory object caching.  The configuration
uses 80% of memory and is run 3 times for varying numbers of clients.
The results on a 4-socket NUMA box are

mutilate
                            4.4.0                 4.4.0
                          vanilla           numaswap-v1
Hmean    1      8394.71 (  0.00%)     8395.32 (  0.01%)
Hmean    4     30024.62 (  0.00%)    34513.54 ( 14.95%)
Hmean    7     32821.08 (  0.00%)    70542.96 (114.93%)
Hmean    12    55229.67 (  0.00%)    93866.34 ( 69.96%)
Hmean    21    39438.96 (  0.00%)    85749.21 (117.42%)
Hmean    30    37796.10 (  0.00%)    50231.49 ( 32.90%)
Hmean    47    18070.91 (  0.00%)    38530.13 (113.22%)

The metric is queries/second with the more the better.  The results are
way outside of the noise and the reason for the improvement is obvious
from some of the vmstats

                                 4.4.0       4.4.0
                               vanillanumaswap-v1r1
Minor Faults                1929399272  2146148218
Major Faults                  19746529        3567
Swap Ins                      57307366        9913
Swap Outs                     50623229       17094
Allocation stalls                35909         443
DMA allocs                           0           0
DMA32 allocs                  72976349   170567396
Normal allocs               5306640898  5310651252
Movable allocs                       0           0
Direct pages scanned         404130893      799577
Kswapd pages scanned         160230174           0
Kswapd pages reclaimed        55928786           0
Direct pages reclaimed         1843936       41921
Page writes file                  2391           0
Page writes anon              50623229       17094

The vanilla kernel is swapping like crazy with large amounts of direct
reclaim and kswapd activity.  The figures are aggregate but it's known
that the bad activity is throughout the entire test.

Note that simple streaming anon/file memory consumers also see this
problem but it's not as obvious.  In those cases, kswapd is awake when
it should not be.

As there are at least two reclaim-related bugs out there, it's worth
spelling out the user-visible impact.  This patch only addresses bugs
related to excessive reclaim on NUMA hardware when the working set is
larger than a NUMA node.  There is a bug related to high kswapd CPU
usage but the reports are against laptops and other UMA hardware and is
not addressed by this patch.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27 10:28:52 -08:00
..
kasan UBSAN: run-time undefined behavior sanity checker 2016-01-20 17:09:18 -08:00
backing-dev.c mm/backing-dev.c: fix error path in wb_init() 2016-02-11 18:35:48 -08:00
balloon_compaction.c virtio_balloon: fix race between migration and ballooning 2016-01-12 20:47:06 +02:00
bootmem.c x86/mm: Introduce max_possible_pfn 2015-12-06 12:46:31 +01:00
cleancache.c cleancache: constify cleancache_ops structure 2016-01-27 09:09:57 -05:00
cma.c
cma.h
cma_debug.c
compaction.c mm/compaction.c: __compact_pgdat() code cleanuup 2016-01-14 16:00:49 -08:00
debug-pagealloc.c
debug.c mm: rework mapcount accounting to enable 4k mapping of THPs 2016-01-15 17:56:32 -08:00
dmapool.c
early_ioremap.c
fadvise.c
failslab.c
filemap.c mm: fix filemap.c kernel doc warning 2016-02-11 18:35:48 -08:00
frame_vector.c
frontswap.c
gup.c mm: retire GUP WARN_ON_ONCE that outlived its usefulness 2016-02-03 08:57:14 -08:00
highmem.c
huge_memory.c thp: call pmdp_invalidate() with correct virtual address 2016-02-24 10:46:30 -08:00
hugetlb.c mm/hugetlb.c: fix incorrect proc nr_hugepages value 2016-02-18 16:23:24 -08:00
hugetlb_cgroup.c
hwpoison-inject.c
init-mm.c
internal.h mm: polish virtual memory accounting 2016-02-03 08:28:43 -08:00
interval_tree.c
Kconfig mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT 2016-02-05 18:10:40 -08:00
Kconfig.debug
kmemcheck.c
kmemleak-test.c
kmemleak.c Revert "gfp: add __GFP_NOACCOUNT" 2016-01-14 16:00:49 -08:00
ksm.c mm/ksm.c: mark stable page dirty 2016-01-15 17:56:32 -08:00
list_lru.c mm: memcontrol: move kmem accounting code to CONFIG_MEMCG 2016-01-20 17:09:18 -08:00
maccess.c
madvise.c mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called 2016-01-15 17:56:32 -08:00
Makefile
memblock.c memblock: don't mark memblock_phys_mem_size() as __init 2016-02-05 18:10:40 -08:00
memcontrol.c thp: change pmd_trans_huge_lock() interface to return ptl 2016-01-21 17:20:51 -08:00
memory-failure.c mm: soft-offline: exit with failure for non anonymous thp 2016-01-15 17:56:32 -08:00
memory.c mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED 2016-02-27 10:28:52 -08:00
memory_hotplug.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
mempolicy.c mempolicy: do not try to queue pages from !vma_migratable() 2016-02-05 18:10:40 -08:00
mempool.c
memtest.c
migrate.c mm: numa: quickly fail allocations for NUMA balancing on full nodes 2016-02-27 10:28:52 -08:00
mincore.c thp: change pmd_trans_huge_lock() interface to return ptl 2016-01-21 17:20:51 -08:00
mlock.c mm: fix mlock accouting 2016-01-21 17:20:51 -08:00
mm_init.c
mmap.c mm: fix regression in remap_file_pages() emulation 2016-02-18 16:23:24 -08:00
mmu_context.c
mmu_notifier.c
mmzone.c mm/mmzone.c: memmap_valid_within() can be boolean 2016-01-14 16:00:49 -08:00
mprotect.c mm, dax: check for pmd_none() after split_huge_pmd() 2016-02-11 18:35:48 -08:00
mremap.c mm, dax: check for pmd_none() after split_huge_pmd() 2016-02-11 18:35:48 -08:00
msync.c
nobootmem.c x86/mm: Introduce max_possible_pfn 2015-12-06 12:46:31 +01:00
nommu.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
oom_kill.c mm, shmem: add internal shmem resident memory accounting 2016-01-14 16:00:49 -08:00
page-writeback.c mm: page_alloc: generalize the dirty balance reserve 2016-01-14 16:00:49 -08:00
page_alloc.c mm, hugetlb: don't require CMA for runtime gigantic pages 2016-02-05 18:10:40 -08:00
page_counter.c
page_ext.c
page_idle.c mm: add page_check_address_transhuge() helper 2016-01-15 17:56:32 -08:00
page_io.c
page_isolation.c mm/page_isolation: do some cleanup in "undo_isolate_page_range" 2016-01-15 17:56:32 -08:00
page_owner.c
pagewalk.c thp: rename split_huge_page_pmd() to split_huge_pmd() 2016-01-15 17:56:32 -08:00
percpu-km.c
percpu-vm.c
percpu.c tree wide: use kvfree() than conditional kfree()/vfree() 2016-01-22 17:02:18 -08:00
pgtable-generic.c mm,thp: fix spellos in describing __HAVE_ARCH_FLUSH_PMD_TLB_RANGE 2016-02-11 18:35:48 -08:00
process_vm_access.c ptrace: use fsuid, fsgid, effective creds for fs access checks 2016-01-20 17:09:18 -08:00
quicklist.c
readahead.c mm: move lru_to_page to mm_inline.h 2016-01-14 16:00:49 -08:00
rmap.c mm: fix locking order in mm_take_all_locks() 2016-01-15 17:56:32 -08:00
shmem.c make sure that freeing shmem fast symlinks is RCU-delayed 2016-01-22 18:08:52 -05:00
slab.c mm: slab: free kmem_cache_node after destroy sysfs file 2016-02-18 16:23:24 -08:00
slab.h mm: slab: free kmem_cache_node after destroy sysfs file 2016-02-18 16:23:24 -08:00
slab_common.c mm: slab: free kmem_cache_node after destroy sysfs file 2016-02-18 16:23:24 -08:00
slob.c mm: slab: free kmem_cache_node after destroy sysfs file 2016-02-18 16:23:24 -08:00
slub.c mm: slab: free kmem_cache_node after destroy sysfs file 2016-02-18 16:23:24 -08:00
sparse-vmemmap.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
sparse.c x86, mm: introduce vmem_altmap to augment vmemmap_populate() 2016-01-15 17:56:32 -08:00
swap.c mm, x86: get_user_pages() for dax mappings 2016-01-15 17:56:32 -08:00
swap_cgroup.c
swap_state.c mm: memcontrol: charge swap to cgroup2 2016-01-20 17:09:18 -08:00
swapfile.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
truncate.c dax: support dirty DAX entries in radix tree 2016-01-22 17:02:18 -08:00
userfaultfd.c memcg: adjust to support new THP refcounting 2016-01-15 17:56:32 -08:00
util.c proc: revert /proc/<pid>/maps [stack:TID] annotation 2016-02-03 08:28:43 -08:00
vmacache.c
vmalloc.c mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment 2016-01-15 17:56:32 -08:00
vmpressure.c mm/vmpressure.c: fix subtree pressure detection 2016-02-03 08:28:43 -08:00
vmscan.c mm: downgrade VM_BUG in isolate_lru_page() to warning 2016-02-05 18:10:40 -08:00
vmstat.c vmstat: make vmstat_update deferrable 2016-02-05 18:10:40 -08:00
workingset.c dax: support dirty DAX entries in radix tree 2016-01-22 17:02:18 -08:00
zbud.c mm/zbud.c: use list_last_entry() instead of list_tail_entry() 2016-01-15 11:40:52 -08:00
zpool.c
zsmalloc.c zsmalloc: fix migrate_zspage-zs_free race condition 2016-01-20 17:09:18 -08:00
zswap.c mm/zswap: change incorrect strncmp use to strcmp 2015-12-18 14:25:40 -08:00