kernel-hacking-2024-linux-s.../Documentation
David Howells 201a15428b FS-Cache: Handle pages pending storage that get evicted under OOM conditions
Handle netfs pages that the vmscan algorithm wants to evict from the pagecache
under OOM conditions, but that are waiting for write to the cache.  Under these
conditions, vmscan calls the releasepage() function of the netfs, asking if a
page can be discarded.

The problem is typified by the following trace of a stuck process:

	kslowd005     D 0000000000000000     0  4253      2 0x00000080
	 ffff88001b14f370 0000000000000046 ffff880020d0d000 0000000000000007
	 0000000000000006 0000000000000001 ffff88001b14ffd8 ffff880020d0d2a8
	 000000000000ddf0 00000000000118c0 00000000000118c0 ffff880020d0d2a8
	Call Trace:
	 [<ffffffffa00782d8>] __fscache_wait_on_page_write+0x8b/0xa7 [fscache]
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffffa0078240>] ? __fscache_check_page_write+0x63/0x70 [fscache]
	 [<ffffffffa00b671d>] nfs_fscache_release_page+0x4e/0xc4 [nfs]
	 [<ffffffffa00927f0>] nfs_release_page+0x3c/0x41 [nfs]
	 [<ffffffff810885d3>] try_to_release_page+0x32/0x3b
	 [<ffffffff81093203>] shrink_page_list+0x316/0x4ac
	 [<ffffffff8109372b>] shrink_inactive_list+0x392/0x67c
	 [<ffffffff813532fa>] ? __mutex_unlock_slowpath+0x100/0x10b
	 [<ffffffff81058df0>] ? trace_hardirqs_on_caller+0x10c/0x130
	 [<ffffffff8135330e>] ? mutex_unlock+0x9/0xb
	 [<ffffffff81093aa2>] shrink_list+0x8d/0x8f
	 [<ffffffff81093d1c>] shrink_zone+0x278/0x33c
	 [<ffffffff81052d6c>] ? ktime_get_ts+0xad/0xba
	 [<ffffffff81094b13>] try_to_free_pages+0x22e/0x392
	 [<ffffffff81091e24>] ? isolate_pages_global+0x0/0x212
	 [<ffffffff8108e743>] __alloc_pages_nodemask+0x3dc/0x5cf
	 [<ffffffff81089529>] grab_cache_page_write_begin+0x65/0xaa
	 [<ffffffff8110f8c0>] ext3_write_begin+0x78/0x1eb
	 [<ffffffff81089ec5>] generic_file_buffered_write+0x109/0x28c
	 [<ffffffff8103cb69>] ? current_fs_time+0x22/0x29
	 [<ffffffff8108a509>] __generic_file_aio_write+0x350/0x385
	 [<ffffffff8108a588>] ? generic_file_aio_write+0x4a/0xae
	 [<ffffffff8108a59e>] generic_file_aio_write+0x60/0xae
	 [<ffffffff810b2e82>] do_sync_write+0xe3/0x120
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810b18e1>] ? __dentry_open+0x1a5/0x2b8
	 [<ffffffff810b1a76>] ? dentry_open+0x82/0x89
	 [<ffffffffa00e693c>] cachefiles_write_page+0x298/0x335 [cachefiles]
	 [<ffffffffa0077147>] fscache_write_op+0x178/0x2c2 [fscache]
	 [<ffffffffa0075656>] fscache_op_execute+0x7a/0xd1 [fscache]
	 [<ffffffff81082093>] slow_work_execute+0x18f/0x2d1
	 [<ffffffff8108239a>] slow_work_thread+0x1c5/0x308
	 [<ffffffff8104c0f1>] ? autoremove_wake_function+0x0/0x34
	 [<ffffffff810821d5>] ? slow_work_thread+0x0/0x308
	 [<ffffffff8104be91>] kthread+0x7a/0x82
	 [<ffffffff8100beda>] child_rip+0xa/0x20
	 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
	 [<ffffffff8102ef83>] ? tg_shares_up+0x171/0x227
	 [<ffffffff8104be17>] ? kthread+0x0/0x82
	 [<ffffffff8100bed0>] ? child_rip+0x0/0x20

In the above backtrace, the following is happening:

 (1) A page storage operation is being executed by a slow-work thread
     (fscache_write_op()).

 (2) FS-Cache farms the operation out to the cache to perform
     (cachefiles_write_page()).

 (3) CacheFiles is then calling Ext3 to perform the actual write, using Ext3's
     standard write (do_sync_write()) under KERNEL_DS directly from the netfs
     page.

 (4) However, for Ext3 to perform the write, it must allocate some memory, in
     particular, it must allocate at least one page cache page into which it
     can copy the data from the netfs page.

 (5) Under OOM conditions, the memory allocator can't immediately come up with
     a page, so it uses vmscan to find something to discard
     (try_to_free_pages()).

 (6) vmscan finds a clean netfs page it might be able to discard (possibly the
     one it's trying to write out).

 (7) The netfs is called to throw the page away (nfs_release_page()) - but it's
     called with __GFP_WAIT, so the netfs decides to wait for the store to
     complete (__fscache_wait_on_page_write()).

 (8) This blocks a slow-work processing thread - possibly against itself.

The system ends up stuck because it can't write out any netfs pages to the
cache without allocating more memory.

To avoid this, we make FS-Cache cancel some writes that aren't in the middle of
actually being performed.  This means that some data won't make it into the
cache this time.  To support this, a new FS-Cache function is added
fscache_maybe_release_page() that replaces what the netfs releasepage()
functions used to do with respect to the cache.

The decisions fscache_maybe_release_page() makes are counted and displayed
through /proc/fs/fscache/stats on a line labelled "VmScan".  There are four
counters provided: "nos=N" - pages that weren't pending storage; "gon=N" -
pages that were pending storage when we first looked, but weren't by the time
we got the object lock; "bsy=N" - pages that we ignored as they were actively
being written when we looked; and "can=N" - pages that we cancelled the storage
of.

What I'd really like to do is alter the behaviour of the cancellation
heuristics, depending on how necessary it is to expel pages.  If there are
plenty of other pages that aren't waiting to be written to the cache that
could be ejected first, then it would be nice to hold up on immediate
cancellation of cache writes - but I don't see a way of doing that.

Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19 18:11:35 +00:00
..
ABI Documentation: ABI: /sys/devices/system/cpu/cpu#/node 2009-10-30 14:59:53 -07:00
accounting Documentation/: fix warnings from -Wmissing-prototypes in HOSTCFLAGS 2009-09-23 07:39:28 -07:00
acpi
aoe
arm ARM: 5738/1: Correct TCM documentation 2009-10-01 16:26:16 +01:00
auxdisplay includecheck fix: Documentation, cfag12864b-example.c 2009-09-24 07:20:57 -07:00
blackfin
block Trivial typo fixes in Documentation/block/data-integrity.txt. 2009-07-01 10:56:25 +02:00
blockdev
cdrom debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. 2009-06-15 21:30:28 -07:00
cgroups cgroups: update documentation of cgroups tasks and procs files 2009-10-08 07:36:39 -07:00
connector connector: Provide the sender's credentials to the callback 2009-10-02 10:54:01 -07:00
console
cpu-freq [CPUFREQ] update Doc for cpuinfo_cur_freq and scaling_cur_freq 2009-09-01 12:45:09 -04:00
cpuidle
cris
crypto async_tx: add support for asynchronous RAID6 recovery operations 2009-08-29 19:09:27 -07:00
development-process docs: Encourage better changelogs in the development process document 2009-06-04 10:32:49 -06:00
device-mapper dm raid1: add userspace log 2009-06-22 10:12:35 +01:00
DocBook Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-09-22 07:51:45 -07:00
driver-model driver model: fix show/store prototypes in doc. 2009-07-12 13:02:10 -07:00
dvb V4L/DVB (12902): Documentation: synchronize documentation for Technisat cards 2009-09-19 00:14:32 -03:00
early-userspace
fault-injection debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. 2009-06-15 21:30:28 -07:00
fb matroxfb: make CONFIG_FB_MATROX_MULTIHEAD=y mandatory 2009-09-23 07:39:56 -07:00
filesystems FS-Cache: Handle pages pending storage that get evicted under OOM conditions 2009-11-19 18:11:35 +00:00
firmware_class driver core: fix documentation of request_firmware_nowait 2009-06-15 21:30:24 -07:00
frv
hwmon hwmon: enhance the sysfs API for power meters 2009-10-29 07:39:30 -07:00
i2c i2c-piix4: Modify code name SB900 to Hudson-2 2009-11-07 13:10:46 +01:00
i2o
ia64 Documentation/: fix warnings from -Wmissing-prototypes in HOSTCFLAGS 2009-09-23 07:39:28 -07:00
ide ide: preserve Host Protected Area by default (v2) 2009-06-07 13:52:52 +02:00
infiniband IB: Fix typo in udev rule documentation 2009-10-07 15:35:55 -07:00
input Input: add new driver for Sentelic Finger Sensing Pad 2009-08-19 21:46:09 -07:00
ioctl drivers/char/uv_mmtimer.c: add memory mapped RTC driver for UV 2009-09-24 07:21:03 -07:00
isdn Documentation: expand isdn/INTERFACE.CAPI document 2009-10-06 22:20:51 -07:00
ja_JP block: rename CONFIG_LBD to CONFIG_LBDAF 2009-06-19 08:08:50 +02:00
kbuild kbuild: introduce ld-option 2009-09-20 12:27:42 +02:00
kdump trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
ko_KR
kvm KVM: Document KVM_CAP_IRQCHIP 2009-09-10 10:46:55 +03:00
laptops Merge branch 'thinkpad-2.6.32-part2' into release 2009-09-26 01:08:55 -04:00
lguest virtio: let header files include virtio_ids.h 2009-10-22 16:39:28 +10:30
m68k
make
mips
misc-devices max6875: Discard obsolete detect method 2009-10-04 22:53:41 +02:00
mn10300 trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
mtd trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
namespaces
netlabel
networking pktgen: Fix multiqueue handling 2009-10-04 21:08:54 -07:00
parisc
PCI PCI: document PCIe fundamental reset interfaces 2009-09-09 13:29:38 -07:00
pcmcia Documentation/: fix warnings from -Wmissing-prototypes in HOSTCFLAGS 2009-09-23 07:39:28 -07:00
power Merge git://git.infradead.org/battery-2.6 2009-09-23 10:11:08 -07:00
powerpc Merge git://git.infradead.org/mtd-2.6 2009-09-23 10:07:49 -07:00
pps LinuxPPS: core support 2009-06-18 13:04:04 -07:00
prctl
RCU rcu: Remove CONFIG_PREEMPT_RCU 2009-08-23 10:32:40 +02:00
s390 [S390] s390dbf: Add description for usage of "%s" in sprintf events 2009-09-11 10:29:53 +02:00
scheduler sched: Documentation/sched-rt-group: Fix style issues & bump version 2009-06-21 13:12:46 +02:00
scsi [SCSI] hptiop: Add RR44xx adapter support 2009-10-02 09:45:22 -05:00
serial
sh
sound ALSA: dummy - Fix descriptions of pcm_substreams parameter 2009-11-02 14:11:55 +01:00
sparc
spi spi: fix spelling of `automatically' in documentation 2009-09-23 07:39:44 -07:00
sysctl Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 2009-09-24 07:53:22 -07:00
telephony
thermal thermal: sysfs-api.txt - document passive attribute for thermal zones 2009-11-05 18:11:18 -05:00
timers trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
trace tracing: Fix comment typo and documentation example 2009-10-24 11:07:50 +02:00
uml
usb USB: Fix sysfs paths in documentation 2009-09-23 06:46:41 -07:00
video4linux Documentation/: fix warnings from -Wmissing-prototypes in HOSTCFLAGS 2009-09-23 07:39:28 -07:00
vm Merge branch 'hostprogs-wmissing-prototypes' of git://git.kernel.org/pub/scm/linux/kernel/git/josh/linux-misc 2009-11-17 09:14:49 -08:00
w1 ds2482: Discard obsolete detect method 2009-10-04 22:53:41 +02:00
watchdog Documentation/: fix warnings from -Wmissing-prototypes in HOSTCFLAGS 2009-09-23 07:39:28 -07:00
wimax
x86 USB: ehci-dbgp,documentation: Documentation updates for ehci-dbgp 2009-09-23 06:46:39 -07:00
zh_CN
00-INDEX Bluetooth: Add documentation for Marvell Bluetooth driver 2009-08-22 14:25:32 -07:00
applying-patches.txt
atomic_ops.txt Documentation/atomic_ops.txt: fix sample code 2009-06-16 19:47:52 -07:00
bad_memory.txt
basic_profiling.txt
binfmt_misc.txt
braille-console.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
bt8xxgpio.txt
btmrvl.txt Bluetooth: Add documentation for Marvell Bluetooth driver 2009-08-22 14:25:32 -07:00
BUG-HUNTING
c2port.txt
cachetlb.txt
Changes Documentation/Changes: perl is needed to build the kernel 2009-06-18 13:03:46 -07:00
CodingStyle trivial: fix typo milisecond/millisecond for documentation and source comments. 2009-06-12 18:01:46 +02:00
cpu-hotplug.txt
cpu-load.txt
cputopology.txt Documentation: ABI: /sys/devices/system/cpu/cpu#/ topology files 2009-10-30 14:59:52 -07:00
credentials.txt
dcdbas.txt
debugging-modules.txt
debugging-via-ohci1394.txt ieee1394: update URLs in debugging-via-ohci1394.txt 2009-10-03 09:28:11 +02:00
dell_rbu.txt trivial: Documentation/dell_rbu.txt: fix typos 2009-06-12 18:01:50 +02:00
devices.txt
DMA-API.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
DMA-attributes.txt
DMA-ISA-LPC.txt
DMA-mapping.txt
dmaengine.txt
dontdiff sparc: Kill PROM console driver. 2009-09-15 17:04:38 -07:00
dynamic-debug-howto.txt
edac.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
eisa.txt
email-clients.txt
feature-removal-schedule.txt inotify: deprecate the inotify kernel interface 2009-10-18 15:49:38 -04:00
flexible-arrays.txt Update flex_arrays.txt 2009-10-15 07:25:20 -06:00
futex-requeue-pi.txt futex: add requeue-pi documentation 2009-05-09 07:12:50 +02:00
gcov.txt trivial: fix typo in CONFIG_DEBUG_FS in gcov doc 2009-09-21 15:14:56 +02:00
gpio.txt gpiolib: allow poll() on value 2009-09-23 07:39:48 -07:00
highuid.txt
HOWTO
hw_random.txt
ics932s401
initrd.txt
Intel-IOMMU.txt intel-iommu: Kill DMAR_BROKEN_GFX_WA option. 2009-09-19 09:37:23 -07:00
intel_txt.txt x86, intel_txt: Intel TXT boot support 2009-07-21 11:49:06 -07:00
io-mapping.txt
IO-mapping.txt
io_ordering.txt
iostats.txt
IPMI.txt
IRQ-affinity.txt
IRQ.txt
irqflags-tracing.txt
isapnp.txt
java.txt
kernel-doc-nano-HOWTO.txt kernel-doc: allow multi-line declaration purpose descriptions 2009-09-18 09:48:52 -07:00
kernel-docs.txt
kernel-parameters.txt x86: earlyprintk: Fix regression to handle serial,ttySn as 1 arg 2009-10-01 10:34:16 +02:00
keys-request-key.txt
keys.txt KEYS: Add a keyctl to install a process's session keyring on its parent [try #6] 2009-09-02 21:29:22 +10:00
kmemcheck.txt kmemcheck: update documentation 2009-07-01 22:36:22 +02:00
kmemleak.txt kmemleak: add clear command support 2009-09-08 16:36:08 +01:00
kobject.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
kprobes.txt debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. 2009-06-15 21:30:28 -07:00
kref.txt kref: double kref_put() in my_data_handler() 2009-09-18 09:48:52 -07:00
ldm.txt
leds-class.txt led: document sysfs interface 2009-08-28 15:21:12 -04:00
leds-lp3944.txt leds: LED driver for National Semiconductor LP3944 Funlight Chip 2009-06-23 20:21:38 +01:00
local_ops.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
lockdep-design.txt lockdep: Fix typos in documentation 2009-08-07 12:03:46 +02:00
lockstat.txt
logo.gif
logo.txt
magic-number.txt
Makefile
ManagementStyle
mca.txt
md.txt
memory-barriers.txt
memory-hotplug.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
memory.txt Documentation/memory.txt: remove some very outdated recommendations 2009-09-22 07:17:26 -07:00
mono.txt
mutex-design.txt
nmi_watchdog.txt
nommu-mmap.txt
numastat.txt mm: fix NUMA accounting in numastat.txt 2009-09-22 07:17:39 -07:00
oops-tracing.txt
parport-lowlevel.txt
parport.txt
pi-futex.txt
pnp.txt
preempt-locking.txt
printk-formats.txt
prio_tree.txt
rbtree.txt trivial: rbtree.txt: fix rb_entry() parameters in sample code 2009-06-12 18:01:47 +02:00
rfkill.txt rfkill: export persistent attribute in sysfs 2009-06-19 11:50:18 -04:00
robust-futex-ABI.txt futex: documentation: fix inconsistent description of futex list_op_pending 2009-06-18 13:03:56 -07:00
robust-futexes.txt
rt-mutex-design.txt
rt-mutex.txt
rtc.txt rtc: add boot_timesource sysfs attribute 2009-09-23 07:39:46 -07:00
SAK.txt
SecurityBugs
SELinux.txt
serial-console.txt
sgi-ioc4.txt
sgi-visws.txt
slow-work.txt SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed 2009-11-19 18:10:57 +00:00
SM501.txt trivial: Miscellaneous documentation typo fixes 2009-06-12 18:01:47 +02:00
Smack.txt
sparse.txt
spinlocks.txt
stable_api_nonsense.txt
stable_kernel_rules.txt
SubmitChecklist block: rename CONFIG_LBD to CONFIG_LBDAF 2009-06-19 08:08:50 +02:00
SubmittingDrivers
SubmittingPatches docs: update patch size in SubmittingPatches 2009-10-01 16:11:12 -07:00
svga.txt
sysfs-rules.txt Doc/sysfs-rules: Swap the order of the words so the sentence makes more sense 2009-05-08 19:22:20 -07:00
sysrq.txt sysrq, kdump: make sysrq-c consistent 2009-07-29 19:10:36 -07:00
tomoyo.txt
unaligned-memory-access.txt
unicode.txt
unshare.txt
VGA-softcursor.txt
vgaarbiter.txt PCI/VGA: add VGA arbitration documentation 2009-09-09 13:29:42 -07:00
video-output.txt
volatile-considered-harmful.txt
voyager.txt
zorro.txt