Traditional writeback tries to accumulate as much dirty data as possible.
This is worth strategy for extremely short-living files and for batching
writes for saving battery power. But for workloads where disk latency is important this policy generates periodic disk load spikes which increases latency for concurrent operations.
Present writeback engine allows to tune only dirty data size or expiration time. Such tuning cannot eliminate pikes - this just lowers and multiplies them. Other option is switching into sync mode which flushes written data
right after each write, obviously this have significant performance impact. Such tuning is system-wide and affects memory-mapped and randomly written files, flusher threads handle them much better.
This patch implements write-behind policy which tracks sequential writes
and starts background writeback when have enough dirty pages in a row.
Write-behind tracks current writing position and looks into two windows
behind it: first represents unwitten pages, Second - async writeback.
Next write starts background writeback when first window exceed threshold
and waits for pages falling behind async writeback window. This allows to combine small writes into bigger requests and maintain optimal io-depth.
This affects only writes via syscalls, memory mapped writes are unchanged.
Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM.
If async window set to 0 then write-behind skips dirty pages for congested
disk and never wait for writeback. This is used for files with O_NONBLOCK.
Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically evicts completely written pages from cache. This is perfect for writing
verbose logs without pushing more important data out of cache.
As a bonus write-behind makes blkio throttling much more smooth for most
bulk file operations like copying or downloading which writes sequentially.
Size of minimal write-behind request is set in: /sys/block/$DISK/bdi/min_write_behind_kb
Default is 256Kb, 0 - disable write-behind for this disk.
Size of async window set in:
/sys/block/$DISK/bdi/async_write_behind_kb
Default is 1024Kb, 0 - disables sync write-behind.
Write-behind is controlled by sysctl vm.dirty_write_behind:
=0: disabled, default
=1: enabled
Signed-off-by: Konstantin Khlebnikov <
khlebnikov@yandex-team.ru>
---
Documentation/ABI/testing/sysfs-class-bdi | 11 ++++
Documentation/sysctl/vm.txt | 15 +++++
include/linux/backing-dev-defs.h | 2 +
include/linux/fs.h | 9 +++
include/linux/mm.h | 3 +
kernel/sysctl.c | 9 +++
mm/backing-dev.c | 46 +++++++++-------
mm/fadvise.c | 4 +
mm/page-writeback.c | 84 +++++++++++++++++++++++++++++
9 files changed, 162 insertions(+), 21 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index d773d5697cf5..50a8b8750c13 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -30,6 +30,17 @@ read_ahead_kb (read-write)
Size of the read-ahead window in kilobytes
+min_write_behind_kb (read-write)
+
+ Size of minimal write-behind request in kilobytes.
+ 0