Saturday, January 18, 2014

Flushing out pdflush

   The kernel page cache contains in-memory copies of data blocks belonging to files kept in persistent storage. Pages which are written to by a processor, but not yet written to disk, are accumulated in cache and are known as "dirty" pages. The amount of dirty memory is listed in /proc/meminfo. Pages in the cache are flushed to disk after an interval of 30 seconds. Pdflush is a set of kernel threads which are responsible for writing the dirty pages to disk, either explicitly in response to a sync() call, or implicitly in cases when the page cache runs out of pages, if the pages have been in memory for too long, or there are too many dirty pages in the page cache (as specified by /proc/sys/vm/dirty_ratio).
   Tanslation: 在内核的页缓存(page cache)之中,存放着永久存储设备上的文件内容的内存副本。当程序把数据写入到页面之中,但还没写到磁盘上的时候,这些数据就放在缓存之中,这就是“脏页”。脏页的数量可以从 /proc/meminfo 中看到。每30秒,脏页都会被刷入(flush)磁盘。Pdflush 是一组负责将脏页刷入磁盘的内核线程,它或者是由显式的 sync() 调用出发,或者在页面被清出 page cache 时隐式地自动触发,具体情况包括,如果页面在内存中的时间过长,或者 page cache 中有过多的脏页(过多的界定由 /proc/sys/vm/dirty_ratio 确定)。



  At a given point of time, there are between two and eight pdflush threads running in the system. The number of pdflush threads is determined by the load on the page cache; new pdflush threads are spawned if none of the existing pdflush threads have been idle for more than one second, and there is more work in the pdflush work queue. On the other hand, if the last active pdflush thread has been asleep for more than one second, one thread is terminated. Termination of threads happens until only a minimum number of pdflush threads remain. The current number of running pdflush threads is reflected by /proc/sys/vm/nr_pdflush_threads.
   Tanslation: 在任意给定的时间,系统中会有 2-8 个 pdflush 线程。pdflush 线程的数量由 page chche 的负载确定,如果一秒钟之内,没有空闲的 pdflush线程,但工作队列里还有很多工作要做,就会创建一个新的 pdflush 线程。另一方面,如果最后一个活跃的 pdflush 线程也已经睡了超过1秒了,那就会有一个 pdflush 线程被终止,知道只剩下了两个 pdflush 线程。当前的 pdflush 线程数可以通过 /proc/sys/vm/nr_pdflush_threads 查看。

  A number of pdflush-related issues have come to light over time. Pdflush threads are common to all block devices, but it is thought that they would perform better if they concentrated on a single disk spindle. Contention between pdflush threads is avoided through the use of the BDI_pdflush flag on the backing_dev_info structure, but this interlock can also limit writeback performance. Another issue with pdflush is request starvation. There is a fixed number of I/O requests available for each queue in the system. If the limit is exceeded, any application requesting I/O will block waiting for a new slot. Since pdflush works on several queues, it cannot block on a single queue. So, it sets the wbc->nonblocking writeback information flag. If other applications continue to write on the device, pdflush will not succeed in allocating request slots. This may lead to starvation of access to the queue, if pdflush repeatedly finds the queue congested.
  Tanslation: 很久以来,就有很多 pdflush 相关问题让它饱受诟病。Pdflush 线程是所有块设备共享的,但是,如果它们能专注于每一个磁臂的话,性能可能会更好一些。通过 backing_dev_info 结构的 BDI_pdflush 标志,可以避免 pdflush 线程的竞争,但是,互相影响还是会影响写回的性能。Pdflush的另一个问题是请求的饥饿问题。系统的每个队列中都可以有固定个数的I/O请求。如果超过了限制,所有应用的I/O请求都会被阻塞住,直到有新的空位出现。因为 pdflush 是为多个队列工作的,它不会阻塞在某一个队列上。这样,它会设置 wbc->nonblocking 写回信息标志。如果其他应用继续在设备上写入数据,pdflush 就无法分配出空位了。如果 pdflush 不断发现一个队列处于拥塞状态,就会导致这个队列的请求被饿死.

  Jens Axboe in his patch set proposes a new idea of using flusher threads per backing device info (BDI), as a replacement for pdflush threads. Unlike pdflush threads, per-BDI flusher threads focus on a single disk spindle. With per-BDI flushing, when the request_queue is congested, blocking happens on request allocation, avoiding request starvation and providing better fairness.
  Tanslation: Jens Axboe 在一组 patch 中提出了一个新的方案,使用为每个块设备(backing device info, BDI)配置的 flusher 线程作为 pdflush 的替代品。与 pdflush 不同,BDI 专属的 flusher 线程专注于一个磁臂。在 BDI 刷写的情况下,当请求队列出现拥塞的时候,请求的分配会被阻塞住,避免饿死请求,提供了更好的公平性。


With pdflush, The dirty inode list is stored by the super block of the filesystem. Since the per-BDI flusher needs to be aware of the dirty pages to be written by its assigned device, this list is now stored by the BDI. Calls to flush dirty inodes on the superblock result in flushing the inodes from the list of dirty inodes on the backing device for all devices listed for the filesystem.
As with pdflush, per-BDI writeback is controlled through the writeback_control data structure, which instructs the writeback code what to do, and how to perform the writeback. The important fields of this structure are:
  • sync_mode: defines the way synchronization should be performed with respect to inode locking. If set to WB_SYNC_NONE, the writeback will skip locked inodes, where as if set to WB_SYNC_ALL will wait for locked inodes to be unlocked to perform the writeback.
  • nr_to_write: the number of pages to write. This value is decremented as the pages are written.
  • older_than_this: If not NULL, all inodes older than the jiffies recorded in this field are flushed. This field takes precedence over nr_to_write.
Tanslation: 使用 pdflush,脏的 inode list 存储于文件系统的超级块。由于 per-BDI flusher 需要
知道要写入的脏页属于
哪个块设备,所以,这个列表现在存储到了 BDI 里。这样,要刷入一个 superblock 上的所有脏 inode,就
会导致刷入这个文件系统占用的所有后端设备上的脏 inode 列表。
与使用 pdflush 时类似,per-BDI 的写回是通过 writeback_control 数据结构控制的,写回程序从中知道写回什么、如何去做。这个结构中的重要字段包括:
  • sync_mode: 定义对于 inode 加锁的情况下,进行同步的方式。如果设置为 WB_SYNC_NONE,写回操作会跳过加锁的 inode,而如果设置为 WB_SYNC_ALL,则会等待加锁的 inode 被 unlock,再进行写回。
  • nr_to_write:要写的页的数量。这个值随着页面被写入而减少。
  • older_than_this:如果不为NULL,所有比这个字段中给出的 jiffies 值老的页面都会被刷入块设备。这个字段会优先于 nr_to_write。

The struct bdi_writeback keeps all information required for flushing the dirty pages:
    struct bdi_writeback {
 struct backing_dev_info *bdi;
 unsigned int nr;
 struct task_struct *task;
 wait_queue_head_t wait;
 struct list_head b_dirty;
 struct list_head b_io;
 struct list_head b_more_io;

 unsigned long  nr_pages;
 struct super_block *sb;
    };
The bdi_writeback structure is initialized when the device is registered through bdi_register(). The fields of the bdi_writeback are:
  • bdi: the backing_device_info associated with this bdi_writeback,
  • task: contains the pointer to the default flusher thread which is responsible for spawning threads for performing the flushing work,
  • wait: a wait queue for synchronizing with the flusher threads,
  • b_dirty: list of all the dirty inodes on this BDI to be flushed,
  • b_io: inodes parked for I/O,
  • b_more_io: more inodes parked for I/O; all inodes queued for flushing are inserted in this list, before being moved to b_io,
  • nr_pages: total number of pages to be flushed, and
  • sb: the pointer to the superblock of the filesystem which resides on this BDI. 
Tanslation: bdi_writeback 结构保存了所有用于刷入脏页的信息:
  • ?
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    struct bdi_writeback {
        struct backing_dev_info *bdi;
        unsigned int nr;
        struct task_struct  *task;
        wait_queue_head_t   wait;
        struct list_head    b_dirty;
        struct list_head    b_io;
        struct list_head    b_more_io;
        unsigned long       nr_pages;
        struct super_block  *sb;
    };
    bdi_writeback结构是在设备使用 bdi_register() 进行注册的时候初始化的。结构的各个字段的含义如下:
  • bdi: 和本结构实例相关联的 backing_device_info,
  • task: 指向缺省flusher线程的指针,这个线程用于启动进行刷写工作的线程,
  • wait: 用于同步flusher线程的等待队列,
  • b_dirty: 本 BDI 上面,需要刷入块设备的所有脏 inode 的 list,
  • b_io: 要进行 I/O 的 inodes ,
  • b_more_io: 更多的要进行 I/O 的 inodes ;所有要刷入的 inode 都先被插入到这个队列中来,之后再移送到 b_io,
  • nr_pages: 要刷入的页的总数,以及
  • sb: 指向当前 BDI 上的文件系统的超级块的指针。
nr_pages and sb are parameters passed asynchronously to the the BDI flush thread, and are not fixed through the life of the bdi_writeback. This is done to facilitate devices with multiple filesystem, hence multiple super_blocks. With multiple super_blocks on a single device, a sync can be requested for a single filesystem on the device.
The bdi_writeback_task() function waits for the dirty_writeback_interval, which by default is 5 seconds, and initiates wb_do_writeback(wb) periodically. If there are no pages written for five minutes, the flusher thread exits (with a grace period of dirty_writeback_interval). If a writeback work is later required (after exit), new flusher threads are spawned by the default writeback thread.
Writeback flushes are done in two ways:
  • pdflush style: This is initiated in response to an explicit writeback request, for example syncing inode pages of a super_block.wb_start_writeback() is called with the superblock information and the number of pages to be flushed. The function tries to acquire the bdi_writebackstructure associated with the BDI. If successful, it stores the superblock pointer and the number of pages to be flushed in the bdi_writebackstructure and wakes up the flusher thread to perform the actual writeout for the superblock. This is different from how pdflush performs writeouts: pdflush attempts to grab the device from the writeout path, blocking the writeouts from other processes.
  • kupdated style: If there is no explicit writeback requests, the thread wakes up periodically to flush dirty data. The first time one of the inode's pages stored in the BDI is dirtied, the dirtying-time is recorded in the inode's address space. The periodic writeback code walks through the superblock's inode list, writing back dirty pages of the inodes older than a specified point in time. This is run once perdirty_writeback_interval, which defaults to five seconds.
Tanslation: nr_pages 和 sb 是异步传送给 BDI 刷写线程的参数,并且在 bdi_writeback 的生命周期中是可能发生变化的。这是因为有些设备上会有多个文件系统,也就会有多个超级块。当一个设备上有多个超级块的时候,sync 操作可以针对设备上的某一个特定的文件系统。
bdi_writeback_task 函数会周期性发起 wb_do_writeback(wb) 调用,调用周期为 dirty_writeback_interval,缺省值是5秒。如果5分钟内都没有页面被写入,flusher 线程就回退出(计时精度为 dirty_writeback_interval)。(退出)之后如果又需要工作了,那么就会由缺省写回线程来启动一个新的 flusher 线程。
写回的flush有两种工作方式:
  • pdflush方式:由显式的协会请求触发,比如 sync 一个超级块的所有 inode 页面。这时会以超级块的信息和要刷入的页面数量为参数调用 wb_start_writeback()。这个函数会去获取与此 BDI 相关联的 bdi_writeback 结构。如果获取成功的话,会将 superblock 指针和要刷入的页的数量写入 bdi_writeback 结构,并唤醒 flusher 线程,为该超级块进行实际的写操作。这个操作和 pdflush 的写入操作有所不同:pdflush 会根据写的路径来获取设备,阻塞住其他进城的写操作。
  • kupdated 方式:如果没有显式的写回请求,写回线程会周期性地刷入脏数据。当一个 BDI 的一个 inode 的页面第一次变脏的时候,会将 dirtying-time 记入 inode 的地址空间中。当周期性的写回代码检查到这个超级块的 inode 列表的时候,就回写回比某一时间更早变脏的 inode。这一操作的周期是 dirty_writeback_interval,缺省是5秒钟。

After review of the first attempt, Jens added functionality of having multiple flusher threads per device based on the suggestions of Andrew Morton. Dave Chinner suggested that filesystems would like to have a flusher thread per allocation group. In the patch set (second iteration) which followed, Jens added a new interface in the superblock to return the bdi_writeback structure associated with the inode:
    struct bdi_writeback *(*inode_get_wb) (struct inode *);
If inode_get_wb is NULL, the default bdi_writeback of the BDI is returned, which means there is only one bdi_writeback thread for the BDI. The maximum number of threads that can be started per BDI is 32.
Initial experiments conducted by Jens found an 8% increase in performance on a simple SATA drive running Flexible File System Benchmark (ffsb). File layout was smoother as compared to the vanilla kernel as reported by vmstat, with a uniform distribution of buffers written out. With a ten-disk btrfs filesystem, per-BDI flushing performed 25% faster. The writeback is tracked by Jens's block layer git tree (git://git.kernel.dk/linux-2.6-block.git) under the "writeback" branch. There have been no comments on the second iteration so far, but per-BDI flusher threads is still not ready enough to go into the 2.6.30 tree.
Acknowledgments: Thanks to Jens Axboe for reviewing and explaining certain aspects of the patch set.

Tanslation: 第一轮的讨论之后,Jens 根据 Andrew Morton 的建议,添加了每个设备有多个 flusher 线程的支持。Dave Chinner 建议,文件系统可能会希望每个分配组有一个 flusher 线程。在随后的(第二轮)补丁中,Jens 在 superblock 中添加了一个新的接口,来通过一个 inode 获取相关联的 bdi_writeback 结构:
?
1
struct bdi_writeback *(*inode_get_wb) (struct inode *);
如果 inode_get_wb 为空,就会返回 BDI 缺省的 bdi_writeback 结构,这就意味着这个 BDI 只有一个 bdi_writeback 线程。每个 BDI 可以启动的最大线程数为 32。
Jens 对一块 SATA 硬盘使用 Flexible File System Benchmark (ffsb) 进行的测试显示,可以获得 8% 的性能提升。通过 vmstat 观察,与未打过补丁的内核相比,文件操作的分布更加平滑 ,缓冲区的写回呈现出均匀分布。而对于一个10块硬盘的 btrfs 文件系统,per-BDI 刷写可以获得 25% 的提速。这组代码位于 Jens 的块设备层 git tree  (git://git.kernel.dk/linux-2.6-block.git)  的 writeback 分支。目前,第二轮讨论还没有新的意见出现,但 per-BDI flusher 线程还不足以进入 2.6.30 内核。
致谢:感谢 Jens Axboe 审阅本文并对这组补丁的一些内容进行了解释。

SOURCE: http://lwn.net/Articles/326552/ 

No comments:

Post a Comment