Saturday, January 18, 2014

What are exactly O_DIRECT, O_SYNC Flags, Buffers & Cached in Linux-Storage I/O?

Feels good to post after a long time. I always hear HPC systems people flapping their mouths in context of I/O performance measures in distributed file systems like Lustre or CXFS about Direct I/O so I thought lets dig this. You might have seen usage of dd command on my blog with oflag parameter, but for guys who don't know I have briefly revised it here.
dd if=/dev/zero of=/tmp/testfile bs=2M count=250 oflag=direct
oflag here indicates write operation through dd in accordance with provided symbols, i.e. direct. What exactly is direct we will see a little later in the post.
dd if=/tmp/testfile of=/dev/null bs=2M iflag=direct
Similar to write operation, read operation takes iflag as a parameter with same symbols, particular to our interest is direct.
If you fire these commands with or without oflag/iflag you will notice a significant I/O performance difference in statistics provided by dd. This is basically the effect of cache employed by modern day storage systems/Linux kernel.
Now these caches can be multilevel going right from operating systems buffer cache to storage controller cache to hard drive cache, so depending upon the underlying system architecture cache effects will appear. Filesystem software also plays a huge role in caching behavior. A traditional distributed filesystem might leverage multiple caches on multiple LUN's distributed across multiple storage controllers. An object-based filesystem such as Lustre will have multiple OSS (Object Storage Server's) which will leverage it's own independent OS buffer cache to enhance performance. I am going to do a separate detailed post shortly about Lustre Performance Impact due to OSS Cache. My point is benchmarks of cache effects of a specific HPC system as whole is not comparable to another system unless all the granular details are known and acting in the same direction. Cache effect cannot be completely removed in today's complex systems, we can try to tell underlying components to not use cache if they are configured to accept such requests. When you open a disk file with none of any flags mentioned below, a call to read() & write() for that file returns as soon as data is copied into kernel address space buffer, actual operation happens later on depending upon operating system. Buffer usually defaults to 2.5% of physical memory but this is subject to change depending upon different Linux kernel tree. We also see what is the difference between "buffers" section and "cached" section of free command in later section of this post.

Flushing out pdflush

   The kernel page cache contains in-memory copies of data blocks belonging to files kept in persistent storage. Pages which are written to by a processor, but not yet written to disk, are accumulated in cache and are known as "dirty" pages. The amount of dirty memory is listed in /proc/meminfo. Pages in the cache are flushed to disk after an interval of 30 seconds. Pdflush is a set of kernel threads which are responsible for writing the dirty pages to disk, either explicitly in response to a sync() call, or implicitly in cases when the page cache runs out of pages, if the pages have been in memory for too long, or there are too many dirty pages in the page cache (as specified by /proc/sys/vm/dirty_ratio).
   Tanslation: 在内核的页缓存(page cache)之中,存放着永久存储设备上的文件内容的内存副本。当程序把数据写入到页面之中,但还没写到磁盘上的时候,这些数据就放在缓存之中,这就是“脏页”。脏页的数量可以从 /proc/meminfo 中看到。每30秒,脏页都会被刷入(flush)磁盘。Pdflush 是一组负责将脏页刷入磁盘的内核线程,它或者是由显式的 sync() 调用出发,或者在页面被清出 page cache 时隐式地自动触发,具体情况包括,如果页面在内存中的时间过长,或者 page cache 中有过多的脏页(过多的界定由 /proc/sys/vm/dirty_ratio 确定)。