Thoughts on KFS vs NDFS
I had a few hours to review the state of NDFS and KFS over the last two days and thought I’d serialize my thoughts.
KFS seems superior to NDFS in a few crucial areas.
KFS supports seeking within a file. This isn’t supported by NDFS as you can only read data not seek. A lot of my applications need seek so I’m not sure I can do without this in NDFS.
KFS supports append (not write once) which is a BIG win.
KFS is written in C++ which means they have access to mlock, mmap, and O_DIRECT for working with files.
However, KFS seems to have a somewhat unfortunate design flaw:
When blocks of a file are striped across nodes in the cluster, KFS stores individual blocks of file as files in the underlying file system (such as, XFS on Linux).
This is going to cause two difficult problems.
For large filesystems, which have lots of blocks, the local filesystem is going to have to keep track of LOTs of small files.
This is going to burn inodes which is going to both allocate more kernel memory and cause a runtime error on the nodes they they eventually hit the maximum number of inodes.
The second major problem is filesystem fragmentation. The filesystem doesn’t know that these files are used by the same application.
The next is application maintenance. Keeping small files like this is going to require a LOT of disk seeks when taring, SCPing, etc.
Most filesystem benchmarks REALLY fall down when it comes to lots of small files in directories. XFS, ext3, jfs, etc all fall down when it comes to maintaining lots of files (I haven’t seen ZFS numbers yet).
XFS for example really shines for dealing with raw throughput to a few files.
The solution is actually pretty simple. Store all the blocks in a segment file with a side file containing pointer data. Blocks are written out into segments of 100M (which is obviously configurable).
While a segment is open (and new blocks are being written) the block IDs are stored in memory. This is necessary to prevent full scans of all the blocks when trying to look them up from a pointer.
When a segment is full, sort the block ID to offset position map and rewrite the side file.
This way future reads for old/stale data can open a side file and do a binary search to find the necessary pointer information.
Now you have a filesystem that keeps data in blocks, have few inodes, is generally fully contiguous, and is FAST since there are very few files per directory.
Update: Sriram noted that the block size is 64M which is close to my proposal. I just assumed it was a small block size.