In May 2008 I wrote about Remzi Arpaci-Dusseau’s work at the University of Wisconsin regarding file system corruption and the dangers of both file systems that tradeoff robustness for performance, and the dangers of commodity disk hardware. As this week I am presenting material on backup, recovery, and database administration, here is the original article.
This blog entry is copyright by Sybase, Inc., an SAP Company, and first appeared on Glenn Paulley’s Sybase blog (http://iablog.sybase.com/paulley/) on 22 May 2008. It is reprinted here with permission.
In November I attended a lecture at the University of Waterloo – part of the database seminar series at UW sponsored by Sybase iAnywhere – given by Remzi Arpaci-Dusseau entitled File systems are broken (and what we’re doing to fix them). Remzi, his colleague (and spouse) Andrea, and graduate students and other faculty at the University of Wisconsin, operate the ADSL Laboratory, whose mandate is to study issues with physical data storage. Remzi and his colleagues have authored various papers on the reliability of the storage stack, not simply hard media failures but disk corruption issues stemming from software bugs to transient media failures and how these failures can become catastrophic, depending on the device drivers, file system, and operating system being used.
In the studies, Remzi’s graduate students would introduce artificial, pseudo-random errors in a software shim at a point in the I/O stack, and then track what happened. The types of errors introduced included (virtual) read and/or write errors to various file system components: ordinary data blocks, inode blocks, and so on. They would then categorize the types of failures and compare the results across media types, file systems, and operating systems.
Frankly, I was startled at the results. Some general trends uncovered by their analysis: SCSI disks, though more expensive, have a longer MTBF than cheaper ATA drives, so the “you get what you pay for” adage appears to hold. More surprising to me was the general lack of robustness in the file systems that were studied. As an example, Arpaci-Dusseau and his team found that the EXT3 file system, in common use in Linux systems, does virtually no detection of write failures, nor retry – making EXT3 more susceptible to transient media failure, and only when a subsequent read of that block will the error be detected. Other file systems such as JFS faired considerably better, but the failure to detect errors is also present in JFS. The studies make for interesting reading.
I mention this now because of a conversation with Peter Bumbulis yesterday, who has installed OpenSolaris on one of his home machines with the ZFS file system, developed at Sun Microsystems by Jeff Bonwick and his team. Jeff’s blog makes interesting reading.
ZFS is a journaling file system that offers a variety of robustness features and in addition offers some interesting self-management features that are quite compelling. ZFS supports mirroring efficiently, and contains self-healing algorithms that utilize the mirrored data to correct corruption automatically without user intervention. Disks can be dynamically added to a ZFS “pool” and made available immediately for use, even in a striped RAID configuration and even if the disks are heterogeneous – ZFS automatically alters the striping pattern to suit individual spindle performance, and a single ZFS system command includes the new (or replaced) disk in the pool. Slick.
This level of self-management is truly compelling; I applaud Jeff and his team for what they’ve achieved.
With the robustness issues that have been reported with flash SSD devices that are now becoming available, ZFS may offer a level of data integrity insurance that is otherwise unavailable.