[geeks] Solaris 10 / OpenSolaris bits to be in next version of OSX
Bill Bradford
mrbill at mrbill.net
Sat Aug 12 00:30:42 CDT 2006
On Wed, Aug 09, 2006 at 07:31:42PM -0500, Bill Bradford wrote:
> There was a weblog post a week or two ago about one of the guys at Sun
> who had a bad disk and didn't notice - because RAID-Z / ZFS kept the
> data intact... I'll see if I can dig it up.
Found it.
(from November 2005)
http://blogs.sun.com/roller/page/elowe?entry=zfs_saves_the_day_ta
According to Eric Lowe:
"I've been using ZFS internally for awhile now. For someone who used to
administer several machines with Solaris Volume Manager (SVM), UFS, and a
pile of aging JBOD disks, my experience so far is easily summed up: "Dude
this so @#%& simple, so reliable, and so much more powerful, how did I
never live without it??"
So, you can imagine my excitement when ZFS finally hit the gate. The very
next day I BFU'ed my workstation, created a ZFS pool, setup a few
filesystems and (four commands later, I might add) started beating on it.
Imagine my surprise when my machine stayed up less than two hours!!
No, this wasn't a bug in ZFS... it was a fatal checksum error. One of
those "you might want to know that your data just went away" sort of errors.
Of course, I had been running UFS on this disk for about a year, and
apparently never noticed the silent data corruption. But then I reached
into the far recesses of my brain, and I recalled a few strange moments --
like the one time when I did a bringover into a workspace on the disk, and
I got a conflict on a file I hadn't changed. Or the other time after a reboot
I got a strange panic in UFS while it was unrolling the log. At the time
I didn't think much of these things -- I just deleted the file and got
another copy from the golden source -- or rebooted and didn't see the
problem recur -- but it makes sense to me now. ZFS, with its end-to-end
checksums, had discovered in less than two hours what I hadn't known for
almost a year -- that I had bad hardware, and it was slowly eating away at
my data.
Figuring that I had a bad disk on my hands, I popped a few extra SATA
drives in, clobbered the disk and this time set myself up a three-disk
vdev using raidz. I copied my data back over, started banging on it again,
and after a few minutes, lo and behold, the checksum errors began to pour in:
elowe at oceana% zpool status
pool: junk
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool online' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
junk ONLINE 0 0 0
raidz ONLINE 0 0 0
c0d0 ONLINE 0 0 0
c3d0 ONLINE 0 0 1
c3d1 ONLINE 0 0 0
A checksum error on a different disk! The drive wasn't at fault after all.
I emailed the internal ZFS interest list with my saga, and quickly got a
response. Another user, also running a Tyan 2885 dual-Opteron workstation
like mine, had experienced data corruption with SATA disks. The root cause?
A faulty power supply.
Since my data is still intact, and the performance isn't hampered at all,
I haven't bothered to fix the problem yet. I've been running over a week
now with a faulty setup which is still corrupting data on its way to the
disk, and have yet to see a problem with my data, since ZFS handily detects
and corrects these errors on the fly.
Eventually I suppose I'll get around to replacing that faulty power supply..."
--
Bill Bradford
Houston, Texas
More information about the geeks
mailing list