[BBLISA] SunFire 4500: Linux + ZFS/FUSE ?

Sat Jul 7 10:03:51 EDT 2012

Also, ignoring the resilver speed issue, if you have one big raidz2 device,
you're basically guaranteed to lose data eventually.  Lose a controller or
just 3 disks (most of my disk failures happened while recovering from
another failure for example) and you are probably toast.

On the 4500s we had been putting the os on a cf card and did 4x11+4, but os
on raid+4x11+2 also works.
On Jul 7, 2012 9:13 AM, "Edward Ned Harvey" <bblisa4 at nedharvey.com> wrote:

> > From: Peter Baer Galvin [mailto:pbg at cptech.com]
> > Sent: Friday, July 06, 2012 10:50 AM
> >
> > Hmm, resilvering performance has greatly increased over time Ned. With
> > which
> > version of ZFS did you have the never-completing problem?
>
> I haven't had the problem myself, because I know enough to avoid it.  I
> participate a lot in the zfs-discuss mailing list (which was formerly
> extremely active including zfs developers, but now it's mostly just other
> IT
> people offering advice to each other, since the oracle takeover.)
>
> The root cause of the problem is like this:
>
> In a zfs resilver, they decided to be clever.  By comparison to a hardware
> raid resilver which must resilver the entire disk, including unused blocks,
> a ZFS resilver only resilvers the used blocks.  Theoretically this should
> make resilvering very fast, right?  Unfortunately, no.  Because the
> hardware
> resilver sequentially does each block of the whole disk, it's easy to
> calculate the whole-disk resilver time as the total disk capacity divided
> by
> the sustained sequential speed of the drive.  Something on the order of 2
> hours depending on your drive.  But in zfs, they don't have any way to
> *sort* the used blocks into disk sequential order.  The resilver ordering
> is
> approximated by temporal order.  And, assuming you have a mostly full pool
> (>50%), that's been in production for a while, reading & writing, creating
> &
> destroying snapshots, it means temporal order is approximated by random
> order.  So zfs resilvering is approximated by random IO for all your used
> blocks.  This is very much dependent on your individual specific usage
> patterns.
>
> Resilvering is a per-vdev operation.  If we assume the size of the pool &
> the size of the data are given by design & budget constraints, and you are
> faced with the decision to organize your pool into a big raidz versus
> divide
> your pool up into a bunch of mirrors, it means you have less data in each
> mirror to resilver.  Naturally, for equal usable capacity, mirrors cost
> more.  For the sake of illustrating my point I've assumed you're able to
> consider a big raidz versus an equivalently sized (higher cost) bunch of
> mirrors.  The principal holds even if you scale up or scale down...  If you
> have a set amount of data, divided by a configurable number of vdev's, you
> will have less to resilver if you choose to have more vdev's.
>
> Also, random IO for a raidzN (or raid5 or raid6 or raid-DP) is approximated
> by the worst case access time for any individual disk (approx 2x slower
> than
> the average access time for a single disk).  Meanwhile, random IO for a
> mirror is approximated by the average access time for an individual disk.
>
> So if you break up your pool into a bunch of mirrors rather than a large
> raidzN, you have both a faster ability to perform random IO (factor of 2x),
> and less random IO that needs to be done (factor of Mx, where M is how many
> times smaller the mirror is compared to the raidz.  If you obey the rule of
> thumb "limit raidz to 8-10 disks per vdev," then Mx is something like
> factor
> of 8x).  End result is factor of ~16x faster using mirrors instead of raid.
>
> So in rough numbers, a 46-disk raidz2 (capacity of 44 disks) will be
> approximately 88 times slower to resilver than a bunch of mirrors.
>
> In systems that I support, I only deploy mirrors.  When I have a resilver,
> I
> expect it to take 12 hours.  By comparison, if this were a hardware raid,
> it
> would resilver in 2 hours...  And if it were one big raidz, it would
> resilver in approx 6 weeks.
>
> _______________________________________________
> bblisa mailing list
> bblisa at bblisa.org
> http://www.bblisa.org/mailman/listinfo/bblisa
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.bblisa.org/pipermail/bblisa/attachments/20120707/1b20a7cd/attachment-0001.htm