<p dir="ltr">Also, ignoring the resilver speed issue, if you have one big raidz2 device, you're basically guaranteed to lose data eventually. Lose a controller or just 3 disks (most of my disk failures happened while recovering from another failure for example) and you are probably toast.</p>
<p dir="ltr">On the 4500s we had been putting the os on a cf card and did 4x11+4, but os on raid+4x11+2 also works.</p>
<div class="gmail_quote">On Jul 7, 2012 9:13 AM, "Edward Ned Harvey" <<a href="mailto:bblisa4@nedharvey.com">bblisa4@nedharvey.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
> From: Peter Baer Galvin [mailto:<a href="mailto:pbg@cptech.com">pbg@cptech.com</a>]<br>
> Sent: Friday, July 06, 2012 10:50 AM<br>
><br>
> Hmm, resilvering performance has greatly increased over time Ned. With<br>
> which<br>
> version of ZFS did you have the never-completing problem?<br>
<br>
I haven't had the problem myself, because I know enough to avoid it. I<br>
participate a lot in the zfs-discuss mailing list (which was formerly<br>
extremely active including zfs developers, but now it's mostly just other IT<br>
people offering advice to each other, since the oracle takeover.)<br>
<br>
The root cause of the problem is like this:<br>
<br>
In a zfs resilver, they decided to be clever. By comparison to a hardware<br>
raid resilver which must resilver the entire disk, including unused blocks,<br>
a ZFS resilver only resilvers the used blocks. Theoretically this should<br>
make resilvering very fast, right? Unfortunately, no. Because the hardware<br>
resilver sequentially does each block of the whole disk, it's easy to<br>
calculate the whole-disk resilver time as the total disk capacity divided by<br>
the sustained sequential speed of the drive. Something on the order of 2<br>
hours depending on your drive. But in zfs, they don't have any way to<br>
*sort* the used blocks into disk sequential order. The resilver ordering is<br>
approximated by temporal order. And, assuming you have a mostly full pool<br>
(>50%), that's been in production for a while, reading & writing, creating &<br>
destroying snapshots, it means temporal order is approximated by random<br>
order. So zfs resilvering is approximated by random IO for all your used<br>
blocks. This is very much dependent on your individual specific usage<br>
patterns.<br>
<br>
Resilvering is a per-vdev operation. If we assume the size of the pool &<br>
the size of the data are given by design & budget constraints, and you are<br>
faced with the decision to organize your pool into a big raidz versus divide<br>
your pool up into a bunch of mirrors, it means you have less data in each<br>
mirror to resilver. Naturally, for equal usable capacity, mirrors cost<br>
more. For the sake of illustrating my point I've assumed you're able to<br>
consider a big raidz versus an equivalently sized (higher cost) bunch of<br>
mirrors. The principal holds even if you scale up or scale down... If you<br>
have a set amount of data, divided by a configurable number of vdev's, you<br>
will have less to resilver if you choose to have more vdev's.<br>
<br>
Also, random IO for a raidzN (or raid5 or raid6 or raid-DP) is approximated<br>
by the worst case access time for any individual disk (approx 2x slower than<br>
the average access time for a single disk). Meanwhile, random IO for a<br>
mirror is approximated by the average access time for an individual disk.<br>
<br>
So if you break up your pool into a bunch of mirrors rather than a large<br>
raidzN, you have both a faster ability to perform random IO (factor of 2x),<br>
and less random IO that needs to be done (factor of Mx, where M is how many<br>
times smaller the mirror is compared to the raidz. If you obey the rule of<br>
thumb "limit raidz to 8-10 disks per vdev," then Mx is something like factor<br>
of 8x). End result is factor of ~16x faster using mirrors instead of raid.<br>
<br>
So in rough numbers, a 46-disk raidz2 (capacity of 44 disks) will be<br>
approximately 88 times slower to resilver than a bunch of mirrors.<br>
<br>
In systems that I support, I only deploy mirrors. When I have a resilver, I<br>
expect it to take 12 hours. By comparison, if this were a hardware raid, it<br>
would resilver in 2 hours... And if it were one big raidz, it would<br>
resilver in approx 6 weeks.<br>
<br>
_______________________________________________<br>
bblisa mailing list<br>
<a href="mailto:bblisa@bblisa.org">bblisa@bblisa.org</a><br>
<a href="http://www.bblisa.org/mailman/listinfo/bblisa" target="_blank">http://www.bblisa.org/mailman/listinfo/bblisa</a><br>
</blockquote></div>