<p dir="ltr">Also, ignoring the resilver speed issue, if you have one big raidz2 device, you&#39;re basically guaranteed to lose data eventually.  Lose a controller or just 3 disks (most of my disk failures happened while recovering from another failure for example) and you are probably toast.</p>


<p dir="ltr">On the 4500s we had been putting the os on a cf card and did 4x11+4, but os on raid+4x11+2 also works.</p>

<div class="gmail_quote">On Jul 7, 2012 9:13 AM, &quot;Edward Ned Harvey&quot; &lt;<a href="mailto:bblisa4@nedharvey.com">bblisa4@nedharvey.com</a>&gt; wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

&gt; From: Peter Baer Galvin [mailto:<a href="mailto:pbg@cptech.com">pbg@cptech.com</a>]<br>

&gt; Sent: Friday, July 06, 2012 10:50 AM<br>

&gt;<br>

&gt; Hmm, resilvering performance has greatly increased over time Ned. With<br>

&gt; which<br>

&gt; version of ZFS did you have the never-completing problem?<br>

<br>

I haven&#39;t had the problem myself, because I know enough to avoid it.  I<br>

participate a lot in the zfs-discuss mailing list (which was formerly<br>

extremely active including zfs developers, but now it&#39;s mostly just other IT<br>

people offering advice to each other, since the oracle takeover.)<br>

<br>

The root cause of the problem is like this:<br>

<br>

In a zfs resilver, they decided to be clever.  By comparison to a hardware<br>

raid resilver which must resilver the entire disk, including unused blocks,<br>

a ZFS resilver only resilvers the used blocks.  Theoretically this should<br>

make resilvering very fast, right?  Unfortunately, no.  Because the hardware<br>

resilver sequentially does each block of the whole disk, it&#39;s easy to<br>

calculate the whole-disk resilver time as the total disk capacity divided by<br>

the sustained sequential speed of the drive.  Something on the order of 2<br>

hours depending on your drive.  But in zfs, they don&#39;t have any way to<br>

*sort* the used blocks into disk sequential order.  The resilver ordering is<br>

approximated by temporal order.  And, assuming you have a mostly full pool<br>

(&gt;50%), that&#39;s been in production for a while, reading &amp; writing, creating &amp;<br>

destroying snapshots, it means temporal order is approximated by random<br>

order.  So zfs resilvering is approximated by random IO for all your used<br>

blocks.  This is very much dependent on your individual specific usage<br>

patterns.<br>

<br>

Resilvering is a per-vdev operation.  If we assume the size of the pool &amp;<br>

the size of the data are given by design &amp; budget constraints, and you are<br>

faced with the decision to organize your pool into a big raidz versus divide<br>

your pool up into a bunch of mirrors, it means you have less data in each<br>

mirror to resilver.  Naturally, for equal usable capacity, mirrors cost<br>

more.  For the sake of illustrating my point I&#39;ve assumed you&#39;re able to<br>

consider a big raidz versus an equivalently sized (higher cost) bunch of<br>

mirrors.  The principal holds even if you scale up or scale down...  If you<br>

have a set amount of data, divided by a configurable number of vdev&#39;s, you<br>

will have less to resilver if you choose to have more vdev&#39;s.<br>

<br>

Also, random IO for a raidzN (or raid5 or raid6 or raid-DP) is approximated<br>

by the worst case access time for any individual disk (approx 2x slower than<br>

the average access time for a single disk).  Meanwhile, random IO for a<br>

mirror is approximated by the average access time for an individual disk.<br>

<br>

So if you break up your pool into a bunch of mirrors rather than a large<br>

raidzN, you have both a faster ability to perform random IO (factor of 2x),<br>

and less random IO that needs to be done (factor of Mx, where M is how many<br>

times smaller the mirror is compared to the raidz.  If you obey the rule of<br>

thumb &quot;limit raidz to 8-10 disks per vdev,&quot; then Mx is something like factor<br>

of 8x).  End result is factor of ~16x faster using mirrors instead of raid.<br>

<br>

So in rough numbers, a 46-disk raidz2 (capacity of 44 disks) will be<br>

approximately 88 times slower to resilver than a bunch of mirrors.<br>

<br>

In systems that I support, I only deploy mirrors.  When I have a resilver, I<br>

expect it to take 12 hours.  By comparison, if this were a hardware raid, it<br>

would resilver in 2 hours...  And if it were one big raidz, it would<br>

resilver in approx 6 weeks.<br>

<br>

_______________________________________________<br>

bblisa mailing list<br>

<a href="mailto:bblisa@bblisa.org">bblisa@bblisa.org</a><br>

<a href="http://www.bblisa.org/mailman/listinfo/bblisa" target="_blank">http://www.bblisa.org/mailman/listinfo/bblisa</a><br>

</blockquote></div>