[TriLUG] Linux Software RAID experience

Jim Salter via TriLUG trilug at trilug.org
Wed Feb 14 17:48:14 EST 2018


Hi Ron -

Need to know a bit more about your workload. There are a few potential 
gotchas going on here.

First, things get weird when you start looking for more than 500MB/sec 
committed to the bare metal. You're starting to hit territory where you 
can get bottlenecks from PCIe lanes and all sorts of silliness. It's 
absolutely /possible/ to hit higher speeds, but it takes some commitment 
to Sparkle Motion and willingness to poke about and try things until you 
get where you need to be.

Second thing, RAID5 is not the way to go if you're looking for maximal 
performance. So far you're talking about throughput and not IOPS - more 
on that in a second - but even when you're looking for maximal 
throughput, you can screw it up /fast/ with a striped array, especially 
a conventional striped array being managed by a dumb hardware 
controller.  Keep in mind that your IOPS - I/O Operations Per Second - 
for a striped array approximates slightly /worse/ than the IOPS of a 
single drive in that stripe. This is pretty counter-intuitive, but what 
you have to realize is that each stripe operation lights up all disks in 
the stripe - eight, in your case - and until you get all eight 
individual operations completed, your stripe is unavailable for anything 
else. For each stripe op, one of your disks will come in a bit slower 
than the others - it may not even be the same disk every time! - hence 
why you're going to trend towards slightly /worse/ than the IOPS of a 
single disk.

OK, but again, you aren't asking for IOPS - maybe you don't know you 
should be, or maybe you don't care. So why do I keep talking about them? 
Because unless you've got an /incredibly/ specialized and tightly 
controlled workload, typically WORM (write once, read many, never 
deleted) you're never going to consistently hit max sequential 
throughput. The second you have a fragmented operation requiring a seek 
on any of those eight disks, you've left maximal sequential throughput 
and you're starting to bottleneck potentially on IOPS. It gets even 
worse if you've got more like a standard server workload, that's going 
to be trying to serve multiple simultaneous tasks in parallel - writing 
for user one, reading for user two, deleting stuff for user 3, having to 
save the files for user one in holes left by prior deletions, having to 
read files that were written fragmented in the first place... yeah, 
you're gonna hit IOPS bottlenecks pretty fast.

So, in general terms, if you want performance you want IOPS. And if you 
want IOPS, the /last/ thing you want is a stripe. For read performance, 
you want RAID1, with as many individual mirrors as possible. For write 
performance, you want RAID10. And it gets even hinkier here; your LSI 
controllers are /probably/ bright enough to distribute individual block 
writes to individual mirrors on a RAID10 array rather than tackling it 
classically - a /stripe/ of mirrors, and you remember what we said about 
stripes, right...? Same caveat applies for RAID0; brighter controllers 
(or mdadm) will do individual block writes to individual devices rather 
than classical stripe operations, but your mileage may vary.

I could go on and on and on here - batteries in the controllers, 
hardware vs mdadm, my personal preference which is neither and for many 
many damn good reasons (ZFS). But keep in mind even /after/ all these 
things, 500 MB/sec can start to be a pretty serious bottleneck it's hard 
to break through. If I actually have to hit bare metal (ie, not serve 
from or write to cache) I generally see things bottlenecking around 
800-ish MB/sec on high-end Xeon servers with 8-ish SSDs in a pool of 
mirrors (ZFS's quasi-equivalent to traditional RAID10). But for damn 
near any operation I ever deal with, you really shouldn't be talking 
highest possible throughput on simplest operation... you should be 
talking /IOPS/, which comes down to "how often can I actually /get/ 
really close to that big maximum throughput number?"

If you want some direct consultation, I'm available. I work with this 
stuff professionally on an every-day basis. You wouldn't be a 
"moonlighting" gig; consulting is my 9-5 and I own the business. =)


On 02/14/2018 05:28 PM, Shay Walters wrote:
> Hi Jim,
>      I'm guessing you don't get this mailing list?  Or maybe you've been
> busy and haven't had a chance to respond.  I started to answer describing
> what you had said at a ColaLUG meeting a long time ago, but then I started
> thinking that your comment was likely referring to spinning disks, and
> didn't know if mdadm with SSD drives was still the high performer.  It
> sounds like his hardware RAID is leaving a lot to be desired, so I'm
> inclined to think that mdadm might out-perform it, but I'd just be
> guessing and I figured you'd be more likely to know for sure.
>
> -Shay
>
>
>
> -----Original Message-----
> From: Ron Kelley via TriLUG <trilug at trilug.org>
> To: Triangle Linux Users Group General Discussion <trilug at trilug.org>
> Date: Wed, 14 Feb 2018 11:30:11 -0500
> Subject: [TriLUG] Linux Software RAID experience
>
> Greetings all,
>
> Looking for some feedback with Linux software RAID.  I have a couple of
> servers in a lights-out facility running LSI RAID controllers (2208
> chips)
> that have been working great for a long time.  Recently, we decided to
> switch out spinning disks for SSDs (8x Samsung 1TB in RAID-5).  After
> getting the RAID volume rebuilt, I noticed slow write speeds (about
> 500MB/secs).  I was expecting at least 1,500MB/secs writes given we have
> 8x
> SSDs.  No amount of tuning/tweaking the RAID controller seems to make a
> difference.
>
> At this point, I can either purchase better/faster RAID controller
> ($500-$700) or switch over to Linux software RAID.  My past experience
> with
> mdadm left me a little disappointed.  The RAID rebuild times were slow
> and
> drive swaps were klunky (even after setting speed_limit_min and
> speed_limit_max settings).  At one point, the build/rebuild speeds seemed
> to cap out around 250MB/sec - even with SSDs.
>
> I know mdadm has been around for a very long time, but I am looking for
> some recent usage using mdadm on systems with SSDs.  What tool do you use
> (Rockstor, FreeNAS, etc) to manage the RAID array and what kind of
> performance do you get?  How "turn-key" is it to recover data or replace
> drives?
>
> At the end of the day, I need a turn-key RAID solution - easy enough that
> someone can walk into the facility and replace a faulty drive w/out
> having
> to dive into a CLI or use a management tool.
>
>
> Thanks,
>
> -Ron



More information about the TriLUG mailing list