[TriLUG] Linux Software RAID experience

Wed Feb 14 19:05:40 EST 2018

Ron & Jim, you are way out of my league, but I will relate an anecdote
which left me with a very positive impression about Linux software RAID.

A few years ago I put Linux on an old laptop, which happened to have dual
2.5" drive bays (unusual for a laptop). So, as an experiment, I put in two
drives (I think they were two different mfgs, probably one Seagate and one
WDC, but memory fades), and set it up RAID1 (mirrored). To my amazement,
Linux boot time with two mirrored drives was slightly *less* *than* *half* of
the boot time with just one drive.

So I can't speak to your high-load, 8-drive, RAID5 setup. But Linux does a
mighty fine job with software RAID1 mirrored drives. Not only do you get a
huge reliability boost, if your load is mostly reads you also get a
enormous performance boost.

I've had it in the back of my mind to do some more experiments: with 3
drives, with 4 drives, with SSD drives (instead of slow spinning rust),
with mixed SSD+HDD, etc., but I've never gotten a Round TUIT
<http://1.bp.blogspot.com/-evsbP6Ck-_4/TrgfEfD0WeI/AAAAAAAAErs/PZRrgTA1Xao/s1600/RT+2.jpg>
.

One caution: beware of matched drives. If you put identical drives with
consecutive serial numbers into a RAID array where they get nearly
identical workloads, it should not come as a surprise if they fail on the
same day, and take down your "redundant" system with them. One nice thing
about Linux software RAID is that it doesn't require or even prefer matched
drives.

Dave

On Wed, Feb 14, 2018 at 5:48 PM, Jim Salter via TriLUG <trilug at trilug.org>
wrote:

> Hi Ron -
>
> Need to know a bit more about your workload. There are a few potential
> gotchas going on here.
>
> First, things get weird when you start looking for more than 500MB/sec
> committed to the bare metal. You're starting to hit territory where you can
> get bottlenecks from PCIe lanes and all sorts of silliness. It's absolutely
> /possible/ to hit higher speeds, but it takes some commitment to Sparkle
> Motion and willingness to poke about and try things until you get where you
> need to be.
>
> Second thing, RAID5 is not the way to go if you're looking for maximal
> performance. So far you're talking about throughput and not IOPS - more on
> that in a second - but even when you're looking for maximal throughput, you
> can screw it up /fast/ with a striped array, especially a conventional
> striped array being managed by a dumb hardware controller.  Keep in mind
> that your IOPS - I/O Operations Per Second - for a striped array
> approximates slightly /worse/ than the IOPS of a single drive in that
> stripe. This is pretty counter-intuitive, but what you have to realize is
> that each stripe operation lights up all disks in the stripe - eight, in
> your case - and until you get all eight individual operations completed,
> your stripe is unavailable for anything else. For each stripe op, one of
> your disks will come in a bit slower than the others - it may not even be
> the same disk every time! - hence why you're going to trend towards
> slightly /worse/ than the IOPS of a single disk.
>
> OK, but again, you aren't asking for IOPS - maybe you don't know you
> should be, or maybe you don't care. So why do I keep talking about them?
> Because unless you've got an /incredibly/ specialized and tightly
> controlled workload, typically WORM (write once, read many, never deleted)
> you're never going to consistently hit max sequential throughput. The
> second you have a fragmented operation requiring a seek on any of those
> eight disks, you've left maximal sequential throughput and you're starting
> to bottleneck potentially on IOPS. It gets even worse if you've got more
> like a standard server workload, that's going to be trying to serve
> multiple simultaneous tasks in parallel - writing for user one, reading for
> user two, deleting stuff for user 3, having to save the files for user one
> in holes left by prior deletions, having to read files that were written
> fragmented in the first place... yeah, you're gonna hit IOPS bottlenecks
> pretty fast.
>
> So, in general terms, if you want performance you want IOPS. And if you
> want IOPS, the /last/ thing you want is a stripe. For read performance, you
> want RAID1, with as many individual mirrors as possible. For write
> performance, you want RAID10. And it gets even hinkier here; your LSI
> controllers are /probably/ bright enough to distribute individual block
> writes to individual mirrors on a RAID10 array rather than tackling it
> classically - a /stripe/ of mirrors, and you remember what we said about
> stripes, right...? Same caveat applies for RAID0; brighter controllers (or
> mdadm) will do individual block writes to individual devices rather than
> classical stripe operations, but your mileage may vary.
>
> I could go on and on and on here - batteries in the controllers, hardware
> vs mdadm, my personal preference which is neither and for many many damn
> good reasons (ZFS). But keep in mind even /after/ all these things, 500
> MB/sec can start to be a pretty serious bottleneck it's hard to break
> through. If I actually have to hit bare metal (ie, not serve from or write
> to cache) I generally see things bottlenecking around 800-ish MB/sec on
> high-end Xeon servers with 8-ish SSDs in a pool of mirrors (ZFS's
> quasi-equivalent to traditional RAID10). But for damn near any operation I
> ever deal with, you really shouldn't be talking highest possible throughput
> on simplest operation... you should be talking /IOPS/, which comes down to
> "how often can I actually /get/ really close to that big maximum throughput
> number?"
>
> If you want some direct consultation, I'm available. I work with this
> stuff professionally on an every-day basis. You wouldn't be a
> "moonlighting" gig; consulting is my 9-5 and I own the business. =)
>
>>
>> ...

>
>>
>> -----Original Message-----
>> From: Ron Kelley via TriLUG <trilug at trilug.org>
>> To: Triangle Linux Users Group General Discussion <trilug at trilug.org>
>> Date: Wed, 14 Feb 2018 11:30:11 -0500
>> Subject: [TriLUG] Linux Software RAID experience
>>
>> Greetings all,
>>
>>
>> Looking for some feedback with Linux software RAID.  I have a couple of
>> servers in a lights-out facility running LSI RAID controllers (2208
>> chips)
>> that have been working great for a long time.  Recently, we decided to
>> switch out spinning disks for SSDs (8x Samsung 1TB in RAID-5).  After
>> getting the RAID volume rebuilt, I noticed slow write speeds (about
>> 500MB/secs).  I was expecting at least 1,500MB/secs writes given we have
>> 8x
>> SSDs.  No amount of tuning/tweaking the RAID controller seems to make a
>> difference.
>>
>> At this point, I can either purchase better/faster RAID controller
>> ($500-$700) or switch over to Linux software RAID.  My past experience
>> with
>> mdadm left me a little disappointed.  The RAID rebuild times were slow
>> and
>> drive swaps were klunky (even after setting speed_limit_min and
>> speed_limit_max settings).  At one point, the build/rebuild speeds seemed
>> to cap out around 250MB/sec - even with SSDs.
>>
>> I know mdadm has been around for a very long time, but I am looking for
>> some recent usage using mdadm on systems with SSDs.  What tool do you use
>> (Rockstor, FreeNAS, etc) to manage the RAID array and what kind of
>> performance do you get?  How "turn-key" is it to recover data or replace
>> drives?
>>
>> At the end of the day, I need a turn-key RAID solution - easy enough that
>> someone can walk into the facility and replace a faulty drive w/out
>> having
>> to dive into a CLI or use a management tool.
>>
>>
>> Thanks,
>>
>> -Ron
>>
>
>