[TriLUG] Linux Software RAID experience

Ron Kelley via TriLUG trilug at trilug.org
Wed Feb 14 23:13:37 EST 2018


Thanks for the feedback, Jim.

The server in question is a CentOS 6 NFS server for a virtualization 
cluster.  We have a few other/similar servers running RAID-5 with LSI 
RAID cards - each server capable of hitting well over 1GB/sec in 
sequential reads.  I was simply using large block reads to see how this 
server compared to the others.  And, I absolutely appreciate and 
understand the difference between IOPs and MB/sec.  I am not building a 
server for max MB/sec - my testing was just to gauge relative performance.

In the case of this particular server, I think the onboard LSI SAS 2208 
chip is somehow less performant than an LSI PCI card.  I have a similar 
server with 16 spinning drives (LSI SAS 2208 - RAID-6), and that server 
can easily hit 1GB/sec sequential read speeds (as expected, IOPs are not 
as good vs SSD).  So, I know the 2208 can give the expected performance 
numbers, just not on this motherboard with these SSDs.  Go figure...

As for ZFS, I tried it a couple of years back running v0.6.5 and had all 
sorts of performance issues.  I read a ton of data on how to tune it but 
the performance was no better than mdadm.  Even with a ton of RAM, the 
system would slow down for no reason.  Maybe it is time to revisit ZFS.


I would still like to hear from anyone with regards to their experience 
with mdadm arrays - especially with newer tools like Nas4Free, etc.  I 
have an LSI "JBOD" card I can put in the server, but I need to see how 
easy it is to remove drives, rebuild arrays, etc.  Honestly, LSI RAID 
cards have made my life easy over the past few years since they are just 
plug-n-play.  Fixing a broken array is as simple as removing the broken 
drive then adding the new one.  No need to login to the server for 
maintenance.


Thanks,

-Ron





On 02/14/2018 05:48 PM, Jim Salter via TriLUG wrote:
> Hi Ron -
> 
> Need to know a bit more about your workload. There are a few potential 
> gotchas going on here.
> 
> First, things get weird when you start looking for more than 500MB/sec 
> committed to the bare metal. You're starting to hit territory where you 
> can get bottlenecks from PCIe lanes and all sorts of silliness. It's 
> absolutely /possible/ to hit higher speeds, but it takes some commitment 
> to Sparkle Motion and willingness to poke about and try things until you 
> get where you need to be.
> 
> Second thing, RAID5 is not the way to go if you're looking for maximal 
> performance. So far you're talking about throughput and not IOPS - more 
> on that in a second - but even when you're looking for maximal 
> throughput, you can screw it up /fast/ with a striped array, especially 
> a conventional striped array being managed by a dumb hardware 
> controller.  Keep in mind that your IOPS - I/O Operations Per Second - 
> for a striped array approximates slightly /worse/ than the IOPS of a 
> single drive in that stripe. This is pretty counter-intuitive, but what 
> you have to realize is that each stripe operation lights up all disks in 
> the stripe - eight, in your case - and until you get all eight 
> individual operations completed, your stripe is unavailable for anything 
> else. For each stripe op, one of your disks will come in a bit slower 
> than the others - it may not even be the same disk every time! - hence 
> why you're going to trend towards slightly /worse/ than the IOPS of a 
> single disk.
> 
> OK, but again, you aren't asking for IOPS - maybe you don't know you 
> should be, or maybe you don't care. So why do I keep talking about them? 
> Because unless you've got an /incredibly/ specialized and tightly 
> controlled workload, typically WORM (write once, read many, never 
> deleted) you're never going to consistently hit max sequential 
> throughput. The second you have a fragmented operation requiring a seek 
> on any of those eight disks, you've left maximal sequential throughput 
> and you're starting to bottleneck potentially on IOPS. It gets even 
> worse if you've got more like a standard server workload, that's going 
> to be trying to serve multiple simultaneous tasks in parallel - writing 
> for user one, reading for user two, deleting stuff for user 3, having to 
> save the files for user one in holes left by prior deletions, having to 
> read files that were written fragmented in the first place... yeah, 
> you're gonna hit IOPS bottlenecks pretty fast.
> 
> So, in general terms, if you want performance you want IOPS. And if you 
> want IOPS, the /last/ thing you want is a stripe. For read performance, 
> you want RAID1, with as many individual mirrors as possible. For write 
> performance, you want RAID10. And it gets even hinkier here; your LSI 
> controllers are /probably/ bright enough to distribute individual block 
> writes to individual mirrors on a RAID10 array rather than tackling it 
> classically - a /stripe/ of mirrors, and you remember what we said about 
> stripes, right...? Same caveat applies for RAID0; brighter controllers 
> (or mdadm) will do individual block writes to individual devices rather 
> than classical stripe operations, but your mileage may vary.
> 
> I could go on and on and on here - batteries in the controllers, 
> hardware vs mdadm, my personal preference which is neither and for many 
> many damn good reasons (ZFS). But keep in mind even /after/ all these 
> things, 500 MB/sec can start to be a pretty serious bottleneck it's hard 
> to break through. If I actually have to hit bare metal (ie, not serve 
> from or write to cache) I generally see things bottlenecking around 
> 800-ish MB/sec on high-end Xeon servers with 8-ish SSDs in a pool of 
> mirrors (ZFS's quasi-equivalent to traditional RAID10). But for damn 
> near any operation I ever deal with, you really shouldn't be talking 
> highest possible throughput on simplest operation... you should be 
> talking /IOPS/, which comes down to "how often can I actually /get/ 
> really close to that big maximum throughput number?"
> 
> If you want some direct consultation, I'm available. I work with this 
> stuff professionally on an every-day basis. You wouldn't be a 
> "moonlighting" gig; consulting is my 9-5 and I own the business. =)
> 
> 
> On 02/14/2018 05:28 PM, Shay Walters wrote:
>> Hi Jim,
>>      I'm guessing you don't get this mailing list?  Or maybe you've been
>> busy and haven't had a chance to respond.  I started to answer describing
>> what you had said at a ColaLUG meeting a long time ago, but then I 
>> started
>> thinking that your comment was likely referring to spinning disks, and
>> didn't know if mdadm with SSD drives was still the high performer.  It
>> sounds like his hardware RAID is leaving a lot to be desired, so I'm
>> inclined to think that mdadm might out-perform it, but I'd just be
>> guessing and I figured you'd be more likely to know for sure.
>>
>> -Shay
>>
>>
>>
>> -----Original Message-----
>> From: Ron Kelley via TriLUG <trilug at trilug.org>
>> To: Triangle Linux Users Group General Discussion <trilug at trilug.org>
>> Date: Wed, 14 Feb 2018 11:30:11 -0500
>> Subject: [TriLUG] Linux Software RAID experience
>>
>> Greetings all,
>>
>> Looking for some feedback with Linux software RAID.  I have a couple of
>> servers in a lights-out facility running LSI RAID controllers (2208
>> chips)
>> that have been working great for a long time.  Recently, we decided to
>> switch out spinning disks for SSDs (8x Samsung 1TB in RAID-5).  After
>> getting the RAID volume rebuilt, I noticed slow write speeds (about
>> 500MB/secs).  I was expecting at least 1,500MB/secs writes given we have
>> 8x
>> SSDs.  No amount of tuning/tweaking the RAID controller seems to make a
>> difference.
>>
>> At this point, I can either purchase better/faster RAID controller
>> ($500-$700) or switch over to Linux software RAID.  My past experience
>> with
>> mdadm left me a little disappointed.  The RAID rebuild times were slow
>> and
>> drive swaps were klunky (even after setting speed_limit_min and
>> speed_limit_max settings).  At one point, the build/rebuild speeds seemed
>> to cap out around 250MB/sec - even with SSDs.
>>
>> I know mdadm has been around for a very long time, but I am looking for
>> some recent usage using mdadm on systems with SSDs.  What tool do you use
>> (Rockstor, FreeNAS, etc) to manage the RAID array and what kind of
>> performance do you get?  How "turn-key" is it to recover data or replace
>> drives?
>>
>> At the end of the day, I need a turn-key RAID solution - easy enough that
>> someone can walk into the facility and replace a faulty drive w/out
>> having
>> to dive into a CLI or use a management tool.
>>
>>
>> Thanks,
>>
>> -Ron
> 


More information about the TriLUG mailing list