 |
|
| Computers Forum Index » Computer Architecture - Storage » Unimpressive performance of large MD raid... |
|
Page 2 of 3 Goto page Previous 1, 2, 3 Next |
|
| Author |
Message |
| kkkk... |
Posted: Fri Apr 24, 2009 1:20 pm |
|
|
|
Guest
|
calypso at (no spam) fly.srk.fer.hr.invalid wrote:
Quote: Seems like I was partially right with 8 or 16 drives as a optimal number of
drives... Seems like that for RAID6 it's optimal to have 6, 10 or 18 drives
(4+2, 8+2, 16+2)... Here's a nice text from EMC guy (look at Stripe size of
a LUN):
http://clariionblogs.blogspot.com/
I still don't understand. Why 4+2, 8+2, 16+2 should be more optimal?
Please note that one raid chunk is NOT one block long (512 bytes). In
facts on my raid-6 it is 64KB long, so the stripes are 64*10=640KB long.
What's wrong with that? Why should that be less performing than 512KB
long? Please note that ext3 has blocksize 4K, so there are 160 and 128
ext3 blocks in one stripe respectively in the two configurations. I
don't see why 128 blocks should be significantly better than 160 blocks..!? |
|
|
| Back to top |
|
|
|
| kkkk... |
Posted: Fri Apr 24, 2009 1:23 pm |
|
|
|
Guest
|
David Brown wrote:
Quote:
I did spend a few minutes in Google trying to find detailed
information about md's RAID-6 implementation but got nowhere.
...
There is a lot more information about linux raid5 than raid6.
You mean on *Linux MD* raid5? That could be good. Where?
Raid-6 algorithms are practically equivalent to raid-5, except parity
computation obviously . |
|
|
| Back to top |
|
|
|
| kkkk... |
Posted: Fri Apr 24, 2009 2:09 pm |
|
|
|
Guest
|
calypso at (no spam) fly.srk.fer.hr.invalid wrote:
Quote: What performance would you expect from 3ware raid-6 12-disks with ext3
(defaults mount) sequential dd write?
I haven't tested RAID6 on 9650SE, but have tested RAID5 on 9650SE (older
generation on PCI-X), and IIRC got around 250MB/s write from 15x160GB
Hitachi 7200rpm SATA drives... So, with this 9650SE I expect at least around
350MB/s from 16 drives (today's SATA)...
What filesystem and operating system? This is important...
I assume you mean "first write of a sequential file"..?
(overwrites as you see are much faster)
Quote: Consider that bandwidth is not what
you'll be worried about, it's more to RAID6 write penalty that cache memory
annulates (it's 6 IOPS per write)...
"6 IOPS per write"? Could you explain this?
Thank you |
|
|
| Back to top |
|
|
|
| kkkk... |
Posted: Fri Apr 24, 2009 3:00 pm |
|
|
|
Guest
|
calypso at (no spam) fly.srk.fer.hr.invalid wrote:
Quote: U comp.arch.storage kkkk <kkkk at (no spam) bbbb.com> prica:
I haven't tested RAID6 on 9650SE, but have tested RAID5 on 9650SE (older
generation on PCI-X), and IIRC got around 250MB/s write from 15x160GB
Hitachi 7200rpm SATA drives... So, with this 9650SE I expect at least around
350MB/s from 16 drives (today's SATA)...
What filesystem and operating system? This is important...
Windows XP, NTFS...
I suspected that. I suspect NTFS is much faster than ext3, it will
probably be like XFS in Linux. (and also more unsafe e.g. in case of
power losses, just like XFS) Speed depends among other things on how
paranoid is the journal behaviour. |
|
|
| Back to top |
|
|
|
| Maxim S. Shatskih... |
Posted: Fri Apr 24, 2009 3:41 pm |
|
|
|
Guest
|
Quote: NTFS unsafe in case of power loss?
User data is not protected by the journaling.
Quote: You missed something, we're not talking about FAT here (which is faster than NTFS)...
Depends on scenario. With >2000 files per directory, things do change - FAT uses linear directories, and NTFS uses B-trees similar to database indices.
--
Maxim S. Shatskih
Windows DDK MVP
maxim at (no spam) storagecraft.com
http://www.storagecraft.com |
|
|
| Back to top |
|
|
|
| Bill Todd... |
Posted: Fri Apr 24, 2009 3:50 pm |
|
|
|
Guest
|
calypso at (no spam) fly.srk.fer.hr.invalid wrote:
Quote: U comp.arch.storage Bill Todd <billtodd at (no spam) metrocast.net> prica:
Calypso seems especially ignorant when talking about optimal RAID group
sizes. Perhaps he's confusing RAID-5/6 with RAID-3 - but even then he'd
be wrong, since what you really want with RAID-3 is for the total *data*
content (excluding parity) of a stripe to be a convenient value, meaning
that you tend to favor group sizes like 5 or 9 (not counting any spares
that may be present). And given that you've got both processing power
and probably system/memory bus bandwidth to burn, there's no reason why
a software RAID-6 implementation shouldn't perform fairly competitively
with a hardware one.
RAID3 implementation doesn't exist on 3Ware controllers...
I wasn't suggesting that it did, only that you might be being confused
by it.
....
Quote: RAID3 is very similar to RAID5
No, it is not. With RAID-3 every drive in the array (except the parity
drive) participates in every read operation, and every drive in the
array participates in every write operation. Thus the maximum IOPS of a
RAID-3 array is the maximum IOPS of a single disk. With RAID-5, only
the drives across which the requested data is spread participate in a
read operation, and only the drives across which the data is spread
(plus the parity drive) participate in a write operation (with the
exception of optimizations when the write spans more than half the disks
in the group) - thus the maximum read IOPS of a RAID-5 array is the
combined IOPS of all the disks in the array (when each read can be
satisfied by a single disk) and the maximum write IOPS is roughly the
combined IOPS of all the disks in the array divided by 4 (when each
write request hits data on only one disk) or the IOPS of a single disk
(when each write is a full-stripe write).
It is true, however, that for full-stripe reads or writes the
performance of a RAID-5 array is similar to that of a RAID-3 array (in
some cases a bit slower, since the spindles aren't synchronized, in
others a bit faster, since there's no dedicated parity drive that can
never participate in read activity).
....
Quote: Seems like I was partially right with 8 or 16 drives as a optimal number of
drives
Only if you define 'partially right' as 'completely wrong', since '8 or
16' is not '6, 10, or 18' (not that the latter numbers are usually
important either).
.... Seems like that for RAID6 it's optimal to have 6, 10 or 18 drives
While that text shows *examples* of drives where the size of the data in
a stripe is a power of 2 it does not state that it *should* be a power
of 2 (for the excellent reason that usually there's no reason for that
in a RAID-5/6 array, though there may well be in a RAID-3 array).
For the kind of streaming reads and writes that kkkk is doing any
competent RAID-5/6 code (whether firmware or software) should do the
right thing in terms of fetching bulk reads and scheduling full-stripe
(for RAID-6, full stripe-group when possible) writes regardless of what
the stripe size is. And even for non-streaming access patterns there's
usually no reason to worry about the stripe size (number of data disks
in a stripe) unless there's some typical (large) write size that can be
matched to the size of the data in a full stripe (or stripe-group in
RAID-6). For non-streaming reads and writes it is useful to align the
array (as the text describes) such that typical individual reads and
writes hit the minimum number of disks necessary to satisfy them - but
for the example shown (64 KB accesses) that doesn't imply a power-of-2
data size for the entire stripe, just that each disk hold a multiple of
64 KB of data in each stripe segment suitably aligned such that 64 KB
accesses will never hit more than one data disk.
Incidentally, Patterson et al. didn't invent RAID, they just formalized
the description of it. IBM had a RAID-5 implementation in the '70s, and
RAID-1 is even older.
- bill |
|
|
| Back to top |
|
|
|
| David Brown... |
Posted: Fri Apr 24, 2009 4:19 pm |
|
|
|
Guest
|
kkkk wrote:
Quote: David Brown wrote:
I did spend a few minutes in Google trying to find detailed
information about md's RAID-6 implementation but got nowhere.
...
There is a lot more information about linux raid5 than raid6.
You mean on *Linux MD* raid5? That could be good. Where?
Google for "linux raid 5" - there are a few million hits, most of which
are for software raid (i.e., MD raid). Googling for "linux raid 6" only
gets you a few hundred thousand hits.
Quote: Raid-6 algorithms are practically equivalent to raid-5, except parity
computation obviously .
Here is a link that might be useful, if you want to know the details of
Linux raid 6:
<http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf> |
|
|
| Back to top |
|
|
|
| kkkk... |
Posted: Fri Apr 24, 2009 6:33 pm |
|
|
|
Guest
|
This guy
http://lists.freebsd.org/pipermail/freebsd-fs/2008-September/005170.html
is doing basically the same as I am doing with software raid done with
ZFS in freebsd (raid-Z2 is basically raid-6) writing and reading 10GB
files. His results are a heck of a lot better than mine with defaults
settings and not very distant from the bare hard disks throughput (he
seems to get about 50MB/sec per non-parity disk).
This tells that software raid is indeed capable of doing good stuff in
theory. Just linux MD + ext3 seems to have some performance problems  |
|
|
| Back to top |
|
|
|
| Waldek Hebisch... |
Posted: Fri Apr 24, 2009 6:42 pm |
|
|
|
Guest
|
In comp.os.linux.development.system kkkk <kkkk at (no spam) bbbb.com> wrote:
Quote:
Why do you think dd stays at 100% CPU? (with disks/3ware caches enabled)
Shouldn't that be 0%?
Do you think the CPU is high due to a memory-copy operation? If it was
that, I suppose dd from /dev/zero to /dev/null should go at 200MB/sec
instead it goes at 1.1GB/sec (with 100%CPU occupation indeed, 65% of
which is in kernel mode). That would mean that the number of copies
performed by dd while copying to the ext3-raid is 5 times greater than
that for copying from /dev/zero to /dev/null . Hmmm... a bit difficult
to believe. there must be other stuff performed in the ext3 case so to
hog the CPU. Is the ext3 code running whithin the dd process when dd writes?
I did not check the kernel code, but logically writing to /dev/null
you do not need to copy data. So I normally I would expect 2 times
more copying. I would try bs parameter to dd, for example
on my machine
dd if=/dev/zero of=/dev/null count=1000000
needs 0.560571s while
time dd if=/dev/zero of=/dev/null count=100000 bs=10240
(which copies twice as much data) needs 0.109896.
By default dd uses 512 byte block which means that you do a lot
of system calls (each block is copied using separate call to
read and write).
And yes, when dd is doing system call work done in kernel is
accounted as work done by dd. That includes many operations
done by ext3 (some work is done by kernel treads and some is
done from interrupts and accounted to whatever process is
running at given time).
Coming back to dd CPU usage: as long as there is enough space
to buffer write dd should have 100% CPU utilization. Simply,
dd is copying data to kernel buffers as fast as it can. Once
kernel buffers are full dd should block -- however what you
wrote suggest that you have enough memory to buffer whole
write. Using large blocks dd should be faster than disks,
but for small blocks cost of system calls may be high
(and it does not help that you have many cores, because
dd is single threaded and much of kernel work is done
in the same thread).
--
Waldek Hebisch
hebisch at (no spam) math.uni.wroc.pl |
|
|
| Back to top |
|
|
|
| Chris Friesen... |
Posted: Fri Apr 24, 2009 6:44 pm |
|
|
|
Guest
|
kkkk wrote:
Quote: In my case dd pushes 5 seconds of data before the disks start writing
(dirty_writeback_centiseces = 500). dd stays always at least 5 seconds
ahead of the writes. This should fill all stripes completely causing no
reads. I even tried to raise the dirty_writeback_centisecs with no
measurable performance benefit.
Where is this 5secs of data stored? Is it at the ext3 layer or at the
LVM layer (I doubt this one, also I notice there is no LVM kernel thread
runing) or at the MD layer?
Most likely it's in the page cache, above both layers.
Quote: Why do you think dd stays at 100% CPU? (with disks/3ware caches enabled)
Shouldn't that be 0%?
Do you think the CPU is high due to a memory-copy operation? If it was
that, I suppose dd from /dev/zero to /dev/null should go at 200MB/sec
instead it goes at 1.1GB/sec (with 100%CPU occupation indeed, 65% of
which is in kernel mode). That would mean that the number of copies
performed by dd while copying to the ext3-raid is 5 times greater than
that for copying from /dev/zero to /dev/null . Hmmm... a bit difficult
to believe. there must be other stuff performed in the ext3 case so to
hog the CPU. Is the ext3 code running whithin the dd process when dd writes?
Copying from /dev/zero to /dev/null is a special case, as it doesn't
have to do any filesystem work. It's basically measuring memory bandwidth.
When copying to an actual file there will be work to arrange the
filesystem, allocate disk blocks, etc. I wouldn't have expected it to
happen within the context of the dd process, but I'm not a filesystem guy.
Quote: Hmm probably not because kjournald had significant CPU occupation. What
is the role of the journal during file overwrites?
I suspect the journal will be involved on any filesystem access.
Just curious, how is your ext3 filesystem configured for data
journalling (journal/ordered/writeback)? Have you tried mounting it
with "noatime"?
Lastly, in your original email you asked about "sync". When run from
the commandline, that command simply flushes all filesystem changes out
to disk and waits for that process to complete. Depending on the disk
the data may or may not have actually hit the platters by the time sync
returns.
Chris |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sat Apr 25, 2009 4:38 pm |
|
|
|
Guest
|
U comp.arch.storage Bill Todd <billtodd at (no spam) metrocast.net> prica:
Quote: Yes. And when you're making statements that depend upon the details,
it's really a good thing to have understood them first.
Well, coin has two sides, right? I've understood it the way I described
it... You had pretty good arguments, and made me learn something more (be
sure that I'll save this article somewhere)... :)
TNX... ;)
Quote: 8+2 drives are 10... 16+2 drives are 18... 8 drives are optimal...
You clearly didn't understand why that's incorrect the first time I
explained it. Perhaps you should just keep reading my previous response
until the light dawns rather than continue to babble on incompetently.
So, you say that it doesn't matter how many drives are in the array (RAID5
or RAID6)? If so, that's nice, but I would like to know why exactly... Will
read again your post...
Quote: Fine, but, if everything is aligned with Base2, then why go around it?
What matters is request *alignment*, not whether the size of the data in a
full stripe is a power of 2. This usually means request alignment with
respect to the size of a single disk's data stripe segment, so that the
minimum number of disks is required to participate in any access request.
Since access request sizes are often a power of 2 this means that the size
of a single disk's stripe segment should usually be a power of 2, but says
nothing about how many disks should be in the stripe.
So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?
What happens if I try to write 1kb data in a 64kb stripe segment using 4kb
blocks in NTFS (let's do this as an example)?
Quote: Well, yes: you still don't know what you're talking about, and appear
to be very resistant to becoming better-educated about it. Do try to
change both before you respond again.
Well, seems that I have found someone who trully understands how RAID works,
so, I won't hesitate to ask what is still unknown to me... But first I had
to make you angry... ;)
Quote: Interestingly (and contrary to most of the articles on the Web that
credit Ouchi with the first description of RAID-5 - let alone the more
prevalent articles that seem to believe that all the RAID the concepts
originated at Berkeley much later) Ouchi's algorithm appears to have
been RAID-4 (see last sentence of first paragraph quoted above). The
small-write approach described above, however, applies equally to
RAID-5, so clearly existed well before Patterson described it. The IBM
patent cited above for a RAID-5-style algorithm was filed on 06/12/1986
(with no indication of how long IBM had been working on the concept
before then) and contains none of the names of the Berkeley RAID team -
but IBM was working with/sponsoring them at around that time so that's
likely where the Berkeley group picked it up.
Basically, what I understood is that IBM invented RAID, but Patterson and
his crew gave it a name when they used inexpensive drives (IBM's storage
surely costed much at that time?)...
--
Pod krevetom se za pet minuta maslinov banderao cvokoce.
By runf
Damir Lukic, calypso at (no spam) _MAKNIOVO_fly.srk.fer.hr
http://inovator.blog.hr
http://calypso-innovations.blogspot.com/ |
|
|
| Back to top |
|
|
|
| Bill Todd... |
Posted: Sun Apr 26, 2009 12:28 am |
|
|
|
Guest
|
calypso at (no spam) fly.srk.fer.hr.invalid wrote:
....
Quote: What matters is request *alignment*, not whether the size of the data in a
full stripe is a power of 2. This usually means request alignment with
respect to the size of a single disk's data stripe segment, so that the
minimum number of disks is required to participate in any access request.
Since access request sizes are often a power of 2 this means that the size
of a single disk's stripe segment should usually be a power of 2, but says
nothing about how many disks should be in the stripe.
So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?
In a conventional RAID-5 you can allocate and write to space at the same
granularity you can to a disk (currently a single 512-byte sector). The
sectors are numbered consecutively within each stripe segment and
continuing across each stripe (leaving out the parity segment) and then
continue on to the next stripe. So if you read or write a 512 KB
request (aligned to the start of a stripe segment, rather than just
aligned to an arbitrary single sector boundary) it will indeed hit 8
logically consecutive 64 KB stripe segments.
In a 12-drive array such a 512 KB write will be optimized such that
instead of reading each modified segment, creating the XOR with the new
data, and then reading/XORing with/updating the relevant parity segment
it will just write the 8 modified segments, read in the 3 unmodified
data segments remaining in the stripe, create the full-stripe data XOR,
and update the parity segment (this is the normal optimization applied
when about half or more of the segments in a stripe are modified). If
not all the modified segments fall in the same stripe it will update the
two affected stripes separately (applying that optimization to one of
them if applicable).
In that example it would be more efficient to use a far larger stripe
segment. If, for example, you used a 512 KB stripe segment and the
access was aligned to that granularity you'd only need to read in the
old data and parity, XOR the old data with the new data, XOR the result
with the old parity, and write out the new data and the new parity: 4
disk accesses, and though each would be 512 KB instead of 64 KB this
would less than double the duration of each access resulting in less
than 2/3 the total array overhead that the 12 accesses in the optimized
(and aligned) initial case above required. In both situations unaligned
accesses increase the overhead, but for comparable lack of alignment the
large stripe segment still usually comes out ahead (and is at least as
efficient at servicing small requests, as explained below).
Quote:
What happens if I try to write 1kb data in a 64kb stripe segment using 4kb
blocks in NTFS (let's do this as an example)?
You don't have to manipulate the entire stripe segment when only part of
it is affected. So at a minimum you just read in the two target data
sectors to get their old value and the corresponding two sectors of the
parity segment, XOR that old data with the two sectors' worth of
modified data, XOR the result with the two sectors of parity data, and
write out the two modified data sectors and the two updated parity sectors.
But if NTFS accesses things at 4 KB granularity at its lower-level disk
interface you actually have to update the entire 4 KB data block and the
corresponding parity block in the affected stripe's parity segment
(still reasonably efficient).
....
But first I had
Quote: to make you angry...
Not really: I actually prefer answering reasonable questions to
correcting persistent misinformation. Unfortunately, there seems to be
a lot more of the latter than of the former ever since "Generation Me"
came around (self-discipline doesn't seem to be their forte).
- bill |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sun Apr 26, 2009 12:46 am |
|
|
|
Guest
|
U comp.arch.storage Bill Todd <billtodd at (no spam) metrocast.net> prica:
Quote: So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?
In a conventional RAID-5 you can allocate and write to space at the same
granularity you can to a disk (currently a single 512-byte sector). The
sectors are numbered consecutively within each stripe segment and
continuing across each stripe (leaving out the parity segment) and then
continue on to the next stripe. So if you read or write a 512 KB
request (aligned to the start of a stripe segment, rather than just
aligned to an arbitrary single sector boundary) it will indeed hit 8
logically consecutive 64 KB stripe segments.
So if you align cache page size with stripe size, you can benefit from it or
no? Let's say that you've got 16kB cache page size and have 8 drives with
2kb stripe segment size... If you dump cache, you basically write to all
drives at once, right? But this situation can slow down everything since
you've got how many IOPS per one write operation (> ?
Quote: In a 12-drive array such a 512 KB write will be optimized such that
instead of reading each modified segment, creating the XOR with the new
data, and then reading/XORing with/updating the relevant parity segment
it will just write the 8 modified segments, read in the 3 unmodified
data segments remaining in the stripe, create the full-stripe data XOR,
and update the parity segment (this is the normal optimization applied
when about half or more of the segments in a stripe are modified). If
not all the modified segments fall in the same stripe it will update the
two affected stripes separately (applying that optimization to one of
them if applicable).
Cool optimization...
Quote: In that example it would be more efficient to use a far larger stripe
segment. If, for example, you used a 512 KB stripe segment and the
access was aligned to that granularity you'd only need to read in the
old data and parity, XOR the old data with the new data, XOR the result
with the old parity, and write out the new data and the new parity: 4
disk accesses, and though each would be 512 KB instead of 64 KB this
would less than double the duration of each access resulting in less
than 2/3 the total array overhead that the 12 accesses in the optimized
(and aligned) initial case above required. In both situations unaligned
accesses increase the overhead, but for comparable lack of alignment the
large stripe segment still usually comes out ahead (and is at least as
efficient at servicing small requests, as explained below).
Thinking..... So, you need to optimize the cache of a RAID controller to
gather changed data so that it could be written in one dump (utilize all
actuators at once)?
Quote: Not really: I actually prefer answering reasonable questions to
correcting persistent misinformation. Unfortunately, there seems to be
a lot more of the latter than of the former ever since "Generation Me"
came around (self-discipline doesn't seem to be their forte).
Cool... Anyway, sorry... It's almost impossible to find this kind of
information on the internet... And I mostly work with EMC storage arrays
(Symmetrix and CLARiiON), and get into detail as much as I can, but even
with the access to information I don't get this deep... This is mostly
information for firmware programming...
Thanks a lot for explanations...
--
"Naklonjens li jabukau pasiru ?" upita cigan mrcvari Crnogorkaog
mrcvarija. "Nisam ja nikog bombardiro !" rece drota masira "Ja samo
dostavljacu njuku zdravm !" By runf
Damir Lukic, calypso at (no spam) _MAKNIOVO_fly.srk.fer.hr
http://inovator.blog.hr
http://calypso-innovations.blogspot.com/ |
|
|
| Back to top |
|
|
|
| Bill Todd... |
Posted: Sun Apr 26, 2009 5:15 am |
|
|
|
Guest
|
calypso at (no spam) fly.srk.fer.hr.invalid wrote:
Quote: U comp.arch.storage Bill Todd <billtodd at (no spam) metrocast.net> prica:
So basically, if I say that stripe segment is 64kb, it means that when I
write 512kb of data and have 12 drives, I simply use 8 drives at a time, and
the rest of 4 drives are not used (forget about parity drives now)?
In a conventional RAID-5 you can allocate and write to space at the same
granularity you can to a disk (currently a single 512-byte sector). The
sectors are numbered consecutively within each stripe segment and
continuing across each stripe (leaving out the parity segment) and then
continue on to the next stripe. So if you read or write a 512 KB
request (aligned to the start of a stripe segment, rather than just
aligned to an arbitrary single sector boundary) it will indeed hit 8
logically consecutive 64 KB stripe segments.
So if you align cache page size with stripe size, you can benefit from it or
no?
Maybe - mostly depending upon how much concurrency exists in your
workload (as seen by the disks).
If there's no concurrency at all (i.e., no request is submitted until
the previous one completes) you're best off using full-stripe writes
(and for that matter RAID-3 rather than RAID-5) - because there's no way
you can get more than a single disk's worth of IOPS out of the array and
you should ensure that nothing causes it to be even worse than that.
If there's a lot of concurrency, you're best off minimizing the
resources that each operation uses, which usually means large stripe
segment sizes - so large that even a single segment will tend to be
larger than most requests (thus the entire stripe will be *much* larger
than almost any request): that way, each read typically is satisfied by
one disk access and each write by 4 disk accesses spread across 2 disks
(3 if the old data still happens to be in cache), so N reads or N/2
writes can proceed in parallel with virtually no worse latency for reads
and only about twice the latency for writes as would happen with
full-stripe RAID-3 accesses.
So you wind up with N times the potential read throughput with minimal
latency penalty and N/4 times the write throughput without dramatic
latency penalty - compared with the *no-load* full-stripe case. But if
having only effectively one disk's worth of IOPS for full-stripe
accesses would have resulted in significant request queuing delays, you
may have a lot *better* latency (as well as throughput) for both reads
and writes when you minimize the number of disks involved in each one by
using a large stripe segment size.
With NVRAM for stable write-back cache in your array controller it can
perform other optimizations - e.g., even if your writes aren't optimally
aligned it can in many situations gather them up in its NVRAM and later
issue them to the disks with more optimal alignment. The same can
happen with a software array controller and lazy writes if your
file-level cache is sufficiently intelligent to do that kind of thing
when presenting data to the array.
Let's say that you've got 16kB cache page size and have 8 drives with
Quote: 2kb stripe segment size... If you dump cache, you basically write to all
drives at once, right? But this situation can slow down everything since
you've got how many IOPS per one write operation (>  ?
Exactly.
....
Quote: In that example it would be more efficient to use a far larger stripe
segment. If, for example, you used a 512 KB stripe segment and the
access was aligned to that granularity you'd only need to read in the
old data and parity, XOR the old data with the new data, XOR the result
with the old parity, and write out the new data and the new parity: 4
disk accesses, and though each would be 512 KB instead of 64 KB this
would less than double the duration of each access resulting in less
than 2/3 the total array overhead that the 12 accesses in the optimized
(and aligned) initial case above required. In both situations unaligned
accesses increase the overhead, but for comparable lack of alignment the
large stripe segment still usually comes out ahead (and is at least as
efficient at servicing small requests, as explained below).
Thinking..... So, you need to optimize the cache of a RAID controller to
gather changed data so that it could be written in one dump (utilize all
actuators at once)?
That would certainly be a useful optimization (just mentioned that above).
- bill |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sun Apr 26, 2009 12:54 pm |
|
|
|
Guest
|
U comp.arch.storage Bill Todd <billtodd at (no spam) metrocast.net> prica:
Quote: So if you align cache page size with stripe size, you can benefit from it or
no?
If there's no concurrency at all (i.e., no request is submitted until the
previous one completes) you're best off using full-stripe writes (and for
that matter RAID-3 rather than RAID-5) - because there's no way you can
get more than a single disk's worth of IOPS out of the array and you
should ensure that nothing causes it to be even worse than that.
How do you think about using RAID3 for multimedia broadcasting, and what
about multimedia recording?
Quote: If there's a lot of concurrency, you're best off minimizing the resources
that each operation uses, which usually means large stripe segment sizes -
so large that even a single segment will tend to be larger than most
requests (thus the entire stripe will be *much* larger than almost any
request): that way, each read typically is satisfied by one disk access
and each write by 4 disk accesses spread across 2 disks (3 if the old data
still happens to be in cache), so N reads or N/2 writes can proceed in
parallel with virtually no worse latency for reads and only about twice
the latency for writes as would happen with full-stripe RAID-3 accesses.
OK, much concurrency is solved using TCQ/NCQ which means you've got only one
IO operation for fetching few data segments per drive...
Quote: So you wind up with N times the potential read throughput with minimal
latency penalty and N/4 times the write throughput without dramatic
latency penalty - compared with the *no-load* full-stripe case. But if
having only effectively one disk's worth of IOPS for full-stripe accesses
would have resulted in significant request queuing delays, you may have a
lot *better* latency (as well as throughput) for both reads and writes
when you minimize the number of disks involved in each one by using a
large stripe segment size.
I see... So, it's possible to see that 5 1TB/7.2k drives in RAID5 with a
huge segment size can work faster than 15 146GB/15k drives in RAID5 with
small segment size?
--
"Bradats li zidaro zvace ?" upita tabletaa izdrkava gaceog gnjecija.
"Nisam ja nikog bombardiro !" rece sprajt ljubi "Ja samo cokoladaog volija krvavm !" By runf
Damir Lukic, calypso at (no spam) _MAKNIOVO_fly.srk.fer.hr
http://inovator.blog.hr
http://calypso-innovations.blogspot.com/ |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Fri Nov 27, 2009 4:21 am
|
|