Main Page | Report this Page
Computers Forum Index  »  Computer Architecture - Storage  »  Writing to block device is *slower* than writing to...
Page 1 of 1    

Writing to block device is *slower* than writing to...

Author Message
kkkk...
Posted: Fri Aug 07, 2009 4:30 pm
Guest
Hi all,
we have a new machine with 3ware 9650SE controllers and I am testing
hardware RAID and linux software MD raid performances
For now I am on hardware RAID. I have setup a raid-0 with 14 drives.

If I create an xfs filesystem on it (whole device, no partitioning,
aligned stripes during mkfs, etc) then I write to a file with dd (or
with bonnie++) like this:
sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero
of=/mnt/tmp/ddtry bs=1M count=6000 conv=fsync ; time sync
about 540MB/sec come out (last sync takes 0 seconds). This is similar to
3ware-declared performances of 561MB/sec
http://www.3ware.com/KB/Article.aspx?id=15300

however, if instead I write directly to the block device like this
sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/dev/sdc
bs=1M count=6000 conv=fsync ; time sync
performance is 260MB/sec!?!? (last sync takes 0 seconds)

I tried many times and this is the absolute fastest I could obtain. I
tweaked the bs, the count, I removed the conv=fsync... i ensured 3ware
caches are ON on the block device, I set anticipatory scheduler... No
way. I am positive that creating the xfs filesystem and writing on it is
definitely faster than writing to the block device directly.

How could that be!? Anyone knows what's happening?

Please note that the machine is absolutely clean and there is no other
workload. I am running kernel 2.6.31 (ubuntu 9.10 alpha live).

Thank you
 
David Schwartz...
Posted: Sat Aug 08, 2009 12:25 am
Guest
On Aug 7, 5:30 am, kkkk <k... at (no spam) bbbb.com> wrote:

Quote:
If I create an xfs filesystem on it (whole device, no partitioning,
aligned stripes during mkfs, etc) then I write to a file with dd (or
with bonnie++) like this:
  sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero
of=/mnt/tmp/ddtry bs=1M count=6000 conv=fsync ; time sync
about 540MB/sec come out (last sync takes 0 seconds). This is similar to
3ware-declared performances of 561MB/sec
 http://www.3ware.com/KB/Article.aspx?id=15300

however, if instead I write directly to the block device like this
  sync ; echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/zero of=/dev/sdc
bs=1M count=6000 conv=fsync ; time sync
performance is 260MB/sec!?!? (last sync takes 0 seconds)

I tried many times and this is the absolute fastest I could obtain. I
tweaked the bs, the count, I removed the conv=fsync... i ensured 3ware
caches are ON on the block device, I set anticipatory scheduler... No
way. I am positive that creating the xfs filesystem and writing on it is
definitely faster than writing to the block device directly.

How could that be!? Anyone knows what's happening?

There could be a lot of reasons, but the most likely is that they're
writing to opposite ends of the drive. To test, put a 'skip' in your
'dd' to the block device. See if larger skips result in higher speeds.

DS
 
kkkk...
Posted: Sat Aug 08, 2009 10:47 pm
Guest
David Schwartz wrote:
Quote:
There could be a lot of reasons, but the most likely is that they're
writing to opposite ends of the drive. To test, put a 'skip' in your
'dd' to the block device. See if larger skips result in higher speeds.

Nope, it's not that. I seeked as you said to the end of the device and
the speed is not significantly different. Writing to the device goes
from 239 to 233 MB/sec (it's actually a bit faster at the beginning).

I am positive that the seek value I used for dd is correct because I
tried to raise it a bit further and it gave me error: dd: `/dev/sdc':
cannot seek: Invalid argument

Next idea...?

Thank you!
 
kkkk...
Posted: Mon Aug 10, 2009 4:01 am
Guest
kkkk wrote:
Quote:
Hi all,
we have a new machine with 3ware 9650SE controllers and I am testing ...

I found it! I found it!

dd apparently does not buffer writes correctly (good catch, Mark):
apparently disregards bs value and submits very small writes. It needs
oflags=direct to really do that, and even then there's a limit. Also,
elevator merging of small writes does not try hard enough and cannot
achieve good throughput. More details tomorrow.
 
Robert Nichols...
Posted: Mon Aug 10, 2009 3:00 pm
Guest
In article <4a7f637e$0$28126$892e0abb at (no spam) auth.newsreader.octanews.com>,
kkkk <kkkk at (no spam) bbbb.com> wrote:
:kkkk wrote:
:> Hi all,
:> we have a new machine with 3ware 9650SE controllers and I am testing ...
:
:I found it! I found it!
:
:dd apparently does not buffer writes correctly (good catch, Mark):
:apparently disregards bs value and submits very small writes. It needs
:oflags=direct to really do that, and even then there's a limit. Also,
:elevator merging of small writes does not try hard enough and cannot
:achieve good throughput. More details tomorrow.

Curious. I'm not seeing that behavior in either Centos 5 or Fedora 11
(coreutils-5.97-19.el5, coreutils-7.2-2.fc11). In both of those, when I
run:

strace dd if=/dev/zero bs=1M count=1 of=somefile conv=fsync

I see exactly one read and write, each of size 1048576.

--
Bob Nichols AT comcast.net I am "RNichols42"
 
kkkk...
Posted: Mon Aug 10, 2009 9:40 pm
Guest
Robert Nichols wrote:
Quote:
Curious. I'm not seeing that behavior in either Centos 5 or Fedora
11 (coreutils-5.97-19.el5, coreutils-7.2-2.fc11). In both of those,
when I run:

strace dd if=/dev/zero bs=1M count=1 of=somefile conv=fsync

I see exactly one read and write, each of size 1048576.


I haven't straced it but this is what appears from iostat -x 1 (grabbed
from live iostat)

Without direct: (bs=1M)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util

sdc 0.00 559294.00 0.00 14384.00 0.00 570550.00
39.67 143.98 9.96 0.07 100.00



With direct: (bs=1M)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s
avgrq-sz avgqu-sz await svctm %util

sdc 0.00 0.00 0.00 3478.00 0.00 890368.00
256.00 5.77 1.66 0.28 98.40



You see, without direct there are a whole lot of wrqm/s (= probably lots
of wasted CPU cycles), and the average submitted size is still 143.98 <
256.0 (I suppose 143.98 is after the merges, correct?)

With direct there are no wrqm/s, and the submitted request size is 256
sectors exactly.


With oflag=direct, performances increase with increasing bs, like this:

(3ware 9650SE-16ML hw raid-0 256K chunk size, 14 disks [1TB 7200RPM SATA])
bs size -> speed:
512B -> 4.9MB/sec
1K -> 13.3MB/sec
2K -> 26.6MB/sec
4K -> 54.1MB/sec
8K -> 96MB/sec
16K -> 157MB/sec
32K -> 231 MB/s
64K -> 300 MB/s
128K -> 359 MB/s (from this point on, avgrq-sz does not increase
anymore, but performances still increase)
256K -> 404MB/sec
512K -> 430MB/sec
1M -> 456MB/sec
2M -> 466MB/sec
4M -> 473MB/sec
3584K (stripe size) -> 494MB/sec
8M -> 542MB/sec !! A big performance jump!!
16M -> 543MB/sec
32M -> 568MB/sec ! Another big performance jump
64M -> 603MB/sec ! Again !! Here are CPU occupations: real 0m11.213s,
user 0m0.004s, sys 0m3.880s
128M -> 641MB/sec
256M -> 676MB/sec
512M -> 645MB/sec (performances start dropping)
1G -> 620MB/sec

Avgrq-sz apparently cannot go over 256 sectors, is this a hardware limit
by the device, 3ware?

Notwithstanding this, performances still increase up to bs=256M. From
iostat the only apparent change (apart from increasing wsec/s obviously)
is avgqu-sz, being < 1.0 up to bs=128K, and then raising to about 20.0
at bs=256M. Do you think this can be the reason for the performance
increase up to 256M?

Thanks for any thoughts.
 
 
Page 1 of 1    
All times are GMT
The time now is Mon Nov 30, 2009 9:28 pm