Main Page | Report this Page
Computers Forum Index  »  Computer Architecture - Storage  »  Out-of-order writing by disk drives...
Page 1 of 1    

Out-of-order writing by disk drives...

Author Message
Anton Ertl...
Posted: Wed Apr 08, 2009 1:21 am
Guest
I have released a new version of hdtest, a program that tests whether
hard disks write out-of-order relative to the order that the writes
were passed to them from the OS. You find the program at

http://www.complang.tuwien.ac.at/anton/hdtest/

Here I mainly present the results from my tests, and explain enough
about the program so you know what I am talking about.


HOW DOES IT WORK?

It writes the blocks in an order like this:

1000-0-1001-0-1002-0-...

This sequence seems to inspire PATA and SATA disks to write
out-of-order (in the order 1000-1001-1002-...-0). So you turn off the
drive's power while running the program. The written blocks contain
certain data that another program from the suite can check after you
power the drive up again.


RESULTS

I performed two sets of tests, one in November 1999, and one in April
2009. The results have not changed much. In both tests disks wrote
data seriously out-of-order in their default configuration; they can
delay the writing of block 0 in this test for quite a long time.

In more detail:

In 2009 I tested three drives (and accessed the whole drive) under
Linux 2.6.18 on Debian Etch; the USB enclosure used was a Tsunami
Elegant 3.5" Enclosure that has PATA and SATA disk drive interfaces.

* Maxtor L300R0 PATA (300GB) connected through an USB enclosure: In
two tests it wrote the consecutive blocks 47 and 34 blocks after the
last written block 0.

* Seagate ST340062 Model 0A PATA (7200.10, 400GB):
connected through a USB enclosure:
3 times the result was as if it had written the blocks in-order
1 time it wrote 3064 blocks out-of-order
2 times it wrote 18384 blocks out-of-order
connected directly via PATA cable:
1 time it wrote 1972 blocks out-of-order

* Seagate ST340062 Model 0AS SATA (7200.10, 400GB) connected through a
USB enclosure:
1 time the result was as if it had written the blocks in-order
2 times it wrote 3064 blocks out-of-order
1 time it wrote 6128 blocks out-of-order
1 time it wrote 12256 blocks out-of-order
1 time it did not write block 0 at all

It is interesting that the number of blocks that is found to be
out-of-order is often a multiple of 3064. Maybe this is a multiple of
a track size; no other explanations come to mind.

In 1999 I tested two drives (and accessed one partition) under
Linux-2.2.1 on RedHat 5.1. The two drives were a Quantum Fireball
CR8.4A (8GB) and an IBM-DHEA-36480 (6GB), both connected directly via
PATA. I did one test with each of the disks, and they did not even
write block 0 once on the platters before I turned off the power.

I also tested the Quantum with write caching disabled (hdparm -W 0).
Hdtest was now quite noisy and produced the in-order result.


CONCLUSION

Applications and file systems requiring in-order writes (i.e.,
basically all of them) should use barriers or turn off write caching
for the disk drive(s) they use. Unfortunately, the Linux ext3 file
system does not use barriers by default; use the mount option
barrier=1 to enable them, e.g. by putting a line like this in
/etc/fstab:

/dev/md2 /home ext3 defaults,barrier=1 1 2

Followups set to comp.arch.storage

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
 
Bill Todd...
Posted: Wed Apr 08, 2009 8:19 pm
Guest
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled. I'd be far more
concerned if you had found that disks ever wrote data out of order with
write-back caching disabled (and indeed I've heard anecdotes that some
did - perhaps because they just never disabled write-back caching
regardless of what they were told to do to obtain better performance
numbers or simply due to incompetent firmware).

The only other explanation I can readily come up with for why 3064
sectors might be written out of order would involve the heuristics
employed in the write-back caching algorithm (e.g., that's the maximum
amount of cache space it will allow dirty data to occupy before
destaging it to disk).

- bill
 
Bill Todd...
Posted: Wed Apr 08, 2009 9:37 pm
Guest
Anton Ertl wrote:
Quote:
Bill Todd <billtodd at (no spam) metrocast.net> writes:
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled.

Yes. But some people seem to imagine that this is a very small effect
that can be ignored without ill effects on the consistency of the
on-disk data of a file system; this attitude is exemplified by having
barrier=1 disabled by default in the ext3 file system in Linux.

The test demonstrates that the reordering can happen over several
seconds.

That indeed seems to be quite a long time - but then it wasn't so long
ago that Unix systems would by default allow writes to languish for as
much as 30 seconds (with no particular guarantees about ordering when
they actually got destaged) so I can't really fault the disk vendors for
this: as has always been the case, if you want ordering guarantees, you
need to take explicit steps to ensure them.

....

Quote:
the program ran
significantly slower (about 6MB/s transfer rate) than what the drive
is capable of (>70MB/s)

Interesting: it implies that the disk was destaging a few blocks every
rev rather than waiting for a track to fill up (what are track sizes on
those 400 GB disks - 0.5 MB or so?) but was still very reluctant to move
the head to give those block 0 writes a reasonable chance. That doesn't
strike me as a very good approach (achieving neither decent throughput
nor reasonable fairness) assuming that you did present the non-block-0
writes in strictly ascending order.

Have you tested disks to see whether they indeed destaged single large
transfers out of order (as many claim to when the write is at least a
large percentage of a track in size)?

- bill
 
Bill Todd...
Posted: Thu Apr 09, 2009 1:18 am
Guest
Anton Ertl wrote:
Quote:
Bill Todd <billtodd at (no spam) metrocast.net> writes:
Anton Ertl wrote:
Bill Todd <billtodd at (no spam) metrocast.net> writes:
That disks write data out of order when write-back caching is enabled
does not seem at all surprising, since that's one of the main potential
benefits of having write-back caching enabled.
Yes. But some people seem to imagine that this is a very small effect
that can be ignored without ill effects on the consistency of the
on-disk data of a file system; this attitude is exemplified by having
barrier=1 disabled by default in the ext3 file system in Linux.

The test demonstrates that the reordering can happen over several
seconds.
That indeed seems to be quite a long time - but then it wasn't so long
ago that Unix systems would by default allow writes to languish for as
much as 30 seconds (with no particular guarantees about ordering when
they actually got destaged) so I can't really fault the disk vendors for
this: as has always been the case, if you want ordering guarantees, you
need to take explicit steps to ensure them.

Yes, nowadays you can have them without turning off write caching
completely, so it's entirely reasonable.

There are file systems like ext3 with data=ordered or data=journal or
BSD FFS with soft updates that do give guarantees about ordering. But
in order to implement these guarantees they must take the explicit
steps, and ext3 does not do that by default.

I may have been too quick to ignore soft updates (AFAIK unique to BSD
and thus not typical of Unix capabilities in general) and the optional
behavior of ext3 (again, not generally available in most Unixes AFAIK) -
my point was that I didn't think that write-back delays (with resulting
out-of-order writes) of even a few seconds constituted irresponsible
behavior on the part of disk vendors given the typical lack of ordering
guarantees in the systems their disks ran in (actually, the main use of
ATA and SATA disks may be in Windows boxes, so perhaps that should have
been the focus of my comment: NTFS does attempt to control ordering, at
least for critical metadata updates, even with write-back caching
enabled, but I think only on drives that support the force unit access
flag, which at least until somewhat recently many ATA and perhaps even
SATA drives did not).

Quote:

the program ran
significantly slower (about 6MB/s transfer rate) than what the drive
is capable of (>70MB/s)
Interesting: it implies that the disk was destaging a few blocks every
rev rather than waiting for a track to fill up (what are track sizes on
those 400 GB disks - 0.5 MB or so?)

At 70MB/s and 7200rpm=120/s the track size is at least
70(MB/s)/120(/s)=0.583MB.

Duh - on my better days I would have thought of that rather than just
being too lazy to look up the specs at seagate.com.

Probably a little larger because aligning
Quote:
the head for the next platter or moving it to the next cylinder also
costs a little time on each revolution.

1 ms or less these days IIRC - around 10% +/- of a rev at 7200 rpm.

Quote:

My guess (inspired by you) is that it destaged 3064KB at a time. The
slow transfer rate is probably a result of doing synchronous writes to
the disk buffers; the write would only report completion when the data
has arrived in the disk's buffers, and only then the next write would
start and weave its way through the various subsystems.

Even so that should result in something close to half the max transfer
rate (sounds as if all your writes were near the outer edge of the disk,
so we don't have to worry about varying track sizes). Or, if after
every 3+ MB written it seeked (sought? never thought about that...) to
track 0 to update block 0 (perhaps the seek back could hide behind the
next 3 MB transfer over the bus) that would still add only around 10 ms.
(short seek plus 1/2 rotation on average) to the roughly 55 ms. write
time, decreasing throughput only to a little under 30 MB/sec rather than
to the 6 MB/sec that you saw.

That's why I suspected that it was destaging dirty data in much smaller
chunks. For example, if it destaged 64 KB each time and then missed a
full rev before continuing (but stayed on-track rather than went to
track 0) the transfer rate would be under 7 MB/sec. But that would be a
somewhat brain-damaged way to go about things given today's on-disk
cache sizes and controller intelligence, since it could just use
multi-buffering plus a smidge more cache space to accept new data
continually while it destaged dirty data continually.

....

My guess is that it tries to
Quote:
write the blocks roughly in the order of age, and block 0 is rarely
the oldest one it sees because it gets overwritten by younger
instances all the time.

That explains why block 0 gets updated so infrequently but not the
abysmal transfer rate for the rest of the blocks. (And since block 0 is
getting updated on the platters only rarely that activity would seem to
consume only a small percentage of the disk bandwidth, whatever it is).

Quote:

Have you tested disks to see whether they indeed destaged single large
transfers out of order (as many claim to when the write is at least a
large percentage of a track in size)?

No. How would you test that?

By issuing continual near-full-track writes to random locations on a
zero-filled disk and then pulling the plug a few times to see whether
any of them wound up with a partial write that did not start at the
beginning of the request.

- bill
 
Maxim S. Shatskih...
Posted: Fri Apr 10, 2009 2:13 am
Guest
Quote:
writes. Journaling file systems need guarantees about the order of
journal writes as well as journal writes relative to the writes the
journal entries describe.

Usually, the update is first written to the journal (and must reach the hard disk media) and only then is reflected in the actual metadata.

In this case, it is enough to only use FUA (or some similar thing emulated on ATA, for instance, drive's cache flush after each such write) on journal writes.

--
Maxim S. Shatskih
Windows DDK MVP
maxim at (no spam) storagecraft.com
http://www.storagecraft.com
 
Anton Ertl...
Posted: Fri Apr 10, 2009 5:51 pm
Guest
"Maxim S. Shatskih" <maxim at (no spam) storagecraft.com.no.spam> writes:
Quote:
writes. Journaling file systems need guarantees about the order of
journal writes as well as journal writes relative to the writes the
journal entries describe.

Usually, the update is first written to the journal (and must reach the =
hard disk media) and only then is reflected in the actual metadata.

In this case, it is enough to only use FUA (or some similar thing =
emulated on ATA, for instance, drive's cache flush after each such =
write) on journal writes.

Yes, any feature that ensures partial ordering is sufficient. But
using write caching without any such features is not.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
 
 
Page 1 of 1    
All times are GMT
The time now is Tue Dec 08, 2009 12:06 am