 |
|
| Computers Forum Index » Computer Architecture - Arithmetic » Hardware divide and SQRT |
|
Page 1 of 2 Goto page 1, 2 Next |
|
| Author |
Message |
| Martin Howe |
Posted: Sun Aug 28, 2005 1:00 am |
|
|
|
Guest
|
After spending a long time with the x86 architecture and now looking at the
Intel Itanium, I have become quite curious about the issue of division and
square-root in FP hardware. I have been programming for over 20 years and
also have training in VLSI design (albeit over a decade ago ). So that
said, can anyone here give a "ball-park figures" answer to the following
questions:
(a) If one were to throw silicon at a problem and create a IEEE compliant FP
hardware divider (I believe they are called cell array dividers, though
there may be other kinds) that is as fast (or not far off) as a multiplier
of the same number of bits, roughtly how much extra silicon would it consume
compared to the multiplier? 2x? 3x? 10x? Soccer pitch? :)
(b) If one were to instead turn that on its head and use the same amount of
silicon (or thereabouts) and tolerate some multi-clock implementation,
roughly how many clock tics would be needed?
(c), (d) - the same questions for SQRT.
(e), (f) - the same questions for REMAINDER
NB: For (a), (c) and (e) I mean an implementation having no software, not
even microcode, just a pure VLSI logic circuit. Is that even possible for
SQRT and REMAINDER or would the amount of logic be just stupidly high for
those, as I suspect?
I realise that any answers will of necessity be approximate and probably
simplistic, but that's OK; this is not part of a real chip or anything like
that, I'm just curious as to how far pure hardware division, SQRT and
remainder can be taken if the silicon budget is there.
TIA
--
Martin Howe
Webmaster and Kitten Registrar
Bombay and Asian Self Breed Club
www.bombaybreedclub.org |
|
|
| Back to top |
|
|
|
| Nick Maclaren |
Posted: Sun Aug 28, 2005 11:00 am |
|
|
|
Guest
|
In article <gw6Qe.438$B4.175@newsfe5-win.ntli.net>,
Martin Howe <martinhowe@myprivacy.ca> wrote:
Quote: After spending a long time with the x86 architecture and now looking at the
Intel Itanium, I have become quite curious about the issue of division and
square-root in FP hardware. I have been programming for over 20 years and
also have training in VLSI design (albeit over a decade ago  ). So that
said, can anyone here give a "ball-park figures" answer to the following
questions:
The following are from memory and may be unreliable, and refer to the
theoretical bounds. R.P. Brent wrote a paper on precisely this about
20 years back, and it is WELL worth looking up.
The first point is that none of your questions are answerable
He proved what everybody knew - that you need more logic to implement
faster arithmetic! And he gave good bounds on what was achievable.
So the following is a gross-oversimplication.
Quote: (a) If one were to throw silicon at a problem and create a IEEE compliant FP
hardware divider (I believe they are called cell array dividers, though
there may be other kinds) that is as fast (or not far off) as a multiplier
of the same number of bits, roughtly how much extra silicon would it consume
compared to the multiplier? 2x? 3x? 10x? Soccer pitch?
Entirely dependent on how fast you want it relative to the multiply.
Quote: (b) If one were to instead turn that on its head and use the same amount of
silicon (or thereabouts) and tolerate some multi-clock implementation,
roughly how many clock tics would be needed?
For a comparable amount of silicon, three times. Any implementation
that has a divide more than 5 times slower than the multiplication
is poor; any that has one more than 10 times is incompetent. You
can do it by microcode in 3-5 times (depending on overhead and
specification).
Quote: (c), (d) - the same questions for SQRT.
Same answers.
Quote: (e), (f) - the same questions for REMAINDER
Ditto.
Quote: NB: For (a), (c) and (e) I mean an implementation having no software, not
even microcode, just a pure VLSI logic circuit. Is that even possible for
SQRT and REMAINDER or would the amount of logic be just stupidly high for
those, as I suspect?
Nope. SQRT is as near as dammit equivalent in complexity, and REMAINDER
is only a little more complex.
A good way to show this is to take a Newton-Raphson software code, and
optimise it by providing hardware for the housekeeping and last-bit
stuff that is so evil to do using typical ISAs. Aim for 3x multiply
in all cases, and you shouldn't miss by much.
Note that I am NOT saying that is the way to do it - I am not into
that area - but I am saying that, what you can do in software, you can
do in hardware.
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| Martin Howe |
Posted: Sun Aug 28, 2005 11:00 am |
|
|
|
Guest
|
Hi
"Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:des0i1$sgb$1@gemini.csx.cam.ac.uk...
Quote: In article <gw6Qe.438$B4.175@newsfe5-win.ntli.net>,
(a) If one were to throw silicon at a problem and create a IEEE compliant
FP
hardware divider (I believe they are called cell array dividers, though
there may be other kinds) that is as fast (or not far off) as a multiplier
of the same number of bits, roughtly how much extra silicon would it
consume
compared to the multiplier? 2x? 3x? 10x? Soccer pitch? :)
Entirely dependent on how fast you want it relative to the multiply.
Let's say no more than 25% slower.
Quote: R.P. Brent wrote a paper on precisely this about
20 years back, and it is WELL worth looking up.
Thanks, do you have the exact title and date (so I can google it in the hope
somebody might have it online)?
Thanks
--
Martin Howe
Webmaster and Kitten Registrar
Bombay and Asian Self Breed Club
www.bombaybreedclub.org |
|
|
| Back to top |
|
|
|
| Nick Maclaren |
Posted: Sun Aug 28, 2005 11:00 am |
|
|
|
Guest
|
In article <des0i1$sgb$1@gemini.csx.cam.ac.uk>,
Nick Maclaren <nmm1@cus.cam.ac.uk> wrote:
Quote: In article <gw6Qe.438$B4.175@newsfe5-win.ntli.net>,
Martin Howe <martinhowe@myprivacy.ca> wrote:
(e), (f) - the same questions for REMAINDER
Ditto.
NB: For (a), (c) and (e) I mean an implementation having no software, not
even microcode, just a pure VLSI logic circuit. Is that even possible for
SQRT and REMAINDER or would the amount of logic be just stupidly high for
those, as I suspect?
Nope. SQRT is as near as dammit equivalent in complexity, and REMAINDER
is only a little more complex.
Oops. Which REMAINDER? I was answering for the one that is consistent
with the divide, but you may have meant the other (which I have never
understood the purpose of, in its last-bit accuracy form). I would
need to redo the calculations to be quite certain how complex that
one is, and it might be significantly slower.
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| Nick Maclaren |
Posted: Sun Aug 28, 2005 11:00 am |
|
|
|
Guest
|
In article <JSfQe.1451$x4.756@newsfe2-gui.ntli.net>,
Martin Howe <martinhowe@myprivacy.ca> wrote:
Quote: "Nick Maclaren" <nmm1@cus.cam.ac.uk> wrote in message
news:des0i1$sgb$1@gemini.csx.cam.ac.uk...
(a) If one were to throw silicon at a problem and create a IEEE compliant
FP
hardware divider (I believe they are called cell array dividers, though
there may be other kinds) that is as fast (or not far off) as a multiplier
of the same number of bits, roughtly how much extra silicon would it
consume
compared to the multiplier? 2x? 3x? 10x? Soccer pitch? :)
Entirely dependent on how fast you want it relative to the multiply.
Let's say no more than 25% slower.
That's going to be expensive in silicon - VERY expensive. If I recall,
the cost goes up asymptotically as you bring the time of the divide
down to that of the multiply.
Quote: R.P. Brent wrote a paper on precisely this about
20 years back, and it is WELL worth looking up.
Thanks, do you have the exact title and date (so I can google it in the hope
somebody might have it online)?
Sorry, no.
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| Jeff Kenton |
Posted: Sun Aug 28, 2005 9:01 pm |
|
|
|
Guest
|
Martin Howe wrote:
...
Quote:
R.P. Brent wrote a paper on precisely this about
20 years back, and it is WELL worth looking up.
Thanks, do you have the exact title and date (so I can google it in the hope
somebody might have it online)?
Perhaps this is the doc you want:
R.P. Brent, and H.T. Kung, "The Area-Time Complexity of Binary
Multiplication", Journal of the ACM, Vol. 28, No. 3, July 1981, pp. 521-534. |
|
|
| Back to top |
|
|
|
| Martin Howe |
Posted: Sun Aug 28, 2005 11:00 pm |
|
|
|
Guest
|
|
| Back to top |
|
|
|
| Guest |
Posted: Mon Aug 29, 2005 3:01 am |
|
|
|
|
On the Itanium rationale for deferring DIV SQRT and REM to software,
you might take a look at Peter Markstein's book, "IA-64 and Elementary
Functions: Speed and Precision". Prentice-Hall 2000. For the scope of
the book see www.markstein.org. He was one of the designers of HP's
Itanium math library.
Obviously Itanium could have put these operations in hardware, but the
discovery of cute and fast implementations in software meant it was not
so necessary. |
|
|
| Back to top |
|
|
|
| m |
Posted: Mon Aug 29, 2005 5:00 am |
|
|
|
Guest
|
On Sun, 28 Aug 2005 15:17:45 -0400, Jeff Kenton
<jeffrey.kenton@comcast.net> wrote:
Quote: Martin Howe wrote:
...
R.P. Brent wrote a paper on precisely this about
20 years back, and it is WELL worth looking up.
Thanks, do you have the exact title and date (so I can google it in the hope
somebody might have it online)?
Perhaps this is the doc you want:
R.P. Brent, and H.T. Kung, "The Area-Time Complexity of Binary
Multiplication", Journal of the ACM, Vol. 28, No. 3, July 1981, pp. 521-534.
which is available at http://cr.yp.to/2005-590/brent.pdf in case
anyone missed it. |
|
|
| Back to top |
|
|
|
| glen herrmannsfeldt |
Posted: Mon Aug 29, 2005 7:01 am |
|
|
|
Guest
|
Martin Howe wrote:
Quote: After spending a long time with the x86 architecture and now looking at the
Intel Itanium, I have become quite curious about the issue of division and
square-root in FP hardware. I have been programming for over 20 years and
also have training in VLSI design (albeit over a decade ago  ). So that
said, can anyone here give a "ball-park figures" answer to the following
questions:
(a) If one were to throw silicon at a problem and create a IEEE compliant FP
hardware divider (I believe they are called cell array dividers, though
there may be other kinds) that is as fast (or not far off) as a multiplier
of the same number of bits, roughtly how much extra silicon would it consume
compared to the multiplier? 2x? 3x? 10x? Soccer pitch?
Combinatorial divide is hard, but pipelining it isn't so hard.
The 360/91 uses the multiply logic to do divide, does multiply in
three cycles, short divide in 12, and long divide in 18.
(I think those are right, but I didn't look them up.)
That would seem to indicate that six times the hardware and pipelining
should be enough to do it.
-- glen |
|
|
| Back to top |
|
|
|
| Martin Howe |
Posted: Mon Aug 29, 2005 11:00 am |
|
|
|
Guest
|
Thanks, I'll look those up.
--
Martin Howe
Webmaster and Kitten Registrar
Bombay and Asian Self Breed Club
www.bombaybreedclub.org
"mk" <kal*@dspia.*comdelete> wrote in message
news:s6e5h1dree4e8q71c7ho2rji1fgqh9t8bm@4ax.com...
Quote: On Mon, 29 Aug 2005 04:13:42 GMT, mk<kal*@dspia.*comdelete> wrote:
On Sun, 28 Aug 2005 15:17:45 -0400, Jeff Kenton
jeffrey.kenton@comcast.net> wrote:
Martin Howe wrote:
...
R.P. Brent wrote a paper on precisely this about
20 years back, and it is WELL worth looking up.
Thanks, do you have the exact title and date (so I can google it in the
hope
somebody might have it online)?
Perhaps this is the doc you want:
R.P. Brent, and H.T. Kung, "The Area-Time Complexity of Binary
Multiplication", Journal of the ACM, Vol. 28, No. 3, July 1981, pp.
521-534.
which is available at http://cr.yp.to/2005-590/brent.pdf in case
anyone missed it.
on another note, here is a worthwhile place to look for most things
arithmetic:
http://web.comlab.ox.ac.uk/oucl/work/richard.brent/pub/pubsall.html |
|
|
| Back to top |
|
|
|
| Nick Maclaren |
Posted: Mon Aug 29, 2005 11:00 am |
|
|
|
Guest
|
In article <LsmdnbWta4K4N4_eRVn-hA@comcast.com>,
glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
Quote:
The 360/91 uses the multiply logic to do divide, does multiply in
three cycles, short divide in 12, and long divide in 18.
(I think those are right, but I didn't look them up.)
That would seem to indicate that six times the hardware and pipelining
should be enough to do it.
Yeah, but that's a poor implementation. There were several such
where I could beat the hardware in software, as far as the actual
calculation went, but was defeated by the overheads and bit-munging
(which are EVIL in the sort of ISAs that the hardware people inflict
on the software ones).
It would be paranoid to say that they do that to ensure that we can't
wipe their eyes, but sometimes I wonder :-(
As I say, take a decent software Newton-Raphson, map its operations
to plausible hardware primitives, and then wonder what they hell they
have done to do worse. Last-bit accurate division is nasty, but not
all THAT nasty. And, not merely is the factor not all that large,
the operation is partially pipelinable in all cases, and wholly so
in the faster ones!
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| glen herrmannsfeldt |
Posted: Mon Aug 29, 2005 7:00 pm |
|
|
|
Guest
|
Nick Maclaren wrote:
(snip)
Quote: As I say, take a decent software Newton-Raphson, map its operations
to plausible hardware primitives, and then wonder what they hell they
have done to do worse. Last-bit accurate division is nasty, but not
all THAT nasty. And, not merely is the factor not all that large,
the operation is partially pipelinable in all cases, and wholly so
in the faster ones!
And the 360/91 gave more accurate, and so not conforming to
the architecture, results on divide. The architecture specifies
truncated quotient, but the 360/91 rounds.
-- glen |
|
|
| Back to top |
|
|
|
| Martin Howe |
Posted: Fri Sep 02, 2005 1:05 am |
|
|
|
Guest
|
Hi All
This thread seems to be going places I didn't quite imagine - not to mention
the spam - and I also found a load of stuff on the net via google about
CPU architectures.
However, the stuff I wanted answered has been, so thanks very much to all
who responded!
Regards
Martin
--
Martin Howe
Webmaster and Kitten Registrar
Bombay and Asian Self Breed Club
www.bombaybreedclub.org |
|
|
| Back to top |
|
|
|
| John Savard |
Posted: Tue Sep 06, 2005 11:01 am |
|
|
|
Guest
|
On Sat, 27 Aug 2005 23:17:32 GMT, "Martin Howe"
<martinhowe@myprivacy.ca> wrote, in part:
Quote: (a) If one were to throw silicon at a problem and create a IEEE compliant FP
hardware divider (I believe they are called cell array dividers, though
there may be other kinds) that is as fast (or not far off) as a multiplier
of the same number of bits, roughtly how much extra silicon would it consume
compared to the multiplier? 2x? 3x? 10x? Soccer pitch?
About 3x. And 3 clock ticks would be required to do a division
regardless - but this apparatus could start a new division every clock
tick, hence it would be good for vectors.
John Savard
http://home.ecn.ab.ca/~jsavard/index.html
http://www.quadibloc.com/index.html
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 140,000 groups
Unlimited download
http://www.usenetzone.com to open account |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Sat Mar 20, 2010 1:01 am
|
|