 |
|
| Computers Forum Index » Computer Architecture » Is it time to stop research in Computer Architecture ?... |
|
Page 8 of 10 Goto page Previous 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Next |
|
| Author |
Message |
| Joe Pfeiffer... |
Posted: Sat Oct 24, 2009 5:15 am |
|
|
|
Guest
|
"Del Cecchi" <delcecchiofthenorth at (no spam) gmail.com> writes:
Quote:
I don't put the death of PA-Risc at Itaniums door, since HP was from
all appearances one of the parents of the Itanium architecture and
perhaps the ones that sold it to Intel, rather than vice versa.
They certainly were co-conspirators, so to speak.
So... not at Intel's door, perhaps. Still Itanium.
--
As we enjoy great advantages from the inventions of others, we should
be glad of an opportunity to serve others by any invention of ours;
and this we should do freely and generously. (Benjamin Franklin) |
|
|
| Back to top |
|
|
|
| Robert Myers... |
Posted: Sat Oct 24, 2009 1:12 pm |
|
|
|
Guest
|
On Oct 24, 3:08 am, Terje Mathisen <terje.wiig.mathi... at (no spam) gmail.com>
wrote:
Quote: On Oct 23, 8:11 pm, ga... at (no spam) allegro.com (Gavin Scott) wrote:
For PA-RISC capability HP had very high hopes for dynamic translation.
One slide from fairly early on suggests they expected to get to 50%
of native performance using translation. In reality they failed to
scrounge up enough cleverness to do it well, and the PA-RISC
compatibility on IPF has always been poor enough that the performance
is commonly considered unacceptable even for business applications.
That's interesting:
IA64 seemed to have a close to complete superset of all PA-RISC
features/instructions, including some very funky address shift/
combination operations specifically claimed to be there to support PS-
RISC features.
The register set was so much larger that it could be mapped
statically.
If all this hw support didn't get at least 50%, then the clock rate
must have been very disappointing (which it was, right?).
If I remember the numbers Anton provided, 50% per clock for untuned
code and a less than optimal compiler seems about right, even without
accounting for translation overhead, and I doubt that the existence of
a natural mapping to the instruction set provides much relief.
As I'm writing this, I'm wondering how code translators interact with
branch predictors. It seems like a hard problem to me, and Itanium
doesn't like surprises.
Robert. |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sat Oct 24, 2009 1:59 pm |
|
|
|
Guest
|
In article <hbmdnY6DTMGrYH_XnZ2dnUVZ8tKdnZ2d at (no spam) giganews.com>,
<jgd at (no spam) cix.compulink.co.uk> wrote:
Quote:
The only other Itanium platform I ever used was HP-UX, where the x86 was
not significant. By the time people were asking for our software on
Itanium Linux, our answer was "That's going to cost you more than you
are willing to pay."
The other problem with even native IA64 code was reliability. The
skill needed to track down problems that might be code generation
bugs or obscure race conditions was MUCH greater than that needed
for other systems. So a lot of code was very unreliable.
I should be interested if anyone used desktop GUIs and applications
compiled natively, to know what they thought. The HPC people were
distinctly unhappy - SGI got the Altix going, but several owners
of 'ordinary' Linux Itanium boxes turned them off as unsupportable
for an amount of effort that made them cost-effective.
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sat Oct 24, 2009 2:18 pm |
|
|
|
Guest
|
In article <2009Oct24.160207 at (no spam) mips.complang.tuwien.ac.at>,
Anton Ertl <anton at (no spam) mips.complang.tuwien.ac.at> wrote:
Quote: Robert Myers <rbmyersusa at (no spam) gmail.com> writes:
[Speed of PA-RISC emulation on Itanium]
As I'm writing this, I'm wondering how code translators interact with
branch predictors.
Direct branches are translated to direct branches and are fast (and
work well with branch predictors). In general indirect branches have
to go through a translation table and are quite a bit slower; it may
be possible to translate some patterns in a way that avoids the
translation table overhead (like the code coming out of C compilers
for switch statements); AFAIK neither PA-RISC nor IA-64
implementations have indirect branch predictors, so branch prediction
does not come into play here.
In my rather ancient experience (for other translations), that's not
the problem. It's the cases where a very commonly used instruction
in the original needs a conditional in the target, of the sort that
is hard for an automatic optimiser to remove.
Delights like whether right shift of negative values propagate the
sign bit or not, If you need a conditional every time you can't be
certain of the sign of the integer being shifted, that's bad news.
Especially as the direction of the branch may not be predictable
based solely on the location of the shift.
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| Robert Myers... |
Posted: Sat Oct 24, 2009 4:45 pm |
|
|
|
Guest
|
On Oct 24, 10:02 am, an... at (no spam) mips.complang.tuwien.ac.at (Anton Ertl)
wrote:
Quote: Robert Myers <rbmyers... at (no spam) gmail.com> writes:
[Speed of PA-RISC emulation on Itanium]
If I remember the numbers Anton provided, 50% per clock for untuned
code and a less than optimal compiler seems about right
I don't know what you think you remember, but I have not presented
PA-RISC results, simply because we have no PA-RISC box (for Gforth)
and nobody has submitted PA-RISC results (for the latex benchmark).
For those who wonder what this is all about, the message that he means
is <2009Oct22.164... at (no spam) mips.complang.tuwien.ac.at>, and the results
referred to are
http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-sli...http://www.complang.tuwien.ac.at/franz/latex-bench
3.
I couldn't get the link to work when I wrote the post. On your scale,
where ia32 is 1.0 performance per cycle, Itanium was between 0.35 and
0.40, barely better than ARM XScale. I took the ia32 to indicate a
compiler working with a processor that it was well-tuned to schedule
for and the Itanium results as indicative of how code that wasn't
analyzed or scheduled with much insight into ia64 would do. The PA-
RISC code would have been compiled in an environment that was
completely naive of itanium, and I'm not surprised that it can't be
translated into code that does well on itanium (any more than can
ia32).
If the architecture depends heavily on the compiler and the code was
compiled and scheduled by a compiler that's naive of the architecture,
it's hardly surprising that it can't be translated into code that
performs well. That they got ia32 translation to work even acceptably
seems something of a miracle to me.
Robert. |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sat Oct 24, 2009 5:12 pm |
|
|
|
Guest
|
In article <46ednbx9zPp4JULXnZ2dnUVZ_tqdnZ2d at (no spam) metrocastcablevision.com>,
billtodd at (no spam) metrocast.net (Bill Todd) wrote:
Quote: Why not? It ran x86 code natively in an integrated manner on a
native Itanic OS. As with most things Merced the original cut wasn't
impressive in terms of speed, but the relative sizes of the x86 and
Itanic processors (especially given the amount of the chip area
dedicated to cache) made it clear that full-fledged x86 cores could
be included later if necessary as soon as the next process
generations appeared.
I used it bit. On both Merced and McKinley, the x86 had about one-third
of the throughput of native Itanium code: I was benchmarking with the
same source built both ways. The reasons for the poor performance seemed
to be:
(a) It was an x86 front-end driving the Itanium back-end execution
units. This didn't allow for the kind of speculative and out-of-order
execution that was normal in the x86 world by that time with the Pentium
Pro/II/III family, Athlon and Pentium 4. You were dropping back to
something that was essentially a fast-clocked 486.
(b) At least under Windows, you had to go through a complete execution
transition to Itanium mode and back again on every system call. This was
kind of slow, and meant that running the compilers that ran on x86 and
generated Itanium code on an Itanium was much slower than 1/3
performance.
The only other Itanium platform I ever used was HP-UX, where the x86 was
not significant. By the time people were asking for our software on
Itanium Linux, our answer was "That's going to cost you more than you
are willing to pay."
The kind of guys who take pride in being corporate "power users", who
often drive uptake of technology, even if they don't have much insight
into it, hit severe problems with Itanium. They thought "Wow, here's
this amazing new 64-bit thing that also runs my MS Office work", got
one, and found that Office had slowed down a lot for them. That kind of
ego-driven customer really hates being wrong, and holds it against the
platform, rather than questioning their own judgement. By contrast,
AMD64 gave them just what they wanted. These people can be quite
significant, even if they are basically idiots: my employers used to
belong to EDS, and there were regular corporate edicts against buying
Alpha boxes in the nineties and Itania around the turn of the
millennium, to prevent those guys wasting money. If you had a real need
for the kit, and could explain why, you could buy them through the
company, but it was a long explanation each time.
--
John Dallman, jgd at (no spam) cix.co.uk, HTML mail is treated as probable spam. |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sat Oct 24, 2009 5:12 pm |
|
|
|
Guest
|
In article <7kfe6aF39sdhsU1 at (no spam) mid.individual.net>,
delcecchiofthenorth at (no spam) gmail.com (Del Cecchi) wrote:
Quote: "Bill Todd" <billtodd at (no spam) metrocast.net> wrote in message
Save for the grace of AMD it still might have: without a credible,
inexpensive, and pervasive 64-bit alternative Intel could have just
waited until desktops began to demand 64-bit processors.
Yup. I remain grateful to AMD for saving me from a lifetime of Itanium
low-level debugging.
Quote: I don't put the death of PA-Risc at Itaniums door, since HP was from
all appearances one of the parents of the Itanium architecture and
perhaps the ones that sold it to Intel, rather than vice versa.
They certainly were co-conspirators, so to speak.
As the Intel porting training course explained it in mid-1999, the
project had started as PA-RISC 3.0 at HP. HP had realised that it would
be too expensive to develop just for the PA-RISC replacement market, and
sought a partnership with Intel.
--
John Dallman, jgd at (no spam) cix.co.uk, HTML mail is treated as probable spam. |
|
|
| Back to top |
|
|
|
| Anton Ertl... |
Posted: Sat Oct 24, 2009 5:43 pm |
|
|
|
Guest
|
jgd at (no spam) cix.compulink.co.uk writes:
Quote: I used it bit. On both Merced and McKinley, the x86 had about one-third
of the throughput of native Itanium code: I was benchmarking with the
same source built both ways.
....
By the time people were asking for our software on
Itanium Linux, our answer was "That's going to cost you more than you
are willing to pay."
Our IA-64 box is running under Linux reacts as follows when trying to
run an IA-32 executable:
-bash: ./gforth: No such file or directory
(Note that ./gforth exists and is executable).
In contrast, when I try to execute an AMD64 or Alpha executable, I see:
-bash: ./gforth: cannot execute binary file
Judging from experience with Linux-Alpha, this probably means that the
kernel supports executing IA-32 executables, but needs a helper file
for that (on Linux-Alpha it was the emulator), and that file is
missing.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
|
| Back to top |
|
|
|
| Anton Ertl... |
Posted: Sat Oct 24, 2009 6:02 pm |
|
|
|
Guest
|
Robert Myers <rbmyersusa at (no spam) gmail.com> writes:
[Speed of PA-RISC emulation on Itanium]
Quote: If I remember the numbers Anton provided, 50% per clock for untuned
code and a less than optimal compiler seems about right
I don't know what you think you remember, but I have not presented
PA-RISC results, simply because we have no PA-RISC box (for Gforth)
and nobody has submitted PA-RISC results (for the latex benchmark).
For those who wonder what this is all about, the message that he means
is <2009Oct22.164225 at (no spam) mips.complang.tuwien.ac.at>, and the results
referred to are
http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-slides.pdf
http://www.complang.tuwien.ac.at/franz/latex-bench
Quote: As I'm writing this, I'm wondering how code translators interact with
branch predictors.
Direct branches are translated to direct branches and are fast (and
work well with branch predictors). In general indirect branches have
to go through a translation table and are quite a bit slower; it may
be possible to translate some patterns in a way that avoids the
translation table overhead (like the code coming out of C compilers
for switch statements); AFAIK neither PA-RISC nor IA-64
implementations have indirect branch predictors, so branch prediction
does not come into play here.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
|
| Back to top |
|
|
|
| ... |
Posted: Sat Oct 24, 2009 6:39 pm |
|
|
|
Guest
|
In article <2009Oct24.154356 at (no spam) mips.complang.tuwien.ac.at>,
anton at (no spam) mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Quote: Judging from experience with Linux-Alpha, this probably means that the
kernel supports executing IA-32 executables, but needs a helper file
for that (on Linux-Alpha it was the emulator), and that file is
missing.
What do you get when you run ldd on the IA-32 executable? Just
interested, no need to worry if checking this isn't trivial. I'm
wondering if it needs a different loader, since having one of those
missing is one way of producing the error message you quote.
--
John Dallman, jgd at (no spam) cix.co.uk, HTML mail is treated as probable spam. |
|
|
| Back to top |
|
|
|
| Robert Myers... |
Posted: Sat Oct 24, 2009 7:25 pm |
|
|
|
Guest
|
On Oct 24, 2:39 pm, an... at (no spam) mips.complang.tuwien.ac.at (Anton Ertl)
wrote:
Quote: Robert Myers wrote
On your scale,
where ia32 is 1.0 performance per cycle,
Different IA32 implementations have different performance per cycle in
the range of 0.55-1.0.
Itanium was between 0.35 and
0.40, barely better than ARM XScale.
~0.39, In the same ballpark as the other non-IA32/AMD64 CPUs (~0.34-0.53)..
I took the ia32 to indicate a
compiler working with a processor that it was well-tuned to schedule
for and the Itanium results as indicative of how code that wasn't
analyzed or scheduled with much insight into ia64 would do.
So the PPC, Alpha and ARM results are also due to lack of insight into
the scheduling requirements of the CPU in your opinion?
My theory (which you can find in the text of that slide) is that the
better perfromance of the IA32 and AMD64 implementations on this
benchmark is because they perform indirect-branch prediction and most
of the others do not (hmm, the 21264B also has a kind of
indirect-branch predictor, but the performance is still not so great
at ~0.43; I have no theory for that).
Unless the PA-RISC implementation you are thinking of has an
indirect-branch predictor, I have no reason to expect it to perform
better than ~0.5.
I don't have enough insight into the other architectures to comment.
I first looked at the chart and said, yup, just like I said, it's a
compiler built and tuned around x86.
I don't have any insight into what being architecture-naive on the
other architectures might be, but, for Itanium, you have to start with
deep insight into the code in order to get a payback on all the fancy
bells and whistles. Itanium should be getting more instructions per
clock, not significantly fewer (that *was* the idea, wasn't it?).
Even with respect to the other architectures, it's only in the pack.
Once you're past the source code and information you can preserve from
it in intermediate representations, you have an expensive space
heater.
I just happened to have your charts fresh in mind when I made the
comment, and neither your results nor the fact that binary translation
doesn't work well is a surprise. My apologies if you feel that I
overinterpreted your numbers and didn't give sufficient credit to your
own analysis.
Robert. |
|
|
| Back to top |
|
|
|
| Anton Ertl... |
Posted: Sat Oct 24, 2009 10:05 pm |
|
|
|
Guest
|
jgd at (no spam) cix.compulink.co.uk writes:
Quote: In article <2009Oct24.154356 at (no spam) mips.complang.tuwien.ac.at>,
anton at (no spam) mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Judging from experience with Linux-Alpha, this probably means that the
kernel supports executing IA-32 executables, but needs a helper file
for that (on Linux-Alpha it was the emulator), and that file is
missing.
What do you get when you run ldd on the IA-32 executable?
[ia64:~/gforth:25338] ldd ./gforth
not a dynamic executable
[ia64:~/gforth:25339] file ./gforth
../gforth: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.8, not stripped
Quote: I'm
wondering if it needs a different loader, since having one of those
missing is one way of producing the error message you quote.
It's possible that it needs a different loader. AFAIK ldd then needs
that loader, too. Looking at the strace for ldd, I see:
stat("/lib/ld-linux.so.2", 0x60000fffffe4b480) = -1 ENOENT (No such file or directory)
With that, I found that package I needed to install on this Debian
system (ia32-libs), and now I can run IA32 programs on this IA64
machine. I just ran some simple Gforth benchmarks on it:
sieve bubble matrix fib
0.764 1.000 0.560 1.188 IA64 code (gcc-4.1) on 900MHz Itanium II
1.840 2.284 1.080 2.796 IA32 code (gcc-2.95) on 900MHz Itanium II
0.261 0.299 0.156 0.375 IA32 code (gcc-2.95) on 2.26GHz Pentium 4
(These gcc versions give good performance for Gforth).
Note that this Pentium 4 (released in May 2002 according to Wikipedia)
is contemporary with this Itanium II (released in 2002-07-08 according
to Wikipedia).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
|
| Back to top |
|
|
|
| Anton Ertl... |
Posted: Sat Oct 24, 2009 10:39 pm |
|
|
|
Guest
|
Robert Myers <rbmyersusa at (no spam) gmail.com> writes:
That's no wonder because apparently your Newsreader mutilates it.
Here is is again:
http://www.complang.tuwien.ac.at/anton/euroforth/ef09/papers/ertl-slides.pdf
Quote: On your scale,
where ia32 is 1.0 performance per cycle,
Different IA32 implementations have different performance per cycle in
the range of 0.55-1.0.
Quote: Itanium was between 0.35 and
0.40, barely better than ARM XScale.
~0.39, In the same ballpark as the other non-IA32/AMD64 CPUs (~0.34-0.53).
Quote: I took the ia32 to indicate a
compiler working with a processor that it was well-tuned to schedule
for and the Itanium results as indicative of how code that wasn't
analyzed or scheduled with much insight into ia64 would do.
So the PPC, Alpha and ARM results are also due to lack of insight into
the scheduling requirements of the CPU in your opinion?
My theory (which you can find in the text of that slide) is that the
better perfromance of the IA32 and AMD64 implementations on this
benchmark is because they perform indirect-branch prediction and most
of the others do not (hmm, the 21264B also has a kind of
indirect-branch predictor, but the performance is still not so great
at ~0.43; I have no theory for that).
Unless the PA-RISC implementation you are thinking of has an
indirect-branch predictor, I have no reason to expect it to perform
better than ~0.5.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
|
| Back to top |
|
|
|
| Robert Myers... |
Posted: Sat Oct 24, 2009 11:22 pm |
|
|
|
Guest
|
On Oct 24, 6:31 pm, Bernd Paysan <bernd.pay... at (no spam) gmx.de> wrote:
Quote: One interesting property of quantum mechanics is that for irreversible
logic, there's a minimum amount of energy that is necessary to make it
happen. Reversible logic does not have this drawback. Therefore,
people investigate into reversible logic, even though the actual
components to get that benefit are not in sigh (not even carbon nanotube
switches have these properties, even though they are much closer to the
physical limits for irreversible logic). Many people also forget that
quantum mechanics does not properly take changes in the system into
account, and that means that your reversible logic only works with the
predicted low power when the inputs are not changing any more - and this
is just the uninteresting case (the coherent one - changes in the system
lead to decoherence, and thereby to classical physics).
Let's see. Quantum mechanics properly applied takes account of
everything in the whole universe, which is, so far as I know, quantum
mechanical and reversible in it's entirety. If you could isolate
parts of the system, like your computing apparatus, then it would be
like a universe that is quantum mechanical and reversible in its
entirety. Such a device would have little use to us, because we could
neither give it new problems to work on nor read the results when it's
done.
In order to give the device a new problem, we must disturb it, but the
system can still retain enough coherence to function as a quantum
mechanical device. Only the entropy involved in the process of giving
the device input and reading the output has an irreducible cost in
energy that we must put on to the electric bill, as we will never get
it back, except as waste heat.
Thus, even though you can't do operations with *no* net cost in
energy, we can still build and operate devices that act as quantum
mechanical computers to an arbitrarily good approximation. Writing to
them and reading from them is always an irreversible process that, if
repeated often enough, will eventually lead to the device having no
useful quantum mechanical coherence left for us to exploit, as we have
destroyed it all through our reading and writing. In the interim, we
can do an awful lot of computation. Otherwise, "quantum computers"
would not be possible.
I'm having a hard time reconciling how I understand the problem with
what you just said, which seems too sweeping and too black and white.
Can you help me out?
Robert. |
|
|
| Back to top |
|
|
|
| Robert Myers... |
Posted: Sun Oct 25, 2009 1:59 am |
|
|
|
Guest
|
On Oct 24, 9:40 pm, Andrew Reilly <andrew-newsp... at (no spam) areilly.bpc-
users.org> wrote:
Quote: On Sat, 24 Oct 2009 12:25:40 -0700, Robert Myers wrote:
I don't have any insight into what being architecture-naive on the other
architectures might be, but, for Itanium, you have to start with deep
insight into the code in order to get a payback on all the fancy bells
and whistles. Itanium should be getting more instructions per clock,
not significantly fewer (that *was* the idea, wasn't it?).
I've not used an Itanium, but it would seem to have quite a bit of
practical similarity to the Texas Instruments TIC6000 series of VLIW DSP
processors, in that it is essentially in-order VLIW with predicated
instructions and some instruction encoding funkiness. That whole idea is
*predicated* on being able to software-pipeline loop bodies and do enough
iterations to make them a worthwhile fraction of your execution time.
From memory, Anton's TeX benchmark is the exact opposite: strictly
integer code of the twistiest non-loopy conditional nature. I would not
expect even a heroic compiler to get *any* significant parallel issues
going, at which point it falls back to being an in-order RISC-like
machine: not dramatically unlike a pre-Cortex ARM, or SPARC, as you said.
Now, Texas' compilers for the C6000 *are* heroic, and I've seen them
regularly schedule all eight possible instruction slots active per cycle,
for appropriate DSP code. The interesting thing is that this process is
*extremely* fragile. If the loop body contains too many instructions
(for whatever reason), or some other limitation, then the compiler seems
to throw up its hands and give you essentially single-instruction-per-
cycle code, which is (comparatively) hopeless. Smells like a box full of
heuristics, rather than reliable proof. The only way to proceed is to
hack the source code into little pieces and try variations until the
compiler behaves "well" again.
At least the TI parts *do* get low power consumption out of the deal, and
since they clock more slowly they don't have quite so many cycles to wait
for a cache miss. And no-one is trying to run TeX on them...
I get so tense here, trying to make sure I don't make a grotesque
mistake.
Your post made me chuckle. Thanks. I actually didn't even look at
the TeX numbers, only the ones I had first relied upon. As a seventh-
grade teacher remarked, my laziness might one day be my undoing.
Thanks for calling attention to the TI compiler. I've looked at the
TI DSP chips, but never gotten further.
You know just how heroic a heroic compiler really is. I don't know
whether David Dinucci (did I get it right?) is still following.
Robert. |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Wed Dec 09, 2009 3:24 am
|
|