 |
|
| Computers Forum Index » Computer Architecture » hardware prefeth, branch predictor in the context of... |
|
Page 1 of 2 Goto page 1, 2 Next |
|
| Author |
Message |
| Sid Touati... |
Posted: Wed Oct 28, 2009 10:08 pm |
|
|
|
Guest
|
Hi all,
branch predictors and data prefetchors are usually evaluated by
considering a single task: the fixed benchmark is the unique program
executed for evaluating the efficiency of the hardware data prefetchor
or the branch predictor.
With the multicore era, desktop and servers will execute more and more
multi-threaded applications, or multiple distinct applications, from
distinct users. When executing multiple threads from multiple
applications, branch predictors and data prefetchor are disturbed, and
their "learning" becomes erroneous (especially when they use physical
address as tags).
Does anyone know about a serious experimental study of the performance
of hardware data prefetchor and branch predictor in such context ?
Thanks
S, waiting for the next generation of branch predictors and data
prefetchors for multicore processors |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Thu Oct 29, 2009 5:15 am |
|
|
|
Guest
|
Sid Touati wrote:
Quote: Hi all,
branch predictors and data prefetchors are usually evaluated by
considering a single task: the fixed benchmark is the unique program
executed for evaluating the efficiency of the hardware data prefetchor
or the branch predictor.
With the multicore era, desktop and servers will execute more and more
multi-threaded applications, or multiple distinct applications, from
distinct users. When executing multiple threads from multiple
applications, branch predictors and data prefetchor are disturbed, and
their "learning" becomes erroneous (especially when they use physical
address as tags).
Does anyone know about a serious experimental study of the performance
of hardware data prefetchor and branch predictor in such context ?
Thanks
S, waiting for the next generation of branch predictors and data
prefetchors for multicore processors
a) There have been studies in academia, published I believe, on the
effects of context switching on branch predictors. As you might expect,
the more context switching, the worse.
b) Most branch predictors in my experience use virtual addresses,
although using physical addresses can shave a cycle in front end.
c) P6 anecdote, circa 1991: the IFU (I-cache) designer wanted to flush
the BTB on all context switches. Because we cross checked, we did not
need to do so for correctness, and not flushing turned out to be a
slight performance win.
d) Multicore in some ways *reduces* the frequency of context switches
(compared to the same workload running timesliced), so predictors may
improve. It's all a question of what you measure with respect to.
e) Since many multicore and GP-GPU workloads run the same code on
multiple processors, one might hope for possible IMPROVEMENTS in branch
predictors. Especially if learning from one thread can help another.
E.g. shared BIdB (Branch Identification Buffer) and BTB - basically,
shared big expensive tagged structures. Private histories.
Problem: nobody wants to have shared structures. It's nicer if the
cores are independent. But if your units start becoming clusters of 2,4
processor, then such sharing is reasonable. Similarly, SIMT/CT
(Choherent Threading) warps or clusters may easily emply a shared branch
predictor. There should also be optimizations related to the mainly
shared history.
f) I've long wanted to have the option of loading/unloading predictor
state like other context. Trouble is, it is often faster to recompute
than reload. |
|
|
| Back to top |
|
|
|
| Quadibloc... |
Posted: Fri Oct 30, 2009 7:21 pm |
|
|
|
Guest
|
On Oct 28, 12:08 pm, Sid Touati <SidnospamTou... at (no spam) inria.fr> wrote:
Quote: With the multicore era, desktop and servers will execute more and more
multi-threaded applications, or multiple distinct applications, from
distinct users. When executing multiple threads from multiple
applications, branch predictors and data prefetchor are disturbed, and
their "learning" becomes erroneous (especially when they use physical
address as tags).
Multicore processors help, rather than hindering, as someone else
already noted, since threads running on other processors are
irrelevant; the branch predictor is a part of each core, so if there
are other processors, this means fewer threads from the total have to
be handled by each core.
If one has a multithreaded core, then that core should have separate
branch predictor states for each thread as well.
John Savard |
|
|
| Back to top |
|
|
|
| ... |
Posted: Fri Oct 30, 2009 7:54 pm |
|
|
|
Guest
|
In article <dd85042e-2ccf-4691-bada-6e5b29c4fdde at (no spam) 2g2000prl.googlegroups.com>,
Quadibloc <jsavard at (no spam) ecn.ab.ca> wrote:
Quote: On Oct 28, 12:08=A0pm, Sid Touati <SidnospamTou... at (no spam) inria.fr> wrote:
With the multicore era, desktop and servers will execute more and more
multi-threaded applications, or multiple distinct applications, from
distinct users. =A0When executing multiple threads from multiple
applications, branch predictors and data prefetchor are disturbed, and
their "learning" becomes erroneous (especially when they use physical
address as tags).
Multicore processors help, rather than hindering, as someone else
already noted, since threads running on other processors are
irrelevant; the branch predictor is a part of each core, so if there
are other processors, this means fewer threads from the total have to
be handled by each core.
Grrk. It's not that simple Under most circumstances, threads
tend to wander around CPUs, and kernel code is often executed on
the CPU that invoked it. While it is rare for them to make the
problem worse, it can happen.
Regards,
Nick Maclaren. |
|
|
| Back to top |
|
|
|
| Mayan Moudgill... |
Posted: Fri Oct 30, 2009 9:01 pm |
|
|
|
Guest
|
Sid Touati wrote:
Quote: Andy "Krazy" Glew a écrit :
f) I've long wanted to have the option of loading/unloading predictor
state like other context. Trouble is, it is often faster to recompute
than reload.
You can save the branch predictor tables and restore them on a context
switch. Or you can zero out the tables on a context switch. Or you can
just leave them alone, and let them correct themselves as the
switched-in program runs.
Turns out that there is not much point to doing either of the first two
approaches; the branch predictor will correct itself pretty quickly -
quickly enough that the extra cycles spent unloading and reloading the
predictor tables on a context switch overwhelm the actual performance gain. |
|
|
| Back to top |
|
|
|
| Chris Gray... |
Posted: Sat Oct 31, 2009 2:07 am |
|
|
|
Guest
|
Mayan Moudgill <mayan at (no spam) bestweb.net> writes:
Quote: You can save the branch predictor tables and restore them on a context
switch. Or you can zero out the tables on a context switch. Or you can
just leave them alone, and let them correct themselves as the
switched-in program runs.
Turns out that there is not much point to doing either of the first
two approaches; the branch predictor will correct itself pretty
quickly -
quickly enough that the extra cycles spent unloading and reloading the
predictor tables on a context switch overwhelm the actual performance
gain.
Is it possible to maintain a small reversible *summary* of the contents of
the branch prediction unit? Something like a 64 bit word that has 6 sets of
low-4-bits-of-PC+1-bit-of-taken plus 4 other bits. I guess what I'm thinking
of here might be just having a second, very small, branch predictor state
that gets a decent number of branches correct. What it suggests is overridden
by the main predictor's state.
Also, what about turning off the use of the branch predictor when switching
into kernel code, then turning it back on on return to user mode? The first
instruction when returning to user mode could be "start reloading the branch
predictor from this summary word", and the last could be "return to user mode
and re-enable branch predictor".
Just throwing out some weird ideas.
--
Experience should guide us, not rule us.
Chris Gray cg at (no spam) GraySage.COM
http://www.Nalug.ORG/ (Lego)
http://www.GraySage.COM/cg/ (Other) |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Sat Oct 31, 2009 5:07 am |
|
|
|
Guest
|
Sid Touati wrote:
Quote: Andy "Krazy" Glew a écrit :
a) There have been studies in academia, published I believe, on the
effects of context switching on branch predictors. As you might
expect, the more context switching, the worse.
do you have exact references on such academic studies ? Of course I was
talking about real experiments, not simulations. Simulating the
performances of multicore systems is tricky.
b) Most branch predictors in my experience use virtual addresses,
although using physical addresses can shave a cycle in front end.
Fine, how do they distinguish between the PC of two separate
applications running in parallel on the same multicore processor ?
On existing Intel and AMD multicore systems, the processor cores are
separate, and do not were breach predictors.it would be hard to share
BP, since the cores are attached only at the L2 or L3 cache.
Multithreaded cores could share branch predictorsbetween threads.
Whether they share tables but arrange to hash different threads'
versions of the same address to different table entries is interesting.
a minor research toper.
Quote: you are right when we talk about executing multiple open-mp threads of
the same application. In practice, multiple applications can be run in
parallel, and this is the way we use computers usually (batch mode is
reserved for special situations only)
Nevertheless, even different apps share many OS services and DLLs-the.
training from one for the shared code may help a separate app, whether
running simultaneously or later in time. The problem is distinguishing
shared from unshared. Standard multi predictor choosers may work,
choosing between shared & unshared. Similarly, standard techniques such
as partial tags, and the aforementioned hashing.
- - -
Sorry if I am cryptic.
Writing this using handwriting recognition on a tablet PC in a shuttle
ran driving from Seattle to Portland as part of my weekly commute
to/from Intellectual Ventures in Bellevue.
The ran bounces so much that keyboard is almost useless.
Interesting: cursive is normally better than printing, but not when
driving/ vibrating. |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Sat Oct 31, 2009 5:15 am |
|
|
|
Guest
|
Quadibloc wrote:
Quote: If one has a multithreaded core, then that core should have separate
branch predictor states for each thread as well.
John Savard
could, not necessarily should. |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Sat Oct 31, 2009 5:15 am |
|
|
|
Guest
|
Mayan Moudgill wrote:
Quote: Sid Touati wrote:
Andy "Krazy" Glew a écrit :
f) I've long wanted to have the option of loading/unloading predictor
state like other context. Trouble is, it is often faster to
recompute than reload.
You can save the branch predictor tables and restore them on a context
switch. Or you can zero out the tables on a context switch. Or you can
just leave them alone, and let them correct themselves as the
switched-in program runs.
Turns out that there is not much point to doing either of the first two
approaches; the branch predictor will correct itself pretty quickly -
quickly enough that the extra cycles spent unloading and reloading the
predictor tables on a context switch overwhelm the actual performance gain.
Exactly. All work in thisarea has been disappointing. Perhaps as tables
grow bigger - but it is also quite likely that cross app training can be
useful.
Also, r/W interfaces to BP arrays have a cost. Maybe the DFT guys have)
want suchaccess ports.
I am skeptical of any proposal to context switchpredictor state.
Perhaps an efficient delta- not the whole table, but a list of the most
costly mispredicts. |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Wed Nov 04, 2009 6:15 am |
|
|
|
Guest
|
Anton Ertl wrote:
Quote: "Andy \"Krazy\" Glew" <ag-news at (no spam) patten-glew.net> writes:
b) Most branch predictors in my experience use virtual addresses,
although using physical addresses can shave a cycle in front end.
I would expect physical addresses to cost extra cycles, because there
is additional translation.
Other way around.
If you have a physically addressed I$, but a virtual branch predictor,
you have to translate the, e.g., virtual branch target addresses into
physical, giving you latency on a predicted taken branch. On the other
hand, it is I-fetch, where latency can often be tolerated.
Whereas you could use physical addresses for I-fetch: e.g. have a
current I-fetch PC (Intel parlance, PFIP, physical fetch instruction
pointer (I made that up)), increment it to the next I$ line. Have the
BTB have physical addresses. Trouble is, you have to do extra work,
like translating when sequential instruction fetch crosses a page
boundary, or remembering such crossings. You pretty much have to
maintain the virtual or linear, VFIP or VLIP, instruction pointers as
well, although maybe not as fast as the main PFIP. |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Wed Nov 04, 2009 6:15 am |
|
|
|
Guest
|
Anton Ertl wrote:
Quote: "Andy \"Krazy\" Glew" <ag-news at (no spam) patten-glew.net> writes:
b) Most branch predictors in my experience use virtual addresses,
although using physical addresses can shave a cycle in front end.
I would expect physical addresses to cost extra cycles, because there
is additional translation.
Other way around.
If you have a physically addressed I$, but a virtual branch predictor,
you have to translate the, e.g., virtual branch target addresses into
physical, giving you latency on a predicted taken branch. On the other
hand, it is I-fetch, where latency can often be tolerated.
Whereas you could use physical addresses for I-fetch: e.g. have a
current I-fetch PC (Intel parlance, PFIP, physical fetch instruction
pointer (I made that up)), increment it to the next I$ line. Have the
BTB have physical addresses. Trouble is, you have to do extra work,
like translating when sequential instruction fetch crosses a page
boundary, or remembering such crossings. You pretty much have to
maintain the virtual or linear, VFIP or VLIP, instruction pointers as
well, although maybe not as fast as the main PFIP. |
|
|
| Back to top |
|
|
|
| Anton Ertl... |
Posted: Wed Nov 04, 2009 11:10 pm |
|
|
|
Guest
|
"Andy \"Krazy\" Glew" <ag-news at (no spam) patten-glew.net> writes:
Quote: Anton Ertl wrote:
"Andy \"Krazy\" Glew" <ag-news at (no spam) patten-glew.net> writes:
b) Most branch predictors in my experience use virtual addresses,
although using physical addresses can shave a cycle in front end.
I would expect physical addresses to cost extra cycles, because there
is additional translation.
Other way around.
If you have a physically addressed I$,
Some years ago the usual way was virtually-indexed physically-tagged
L1 caches. Has this changed?
Quote: but a virtual branch predictor,
Ah, you mean the addresses coming out of the branch predictor, right?
I was thinking about the addresses going in; that's because
conditional branch predictors only predict taken/not-taken, and
because the question being discussed was the aliasing in the branch
predictor from merging the histories of different threads.
For the addresses going in using physical addresses would increase the
latency (or at least the hardware required), and the benefit is
probably small.
Quote: you have to translate the, e.g., virtual branch target addresses into
physical, giving you latency on a predicted taken branch. On the other
hand, it is I-fetch, where latency can often be tolerated.
For the BTB, storing physical addresses may be a good idea (if it
gives any advantage over virtually-indexed physically-tagged access).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton at (no spam) mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Thu Nov 05, 2009 6:15 am |
|
|
|
Guest
|
Anton Ertl wrote:
Quote: "Andy \"Krazy\" Glew" <ag-news at (no spam) patten-glew.net> writes:
If you have a physically addressed I$,
Some years ago the usual way was virtually-indexed physically-tagged
L1 caches. Has this changed?
Although there have been a number of systems with virtually indexed
physically tagged D$ and I$, including IIRC the Willamette L0 D$, array lookup,
most Intel x86s of the P6 family have physically indexed, physically
tagged, caches.
IMHO virtual indexing has gotten a bit of a bad rap. But, it certainly
had a bad reputation in quite a few design groups.
Quote: but a virtual branch predictor,
Ah, you mean the addresses coming out of the branch predictor, right?
Could be the address coming out
Could be the address going in.
It is convenient if they are of the same type, so that you can feed
the predictor output right back to the input.
Quote:
I was thinking about the addresses going in; that's because
conditional branch predictors only predict taken/not-taken, and
because the question being discussed was the aliasing in the branch
predictor from merging the histories of different threads.
For the addresses going in using physical addresses would increase the
latency (or at least the hardware required), and the benefit is
probably small.
Why would physical addresses going in increase the latency?
They would not increase latency of the array lookup or tag match.
They add complexity. And they require the target to be translated
when it is put into the array, typically on a misprediction when you are
doing an ifetch anyway.
Quote: For the BTB, storing physical addresses may be a good idea (if it
gives any advantage over virtually-indexed physically-tagged access).
Like I said, unclear if it is a complexity win. Definitely costs devices. |
|
|
| Back to top |
|
|
|
| Terje Mathisen... |
Posted: Mon Nov 09, 2009 2:00 am |
|
|
|
Guest
|
Quadibloc wrote:
Quote: On Nov 2, 3:08 am, "Ken Hagan"<K.Ha... at (no spam) thermoteknix.com> wrote:
On Fri, 30 Oct 2009 19:21:28 -0000, Quadibloc<jsav... at (no spam) ecn.ab.ca> wrote:
If one has a multithreaded core, then that core should have separate
branch predictor states for each thread as well.
Isn't that the same as "For a multithreaded core, the space available for
storing branch predictor state should be divided exactly 1/N to each
context."? That's fair to each thread, but not necessarily the best use of
a presumably limited resource.
It's only the same if one has a branch predictor that is capable of
working that way. I certainly do agree that if one can optimally
allocate branch predictor state without incurring inordinate costs for
that capability, one should do so.
However, I was trying to get at something much more simple, and I
think less controversial:
If one has a multithreaded core, branch predictor information should
be labelled by thread, so that information gathered about the branches
in one thread isn't used to control how branches in another thread are
handled. The branch predictor should not simply ignore the fact that
multiple different threads are being executed in the core.
In a multicore cpu, this is very probably exactly the wrong thing to do:
The usual programming paradigm for such a system is to have many threads
running the same algorithm, which means that training information from
one thread is likely to be useful for another, or at least not detrimental.
Cores that run different functions, will have a separate set of branches
to consider, and again each set running the same code can share branch info.
The main reason for keeping them separate is simply that the branch
predictions needs to be very close to the instruction fetch and
execution units, something which is hard to achieve if a single large
global branch table is many cycles away.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching" |
|
|
| Back to top |
|
|
|
| Quadibloc... |
Posted: Mon Nov 09, 2009 3:58 am |
|
|
|
Guest
|
On Nov 8, 2:00 pm, Terje Mathisen <Terje.Mathi... at (no spam) tmsw.no> wrote:
Quote: The usual programming paradigm for such a system is to have many threads
running the same algorithm, which means that training information from
one thread is likely to be useful for another, or at least not detrimental.
Ah, I thought the usual operation of a multicore system is to have an
operating system running multiple different applications at once, and
the operating system itself, so that the system would have more
different applications providing threads than there were cores.
Thus, on a Windows PC, when I look at Task Manager, under the
Processes tab, I usually find more than four things listed there.
Admittedly, if I was using multicore chips in a supercomputer in order
to do massively-parallel number-crunching, I probably _would_ be using
the system as you describe. In fact, even on a PC, if I was playing
certain graphics-intensive computer games, that may well be what I
want. So the situation you describe, even if not "usual", is the one
that applies... the only times when performance is critical.
John Savard |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Tue Nov 24, 2009 4:48 am
|
|