 |
|
| Computers Forum Index » Computer Architecture » Question on Lynnfield chip layout... |
|
Page 1 of 1 |
|
| Author |
Message |
| Stephen Fuld... |
Posted: Tue Nov 03, 2009 4:18 am |
|
|
|
Guest
|
The annotated photo of Intel's new Lynnfield chip at
http://techreport.com/articles.x/17545
Shows the cores in the middle, with the memory interface at the top and
the L3 cache on the bottom. This means that there must be wires running
from the L3, the whole way across the chip to the memory interface.
Why didn't they "swap" the position of the cores with the L3 cache so
that these wires would be much shorter and not have to cross any other
major part of the chip? Has this something to do with heat dissipation?
--
- Stephen Fuld
(e-mail address disguised to prevent spam) |
|
|
| Back to top |
|
|
|
| Brett Davis... |
Posted: Tue Nov 03, 2009 6:15 am |
|
|
|
Guest
|
In article <hcnpbp$11e$1 at (no spam) news.eternal-september.org>,
Stephen Fuld <SFuld at (no spam) alumni.cmu.edu.invalid> wrote:
Quote: The annotated photo of Intel's new Lynnfield chip at
http://techreport.com/articles.x/17545
Shows the cores in the middle, with the memory interface at the top and
the L3 cache on the bottom. This means that there must be wires running
from the L3, the whole way across the chip to the memory interface.
Why didn't they "swap" the position of the cores with the L3 cache so
that these wires would be much shorter and not have to cross any other
major part of the chip? Has this something to do with heat dissipation?
I believe that loads go strait to L2, or L1, the L3 is just a victim
cache for the L2, a way station to lazy writes to RAM.
Brett |
|
|
| Back to top |
|
|
|
| Andy \"Krazy\" Glew... |
Posted: Wed Nov 04, 2009 6:15 am |
|
|
|
Guest
|
Stephen Fuld wrote:
Quote: The annotated photo of Intel's new Lynnfield chip at
http://techreport.com/articles.x/17545
Shows the cores in the middle, with the memory interface at the top and
the L3 cache on the bottom. This means that there must be wires running
from the L3, the whole way across the chip to the memory interface.
Why didn't they "swap" the position of the cores with the L3 cache so
that these wires would be much shorter and not have to cross any other
major part of the chip? Has this something to do with heat dissipation?
With the cache on an edge, you can easily create variants of the chip:
a chip with CPUs and memory interface but no cache, 1M, 2M ... |
|
|
| Back to top |
|
|
|
| Del Cecchi... |
Posted: Sat Nov 07, 2009 9:24 pm |
|
|
|
Guest
|
"Terje Mathisen" <Terje.Mathisen at (no spam) tmsw.no> wrote in message
news:qcSdnejIo-XvGGjXnZ2dnUVZ8uOdnZ2d at (no spam) lyse.net...
Quote: Stephen Fuld wrote:
Brett Davis wrote:
Actually I spoke too quick, I believe that generally the L3 only
holds
data that is committed to RAM. So a dirty L2 cache line that gets
evicted to L3 would also be written to RAM. Makes the L3 simpler
to
implement, fewer tag bits and checks, easier to replace lines,
just
overwrite.
Really??? Then the L3 is basically a giant store queue. It doesn't
hold
clean data that has been evicted from the L2 due to the L2's
limited
capacity. That sure seems odd.
I agree.
Particularly since the L3 is shared, it seems a given that it can be
the source of a cached load for any of the cores on the chip.
Terje
--
- <Terje.Mathisen at tmsw.no
"almost all programming can be viewed as an exercise in caching"
Perhaps he meant to say the L3 is "store through" rather than the L2
"store in" which does mean there is never dirty data in cache. When
clean data is evicted from L2 isn't it just written over? So whether
there is a copy in L3 depends on the L3 traffic and replacement
algorithm, doesn't it?
del |
|
|
| Back to top |
|
|
|
| Brett Davis... |
Posted: Sun Nov 08, 2009 4:08 am |
|
|
|
Guest
|
In article <7lll99F3e1qh6U1 at (no spam) mid.individual.net>,
"Del Cecchi" <delcecchiofthenorth at (no spam) gmail.com> wrote:
Quote: Brett Davis wrote:
Actually I spoke too quick, I believe that generally the L3 only
holds
data that is committed to RAM. So a dirty L2 cache line that gets
evicted to L3 would also be written to RAM. Makes the L3 simpler
to
implement, fewer tag bits and checks, easier to replace lines,
just overwrite.
Really??? Then the L3 is basically a giant store queue. It doesn't
hold
clean data that has been evicted from the L2 due to the L2's
limited
capacity. That sure seems odd.
Perhaps he meant to say the L3 is "store through" rather than the L2
"store in" which does mean there is never dirty data in cache. When
clean data is evicted from L2 isn't it just written over? So whether
there is a copy in L3 depends on the L3 traffic and replacement
algorithm, doesn't it?
del
Yes, store through.
There are three main types of data in games; inputs, outputs, and
deciders. An example is CPU based character skinning, huge arrays of
tens of thousands of verts, and texture coordinates, and bone weights on
those verts. This read only data is multiplied times the decider array
of 100 odd bones matrixes that make up the skeleton. The decider array
and its support data change every frame. The output is write only and
goes to a DMA buffer that gets sent to the graphics chip.
You do not want these megabytes of output data thrashing your L1/2/3, so
on consoles you will often mark this memory as non-cache write only.
Otherwise the cache will fetch this data from RAM to merge in sub cache
line writes, not realizing this is BAD for performance. If you cant mark
as non-cache, then you want the cache to be smart. Each cache line will
get a half dozen writes in L1, eventually get purged to L2 where it will
get zero hits. Then when it is purged from L2 the cache should decide to
commit to memory, and NOT send to the L3, due to the zero L2 hits.
You also do not want the megabytes of read only data thrashing your L2
cache, but right now there is no way to stop this. This read only data
is in nice linear arrays, the CPU can fetch this data with no delays
from memory if your CPU is smart. You will hit each cache line with a
half dozen reads from L1, and then never touch it again. The data will
then get flushed to L2, where it will get no hits. When the data is
flushed from L2, the smart thing to do is to look at the zero L2 hits,
decide the data is dead and not flush to L3.
The decider skeleton data is read and written, with hits in L2 as well
as L1, the L3 would decide to keep it.
For this example assume AMD style exclusive caches, so that data is only
ever in one cache. For Intel you can have stale copies in L2 and L3, but
I assume the decisions come out much the same.
The L1 cache is only 2 way, so some data will get purged from L1 early,
re-fetched from L2, then finally purged to L2. This can confuse the
decision on if the L3 should save it.
Caveat: Caches actually make pseudo LRU decisions, dont know if any
track L2 hits to make L3 decisions, and if they did it may be secret.
The root question was why was the L3 on the opposite side of the die
from the memory controller for Intel. With reads going to L2, and a
write through L3 cache, the L3 will never have dirty data, and will
almost never talk to the memory controller.
I dont know what the L3 policy is for AMD, and if it talks to the memory
controller. Having L3 does delay memory reads for AMD, making the
AthlonII (no L3 "stars" core) much faster than the PhenomII ("stars"
core) in non-bloatware benchmarks.
Brett
PS: So what is the new Phenom3 coming in Q1 bringing to the table that
AMD decided to discontinue all the PhenomII chips? AVX? Four issue?
http://www.fudzilla.com/content/view/16209/54/
FYI: I am expecting Bulldozer to be a multi-threading design, and
perhaps huge in comparison to PhenomII. So for the bulk of AMDs low end
sales AMD has to also rev the old PhenomII to include AVX/FMA4/XOP/CVT16
to stay competitive in the volume market. Phenom was clearly designed to
be a fresh base with room to grow to add new extensions. The only part
that stayed the same was the three issue logic, everything else was
expanded. Wider instruction fetch (underutilized) wider SSE registers,
(underutilized without AVX) wider bus to cache, (underutilized) more
ways added to cache and MMU to support bloatware, etc. The bottleneck is
now completely the three issue logic engine, designed for the K8 over a
decade ago.
FYI2: AMD is a big company now with multiple large design teams, not the
little shoestring company with one small design team that bet the farm
on the K8. In this race the Phenom team is the turtle with predictable
upgrades, and the Bulldozer team the risky rabbit with breakthroughs,
that might break a leg in the starting gate and never see the light of
day. |
|
|
| Back to top |
|
|
|
| Stephen Fuld... |
Posted: Mon Nov 09, 2009 6:15 am |
|
|
|
Guest
|
Brett Davis wrote:
snip
Quote: There are three main types of data in games; inputs, outputs, and
deciders. An example is CPU based character skinning, huge arrays of
tens of thousands of verts, and texture coordinates, and bone weights on
those verts. This read only data is multiplied times the decider array
of 100 odd bones matrixes that make up the skeleton. The decider array
and its support data change every frame. The output is write only and
goes to a DMA buffer that gets sent to the graphics chip.
You do not want these megabytes of output data thrashing your L1/2/3, so
on consoles you will often mark this memory as non-cache write only.
Otherwise the cache will fetch this data from RAM to merge in sub cache
line writes, not realizing this is BAD for performance. If you cant mark
as non-cache, then you want the cache to be smart. Each cache line will
get a half dozen writes in L1, eventually get purged to L2 where it will
get zero hits. Then when it is purged from L2 the cache should decide to
commit to memory, and NOT send to the L3, due to the zero L2 hits.
I don't know if modern X86 chips have reasonable implementations of
this, but it sure would be possible to have an instruction that tells
the cache to write a dirty line directly to memory and then invalidate
it in cache. If your code knew when it wasn't going to reference a
particular piece of data for a long while, it could issue such an
instruction and I believe this would accomplish exactly what you want.
Quote: You also do not want the megabytes of read only data thrashing your L2
cache, but right now there is no way to stop this.
Again, the instruction would invalidate the data in cache (no need to
write it back as it hasn't bee modified). This does assume you know
when you are done with a particular piece of data.
Quote: This read only data
is in nice linear arrays, the CPU can fetch this data with no delays
from memory if your CPU is smart. You will hit each cache line with a
half dozen reads from L1, and then never touch it again. The data will
then get flushed to L2, where it will get no hits. When the data is
flushed from L2, the smart thing to do is to look at the zero L2 hits,
decide the data is dead and not flush to L3.
Does any three level cache actually do this?
--
- Stephen Fuld
(e-mail address disguised to prevent spam) |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Sat Dec 05, 2009 1:26 pm
|
|