Main Page | Report this Page
Linux Forum Index  »  Linux Development - Applications  »  Being cache conscious - How is data read from memory...
Page 1 of 2    Goto page 1, 2  Next

Being cache conscious - How is data read from memory...

Author Message
Sune...
Posted: Sun Aug 23, 2009 1:38 am
Guest
Hi,

this question is a bit of an odd ball I'm afraid. What I need to know
to be cache conscious when implementing a hash table is how an OS
Linux/UNIX reads data from RAM into a cache line.

Let's assume I have:
* a CPU with a cache line size of 64 bytes
* the page size of my OS set to 4 kb
* my OS is a 32-bit system
* my own memory allocator serving me one full page at a time

How do I need to layout my data elements on a page to ensure they are
placed cache consciously?

Eg:

1)
A page is allocated. It is empty apart from a page header of 40
bytes.

2)
I want to insert a Data element1 of 50 bytes into the page.

Questions:

1)
Do I need to align the page placement with my cache line size? Meaning
I have to write my element with an offset of 64 bytes from the page
start, leaving 14 bytes un-used?
Page -------------------
I Page header 40bytes I 16 bytes un-used I Data element1 50 bytes

That would mean wasting a lot of space, especially since Data element2
would also have to take cache line size into account leaving another
14 un-used.

Please, say it isn't so...


2)
Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
.....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

Thanks in advance
/Sune
 
Sune...
Posted: Sun Aug 23, 2009 8:00 am
Guest
On 23 Aug, 19:19, Patricia Shanahan <p... at (no spam) acm.org> wrote:
Quote:
Sune wrote:

...

So now the only thing left is my questions  Wink  Maybe I don't know
enough to be able to ask the right things...

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

*Why* do you assume you need to store your data cache line aligned?

Patricia

For example, this:
'INSN_CACHE_LINE_WIDTH
The length in bytes of each cache line. The cache is divided into
cache lines which are disjoint slots, each holding a contiguous chunk
of data fetched from memory. Each time data is brought into the cache,
an entire line is read at once. The data loaded into a cache line is
always aligned on a boundary equal to the line size. '

is stated about gcc here:
http://www.delorie.com/gnu/docs/gcc/gccint_136.html

Similar statements can be found here:
http://www.linuxshowcase.org/2000/2000papers/papers/sears/sears_html/

If I'm misinterpreting this I'd obviously need some support.

Thanks again
/Sune
 
Sune...
Posted: Sun Aug 23, 2009 8:07 am
Guest
On 23 Aug, 19:22, "H. S. Lahman" <h.lah... at (no spam) verizon.net> wrote:
Quote:
Responding to Sune...

this question is a bit of an odd ball I'm afraid. What I need to know
to be cache conscious when implementing a hash table is how an OS
Linux/UNIX reads data from RAM into a cache line.

Shanahan's advice is very good. However, I would like to push back on
why you want to optimize at this level at all...



Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

It is very difficult to optimize CPU caching at the 3GL level. You
really need to do it at the Assembly level. That's because the hardware
caching algorithm is optimized around the ALU pipeline and instruction
set so you really need to understand what instructions are used and what
their order is to do it properly. But a good 3GL optimizing compiler
will generate object code that is not at all intuitive from the source
listing and can be very difficult to understand. [Some compilers that
have an option to print line-by-line Assembly with the source do not
even optimize when they do that; the Assembly displayed is what you
would get in an unoptimized debug mode compilation. And that would not
do you much good.]

Essentially you need to look at the compiler's Assembly output and
design macros that will force access the of data in a preferred
instruction order given the CPU caching algorithm (in addition to
modifying the data structures to play well with the macros!). That would
be <relatively> easy in a language like BLISS that has a good macro
facility but it will be both very difficult and tedious in C.

In addition, such optimization will be very dependent on what you are
doing with the data. So if you access the same data in different
execution contexts (i.e., needing different instructions) the
optimization might be fine for one context but worse than the compiler's
way of doing things for another. IOW, you may need multiple sets of
macros for accessing the same data in different contexts.

Finally, a good 3GL optimizing compiler chooses instructions and puts
them in a particular order for good reasons. So optimizing the
instruction order just to optimize the CPU cache could negate the
compiler's attempts to optimize other things for a net loss in
performance -- though that should be rare unless the macros force
different instructions to be used. (I suggest using macros above because
you probably only want to override the compiler optimization in
particular situations.)

Corollary: Since such low level optimization is highly dependent on the
hardware platform, your optimization can become obsolete when the
hardware is upgraded. Thus your custom optimization may hard-wire in a
performance hit, especially if the compiler is also upgraded.

Bottom line: the days of ordering FORTRAN COMMONs by variable size and
using two additions rather than one multiplication by 3 are long gone.
It is usually not worth the trouble to do optimization at the CPU cache
level unless you have a highly specialized application. Generally
speaking, managing the overall memory access via things like nested loop
order is much more effective because the scale is larger, especially
when the application is large enough to have VM page faulting problems.

Usually you can do that by simply looking at your critical 3GL
procedures and viewing each data element access as loading a register
from memory. If you view those register loads as defining clumps of data
that should be close to adjacent in memory and/or all accessed in a
closed scope, you can then optimize your data structures, iterations,
and statements so that happens. (The optimizations one routinely does to
minimize VM page faulting are useful and usually will be sufficient,
even if VM paging itself isn't an issue.) Then the hardware caching
algorithm will almost always do the right thing.

FWIW I would not do such optimization unless I had hard data to indicate
it was really needed and I would look closely at the data to make sure
it wasn't the data collection itself that was skewing the results.
Unless you are doing embedded systems with a brain dead processor, have
hard real-time constraints, or have to use a cretinous C compiler, I
think it would be highly unlikely that worrying about the CPU cache
itself would be worthwhile.

--
Life is the only flaw in an otherwise perfect nonexistence
    -- Schopenhauer

H. S. Lahman
H.lah... at (no spam) verizon.net
software blog:http://pathfinderpeople.blogs.com/hslahman/index.html

Ok, point taken. But if we loose the alignment on the stack and just
go on to this:

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

H is header
? is padding to make the data layout cache line aligned

0xA000:
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH????????????????????????
0xA040:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????
0xA080:
DDDDDDDDDDDDDataElement2222222222222222222222222222222???????????????
:

If I have a pointer referencing address 0xA050 (the E in DataElement1)
the following is read into the cache line:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????

The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?

BRs
/Sune
 
James Harris...
Posted: Sun Aug 23, 2009 10:14 am
Guest
On 23 Aug, 19:07, Sune <sune_ahlg... at (no spam) hotmail.com> wrote:

....

Quote:
Ok, point taken. But if we loose the alignment on the stack and just
go on to this:

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

H is header
? is padding to make the data layout cache line aligned

0xA000:
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH????????????????????????
0xA040:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????
0xA080:
DDDDDDDDDDDDDataElement2222222222222222222222222222222???????????????
:

If I have a pointer referencing address 0xA050 (the E in DataElement1)
the following is read into the cache line:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????

The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?

If I understand correctly you are concerned about your data splitting
across cache lines. Your quote earlier also suggests you think this
will be a problem. In your first post you say that you want to
implement a hash table and, to an extent, you are right to have a
small concern over cache performance - but only a small one.

First, you should be aware that if it is OK if a 50-byte element is
split over two cache lines. Once both parts are in cache the CPU will
not slow down when you change from accessing the part in the first
cache line to accessing the part in the second cache line. You can
treat both cache lines as the same. They will both be equally fast.

Second, caches work best a) when you repeatedly access the same areas
of memory or b) when you step through memory sequentially. In the
latter case they effectively read ahead for bytes you will want in the
future.

Since you are implementing a hash table you will have little
sequential access but the cache will retain data already read and you
might also get some benefit when your hash function indexes entries
adjacent to ones you aready have cached. For example, given a 64-byte
line size, if you have cached the last 10 bytes of one entry you will
also have cached the first 54 bytes of the next.

Third, modern machines have multiple levels of cache. Your outer
levels may have larger cache lines. Fetching from these will still be
quicker than fetching from memory.

Fourth, if you don't need to access all 50 bytes at once you can split
your 50-byte records into smaller parts and store then in two hash
tables. HOWEVER, this is rarely a good idea. It can help for much
larger records but I doubt it will in this case.

Fifth, and this is the most important point for performance, if your
50-byte records contain any 16-bit, 32-bit or 64-bit quantities, make
sure you keep those values aligned to their natural boundaries. So,
for example, 32-bit values that you load and store from 32-bit
registers would be aligned to 4-byte boundaries.

So, re. your question on alignment you'd normally be best to place
your records adjacent to each other but ensure that any values within
the records are properly aligned. Don't use any padding other than
that required for field alignment.

To reiterate: it's ok to split records across cache lines even if they
will fit into one line. Just make sure your field alingments are
right.

James
 
Patricia Shanahan...
Posted: Sun Aug 23, 2009 11:19 am
Guest
Sune wrote:
....
Quote:
So now the only thing left is my questions Wink Maybe I don't know
enough to be able to ask the right things...

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

*Why* do you assume you need to store your data cache line aligned?

Patricia
 
H. S. Lahman...
Posted: Sun Aug 23, 2009 11:22 am
Guest
Responding to Sune...

Quote:
this question is a bit of an odd ball I'm afraid. What I need to know
to be cache conscious when implementing a hash table is how an OS
Linux/UNIX reads data from RAM into a cache line.

Shanahan's advice is very good. However, I would like to push back on
why you want to optimize at this level at all...

Quote:
Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

It is very difficult to optimize CPU caching at the 3GL level. You
really need to do it at the Assembly level. That's because the hardware
caching algorithm is optimized around the ALU pipeline and instruction
set so you really need to understand what instructions are used and what
their order is to do it properly. But a good 3GL optimizing compiler
will generate object code that is not at all intuitive from the source
listing and can be very difficult to understand. [Some compilers that
have an option to print line-by-line Assembly with the source do not
even optimize when they do that; the Assembly displayed is what you
would get in an unoptimized debug mode compilation. And that would not
do you much good.]

Essentially you need to look at the compiler's Assembly output and
design macros that will force access the of data in a preferred
instruction order given the CPU caching algorithm (in addition to
modifying the data structures to play well with the macros!). That would
be <relatively> easy in a language like BLISS that has a good macro
facility but it will be both very difficult and tedious in C.

In addition, such optimization will be very dependent on what you are
doing with the data. So if you access the same data in different
execution contexts (i.e., needing different instructions) the
optimization might be fine for one context but worse than the compiler's
way of doing things for another. IOW, you may need multiple sets of
macros for accessing the same data in different contexts.

Finally, a good 3GL optimizing compiler chooses instructions and puts
them in a particular order for good reasons. So optimizing the
instruction order just to optimize the CPU cache could negate the
compiler's attempts to optimize other things for a net loss in
performance -- though that should be rare unless the macros force
different instructions to be used. (I suggest using macros above because
you probably only want to override the compiler optimization in
particular situations.)

Corollary: Since such low level optimization is highly dependent on the
hardware platform, your optimization can become obsolete when the
hardware is upgraded. Thus your custom optimization may hard-wire in a
performance hit, especially if the compiler is also upgraded.

Bottom line: the days of ordering FORTRAN COMMONs by variable size and
using two additions rather than one multiplication by 3 are long gone.
It is usually not worth the trouble to do optimization at the CPU cache
level unless you have a highly specialized application. Generally
speaking, managing the overall memory access via things like nested loop
order is much more effective because the scale is larger, especially
when the application is large enough to have VM page faulting problems.

Usually you can do that by simply looking at your critical 3GL
procedures and viewing each data element access as loading a register
from memory. If you view those register loads as defining clumps of data
that should be close to adjacent in memory and/or all accessed in a
closed scope, you can then optimize your data structures, iterations,
and statements so that happens. (The optimizations one routinely does to
minimize VM page faulting are useful and usually will be sufficient,
even if VM paging itself isn't an issue.) Then the hardware caching
algorithm will almost always do the right thing.

FWIW I would not do such optimization unless I had hard data to indicate
it was really needed and I would look closely at the data to make sure
it wasn't the data collection itself that was skewing the results.
Unless you are doing embedded systems with a brain dead processor, have
hard real-time constraints, or have to use a cretinous C compiler, I
think it would be highly unlikely that worrying about the CPU cache
itself would be worthwhile.


--
Life is the only flaw in an otherwise perfect nonexistence
-- Schopenhauer

H. S. Lahman
H.lahman at (no spam) verizon.net
software blog: http://pathfinderpeople.blogs.com/hslahman/index.html
 
Patricia Shanahan...
Posted: Sun Aug 23, 2009 1:00 pm
Guest
Sune wrote:
....
Quote:
The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?

You seem to be very focussed on a single memory reference. That is
usually a mistake when thinking about caches. If you are not getting
*many* uses out of a typical cache miss you are in deep performance
trouble.

If your data structure and algorithms are such that you are doing a lot
of scattered references with no possibility of cache reuse, you need to
reexamine your choices at a far higher level.

Patricia
 
Moi...
Posted: Sun Aug 23, 2009 2:13 pm
Guest
On Sun, 23 Aug 2009 12:00:26 -0700, Patricia Shanahan wrote:

Quote:
Sune wrote:
...
The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?

You seem to be very focused on a single memory reference. That is
usually a mistake when thinking about caches. If you are not getting
*many* uses out of a typical cache miss you are in deep performance
trouble.

If your data structure and algorithms are such that you are doing a lot
of scattered references with no possibility of cache reuse, you need to
reexamine your choices at a far higher level.

I agree with Patricia's advice:
0) "travel light": don't make your data structure larger than needed
1) once a cache slot is "faulted in" , make sure most of it's contents
are valuable to the program at that time: avoid pulling in a cache
line because you need only one byte.

Since your application appears to be a hashtable: in most cases linear
probing has a better locality then chaining.
(chaining is "cleaner" if your app requires a lot of deletes, or if the
table's size is very variable).
So the choise of your hashing/overflow scheme will depend on the usage
pattern of your hash-table. YMMV.

2) if your data happens to be on a diskpage: forget about L1/L2 cache;
the performance will be dominated by disk I/O anyway.

HTH,
AvK
 
Patricia Shanahan...
Posted: Sun Aug 23, 2009 2:31 pm
Guest
Moi wrote:
Quote:
On Sun, 23 Aug 2009 12:00:26 -0700, Patricia Shanahan wrote:

Sune wrote:
...
The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?
You seem to be very focused on a single memory reference. That is
usually a mistake when thinking about caches. If you are not getting
*many* uses out of a typical cache miss you are in deep performance
trouble.

If your data structure and algorithms are such that you are doing a lot
of scattered references with no possibility of cache reuse, you need to
reexamine your choices at a far higher level.

I agree with Patricia's advice:
0) "travel light": don't make your data structure larger than needed
1) once a cache slot is "faulted in" , make sure most of it's contents
are valuable to the program at that time: avoid pulling in a cache
line because you need only one byte.

Since your application appears to be a hashtable: in most cases linear
probing has a better locality then chaining.

I agree with this, except for the idea of hashtable as an application. A
hashtable is a data structure, and associated algorithms, implementing a
mapping function. If the program is in serious performance trouble
due to cache misses in map lookup, I would question the choice of data
structure.

Patricia
 
Gordon Burditt...
Posted: Sun Aug 23, 2009 2:38 pm
Guest
Quote:
this question is a bit of an odd ball I'm afraid. What I need to know
to be cache conscious when implementing a hash table is how an OS
Linux/UNIX reads data from RAM into a cache line.

The OS doesn't. The CPU does. It's handled at a level below the OS.
If there were cache-loading code, it's something you'd definitely want
in cache.

Quote:
Let's assume I have:
* a CPU with a cache line size of 64 bytes
* the page size of my OS set to 4 kb
* my OS is a 32-bit system
The bit-ness of a CPU or OS is meaningful description of anything.

For example, 32-bit Windows and 32-bit Linux are *MUCH* more
different than 32-bit Linux and 64-bit Linux.

Quote:
* my own memory allocator serving me one full page at a time

How do I need to layout my data elements on a page to ensure they are
placed cache consciously?

Place variables that are USED together PHYSICALLY together (e.g.
in the same struct), where possible.

Quote:
1)
A page is allocated. It is empty apart from a page header of 40
bytes.

2)
I want to insert a Data element1 of 50 bytes into the page.

Questions:

1)
Do I need to align the page placement with my cache line size? Meaning
I have to write my element with an offset of 64 bytes from the page
start, leaving 14 bytes un-used?

First, performance is not required, so cache optimization is not
required.

One of the basic principles of cache *pessimization* is to spread
everything out so that no two variables are in the same cache line,
to maximize the number of cache misses. A cache miss vs. a cache
hit is a much bigger performance difference than a variable at the
end of the cache line vs. a variable at the beginning of a cache
line.

Put variables that are used together physically together, so hopefully
they are in the same cache line. Scan multi-dimensional arrays in
physical order if possible.

Quote:
Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

There is no guarantee that the order of declaration of auto variables
has any effect at all on their ordering in memory. The ordering
might be based on a SHA-256 hash of the variable *name*. 4 ints
will usually fit in the same cache line.
 
Patricia Shanahan...
Posted: Sun Aug 23, 2009 2:50 pm
Guest
Gordon Burditt wrote:
.....
Quote:
Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

There is no guarantee that the order of declaration of auto variables
has any effect at all on their ordering in memory. The ordering
might be based on a SHA-256 hash of the variable *name*. 4 ints
will usually fit in the same cache line.


In any case, in most code in C or similar, the active area of the stack
is rarely not in cache. It is used too often, for call/return, for auto
variables, and for compiler generated temporaries.

Patricia
 
Sune...
Posted: Mon Aug 24, 2009 2:33 am
Guest
On 23 Aug, 22:38, gordonb.4a... at (no spam) burditt.org (Gordon Burditt) wrote:
Quote:
this question is a bit of an odd ball I'm afraid. What I need to know
to be cache conscious when implementing a hash table is how an OS
Linux/UNIX reads data from RAM into a cache line.

The OS doesn't.  The CPU does.  It's handled at a level below the OS.
If there were cache-loading code, it's something you'd definitely want
in cache.

Let's assume I have:
* a CPU with a cache line size of 64 bytes
* the page size of my OS set to 4 kb
* my OS is a 32-bit system

The bit-ness of a CPU or OS is meaningful description of anything.
For example, 32-bit Windows and 32-bit Linux are *MUCH* more
different than 32-bit Linux and 64-bit Linux.

* my own memory allocator serving me one full page at a time
How do I need to layout my data elements on a page to ensure they are
placed cache consciously?

Place variables that are USED together PHYSICALLY together (e.g.
in the same struct), where possible.

1)
A page is allocated. It is empty apart from a page header of 40
bytes.

2)
I want to insert a Data element1 of 50 bytes into the page.

Questions:

1)
Do I need to align the page placement with my cache line size? Meaning
I have to write my element with an offset of 64 bytes from the page
start, leaving 14 bytes un-used?

First, performance is not required, so cache optimization is not
required.

One of the basic principles of cache *pessimization* is to spread
everything out so that no two variables are in the same cache line,
to maximize the number of cache misses.  A cache miss vs. a cache
hit is a much bigger performance difference than a variable at the
end of the cache line vs. a variable at the beginning of a cache
line.

Put variables that are used together physically together, so hopefully
they are in the same cache line.  Scan multi-dimensional arrays in
physical order if possible.



Say I have a function blork with the following variables declared in
C:

void blork(void)
{
struct foo {
int var1;
int var2;
};

int var3;
int var4
....

Will the cache line look like this:
var1, var2, var3, var4....

or
var3, var4....

There is no guarantee that the order of declaration of auto variables
has any effect at all on their ordering in memory.  The ordering
might be based on a SHA-256 hash of the variable *name*.  4 ints
will usually fit in the same cache line.

This thing basically sums it up:

Put variables that are used together physically together, so hopefully
they are in the same cache line. Scan multi-dimensional arrays in
physical order if possible.

Thanks for having the guts to actually direct me in a reasonable
direction _and_ answering my questions. This is what I will go with.

BRs
/Sune
 
Martin Brown...
Posted: Mon Aug 24, 2009 6:57 am
Guest
Sune wrote:
Quote:
On 23 Aug, 19:19, Patricia Shanahan <p... at (no spam) acm.org> wrote:
Sune wrote:

...

So now the only thing left is my questions Wink Maybe I don't know
enough to be able to ask the right things...
I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:
*Why* do you assume you need to store your data cache line aligned?

Patricia

For example, this:
'INSN_CACHE_LINE_WIDTH
The length in bytes of each cache line. The cache is divided into
cache lines which are disjoint slots, each holding a contiguous chunk
of data fetched from memory. Each time data is brought into the cache,
an entire line is read at once. The data loaded into a cache line is
always aligned on a boundary equal to the line size. '

is stated about gcc here:
http://www.delorie.com/gnu/docs/gcc/gccint_136.html

Similar statements can be found here:
http://www.linuxshowcase.org/2000/2000papers/papers/sears/sears_html/

If I'm misinterpreting this I'd obviously need some support.

There is good mileage in making sure that all 32bit quantities are
appropriately aligned and creating an object that is padded to the
nearest 4 byte boundary(8/16 byte boundary aligned for SIMD code). There
might be a marginal performance advantage if the structure fits cleanly
into a cache line an exact number of times.

However, rather than guessing why not instrument the code with
performance monitoring counters and *measure* what the effects of
different approaches are. Univerity of Texas at Austin have a free
distribution that will probably work - never tried the Linux variant.

http://iss.ices.utexas.edu/ia32.php

The cache access predictors on modern CPUs are extremely good if you are
doing sustained sequential access. I'd be surprised if cache
considerations made much difference to a hash table. But there is no
substitute for doing the experiment.

Regards,
Martin Brown
 
David Schwartz...
Posted: Mon Aug 24, 2009 7:39 am
Guest
On Aug 23, 11:07 am, Sune <sune_ahlg... at (no spam) hotmail.com> wrote:

Quote:
0xA000:
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH????????????????????????
0xA040:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????
0xA080:
DDDDDDDDDDDDDataElement2222222222222222222222222222222???????????????
:

If I have a pointer referencing address 0xA050 (the E in DataElement1)
the following is read into the cache line:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????

The reason for doing this would be to assure that I don't use 2 cache
lines for data that easily fits into 1.

Comments?

Sounds like bogus reasoning. With all the padding, you may wind up
using 18 cache lines for data that easily fits into 13.

DS
 
Scott Lurndal...
Posted: Mon Aug 24, 2009 1:36 pm
Guest
"H. S. Lahman" <h.lahman at (no spam) verizon.net> writes:
Quote:
Responding to Sune...

I allocate a page (page aligned) which starts on address 0xA000.
Immediately there is a header of 40 bytes. How should I place my
data ? For example if I assume I need to store my data in cache line
aligned this is what I need to do for question1 above:

H is header
? is padding to make the data layout cache line aligned

0xA000:
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH????????????????????????
0xA040:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????
0xA080:
DDDDDDDDDDDDDataElement2222222222222222222222222222222???????????????
:

If I have a pointer referencing address 0xA050 (the E in DataElement1)
the following is read into the cache line:
DDDDDDDDDDDDDataElement1111111111111111111111111111111111???????????????

That's not necessarily true. The hardware is only going to suck into its
cache lines the data that it needs to load into registers. If it doesn't
see a need to load data, it won't put it in the cache. So I would bet
that the E will be in the beginning of a cache line if you just
reference that pointer in your C code. OTOH, if it sees a need to load
more than 64 bytes of data, it will load multiple, consecutive cache
lines from main memory without any padding. That sort of anticipatory
buffering for things like string processing is why the CPU cache line
size is bigger than a machine word (32-bits on your machine).

Cache lines do not start on 'arbitrary' boundaries. The OP example is
correct, if he references 0xa050, the entire 64-byte cache line starting
at 0xa040 will be loaded by the northbridge (and if the northbridge has a prefetcher,
the subsequent line(s) may also be loaded, depending on the prefetcher and prior
application behavior).

For single threaded apps, padding isn't required and data should be packed as
tightly as possible. For SMP apps, data accessed by multiple threads should
be in individual cache lines to avoid cache line contention.

scott
 
 
Page 1 of 2    Goto page 1, 2  Next
All times are GMT - 5 Hours
The time now is Mon Nov 23, 2009 10:24 am