| |
 |
|
| Computers Forum Index » Computer Compilers » Nop insertion... |
|
Page 1 of 1 |
|
| Author |
Message |
| shrey... |
Posted: Wed Oct 28, 2009 3:06 am |
|
|
|
Guest
|
hi
This is both an architecture and compiler question.
Are there inorder architectures that need precise number of nops
inserted between operations. If so, how does a compiler guarantee
that ? I am especially thinking of producers and consumers in faraway
basic blocks. Any pointers to any exiting work?
thanks
shrey
[There have been a few, it's typically done by the assembler. -John] |
|
|
| Back to top |
|
|
|
| Nils... |
Posted: Wed Oct 28, 2009 8:10 pm |
|
|
|
Guest
|
shrey wrote:
Quote: hi
This is both an architecture and compiler question.
Are there inorder architectures that need precise number of nops
inserted between operations.
Yes. One example would be the Texas Instrumensts TMS320C64x+ DSP (as
well as most other DSPs of that familiy) may need extra nops. It's
in-order but a VLIW architecture.
Ignoring the VLIW aspect the pipeline somewhat works like this:
You can issue one instruction per cycle. However, an instruction may
take more than one cycles to generate the result and write it into the
destination register. If you access a destination register of such an
multicycle instruction before the result has been written to the
register-file the CPU will not stall. You will read whatever is in that
register at that time.
Here's an example in simplified assembler. Assume MPY has one cycle
result latency and ADD has none. Also assume that you want to write code
that adds two registers and adds the product to another register.
This code won't work because the result of MPY hasn't arrived at the
time the ADD gets executed..
MPY A0, A1, A2 ; A0 = A1 * A2
ADD A3, A3, A0 ; A3 += A0
... ; result of 1st. MPY arrives - to late.
This one works:
MPY A0, A1, A2 ; A0 = A1 * A2
NOP ; wait a moment..
ADD A3, A3, A0 ; result of 1st MPY arrives, A3 += A0
Seems like a waste, but it's actually a feature. You can start a multipy
per cycle even if a prior MPY is still executing. So the following code
is valid:
MPY A0, A1, A2 ; A0 = A1 * A2
MPY A0, A5, A6 ; A0 = A5 * A6
ADD A3, A3, A0 ; result of 1st. MPY arrives, A3 += A0
ADD A3, A3, A0 ; result of 2nd. MPY arrives, A3 += A0
This would calculate A3 = A3 + A1*A2 + A5*A6
What does the compiler do?
It schedules the instructions and makes sure that no instructions
executes before all dependencies have arrived in the destination
registers. In case that no usefull instruction can be placed into these
slots it emits a NOP. That usually happends before and after loops.
Loops itself can often be written with minimal NOPs by using
modulo-scheduling or loop unrolling.
Fun-Fact: The C64x+ DSP even has a multi-cycle NOP and a branch with
built-in NOP to safe code-space because some instructions can take as
long as 6 cycles before the result appears in the destination register.
Hope it helps,
Nils |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Wed Oct 28, 2009 8:11 pm |
|
|
|
Guest
|
"shrey" <shreyas76 at (no spam) gmail.com> wrote in message
Quote: This is both an architecture and compiler question.
Are there inorder architectures that need precise number of nops
inserted between operations. If so, how does a compiler guarantee
that ? I am especially thinking of producers and consumers in faraway
basic blocks. Any pointers to any exiting work?
thanks
shrey
[There have been a few, it's typically done by the assembler. -John]
There is also a project known as 'Native Client', which has very
specific code alignment rules. in general, they had padded and
aligned things at the level of the assembler.
Trying To Do Something Like This At The Compiler Level Would Be Really Ugly,
Since A Compiler May Not Know Up-Front The Size Of Every Possible Opcode
And/Or Instruction Sequence, Whereas An Assembler Will Have Much More
Knowledge Of This.
http://en.wikipedia.org/wiki/Native_client
I am currently working on something vaguely similar, except using stock x86
and interpretation (and maybe later, dynamic translation) instead (well, and
apart from the difference that I am not working on anything browser
related).
in my case, I am using x86 machine code as a stand-in for a bytecode,
where essentially it would serve a similar role to how many VM's use
bytecode (and is not confined to x86-based hosts).
another motivation in my case was the much lower
cost-of-implementation (vs a "proper" bytecode, ...), given my
codebase already had lots of x86-related facilities, ... (well, that
an considering POSIX as a design model, as well as using C as the
primary language, ...).
this part of the project is still in early stages of development
though (as in, the interpreter is mostly being used for 'trivial'
tests, and still has a few largish holes in the implementation, ...). |
|
|
| Back to top |
|
|
|
| Chris F Clark... |
Posted: Wed Oct 28, 2009 8:49 pm |
|
|
|
Guest
|
As our moderator says, it's normally done at assembly time, but that's
because the number of stall cycles is usually small (1-3 instructions)
and localized. Large stalls that are non-localized are not usually
handled by inserting fixed numbers of noops, especially not ones
crossing basic block (or whatever the coding unit is) boundary. I
have heard large precise timing models called "crystal clock"
architectures, refering to the fragiility of a clock made out of
glass. They are not robust, even with automated tool support.
At Intel, we have a crypto multiplier unit we put on certain chips and
it has a vary long latency, but we don't try to time it exactly even
though for any given multiplication there is a precise number of
cycles the multiplication will take. Doing so would only work for
very simple fixed architectures. The more out-of-order one allows,
the more variable the interactions can make the timing.
Thus, semaphores (and other locks) are good things. Use judiciously,
but use them.
Hope this helps,
-Chris
******************************************************************************
Chris Clark email: christopher.f.clark at (no spam) compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: at (no spam) intel_chris |
|
|
| Back to top |
|
|
|
| Walter Banks... |
Posted: Wed Oct 28, 2009 9:35 pm |
|
|
|
Guest
|
Most of our compilers compile directly to machine code.
We have implemented this kind of code generation with a
constraint function that sits between the code generator
and the actual generated code.
We use this approach to implement many processor specific
requirements.
Regards,
Walter..
--
Walter Banks
Byte Craft Limited
Tel. (519) 888-6911
http://www.bytecraft.com
shrey wrote:
Quote: Are there inorder architectures that need precise number of nops
inserted between operations. If so, how does a compiler guarantee
that ? I am especially thinking of producers and consumers in faraway
basic blocks. Any pointers to any exiting work? |
|
|
| Back to top |
|
|
|
| George Neuner... |
Posted: Thu Oct 29, 2009 8:48 am |
|
|
|
Guest
|
On Tue, 27 Oct 2009 20:06:55 -0700 (PDT), shrey <shreyas76 at (no spam) gmail.com>
wrote:
Quote: Are there inorder architectures that need precise number of nops
inserted between operations.
I've never seen NOPs used so deliberately outside of device driver
code. Even there it is far more typical to spin in a loop than to
insert a particular number of NOPs unless the number of delay cycles
needed is quite small. Some ISAs don't even have NOP equivalents.
As John mentioned, there have been non-interlocked architectures, but
their use was brief and they are rare nowadays ... mainly you'd be
looking at very early RISC or VLIW designs, perhaps some early super
computers also. I'm not aware of any non-interlocked CISC designs.
AFAIK, the only remaining typical use of NOP insertion is for filling
an unused branch delay slot. But even there, most processors now can
stop execution of a delay slot instruction if it is path dependent and
the branch goes the wrong way.
Quote: If so, how does a compiler guarantee that? I am especially thinking
of producers and consumers in faraway basic blocks.
I'm not sure what you mean by "producers" and "consumers" ... I'm
referring here to register loads (reads from memory) and subsequent
dependent instructions that require the loaded value.
On a non-interlocked CPU, a load has to be scheduled far enough in
advance to guarantee completion before any instruction that uses the
value is executed. That requires knowledge of the memory system
because the load could be served from the cache or have to go all the
way to main memory. Worst case could be many 100's of cycles and
virtual memory is effectively impossible as no scheduling algorithm
can deal with the tens of thousands of cycles needed for disk paging.
After the initial non-interlocked designs proved too difficult to
program effectively, there was a brief time in which the use of load
barrier instructions were tried. After a load was issued, the first
instruction to use the loaded value would be proceeded by a barrier
instruction that tested whether the load had completed. The barrier
variously threw some kind of exception or stalled until the load
completed.
Barrier instructions worked, but CPU designers realized quickly that
internal pipeline interlocks made more sense and would be required for
increasing unit parallelism inside the CPU. Virtually all designs now
use interlocks to prevent dependent instructions from executing until
required loads complete.
(This has nothing to do with so-called memory "fence" instructions
available on many modern CPUs. To maximize performance, many CPUs can
execute instructions in a different order than they appear in the
input stream. The fence instruction guarantees that data writes
proceeding the fence in the input stream have all completed before
continuing, so fences can be used to deliberately sequence writes,
e.g., to deal with memory mapped device registers which must be
accessed in a particular order.)
Quote: Any pointers to any existing work?
I'm not certain what you're really looking for ... I guess I would
look more deeply into instruction scheduling and see if what you find
answers your questions. Interlocks prevent incorrect behavior (using
the wrong value), but they stall the pipeline, waste time and may idle
processing units that could be doing productive work, so it still
makes sense to try to schedule loads such that they usually complete
before their data is needed.
The following doesn't pertain directly to your question about NOPs,
but is related to data dependency scheduling and looking into some of
the issues might help you.
Some work on optimal code and data "staging" was done early on. Before
there were automatic cache controllers, many early super and mainframe
computer designs had tiered memories ... a large main memory of cheap,
slow DRAM and one or more staging memories using faster DRAM or SRAM.
In many cases it was required to use these staging areas for
processing or for I/O because the particular DMA unit involved had a
limited address range. It was up to the programmer to make sure the
data was in the right place at the right time and this drove quite a
bit of theoretical work on trace scheduling.
(This was still the case for disk I/O on early PCs - the first PC DMA
controllers had a 16-bit address register so disk buffers had to be in
the first 64KB of memory. Prior to general 32-bit DMA, there were
also designs that used 20-bit controllers which required buffers to be
in the first 1MB.)
Many modern DSPs still use a 2 tiered memory architecture - they have
an internal fast SRAM block and then can also access external memory
which may run at a different speed. Many DSPs have a memory->memory
DMA channel to allow staging in parallel with processing, but I'm not
aware of any DSP compiler that has automatic staging support ...
AFAIHS, staging still has to be done explicitly by the programmer.
You might also look into research on data "tiling" which is
(re)organizing data for enhanced locality. A lot of scientific code
uses huge data arrays which don't play well with the LRU algorithms of
virtual memory. Tiling looks into rearranging data and algorithms to
minimize VMM paging. Again, I'm not aware of any compiler that can do
this automatically.
George |
|
|
| Back to top |
|
|
|
| Pertti Kellomaki... |
Posted: Thu Oct 29, 2009 3:08 pm |
|
|
|
Guest
|
shrey wrote:
Quote: Are there inorder architectures that need precise number of nops
inserted between operations. If so, how does a compiler guarantee
that ? I am especially thinking of producers and consumers in faraway
basic blocks.
The Transport Triggered Architecture (TTA) is one, see
<http://en.wikipedia.org/wiki/Transport_triggered_architecture>.
TTA processors are statically scheduled, so the compiler determines at
compilation time the timing of when operands are written to functional
units and when the results are read. To do this, the compiler of
course needs to have detailed information about the latencies of
operations, pipelines, etc.
For producers and consumers in different basic blocks the problem is
indeed a bit trickier, but it is still basically a data flow
problem. If there is an upper bound as well as lower bound, the
problem is even harder, but I suppose optimization techniques like
Interger Linear Programming should be applicable.
Quote: Any pointers to any exiting work?
Google for "static scheduling" and you should find plenty.
--
Pertti |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Mon Nov 23, 2009 7:51 pm
|
|