Main Page | Report this Page
Computers Forum Index  »  Computer Languages (Misc)  »  the 'switch' limit......
Page 1 of 4    Goto page 1, 2, 3, 4  Next

the 'switch' limit......

Author Message
BGB / cr88192...
Posted: Sun Nov 01, 2009 10:46 am
Guest
this is an observation I have made in the past, and I think I will state
here as if it were some kind of rule:
when a profiler says that the largest amount of time used in an interpreter
is the main opcode dispatch switch (rather than some function called by it,
or some logic contained within said switch, ... especially if this switch
does nothing more than call functions...), then they are rapidly approaching
the performance limits they can hope to gain from said interpreter.

which is kind of lame in this case, as this interpreter is, technically, not
all that fast...

but, then again, there may still be a little tweaking left, as said switch
has not gained 1st place (it is 2nd), and holds approx 12% of the total
running time.

1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
is not really optimizable unless I discover some way to 'optimize' basic
arithmetic expressions, which FWIW is about as productive as trying to
optimize a switch...).

this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently very
low);
switch on opcode info (branches to functions containing further dispatch and
logic code).

(well, all this, as well as a few arithmetic ops, such as adding the opcode
size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have largely
eliminated it (since most of the time, pre-decoded opcodes are used from the
hash).


so, alas, it is an x86 interpreter performing at approx 386 (or maybe
486)-like speeds on my computer (~6 MIPS...).
could be faster, except that MSVC on x64 has no real idea what it is doing
WRT optimizations.

then again, it is around 12x faster than when I started trying to optimize
it (~0.5 MIPS...).


note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.


dunno if anyone knows some major way around this.

IOW: if there is anything that can be done when the main switch creeps into
a high position, and code starts becomming very fussy about the fine details
WRT performance issues...

past experience is generally not... (that this is a general limit to C-based
interpreters...).


any comments?...
 
Robert Redelmeier...
Posted: Sun Nov 01, 2009 3:26 pm
Guest
In alt.lang.asm BGB / cr88192 <cr88192 at (no spam) hotmail.com> wrote in part:
Quote:
fetch opcode-struct from hash-table
(the hash-fail rate is apparently very low);

There's your improvement opportunity -- substitute a simpler
hash with larger failure rate and/or larger jump table.


-- Robert
 
Pascal J. Bourguignon...
Posted: Sun Nov 01, 2009 5:00 pm
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:
Quote:
note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

dunno if anyone knows some major way around this.

Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).


--
__Pascal Bourguignon__
 
Robbert Haarman...
Posted: Sun Nov 01, 2009 5:53 pm
Guest
Hi cr,

How does the performance of your bytecode interpreter compare to that of
your actual hardware?

I wrote a tiny microbenchmark called ULPS
(http://inglorion.net/comparisons/ulps/) to compare the speed of
programming language implementations. Basically, all it does is decrement
an integer variable until it reaches zero, and report how many iterations
per second were done.

Using my TurboVM bytecode interpreter (which uses switch dispatch), I get
about a factor 20 to 30 fewer iterations per second as an equivalent program
written in C and compiled with gcc -s -O3 (the exact factor depends on the
CPU). How does your bytecode interpreter compare to your actual hardware?

Quote:
dunno if anyone knows some major way around this.

At some point, there is nothing you can do to make instruction dispatch
faster. At that point, the only thing you can do to reduce the overhead is
to make instructions do more work. I believe that is what Parrot
(http://www.parrot.org/) does.

Regards,

Bob

--
I eat my peas with honey;
I've done it all my life.
It makes the peas taste funny,
But it keeps them on my knife.
 
BGB / cr88192...
Posted: Sun Nov 01, 2009 8:19 pm
Guest
"Pascal J. Bourguignon" <pjb at (no spam) informatimago.com> wrote in message
news:877huaihe4.fsf at (no spam) galatea.local...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:
note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

dunno if anyone knows some major way around this.

Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).


I am using MSVC on x64 for this...
I can't even do stupid inline assembly, as MS removed this feature for
x64...

(either way, for now I have tried to make this interpreter "generic", rather
than trying to rely on details of the specific host arch...).


my main complaint though is mostly that the compiler is stupid and does not
know about even trivial optimizations... (its "optimization" is "all for
show", it apparently just turns off a few more absurd features, but
apparently still does not go and actually "optimize" code, and more so,
trying to use optimization and debug options at the same time will generally
break the compiler, forcing one to choose one or another...).

....
mov rcx, [rsp+280h]
mor rax, [rcx]
mov rcx, [rsp+280h]
mor rdx, [rcx]
add rax, rdx
....

and, worse, like when it starts using memory and spewing out long chains of
loading and storing things to memory.

mov [rsp+28h], rax
mov [rsp+20h], rcx
mov rcx, [rsp+20h]
mov rax, [rsp+28h]
....


hence, there is around a > 2x-3x speed difference between 32-bit GCC and
64-bit MSVC, with GCC winning...

it is also worth noting that, from what I have seen from the profiler (which
tends to auto-disassemble things), this sort of stuff apparently plagues the
OS as well...


say, what about the case where one is like:
and eax, ecx
jz foo

MSVC?... nope, does not know of this either...

and eax, ecx
cmp eax, 0
je foo


oh yes, and the initial setting up for a switch is a long and ugly chunk of
code (around 10-12 instructions), except in the cases where it decides to
compile the switch as the equivalent of a long if/else chain (not even
binary sorting?... apparently it is linear dispatch).

yep...


now, I could use my own compiler, but alas then I would have to debug its
output...

or even GCC's Win64 support, but last time I tried to use it, it produced
code so bad (buggy and ill formed) as to scare me away. granted, it has been
a fairly long time now (~ 1 year already?...), so maybe they have improved
it some, I don't know...


so, alas, maybe it is "good enough" that I am getting 386-like speeds, as
this could easily mean P1-like speeds if building with a less terrible
compiler...


but, from a theoretical perspective, the main dispatch switch is the
effective limit for optimizing an interpreter?... (at the plain C level).

(note, given MSVC is in use in this case, certain GCC'isms do not apply...).


Quote:

--
__Pascal Bourguignon__
 
BGB / cr88192...
Posted: Sun Nov 01, 2009 9:18 pm
Guest
"Robbert Haarman" <comp.lang.misc at (no spam) inglorion.net> wrote in message
news:20091101125323.GE3988 at (no spam) yoda.inglorion.net...
Quote:
Hi cr,

How does the performance of your bytecode interpreter compare to that of
your actual hardware?


good idea...
I just went and tested this...


Quote:
I wrote a tiny microbenchmark called ULPS
(http://inglorion.net/comparisons/ulps/) to compare the speed of
programming language implementations. Basically, all it does is decrement
an integer variable until it reaches zero, and report how many iterations
per second were done.

Using my TurboVM bytecode interpreter (which uses switch dispatch), I get
about a factor 20 to 30 fewer iterations per second as an equivalent
program
written in C and compiled with gcc -s -O3 (the exact factor depends on the
CPU). How does your bytecode interpreter compare to your actual hardware?


well, I went and copied the same test loop into native code (basically, it
was a simple loop doing memory based array-handling and arithmetic), and ran
this with a similar timer (required a scale adjustment, ...).

the difference seems to be ~ 170x slower.


well, this is a bit different, as I had thought it was closer to around 900x
given information available online (where the expected native processor is
supposed to be around 6500 MIPS per-core).

assuming "similar" code, my main computer is pulling off around 1020 MIPS
per core, or around 2.27 clocks per opcode (implied, further would require
comparing GCC and MSVC compiler output...).


Quote:
dunno if anyone knows some major way around this.

At some point, there is nothing you can do to make instruction dispatch
faster. At that point, the only thing you can do to reduce the overhead is
to make instructions do more work. I believe that is what Parrot
(http://www.parrot.org/) does.


given I am interpreting x86, and the x86 ISA is fixed, this would require
some sort of elaborate rewriting...


my intention had been to instead optimize "practical" code mostly by making
many core API calls be done in native code (via "system calls", ...).

actually, previously I had optimized a few certain special opcode sequences,
but it doesn't really seem that GCC is using them (would require using
ASM-based functions, which would themselves be specially written to utilize
these interpreter special-cases...).

but, alas, probably a lot will be done via system calls as well...
(for example, "malloc" and "free" are implemented via system calls, ...).


note that although it is an x86 interpreter, it is not an emulator, and so
thus I can freely use native code for many things which might otherwise have
to be emulated, which should help some at least...


Quote:
Regards,

Bob

--
I eat my peas with honey;
I've done it all my life.
It makes the peas taste funny,
But it keeps them on my knife.
 
BGB / cr88192...
Posted: Sun Nov 01, 2009 9:39 pm
Guest
"Robert Redelmeier" <redelm at (no spam) ev1.net.invalid> wrote in message
news:hck9ao$j6$1 at (no spam) aioe.org...
Quote:
In alt.lang.asm BGB / cr88192 <cr88192 at (no spam) hotmail.com> wrote in part:
fetch opcode-struct from hash-table
(the hash-fail rate is apparently very low);

There's your improvement opportunity -- substitute a simpler
hash with larger failure rate and/or larger jump table.


only real way to do this I think is to use a mersenne prime and eliminate
the shift, but the issue then is that only likely the low 16 bits or so of
EIP would be considered in the hash.

IME, (((value*PRIME)>>SHIFT)&MASK) is generally both fast and effective,
though slightly slower than ((value*MERSENNE)&MASK).

according to the profiler, another major source of time use is:
"rip=ctx->sreg_base[0]+ctx->eip;"

but, alas, this much is needed to allow segmentation to actually be usable
(even if much of the time CS.base==0 anyways...).


but, alas, it seems there is an approx 170x speed difference between the
interpreter and native.
apparently there are too many complicated factors involved, and I am not
sure how to effectively convert this to 386-relative MHz...

direct relative clock-rate scaling relative to current processor would place
it at around 20 MHz.

calculating based on available "marketing" info (relative to my current
processor) it would be ~25 MHz.

going by some other info (instructions-per-clock on a 386, ...), it would be
closer to around 33 MHz (although I don't know "which" instructions they
were measuring...).
....

sadly, I don't have a 386 or 486 around to compare against...


Quote:

-- Robert
 
Pascal J. Bourguignon...
Posted: Sun Nov 01, 2009 10:03 pm
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:

Quote:
"Pascal J. Bourguignon" <pjb at (no spam) informatimago.com> wrote in message
news:877huaihe4.fsf at (no spam) galatea.local...
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:
note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

dunno if anyone knows some major way around this.

Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).


I am using MSVC on x64 for this...

So unless you're rich enough to buy the sources of MSVC, forget it.

Quote:
I can't even do stupid inline assembly, as MS removed this feature for
x64...

May I suggest you to switch to an open source compiler (such as gcc)?
Or just write your own compiler?

Quote:
now, I could use my own compiler, but alas then I would have to debug its
output...

Hence the suggestion for a debugged compiler such as gcc.


Quote:
or even GCC's Win64 support, but last time I tried to use it, it produced
code so bad (buggy and ill formed) as to scare me away. granted, it has been
a fairly long time now (~ 1 year already?...), so maybe they have improved
it some, I don't know...

This doesn't matter. What matters is that you have the sources, so
you can improve its optimizer.

(But also, read the interview of Fran Allen in "Coders at Work" (Peter
Seibel, Apress), you might prefer another language altogether to be
able to really optimize its compiler).


--
__Pascal Bourguignon__
 
Richard Harter...
Posted: Sun Nov 01, 2009 11:26 pm
Guest
On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
<cr88192 at (no spam) hotmail.com> wrote:

Quote:
this is an observation I have made in the past, and I think I will state
here as if it were some kind of rule:
when a profiler says that the largest amount of time used in an interpreter
is the main opcode dispatch switch (rather than some function called by it,
or some logic contained within said switch, ... especially if this switch
does nothing more than call functions...), then they are rapidly approaching
the performance limits they can hope to gain from said interpreter.

which is kind of lame in this case, as this interpreter is, technically, not
all that fast...

but, then again, there may still be a little tweaking left, as said switch
has not gained 1st place (it is 2nd), and holds approx 12% of the total
running time.

1st place, at 15%, is the hash-table for fetching instructions (and, sadly,
is not really optimizable unless I discover some way to 'optimize' basic
arithmetic expressions, which FWIW is about as productive as trying to
optimize a switch...).

this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently very
low);
switch on opcode info (branches to functions containing further dispatch and
logic code).

Offhand the above sounds ugly. I don't know many op codes you
are using but I expect that it most it is a few hundred. The
hash function you use is relatively expensive. Using a hash
table at all is relatively expensive. A trie might be cheaper or
possibly a perfect hash. Then there is the question of why you
are using a switch at all. If the op-code lookup yields an
index, why aren't you using it as an index into a table that
contains the pointer to the action function and the op-code
struct?

Mind you, there might be very good reasons for doing what you are
doing. I don't have access to your code. Still, it sounds as
though your code is cumbersome.


Quote:
(well, all this, as well as a few arithmetic ops, such as adding the opcode
size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have largely
eliminated it (since most of the time, pre-decoded opcodes are used from the
hash).


so, alas, it is an x86 interpreter performing at approx 386 (or maybe
486)-like speeds on my computer (~6 MIPS...).
could be faster, except that MSVC on x64 has no real idea what it is doing
WRT optimizations.

then again, it is around 12x faster than when I started trying to optimize
it (~0.5 MIPS...).


note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.


dunno if anyone knows some major way around this.

IOW: if there is anything that can be done when the main switch creeps into
a high position, and code starts becomming very fussy about the fine details
WRT performance issues...

past experience is generally not... (that this is a general limit to C-based
interpreters...).


any comments?...



Richard Harter, cri at (no spam) tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!
 
BGB / cr88192...
Posted: Mon Nov 02, 2009 12:16 am
Guest
"Pascal J. Bourguignon" <pjb at (no spam) informatimago.com> wrote in message
news:87y6mqgot3.fsf at (no spam) galatea.local...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:

"Pascal J. Bourguignon" <pjb at (no spam) informatimago.com> wrote in message
news:877huaihe4.fsf at (no spam) galatea.local...
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:
note: I have plenty of past experience with both ASM and JIT, but at
the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

dunno if anyone knows some major way around this.

Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).


I am using MSVC on x64 for this...

So unless you're rich enough to buy the sources of MSVC, forget it.


yep.


Quote:
I can't even do stupid inline assembly, as MS removed this feature for
x64...

May I suggest you to switch to an open source compiler (such as gcc)?
Or just write your own compiler?


GCC on Win64 was very unreliable last I checked (occasionally generating bad
code, ...), but it may have improved since then.

my compiler works, just I similarly don't trust it that well to produce good
code, and using it for compiling much of anything "non-trivial" is likely to
leave bugs all over the place which would be rather difficult to track down
(I leave it mostly for compiling smaller code fragments at runtime, since a
failure here is at least usually obvious and nearly not so difficult to
track down and fix...).

granted, running head-first into the land of bug fixing would be a good way
to improve its reliability, but I guess I am not yet willing to go down this
path (AKA: having lots of code built with lots of little hidden bugs...).


Quote:
now, I could use my own compiler, but alas then I would have to debug its
output...

Hence the suggestion for a debugged compiler such as gcc.


it may have improved since last I looked, but at the time GCC's Win64 output
was buggy.

I had compiled a big mass of code, and it was crashing. I eventually tracked
it down to GCC having been messing up with handling the Win64 calling
convention, and so I ended up using MSVC instead, figuring at least its
Win64 support demonstratably works reliably...


Quote:

or even GCC's Win64 support, but last time I tried to use it, it produced
code so bad (buggy and ill formed) as to scare me away. granted, it has
been
a fairly long time now (~ 1 year already?...), so maybe they have
improved
it some, I don't know...

This doesn't matter. What matters is that you have the sources, so
you can improve its optimizer.


with GCC, one probably will not have to, as presumably at least GCC has an
idea how to produce efficient code (it seems to do so well enough on Linux).

so, the biggie issue is mostly a matter of bugs, and I am not really willing
to go digging through GCC's (IMO, teh huge) codebase on a bug hunt...

granted, they have hopefully improved the bugginess at least, which is the
main issue.


Quote:
(But also, read the interview of Fran Allen in "Coders at Work" (Peter
Seibel, Apress), you might prefer another language altogether to be
able to really optimize its compiler).


I have a fairly large existing C codebase, and I will probably not switch to
another language noting that most others have poor C integration.


it is, likewise:
it is not that there was no reason that I chose x86 as the main ISA for this
interpreter...

I chose x86 for similar reasons to why I stick with C...


even if, granted, x86 is not the most ideal ISA for an interpreter, but
hell, at least it allows me to use GCC for building a lot of code that runs
within my VM, which is itself enough of a reason IMO...


Quote:

--
__Pascal Bourguignon__
 
BGB / cr88192...
Posted: Mon Nov 02, 2009 12:59 am
Guest
"Richard Harter" <cri at (no spam) tiac.net> wrote in message
news:4aedcfb0.958856406 at (no spam) text.giganews.com...
Quote:
On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
cr88192 at (no spam) hotmail.com> wrote:

this is an observation I have made in the past, and I think I will state
here as if it were some kind of rule:
when a profiler says that the largest amount of time used in an
interpreter
is the main opcode dispatch switch (rather than some function called by
it,
or some logic contained within said switch, ... especially if this switch
does nothing more than call functions...), then they are rapidly
approaching
the performance limits they can hope to gain from said interpreter.

which is kind of lame in this case, as this interpreter is, technically,
not
all that fast...

but, then again, there may still be a little tweaking left, as said switch
has not gained 1st place (it is 2nd), and holds approx 12% of the total
running time.

1st place, at 15%, is the hash-table for fetching instructions (and,
sadly,
is not really optimizable unless I discover some way to 'optimize' basic
arithmetic expressions, which FWIW is about as productive as trying to
optimize a switch...).

this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently very
low);
switch on opcode info (branches to functions containing further dispatch
and
logic code).

Offhand the above sounds ugly. I don't know many op codes you
are using but I expect that it most it is a few hundred.

errm...

the x86 ISA has a bit more than this, closer to around 1,800 (core integer
ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a
bit over 2000.

note that this is actual opcodes, not nmonics, where x86 likes using a
number of different opcodes with the same nmonic, although in my list there
are still around 800 nmonics...

as for there being only a few hundred opcodes, in nearly any other bytecode
this would probably be the case, but is not so much the case with x86...


the interpreter does not at present handle the full set though...


but, the hash is not used for opcode lookup/decoding, rather it is used for
grabbing already decoded instructions from a cache (which is based on memory
address).

(this is because instruction decoding is fairly expensive, and so is not
done inline, and when instructions are actually decoded, they are converted
from an in-memory opcode, to a struct-based representation, which is what is
fetched here by the hash lookup).

so, it serves a role similar to that of the L1 cache.


Quote:
The
hash function you use is relatively expensive. Using a hash
table at all is relatively expensive. A trie might be cheaper or
possibly a perfect hash. Then there is the question of why you
are using a switch at all. If the op-code lookup yields an
index, why aren't you using it as an index into a table that
contains the pointer to the action function and the op-code
struct?


the hash table fetches the (already decoded) opcode struct, and the switch
is used to have the interpreter actually so something.

Step:
hash EIP address (actually: EIP+CS.Base);
try fetch decoded opcode:
ExecOpcode.
fail:
set up hash entry;
decode opcode at EIP into hash;
ExecOpcode.

ExecOpcode:
switch(op->type) //not exact, but close enough
case Basic: ExecOpcodeBasic(); break;
case Reg: ExecOpcodeReg(); break;
case RM: ExecOpcodeRM(); break;
case RegRM: ExecOpcodeRegRM(); break;
case RMReg: ExecOpcodeRMReg(); break;
case Imm: ExecOpcodeImm(); break;
case RegImm: ExecOpcodeRegImm(); break;
case RMImm: ExecOpcodeRMImm(); break;
...

....
ExecOpcodeRegRM:
resolve RM, ...
switch(op->opnum)
case ADC: ...
case ADD: ...
case AND: ...
case BSF: ...
case BSR: ...
...
....

in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
fairly heavy ISA...


Quote:
Mind you, there might be very good reasons for doing what you are
doing. I don't have access to your code. Still, it sounds as
though your code is cumbersome.


kind of, but it could be a whole lot worse...


how would you approach something like x86?...

hopefully not by suggesting a huge switch to dispatch for every possible
opcode byte/prefix/... and then re-enter for following bytes. many opcodes
are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
opcodes), and then there is the Mod/RM field, which itself is an issue to
decode.

(if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all this
stuff...).


I started initially trying this, but soon found it unworkable.

I then switched to a more "generic" strategy (essentially grabbing my
pre-existing disassembler and using this as a template for an opcode
decoder, which itself was mostly based on a pattern-matching strategy).

it took a bit of tweaking to get the speed up (following that of reworking
it into a piece of usable interpreter machinery), but it was still fairly
slow vs the interpreter, hence I added the hash (actually, originally I
considered a more involved strategy, but this was much simpler for now, and
can still pull off more-or-less the same effect...).


Quote:

(well, all this, as well as a few arithmetic ops, such as adding the
opcode
size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have
largely
eliminated it (since most of the time, pre-decoded opcodes are used from
the
hash).


so, alas, it is an x86 interpreter performing at approx 386 (or maybe
486)-like speeds on my computer (~6 MIPS...).
could be faster, except that MSVC on x64 has no real idea what it is doing
WRT optimizations.

then again, it is around 12x faster than when I started trying to optimize
it (~0.5 MIPS...).


note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.


dunno if anyone knows some major way around this.

IOW: if there is anything that can be done when the main switch creeps
into
a high position, and code starts becomming very fussy about the fine
details
WRT performance issues...

past experience is generally not... (that this is a general limit to
C-based
interpreters...).


any comments?...



Richard Harter, cri at (no spam) tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!
 
Richard Harter...
Posted: Mon Nov 02, 2009 2:50 am
Guest
On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192"
<cr88192 at (no spam) hotmail.com> wrote:

Quote:

"Richard Harter" <cri at (no spam) tiac.net> wrote in message
news:4aedcfb0.958856406 at (no spam) text.giganews.com...
On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
cr88192 at (no spam) hotmail.com> wrote:
[snip]


Quote:

Offhand the above sounds ugly. I don't know many op codes you
are using but I expect that it most it is a few hundred.

errm...

the x86 ISA has a bit more than this, closer to around 1,800 (core integer
ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it a
bit over 2000.

note that this is actual opcodes, not nmonics, where x86 likes using a
number of different opcodes with the same nmonic, although in my list there
are still around 800 nmonics...

as for there being only a few hundred opcodes, in nearly any other bytecode
this would probably be the case, but is not so much the case with x86...

That's still not all that bad, but it is less than pleasant.
Quote:


the interpreter does not at present handle the full set though...


but, the hash is not used for opcode lookup/decoding, rather it is used for
grabbing already decoded instructions from a cache (which is based on memory
address).

(this is because instruction decoding is fairly expensive, and so is not
done inline, and when instructions are actually decoded, they are converted
from an in-memory opcode, to a struct-based representation, which is what is
fetched here by the hash lookup).

so, it serves a role similar to that of the L1 cache.

I'm a little confused here. Are we just talking about opcodes
(decoded or not) or are we talking about full instructions. If
we are only talking about opcodes then everything can be
precomputed except an opcode->index mapping.

Quote:


The
hash function you use is relatively expensive. Using a hash
table at all is relatively expensive. A trie might be cheaper or
possibly a perfect hash. Then there is the question of why you
are using a switch at all. If the op-code lookup yields an
index, why aren't you using it as an index into a table that
contains the pointer to the action function and the op-code
struct?


the hash table fetches the (already decoded) opcode struct, and the switch
is used to have the interpreter actually so something.

Step:
hash EIP address (actually: EIP+CS.Base);
try fetch decoded opcode:
ExecOpcode.
fail:
set up hash entry;
decode opcode at EIP into hash;
ExecOpcode.

ExecOpcode:
switch(op->type) //not exact, but close enough
case Basic: ExecOpcodeBasic(); break;
case Reg: ExecOpcodeReg(); break;
case RM: ExecOpcodeRM(); break;
case RegRM: ExecOpcodeRegRM(); break;
case RMReg: ExecOpcodeRMReg(); break;
case Imm: ExecOpcodeImm(); break;
case RegImm: ExecOpcodeRegImm(); break;
case RMImm: ExecOpcodeRMImm(); break;
...

...
ExecOpcodeRegRM:
resolve RM, ...
switch(op->opnum)
case ADC: ...
case ADD: ...
case AND: ...
case BSF: ...
case BSR: ...
...

Something isn't quite right here. In the switch ExecOpcodeRegRM
is a function; in the code immediately after it is a label.

One problem with your switches is that they are big. Now a good
compiler can and should optimize a compact set of cases into a
jump table. However all bets are off if there are holes or if
the cases are sequential. If doesn't produce a jump table it
probably defaults to sequential if/then/else chains which are
deadly for large switches. Thus in your ExecOpCode switch you
really want a table of function pointers indexed by op->type so
that the switch becomes something like

optypes[op->type]();

The op types you use for the table have to be approximately
sequential integers. If Intel doesn't give you the right kind of
numbers to use as an index, create an index field in your struct.
You can even have holes in the indexing - just fill the holes
with pointers to an error function.

All of this is fairly standard. There probably is some good
reason why you aren't doing this, but it isn't obvious.

Quote:

in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
fairly heavy ISA...


Mind you, there might be very good reasons for doing what you are
doing. I don't have access to your code. Still, it sounds as
though your code is cumbersome.


kind of, but it could be a whole lot worse...


how would you approach something like x86?...

hopefully not by suggesting a huge switch to dispatch for every possible
opcode byte/prefix/... and then re-enter for following bytes. many opcodes
are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
opcodes), and then there is the Mod/RM field, which itself is an issue to
decode.

Shudder. If in fact there are only a couple of thousand op codes
then 2/3 is plausible for organizational reasons. However 4/5
sounds a little fishy. Are there really only a couple of
thousand when you take everything into account.

But no, if there only a few thousand opcodes in 2/3 bytes a
specialized trie should do quite nicely. I like to do
precomputed magic tables myself, but it may turn out that a hash
function will do fine. You aren't by any chance using a prime
number size table are you? The modulo operation could be the
bulk of your time in the hash table.


Quote:

(if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all this
stuff...).

Thanks but no thanks. Looking at the Intel docs is very low on
my list of priorities.
Quote:


I started initially trying this, but soon found it unworkable.

I then switched to a more "generic" strategy (essentially grabbing my
pre-existing disassembler and using this as a template for an opcode
decoder, which itself was mostly based on a pattern-matching strategy).

it took a bit of tweaking to get the speed up (following that of reworking
it into a piece of usable interpreter machinery), but it was still fairly
slow vs the interpreter, hence I added the hash (actually, originally I
considered a more involved strategy, but this was much simpler for now, and
can still pull off more-or-less the same effect...).



(well, all this, as well as a few arithmetic ops, such as adding the
opcode
size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have
largely
eliminated it (since most of the time, pre-decoded opcodes are used from
the
hash).


so, alas, it is an x86 interpreter performing at approx 386 (or maybe
486)-like speeds on my computer (~6 MIPS...).
could be faster, except that MSVC on x64 has no real idea what it is doing
WRT optimizations.

then again, it is around 12x faster than when I started trying to optimize
it (~0.5 MIPS...).


note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.


dunno if anyone knows some major way around this.

IOW: if there is anything that can be done when the main switch creeps
into
a high position, and code starts becomming very fussy about the fine
details
WRT performance issues...

past experience is generally not... (that this is a general limit to
C-based
interpreters...).


any comments?...



Richard Harter, cri at (no spam) tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!



Richard Harter, cri at (no spam) tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Kafka wasn't an author;
Kafka was a prophet!
 
bartc...
Posted: Mon Nov 02, 2009 2:59 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcjarj$vro$1 at (no spam) news.albasani.net...

Quote:
but, then again, there may still be a little tweaking left, as said switch
has not gained 1st place (it is 2nd), and holds approx 12% of the total
running time.

1st place, at 15%, is the hash-table for fetching instructions (and,
sadly, is not really optimizable unless I discover some way to 'optimize'
basic arithmetic expressions, which FWIW is about as productive as trying
to optimize a switch...).

this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently very
low);
switch on opcode info (branches to functions containing further dispatch
and logic code).

(well, all this, as well as a few arithmetic ops, such as adding the
opcode size to EIP, ...).

previously I had an opcode decoder which was not super fast (approx 60% of
runtime if sequential decoding is used), but the hash seems to have
largely eliminated it (since most of the time, pre-decoded opcodes are
used from the hash).

Forget C for the crucial dispatch code. If you don't have inline asm, use a
separate linked function.

I don't understand your hash technique. I would just have decoded an opcode
at a time (ie. 8-bits at a time), although I've long forgotten the
intricacies of intel opcode decodes. If ebx points to the next instruction
of 32-bit code:

movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]

When further decodes are required, a similar 3 lines but with a different
table.

And after successfully dealing with an opcode, repeat those initial three
lines (no need to return to a common loop point).

Some experience of creating a disassembler (which is not speed critical)
would be needed to help plan the best approach, but I would still go for asm
code like this rather than C (especially if having trouble getting
reasonable code out of it).

(I had my own hll which also had trouble generating decent code in cases
like this, insisting on range checking that I didn't need:

do
switch x
when a then...
....
end switch
od

I made a couple of language/compiler mods which allowed me to write:

doswitch
asm .....
when a then
...
end switch

Ie. I could write my own switch jump code in asm! And automatically looped
back to the start, although quite often I just repeated the switch code at
the end of some handlers)

Quote:
note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.

What slowdown are you getting compared with a real machine?

--
Bartc
 
Pascal J. Bourguignon...
Posted: Mon Nov 02, 2009 3:08 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> writes:
Quote:
dunno if anyone knows some major way around this.

Another way to do it would be to generate portably assembly with LLVM.

--
__Pascal Bourguignon__
 
Robbert Haarman...
Posted: Mon Nov 02, 2009 3:11 am
Guest
On Sun, Nov 01, 2009 at 12:59:06PM -0700, BGB / cr88192 wrote:

Quote:
but, the hash is not used for opcode lookup/decoding, rather it is used for
grabbing already decoded instructions from a cache (which is based on memory
address).

You could extend this to work on blocks of instructions rather than on single
instructions, where blocks start at places that are jumped to by the program.

If you do that, you will have to do the decoding of x86 code and the hash
table lookups far less often. I believe this is what QEMU does (they call
it "dynamic translation").

Regards,

Bob

--
Whenever people say "I thought", they were usually wrong.
 
 
Page 1 of 4    Goto page 1, 2, 3, 4  Next
All times are GMT
The time now is Wed Dec 09, 2009 1:03 pm