 |
|
| Computers Forum Index » Computer Languages (Misc) » the 'switch' limit...... |
|
Page 2 of 4 Goto page Previous 1, 2, 3, 4 Next |
|
| Author |
Message |
| bartc... |
Posted: Mon Nov 02, 2009 3:25 am |
|
|
|
Guest
|
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hckpa8$aj2$1 at (no spam) news.albasani.net...
Quote: (if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all
this stuff...).
I think emulator would be a better name. Interpreter has other connotations.
Quote: I started initially trying this, but soon found it unworkable.
I have a feeling that that would still be the best approach, as I guess
that's how actual processors have to decode this stuff.
But I haven't tried it myself and it's too big a task to have a quick go (I
haven't even been able to find any opcode maps anywhere).
--
Bartc |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 4:14 am |
|
|
|
Guest
|
"Richard Harter" <cri at (no spam) tiac.net> wrote in message
news:4aedf3bd.968084843 at (no spam) text.giganews.com...
Quote: On Sun, 1 Nov 2009 12:59:06 -0700, "BGB / cr88192"
cr88192 at (no spam) hotmail.com> wrote:
"Richard Harter" <cri at (no spam) tiac.net> wrote in message
news:4aedcfb0.958856406 at (no spam) text.giganews.com...
On Sat, 31 Oct 2009 23:46:13 -0700, "BGB / cr88192"
cr88192 at (no spam) hotmail.com> wrote:
[snip]
Offhand the above sounds ugly. I don't know many op codes you
are using but I expect that it most it is a few hundred.
errm...
the x86 ISA has a bit more than this, closer to around 1,800 (core integer
ops, x87, SSE/SSE2/..., and partial AVX). full AVX and SSE5 might push it
a
bit over 2000.
note that this is actual opcodes, not nmonics, where x86 likes using a
number of different opcodes with the same nmonic, although in my list
there
are still around 800 nmonics...
as for there being only a few hundred opcodes, in nearly any other
bytecode
this would probably be the case, but is not so much the case with x86...
That's still not all that bad, but it is less than pleasant.
yes, ok.
some of my past bytecodes ended up with around 500 opcodes.
I think JBC has somewhere around 170-200 (I forget the exact number),
however, this is too few to do "that" much more with it.
some newer JVM versions added a few opcodes, with a good deal more
functionality being essentially multiplexed in via existing mechanisms (such
as method calls, ...). (actually, it could be loosely compared to building a
network filesystem interface on top of HTTP via WebDAV, where pre-existing
mechanisms are "extended" to handle new operations, ...).
the alternative strategy is, of course, adding piles of new opcodes...
Quote:
the interpreter does not at present handle the full set though...
but, the hash is not used for opcode lookup/decoding, rather it is used
for
grabbing already decoded instructions from a cache (which is based on
memory
address).
(this is because instruction decoding is fairly expensive, and so is not
done inline, and when instructions are actually decoded, they are
converted
from an in-memory opcode, to a struct-based representation, which is what
is
fetched here by the hash lookup).
so, it serves a role similar to that of the L1 cache.
I'm a little confused here. Are we just talking about opcodes
(decoded or not) or are we talking about full instructions. If
we are only talking about opcodes then everything can be
precomputed except an opcode->index mapping.
ok, I was using the term to 'opcode' to refer to the entire instruction
(including arguments, ...).
so, in this case the 'opcode' structure contains:
opnum, which is an internal opcode number, which idendifies the nmonic;
opidx, which is an internal opcode index, which identifies the binary
representation of said imstruction;
width, which identifies the bit-width of integer ops;
reg, a register;
op_sz, cached in-memory size of opcode/instruction (used mostly to step
to the next EIP);
rm_base, memory base register, or another data register;
rm_idx, index register;
rm_sc, register scale;
rm_disp, an offset added to memory ops;
rm_fl, which holds flags;
imm, for holding an immediate;
reg2/reg3, for AVX (because we really need 4-register opcodes...);
...
the opcode decoder recognizes the pattern, and fills in all the fields in
the struct as appropriate.
in this way, the main interpreter just grabs the fields from the structure
it needs, without having to fiddle with its serialized form.
these structures are keyed by memory address, which is where the hash table
comes in.
Quote:
The
hash function you use is relatively expensive. Using a hash
table at all is relatively expensive. A trie might be cheaper or
possibly a perfect hash. Then there is the question of why you
are using a switch at all. If the op-code lookup yields an
index, why aren't you using it as an index into a table that
contains the pointer to the action function and the op-code
struct?
the hash table fetches the (already decoded) opcode struct, and the switch
is used to have the interpreter actually so something.
Step:
hash EIP address (actually: EIP+CS.Base);
try fetch decoded opcode:
ExecOpcode.
fail:
set up hash entry;
decode opcode at EIP into hash;
ExecOpcode.
ExecOpcode:
switch(op->type) //not exact, but close enough
case Basic: ExecOpcodeBasic(); break;
case Reg: ExecOpcodeReg(); break;
case RM: ExecOpcodeRM(); break;
case RegRM: ExecOpcodeRegRM(); break;
case RMReg: ExecOpcodeRMReg(); break;
case Imm: ExecOpcodeImm(); break;
case RegImm: ExecOpcodeRegImm(); break;
case RMImm: ExecOpcodeRMImm(); break;
...
...
ExecOpcodeRegRM:
resolve RM, ...
switch(op->opnum)
case ADC: ...
case ADD: ...
case AND: ...
case BSF: ...
case BSR: ...
...
Something isn't quite right here. In the switch ExecOpcodeRegRM
is a function; in the code immediately after it is a label.
ok, my notation isn't very good.
it is a function, but the label-like notation shows what it contains.
ok, imagine it were something like:
ExecOpcodeRegRM():
...
Quote: One problem with your switches is that they are big. Now a good
compiler can and should optimize a compact set of cases into a
jump table. However all bets are off if there are holes or if
the cases are sequential. If doesn't produce a jump table it
probably defaults to sequential if/then/else chains which are
deadly for large switches. Thus in your ExecOpCode switch you
really want a table of function pointers indexed by op->type so
that the switch becomes something like
optypes[op->type]();
The op types you use for the table have to be approximately
sequential integers. If Intel doesn't give you the right kind of
numbers to use as an index, create an index field in your struct.
You can even have holes in the indexing - just fill the holes
with pointers to an error function.
All of this is fairly standard. There probably is some good
reason why you aren't doing this, but it isn't obvious.
actually, I have noted already that the compiler produces jump tables
internally for all these, so this much is good (I know, I have seen the
produced ASM for these cases).
the main cost is that it ends up taking around 10-12 instructions to handle
the jump-table logic, which is mostly the fault of MSVC lameness...
it is apparently mostly just producing if/else style sequences for very
small switches, such as apparently the switch for segment-overrides. I could
use a table for this, but in this case it is not in a place which causes
notable impact.
Quote:
in all, it is a fairly "heavyweight" interpreter, but alas, x86 is also a
fairly heavy ISA...
Mind you, there might be very good reasons for doing what you are
doing. I don't have access to your code. Still, it sounds as
though your code is cumbersome.
kind of, but it could be a whole lot worse...
how would you approach something like x86?...
hopefully not by suggesting a huge switch to dispatch for every possible
opcode byte/prefix/... and then re-enter for following bytes. many opcodes
are long (2/3 bytes for just the 'opcode' part, or 4/5 for many SSE
opcodes), and then there is the Mod/RM field, which itself is an issue to
decode.
Shudder. If in fact there are only a couple of thousand op codes
then 2/3 is plausible for organizational reasons. However 4/5
sounds a little fishy. Are there really only a couple of
thousand when you take everything into account.
this is mostly because of how Intel organizes their opcodes.
a lot of the SSE opcodes use 0x66, 0xF3, 0xF2, ... as part of the opcode.
here are a few, which may encode as 3 or 4 bytes for the opcode (X is the
optional REX prefix).
addpd 66X0F58/r xr,xrm
addps X0F58/r xr,xrm
addsd F2X0F58/r xr,xrm
addss F3X0F58/r xr,xrm
addsubpd 66X0FD0/r xr,xrm
addsubps F2X0FD0/r xr,xrm
here are a few with 4 or 5 byte opcodes:
dppd 66X0F3A41/r,ib xr,xrm,i8
dpps 66X0F3A40/r,ib xr,xrm,i8
extractps 66X0F3A17/r,ib xr,xrm,i8
insertps 66X0F3A21/r,ib xr,xrm,i8
mpsadbw 66X0F3A42/r,ib xr,xrm,i8
....
aesenc 66X0F38DC/r xr,xrm
aesenclast 66X0F38DD/r xr,xrm
aesdec 66X0F38DE/r xr,xrm
aesdeclast 66X0F38DF/r xr,xrm
aesimc 66X0F38DB/r xr,xrm
aeskeygenassist 66X0F3ADF/r,ib xr,xrm,i8
....
for AVX Intel switched to an essentially partly new opcode encoding scheme.
the AVX opcodes in the listing thus far seem to be using a mostly fixed
4-byte opcodes.
vblendpd Ijr0D/r,ib xr,xrv,xrm,i8
Ijv0D/r,ib yr,yrv,yrm,i8
vblendps Ijr0C/r,ib xr,xrv,xrm,i8
Ijv0C/r,ib yr,yrv,yrm,i8
vblendvpd Ijr4B/r,is xr,xrv,xrm,xri
Ijv4B/r,is yr,yrv,yrm,yri
vblendvps Ijr4A/r,is xr,xrv,xrm,xri
Ijv4A/r,is yr,yrv,yrm,yri
where Ixx basically encodes a 3-byte VEX prefix, so essentially a 4-byte
opcode.
luckily at least, most of the basic opcodes use only 1 or 2 bytes for the
opcode...
this is followed by however many more are needed to encode Mod/RM or
immediates...
so, for example:
mov [eax+4*ecx+0xF00BA5], 0xBAF00
can take 10 bytes (in addition to the opcode, this is the "/r" in the
listings above).
with a 5-byte opcode, this could take 15 bytes for an instruction.
luckily, this should not happen in practice (usually it takes special
construction to create an opcode exceeding Intel's 16-byte limit).
Quote: But no, if there only a few thousand opcodes in 2/3 bytes a
specialized trie should do quite nicely. I like to do
precomputed magic tables myself, but it may turn out that a hash
function will do fine. You aren't by any chance using a prime
number size table are you? The modulo operation could be the
bulk of your time in the hash table.
luckily the main interpreter does not need to see the horrors of the
opcode/instruction encoding scheme, as it is all much more "sterile" than
this (all of this is left to the "opcode decoder", which is, luckily, not
currently a significant part of the running time).
as for the hash table, it is a power-of-2 size, and hence I use a mask...
actually, I since replaced the mask with a cast to 'u16', since this is
slightly faster in this case.
I don't use modulo in hash calculations as this tends to kill performance.
I usually use a multiply, a shift, an a mask.
multiply+mask also works, but it requires a mersenne prime to work well, and
has its own set of "interesting" properties, which as I see it did not match
the current use (I chose instead to multiply by a non-mersenne prime and use
a shift).
Quote:
(if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all
this
stuff...).
Thanks but no thanks. Looking at the Intel docs is very low on
my list of priorities.
I deal with them a lot...
much of what I have done in the past few years has involved these docs... |
|
|
| Back to top |
|
|
|
| bartc... |
Posted: Mon Nov 02, 2009 4:54 am |
|
|
|
Guest
|
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcl4ot$sia$1 at (no spam) news.albasani.net...
Quote: ok, I was using the term to 'opcode' to refer to the entire instruction
(including arguments, ...).
so, in this case the 'opcode' structure contains:
opnum, which is an internal opcode number, which idendifies the nmonic;
opidx, which is an internal opcode index, which identifies the binary
representation of said imstruction;
width, which identifies the bit-width of integer ops;
reg, a register;
op_sz, cached in-memory size of opcode/instruction (used mostly to step
to the next EIP);
rm_base, memory base register, or another data register;
rm_idx, index register;
rm_sc, register scale;
rm_disp, an offset added to memory ops;
rm_fl, which holds flags;
imm, for holding an immediate;
reg2/reg3, for AVX (because we really need 4-register opcodes...);
...
the opcode decoder recognizes the pattern, and fills in all the fields in
the struct as appropriate.
in this way, the main interpreter just grabs the fields from the structure
it needs, without having to fiddle with its serialized form.
these structures are keyed by memory address, which is where the hash
table comes in.
It looks like you're translating from the conventional packed form of the
instruction, to an unpacked form that your code finds it easier to deal
with.
In that case why not complete the process and just run this code directly
and forget the original? You will need to change all references to code
(jumps and calls) to addresses within this new block of code.
Then the hashing disappears and you can use whatever switch index you like.
(Although you are now no longer emulating x86 but interpreting something
different.)
(I understood anyway that your 'x86' code was not exactly x86, but only
similar. This means you might be able to directly generate the simpler,
unpacked form, and not bother with any decoding at all.)
--
Bartc |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 5:46 am |
|
|
|
Guest
|
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:0dpHm.792$Ym4.436 at (no spam) text.news.virginmedia.com...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcl4ot$sia$1 at (no spam) news.albasani.net...
ok, I was using the term to 'opcode' to refer to the entire instruction
(including arguments, ...).
so, in this case the 'opcode' structure contains:
opnum, which is an internal opcode number, which idendifies the
nmonic;
opidx, which is an internal opcode index, which identifies the binary
representation of said imstruction;
width, which identifies the bit-width of integer ops;
reg, a register;
op_sz, cached in-memory size of opcode/instruction (used mostly to
step to the next EIP);
rm_base, memory base register, or another data register;
rm_idx, index register;
rm_sc, register scale;
rm_disp, an offset added to memory ops;
rm_fl, which holds flags;
imm, for holding an immediate;
reg2/reg3, for AVX (because we really need 4-register opcodes...);
...
the opcode decoder recognizes the pattern, and fills in all the fields in
the struct as appropriate.
in this way, the main interpreter just grabs the fields from the
structure it needs, without having to fiddle with its serialized form.
these structures are keyed by memory address, which is where the hash
table comes in.
It looks like you're translating from the conventional packed form of the
instruction, to an unpacked form that your code finds it easier to deal
with.
yes.
Quote: In that case why not complete the process and just run this code directly
and forget the original? You will need to change all references to code
(jumps and calls) to addresses within this new block of code.
this case is still tied to the original image, and needs to be able to deal
with cases like self-modifying code...
granted, if code were assumed to be static and read/only, there is little to
prevent discarding the original code...
Quote: Then the hashing disappears and you can use whatever switch index you
like. (Although you are now no longer emulating x86 but interpreting
something different.)
the 'switch' is actually using indices which are generated automatically by
tools which write parts of the interpreter (itself borrowed from my
assembler/disassembler).
the tool essentially processes high-level listings and writes a lot of
code/tables/... allowing a lot of the decode process to be largely automatic
(I ended up mostly adding a lot of additional specialized logic mostly to
make this part of the process go faster...).
the hash itself it used for fetching the opcode structure, not for figuring
out an index to use for the switch...
Quote: (I understood anyway that your 'x86' code was not exactly x86, but only
similar. This means you might be able to directly generate the simpler,
unpacked form, and not bother with any decoding at all.)
this was a different idea...
I had considered an x86 subset mostly to allow the possibility of high-level
analysis, verification, and translation...
if you remember, I had used the aleph/bet/gimel/dalet naming scheme for that
idea...
all this is, in effect, plain x86, not my modified subset...
(although, it is restricted to userspace code, and some of the internal
details of the interpreter's internals are not strictly orthodox, but this
is largely invisible from the POV of the code being run...).
they are, however, intended for different purposes, although given both
technologies would be based on the binary representation of x86, they would
be, essentially, compatible... |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 5:54 am |
|
|
|
Guest
|
"Robbert Haarman" <comp.lang.misc at (no spam) inglorion.net> wrote in message
news:20091101221132.GB4110 at (no spam) yoda.inglorion.net...
Quote: On Sun, Nov 01, 2009 at 12:59:06PM -0700, BGB / cr88192 wrote:
but, the hash is not used for opcode lookup/decoding, rather it is used
for
grabbing already decoded instructions from a cache (which is based on
memory
address).
You could extend this to work on blocks of instructions rather than on
single
instructions, where blocks start at places that are jumped to by the
program.
If you do that, you will have to do the decoding of x86 code and the hash
table lookups far less often. I believe this is what QEMU does (they call
it "dynamic translation").
this is also what I had considered as well, but I had considered doing it in
terms of "traces", or essentially globs of instructions terminated by a
jump.
however, the complexity of this approach was higher, and so I instead opted
with the current strategy (based on a hash table), since it was less work
up-front...
I have still made many internal provisions for the strategy though, and so
the current approach may be more of a stop-gap measure than a permanent
solution.
it is worth noting that, before last-night, I had been decoding opcodes
purely sequentially, then I added the hash for a notable jump in performance
(around 3x), at which point the main switch became the bottleneck...
the hash allows only rarely/periodically decoding new instructions, since in
general the hash miss rate seems to be fairly low at present...
switching to traces, however, would require an more indirect method of
handling opcodes (rather than stepping along in an "EIP space", as is
happening with the hash). likely, this would require forming and managing
linked lists, which is an added complexity.
or such...
Quote: Regards,
Bob
--
Whenever people say "I thought", they were usually wrong.
|
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 5:57 am |
|
|
|
Guest
|
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:QVnHm.760$Ym4.567 at (no spam) text.news.virginmedia.com...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hckpa8$aj2$1 at (no spam) news.albasani.net...
(if you don't believe me, go and take a good long hard look at the Intel
docs, and imagine how you would go about writing an interpreter for all
this stuff...).
I think emulator would be a better name. Interpreter has other
connotations.
so does emulator...
emulator implies emulating the complete arch (hardware, BIOS, ...), rather
than simply the userspace portion of the CPU / ISA...
this is thus much more "lightweight" than a full emulator.
Quote: I started initially trying this, but soon found it unworkable.
I have a feeling that that would still be the best approach, as I guess
that's how actual processors have to decode this stuff.
but, it is not strictly necessary...
Quote: But I haven't tried it myself and it's too big a task to have a quick go
(I
haven't even been able to find any opcode maps anywhere).
check the Intel docs...
|
|
|
| Back to top |
|
|
|
| tm... |
Posted: Mon Nov 02, 2009 10:26 am |
|
|
|
Guest
|
On 1 Nov., 20:16, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
Quote: "Pascal J. Bourguignon" <p... at (no spam) informatimago.com> wrote in messagenews:87y6mqgot3.fsf at (no spam) galatea.local...
"BGB / cr88192" <cr88... at (no spam) hotmail.com> writes:
"Pascal J. Bourguignon" <p... at (no spam) informatimago.com> wrote in message
news:877huaihe4.fsf at (no spam) galatea.local...
"BGB / cr88192" <cr88... at (no spam) hotmail.com> writes:
note: I have plenty of past experience with both ASM and JIT, but at
the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.
dunno if anyone knows some major way around this.
Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).
I am using MSVC on x64 for this...
So unless you're rich enough to buy the sources of MSVC, forget it.
yep.
I can't even do stupid inline assembly, as MS removed this feature for
x64...
May I suggest you to switch to an open source compiler (such as gcc)?
Or just write your own compiler?
GCC on Win64 was very unreliable last I checked (occasionally generating bad
code, ...), but it may have improved since then.
Did you tell them about the errors you found?
Quote: my compiler works, just I similarly don't trust it that well to produce good
code, and using it for compiling much of anything "non-trivial" is likely to
leave bugs all over the place which would be rather difficult to track down
(I leave it mostly for compiling smaller code fragments at runtime, since a
failure here is at least usually obvious and nearly not so difficult to
track down and fix...).
When you think that GCC and MSVC are not okay you should
definitely use your own C compiler. When there are bugs in it
create a test suite of C programs. That way you can reduce the
number of errors.
You seem to write programs in many areas. But to get this
programs finished with good quality I can only suggest:
Eat your own dogfood.
Greetings Thomas Mertes
Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows. |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 11:50 am |
|
|
|
Guest
|
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:xxnHm.748$Ym4.334 at (no spam) text.news.virginmedia.com...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcjarj$vro$1 at (no spam) news.albasani.net...
but, then again, there may still be a little tweaking left, as said
switch has not gained 1st place (it is 2nd), and holds approx 12% of the
total running time.
1st place, at 15%, is the hash-table for fetching instructions (and,
sadly, is not really optimizable unless I discover some way to 'optimize'
basic arithmetic expressions, which FWIW is about as productive as trying
to optimize a switch...).
this means: 27% of time:
hash EIP (currently: "((EIP*65521)>>16)&65535");
fetch opcode-struct from hash-table (the hash-fail rate is apparently
very low);
switch on opcode info (branches to functions containing further dispatch
and logic code).
(well, all this, as well as a few arithmetic ops, such as adding the
opcode size to EIP, ...).
previously I had an opcode decoder which was not super fast (approx 60%
of runtime if sequential decoding is used), but the hash seems to have
largely eliminated it (since most of the time, pre-decoded opcodes are
used from the hash).
Forget C for the crucial dispatch code. If you don't have inline asm, use
a separate linked function.
I don't understand your hash technique. I would just have decoded an
opcode at a time (ie. 8-bits at a time), although I've long forgotten the
intricacies of intel opcode decodes. If ebx points to the next instruction
of 32-bit code:
movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]
When further decodes are required, a similar 3 lines but with a different
table.
And after successfully dealing with an opcode, repeat those initial three
lines (no need to return to a common loop point).
what is to say that the main interpreter switch is dealing with matters of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
....
apart from needing it to calculate jump targets, the main interpreter almost
does not need to even keep track of EIP...
Quote: Some experience of creating a disassembler (which is not speed critical)
would be needed to help plan the best approach, but I would still go for
asm code like this rather than C (especially if having trouble getting
reasonable code out of it).
(I had my own hll which also had trouble generating decent code in cases
like this, insisting on range checking that I didn't need:
do
switch x
when a then...
....
end switch
od
I made a couple of language/compiler mods which allowed me to write:
doswitch
asm .....
when a then
...
end switch
Ie. I could write my own switch jump code in asm! And automatically looped
back to the start, although quite often I just repeated the switch code at
the end of some handlers)
the decoder is fairly directly based on a disassembler, and yeah, switches
on some of the opcode bytes are used to assist in speeding up the decode
logic.
however, I was able to mostly eliminate instruction decoding from the
runtime overhead via the use of the hash.
also, there are reasons I am using C here...
what is to say an interpreter for x86 will necessarily run ON x86?...
as for doing interpreter logic partly in ASM, I had partly considered this
as well, but for now am sticking with C for other reasons (mostly that, at
the moment, ASM would add more complexities than are worthwhile...).
Quote: note: I have plenty of past experience with both ASM and JIT, but at the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.
What slowdown are you getting compared with a real machine?
currently, by simplistic tests, around 170x at present...
so, it is not great, but it could be worse...
|
|
|
| Back to top |
|
|
|
| James Harris... |
Posted: Mon Nov 02, 2009 1:26 pm |
|
|
|
Guest
|
On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
Quote: "bartc" <ba... at (no spam) freeuk.com> wrote in message
....
I'll add my comments here as they are similar to bartc's.
Quote: I don't understand your hash technique. I would just have decoded an
opcode at a time (ie. 8-bits at a time), although I've long forgotten the
intricacies of intel opcode decodes. If ebx points to the next instruction
of 32-bit code:
movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]
When further decodes are required, a similar 3 lines but with a different
table.
And after successfully dealing with an opcode, repeat those initial three
lines (no need to return to a common loop point).
what is to say that the main interpreter switch is dealing with matters of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
...
apart from needing it to calculate jump targets, the main interpreter almost
does not need to even keep track of EIP...
As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.
....
Quote: the decoder is fairly directly based on a disassembler, and yeah, switches
on some of the opcode bytes are used to assist in speeding up the decode
logic.
however, I was able to mostly eliminate instruction decoding from the
runtime overhead via the use of the hash.
OK but the hash (together with interpreting what the hash returns and
dealing with any duplicate hashes) is still costly and ISTM that a
straight table-based decode may be faster. Why not try the following.
Quote:
also, there are reasons I am using C here...
what is to say an interpreter for x86 will necessarily run ON x86?...
Good point. I wrote my suggestion in assembler before I'd seen your
comment. Trying to write in assembler can be a good idea anyway to
confirm that code *should* compile to be fast. It should be easy
enough to cast to C.
The main code is just the following. It is untested. Apart from the
comments it is only five instructions - all of which are pairable.
Before starting the code you would need to load esi with the first
instruction address (to set esi as the initial simulated eip) and
point edi at the initial opcode table.
;----- Code start -----
;Registers used
; eax to edx - workspace; on dispatch al holds the fetched opcode
byte
; ebp - prefix flags to indicate rep, lock, segment and size
overrides
; esi - copy of the simulated instruction pointer
; edi - points to the relevant 256-entry opcode table
next_instruction:
;Clear all prefix instruction indications
xor ebp, ebp
next_instruction_byte:
;Fetch the next byte of the instruction and increment instruction
pointer
movzx al, [esi]
inc esi
;Execute the corresponding routine
jmp [edi + 4 * eax]
;----- Code end -----
That's it. Once that's in place the service routines would vary
according to what was needed. So I'm not leaving the hard part undone
ere are some samples. First, dealing with a prefix byte, all that's
needed is to set the relevant bit in the prefix-flags register.
;Code for 0x66 - operand size prefix
or ebp, PREFIX_FLAGS_OPSIZ
jmp next_instruction_byte
Second, code for nop. This jumps back to a different point so that the
prefix flags are cleared. Easy to do in assembler, of course, but
maybe some nested loops would just about be able to deal with this in
C without using goto.
;Code for 0x90 - nop
jmp next_instruction
Third, if the first byte is an "escape" code such as 0x0f the
instruction table needs to be changed. Subsequent code executed would
be pointed at by the new table.
;Code for 0x0f - instruction extension
lea edi, [opcode_following_0f_table]
jmp next_instruction_byte
Finally, here's some code to show that entry points can be combined
together to save code space. The intention is to push whatever
register is addressed by three bits (in this case the last three) of
an instruction byte. A small subtlety here is that the 80386 and later
push the value of esp as it was before the push. This routine does
that too.
;Code for initial opcodes 0x50 through 0x57 - push reg
;Identify and fetch the simulated register
and eax, 7
mov eax, [simulated_general_regs + 4 * eax]
;Push on to the simulated stack
;(only 32-bit shown; 16-bit needs to be added)
mov ebx, [simulated_esp]
sub ebx, 4
mov [simulated_esp], ebx
mov [ebx], eax
;And we are done
jmp next_instruction
Naturally, any instructions after an escape would reset to using the
initial opcode table before starting the next instruction with such as
lea edi, [initial_opcode_table]
Quote: as for doing interpreter logic partly in ASM, I had partly considered this
as well, but for now am sticking with C for other reasons (mostly that, at
the moment, ASM would add more complexities than are worthwhile...).
Maybe you were trying to do complex stuff in ASM. The above assembler
is blindingly simple. All it is doing is looking up what to do next
from a table with 256 entries.
Come back ASM, all is forgiven. :-)
....
Quote: What slowdown are you getting compared with a real machine?
currently, by simplistic tests, around 170x at present...
so, it is not great, but it could be worse...
You've maybe already explained why but if you want to emulate the
80386 why not use an existing emulator? Is it something to do with
wanting to restrict to a subset of its instructions?
James |
|
|
| Back to top |
|
|
|
| tm... |
Posted: Mon Nov 02, 2009 4:00 pm |
|
|
|
Guest
|
On 2 Nov., 14:26, James Harris <james.harri... at (no spam) googlemail.com> wrote:
Quote: On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
"bartc" <ba... at (no spam) freeuk.com> wrote in message
...
I'll add my comments here as they are similar to bartc's.
I don't understand your hash technique. I would just have decoded an
opcode at a time (ie. 8-bits at a time), although I've long forgotten the
intricacies of intel opcode decodes. If ebx points to the next instruction
of 32-bit code:
movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]
When further decodes are required, a similar 3 lines but with a different
table.
And after successfully dealing with an opcode, repeat those initial three
lines (no need to return to a common loop point).
what is to say that the main interpreter switch is dealing with matters of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
...
apart from needing it to calculate jump targets, the main interpreter almost
does not need to even keep track of EIP...
As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.
Do I understand this correctly?
The instructions are already decoded and when they
are executed a second decode (switch) is necessary.
This sounds very strange for me, but maybe I just
misunderstand the problem.
I have a question:
Why don't you decode to function pointers instead
of integers (which need to be decoded again)?
Function pointers would even be faster than a
lookup table.
The Seed7 interpreter (hi) works with function
pointers. That way no switch is necessary when
a program is executed.
Greetings Thomas Mertes
Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows. |
|
|
| Back to top |
|
|
|
| bartc... |
Posted: Mon Nov 02, 2009 4:45 pm |
|
|
|
Guest
|
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclvg3$ubg$1 at (no spam) news.albasani.net...
Quote:
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:xxnHm.748$Ym4.334 at (no spam) text.news.virginmedia.com...
What slowdown are you getting compared with a real machine?
currently, by simplistic tests, around 170x at present...
so, it is not great, but it could be worse...
I put together a very quick test for the following code:
mov ebx,100000000
l1:
sub ebx,1
jnz l1
hlt
Emulating only the last 3 instructions, the emulation was 30x slower than
the real code (the first 3 instructions), using a mix of hll and assembler.
Could be a bit tighter with dedicated registers, but I'm a bit short of
time.
The sub instructions used a 2-level jump table (on 1st 8 bits of first byte,
and top 5 bits of modrm byte).
--
bartc |
|
|
| Back to top |
|
|
|
| James Harris... |
Posted: Mon Nov 02, 2009 5:00 pm |
|
|
|
Guest
|
On 2 Nov, 16:00, tm <thomas.mer... at (no spam) gmx.at> wrote:
Quote: On 2 Nov., 14:26, James Harris <james.harri... at (no spam) googlemail.com> wrote:
On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
"bartc" <ba... at (no spam) freeuk.com> wrote in message
...
I'll add my comments here as they are similar to bartc's.
I don't understand your hash technique. I would just have decoded an
opcode at a time (ie. 8-bits at a time), although I've long forgotten the
intricacies of intel opcode decodes. If ebx points to the next instruction
of 32-bit code:
movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]
When further decodes are required, a similar 3 lines but with a different
table.
And after successfully dealing with an opcode, repeat those initial three
lines (no need to return to a common loop point).
what is to say that the main interpreter switch is dealing with matters of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
...
apart from needing it to calculate jump targets, the main interpreter almost
does not need to even keep track of EIP...
As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.
Do I understand this correctly?
The instructions are already decoded and when they
are executed a second decode (switch) is necessary.
Although you are replying to my post I presume your questions are for
BGB. I may have misunderstood him.
Quote:
This sounds very strange for me, but maybe I just
misunderstand the problem.
I have a question:
Why don't you decode to function pointers instead
of integers (which need to be decoded again)?
Function pointers would even be faster than a
lookup table.
Would they? I'm suggesting arrays (of function pointers). It's just
one pairable instruction to index into such arrays. You can't get
faster.
Quote:
The Seed7 interpreter (hi) works with function
pointers. That way no switch is necessary when
a program is executed.
Yes, a big switch will be slow except in the special case that the
data values are adjacent and the compiler converts it to a simple
indexing operation. Rather than hope the compiler does what one wants
it's better to express the function pointers in an array directly,
IMHO.
James |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 8:53 pm |
|
|
|
Guest
|
"tm" <thomas.mertes at (no spam) gmx.at> wrote in message
news:6bc33382-90b5-49aa-a486-fb2992ae8496 at (no spam) a21g2000yqc.googlegroups.com...
Quote: On 1 Nov., 20:16, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
"Pascal J. Bourguignon" <p... at (no spam) informatimago.com> wrote in
messagenews:87y6mqgot3.fsf at (no spam) galatea.local...
"BGB / cr88192" <cr88... at (no spam) hotmail.com> writes:
"Pascal J. Bourguignon" <p... at (no spam) informatimago.com> wrote in message
news:877huaihe4.fsf at (no spam) galatea.local...
"BGB / cr88192" <cr88... at (no spam) hotmail.com> writes:
note: I have plenty of past experience with both ASM and JIT, but at
the
moment am leaning against this, and in favor of using pure C-based
interpretation for now.
dunno if anyone knows some major way around this.
Modify the C compiler so that it generates optimized code for your
specific case (ie. hide your specific assembly in the C compiler).
I am using MSVC on x64 for this...
So unless you're rich enough to buy the sources of MSVC, forget it.
yep.
I can't even do stupid inline assembly, as MS removed this feature for
x64...
May I suggest you to switch to an open source compiler (such as gcc)?
Or just write your own compiler?
GCC on Win64 was very unreliable last I checked (occasionally generating
bad
code, ...), but it may have improved since then.
Did you tell them about the errors you found?
I was not sure who to tell about what, nor was I not sure of the nature or
the cause of the bug (not really knowing GCC's internals...). only, that it
was a bug with passing arguments (args got messed up in transit, from what I
remember due to incorrect register handling).
Quote: my compiler works, just I similarly don't trust it that well to produce
good
code, and using it for compiling much of anything "non-trivial" is likely
to
leave bugs all over the place which would be rather difficult to track
down
(I leave it mostly for compiling smaller code fragments at runtime, since
a
failure here is at least usually obvious and nearly not so difficult to
track down and fix...).
When you think that GCC and MSVC are not okay you should
definitely use your own C compiler. When there are bugs in it
create a test suite of C programs. That way you can reduce the
number of errors.
yeah, proper testing should be good...
I guess I have not done a whole lot of this.
as for MSVC, usually it is good enough (after all, it is good enough for
MS).
it is just not as ideal for areas of code which notably effect performance.
my compiler can produce often more efficient code (except in the case of
switch, errm... because I never got around to properly implementing
jump-tables, and so there is an "O(log2 n)" switch overhead...).
so, nevermind switch, I do address many optimizations which MSVC apparently
does not, which at the time I had found unimpressive (given MS has lots of
resources, ...).
Quote: You seem to write programs in many areas. But to get this
programs finished with good quality I can only suggest:
Eat your own dogfood.
I originally wrote it mostly for JIT / at-runtime compilation, and this is
mostly what I use it for (generating scripts at runtime, and compiling and
running these).
it was only recently that I had (ever) used it for a statically compiled
piece of code, but I forget for what reason I made it usable as a static
compiler.
actually, now I remember, it was the idea of using the Win64 calling
convention on Linux, just before I ended up figuring that I could probably
get more for less effort by instead using virtualized x86 as a
cross-platform IL (in between with a failed attempt at designing a bytecode,
which ended up as a hybrid of x86 and EFI bytecode...).
I guess if I were to use it as a static compiler, for building project
sources, I would have to fix up a few of its known deficiencies:
its lack of a properly working 'switch';
its lack of statically-initialized structs;
....
and, I guess, maybe writing proper test-cases. I have a few small pieces of
code which I had been using for tests, but they are far from
comprehensive...
Quote: Greetings Thomas Mertes
Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows. |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 10:03 pm |
|
|
|
Guest
|
"James Harris" <james.harris.1 at (no spam) googlemail.com> wrote in message
news:475f3699-141a-46ea-ab84-dbbfb0bf09c7 at (no spam) d5g2000yqm.googlegroups.com...
On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
Quote: "bartc" <ba... at (no spam) freeuk.com> wrote in message
....
I'll add my comments here as they are similar to bartc's.
Quote: I don't understand your hash technique. I would just have decoded an
opcode at a time (ie. 8-bits at a time), although I've long forgotten
the
intricacies of intel opcode decodes. If ebx points to the next
instruction
of 32-bit code:
movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]
When further decodes are required, a similar 3 lines but with a
different
table.
And after successfully dealing with an opcode, repeat those initial
three
lines (no need to return to a common loop point).
what is to say that the main interpreter switch is dealing with matters of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
...
apart from needing it to calculate jump targets, the main interpreter
almost
does not need to even keep track of EIP...
<--
As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.
....
-->
yeah, the decoder was fairly expensive, and when it was used for every
instruction, it was around 60% of the total running time.
it uses LUT's for optimizing the opcode lookup, but the "final steps" are
done by generic logic code, which makes use of strings-driven-logic to match
the exact pattern and process the bytes.
for many things, string-driven-logic is fast enough, but in some cases, it
is not the fastest possible route...
by loose analogy, consider if the machine code were being matched by
regex'es, ...
hence, I used the hash to escape having to do it for every op, since it was,
granted, sort of expensive.
Quote: the decoder is fairly directly based on a disassembler, and yeah, switches
on some of the opcode bytes are used to assist in speeding up the decode
logic.
however, I was able to mostly eliminate instruction decoding from the
runtime overhead via the use of the hash.
<--
OK but the hash (together with interpreting what the hash returns and
dealing with any duplicate hashes) is still costly and ISTM that a
straight table-based decode may be faster. Why not try the following.
-->
possibly, it could be faster, except that there are many "hidden" costs with
x86...
even if one could use a direct-lookup to match the opcode, they would still
need to decode the ModRM/SIB/disp mess, which is itself not likely to be
fast (in my case, in the decoder, it is handled via lots of bit-masking and
around 4-levels of switches, and was still a not-small portion of the total
running time of the decoder).
I am not sure if ModRM could "reasonably" be handled via pure LUT's (or, at
least, absent mechanically-generated code).
Quote:
also, there are reasons I am using C here...
what is to say an interpreter for x86 will necessarily run ON x86?...
<--
Good point. I wrote my suggestion in assembler before I'd seen your
comment. Trying to write in assembler can be a good idea anyway to
confirm that code *should* compile to be fast. It should be easy
enough to cast to C.
-->
C is also good for prototyping and keeping the design more flexible...
even though it is an interpreter, and subject somewhat to performance
concerns, I still generally try to retain modularity and internal integrity.
it is not perfect, for example, my memory manager is tangled with the
internals of the address-translation code (rather than, say, a separate
component making use of address-space alloc/free functions and implementing
the MM similar to how it would do so with a conventional OS), ...
the DLL loader, however, does use the external interface, which was modeled
mostly after the Win32 API's VirtualAlloc/VirtualFree interface.
the DLL loader then exports something which resembles the LoadLibrary /
GetProcAddress / ... API.
elsewhere, a function is created which resembles 'CreateProcess', only that
I call it 'SpawnProcess' and it takes different args.
....
but, it could be worse.
<--
The main code is just the following. It is untested. Apart from the
comments it is only five instructions - all of which are pairable.
Before starting the code you would need to load esi with the first
instruction address (to set esi as the initial simulated eip) and
point edi at the initial opcode table.
;----- Code start -----
;Registers used
; eax to edx - workspace; on dispatch al holds the fetched opcode
byte
; ebp - prefix flags to indicate rep, lock, segment and size
overrides
; esi - copy of the simulated instruction pointer
; edi - points to the relevant 256-entry opcode table
next_instruction:
;Clear all prefix instruction indications
xor ebp, ebp
next_instruction_byte:
;Fetch the next byte of the instruction and increment instruction
pointer
movzx al, [esi]
inc esi
;Execute the corresponding routine
jmp [edi + 4 * eax]
;----- Code end -----
That's it. Once that's in place the service routines would vary
according to what was needed. So I'm not leaving the hard part undone
ere are some samples. First, dealing with a prefix byte, all that's
needed is to set the relevant bit in the prefix-flags register.
;Code for 0x66 - operand size prefix
or ebp, PREFIX_FLAGS_OPSIZ
jmp next_instruction_byte
Second, code for nop. This jumps back to a different point so that the
prefix flags are cleared. Easy to do in assembler, of course, but
maybe some nested loops would just about be able to deal with this in
C without using goto.
;Code for 0x90 - nop
jmp next_instruction
Third, if the first byte is an "escape" code such as 0x0f the
instruction table needs to be changed. Subsequent code executed would
be pointed at by the new table.
;Code for 0x0f - instruction extension
lea edi, [opcode_following_0f_table]
jmp next_instruction_byte
Finally, here's some code to show that entry points can be combined
together to save code space. The intention is to push whatever
register is addressed by three bits (in this case the last three) of
an instruction byte. A small subtlety here is that the 80386 and later
push the value of esp as it was before the push. This routine does
that too.
;Code for initial opcodes 0x50 through 0x57 - push reg
;Identify and fetch the simulated register
and eax, 7
mov eax, [simulated_general_regs + 4 * eax]
;Push on to the simulated stack
;(only 32-bit shown; 16-bit needs to be added)
mov ebx, [simulated_esp]
sub ebx, 4
mov [simulated_esp], ebx
mov [ebx], eax
;And we are done
jmp next_instruction
Naturally, any instructions after an escape would reset to using the
initial opcode table before starting the next instruction with such as
lea edi, [initial_opcode_table]
-->
naturally enough, this could be done, but would essentially be a very
different beast than my current interpreter...
technically, the current interpreter is MUCH higher level in terms of its
internals, thinking mostly in terms of abstract register handles which are
get/set via special functions.
example:
rm=BGBV86_ResolveRMAddr(ctx, op);
switch(op->opnum)
{
....
case BGBV86_OP_ADD:
aq=BGBV86_GetRegGeneric(ctx, op->reg);
bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
aq+=bq;
BGBV86_SetRegGeneric(ctx, op->reg, aq);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
case BGBV86_OP_AND:
aq=BGBV86_GetRegGeneric(ctx, op->reg);
bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
aq&=bq;
BGBV86_SetRegGeneric(ctx, op->reg, aq);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
....
case BGBV86_OP_LEA:
BGBV86_SetRegGeneric(ctx, op->reg, rm);
break;
case BGBV86_OP_MOV:
aq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
BGBV86_SetRegGeneric(ctx, op->reg, aq);
break;
....
(oddly, LEA could maybe be optimized, as the profiler seems to say that this
is apparently one of the most heavily used instructions in the tests I was
running...).
then again, GetReg* and SetReg* are filled with an evil: large switches...
or, from a different function (RMReg):
case BGBV86_OP_SHLD:
aq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
bq=BGBV86_GetRegGeneric(ctx, op->reg);
i=BGBV86_GetRegGeneric(ctx, BGBV86_REG_CL);
aq=(aq<<i)|(bq>>((op->width)-i));
BGBV86_ImageWriteUGeneric(ctx, rm, aq, op->width);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
case BGBV86_OP_SHRD:
aq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
bq=BGBV86_GetRegGeneric(ctx, op->reg);
i=BGBV86_GetRegGeneric(ctx, BGBV86_REG_CL);
aq=(aq>>i)|(bq<<((op->width)-i));
BGBV86_ImageWriteUGeneric(ctx, rm, aq, op->width);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;
....
from Imm:
case BGBV86_OP_INT:
BGBV86_CheckRaiseInt(ctx, op->imm);
break;
case BGBV86_OP_INTXT:
BGBV86_CheckRaiseInt(ctx, op->imm);
break;
case BGBV86_OP_JA:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_A);
break;
case BGBV86_OP_JAE:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_AE);
break;
case BGBV86_OP_JB:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_B);
break;
....
case BGBV86_OP_LOOP:
aq=BGBV86_GetRegGeneric(ctx, BGBV86_REG_ECX);
aq--;
BGBV86_SetRegGeneric(ctx, BGBV86_REG_ECX, aq);
if(aq)BGBV86_JumpAddr(ctx, ctx->eip + op->imm);
break;
....
Quote: as for doing interpreter logic partly in ASM, I had partly considered this
as well, but for now am sticking with C for other reasons (mostly that, at
the moment, ASM would add more complexities than are worthwhile...).
<--
Maybe you were trying to do complex stuff in ASM. The above assembler
is blindingly simple. All it is doing is looking up what to do next
from a table with 256 entries.
Come back ASM, all is forgiven. :-)
....
-->
granted, however, by "complexities", I mean, overall architectural ones...
similarly, it does not much help that, at the moment, I have several
different arch's I am building for, and so would have to write ASM for all
of my targets...
x86 ASM, and x64 ASM for both Windows and Linux, which use different calling
conventions.
as well as possibly tangled level-of-abstraction issues, ...
in these cases, for ASM-based micro-optimization, I usually "confine" it,
and then dynamically produce the ASM via procedural generation.
however, this raises other architectural issues (currently, my interpreter
is not in the direct usability-scope of my assembler, which will probably
mean the "API sharing via structs of function pointers" trick...).
actually, in one place I am already needing this, but this would wrap a
region of code already exporting an interface to the assembler, so I could
include this in said struct as well.
Quote: What slowdown are you getting compared with a real machine?
currently, by simplistic tests, around 170x at present...
so, it is not great, but it could be worse...
<--
You've maybe already explained why but if you want to emulate the
80386 why not use an existing emulator? Is it something to do with
wanting to restrict to a subset of its instructions?
-->
well, I had looked at QEMU some, but opted against it mostly for
code-related reasons (it did not appear to be all that modular, and I
couldn't really get it to build on Windows...).
more so, apparently officially QEMU currently does not work directly on
Win64 (if built for 64-bits), ...
another issue is that it is only doing partial emulation, rather than full
emulation (it is, technically, much closer to an interpreter than an
emulator in terms of purpose and general operation).
but, alas, it is in a some vague area between the 2 technologies...
so, alas, I just beat together my own... |
|
|
| Back to top |
|
|
|
| BGB / cr88192... |
Posted: Mon Nov 02, 2009 10:20 pm |
|
|
|
Guest
|
"tm" <thomas.mertes at (no spam) gmx.at> wrote in message
news:11b12e14-d4ce-4f2d-afaa-028143e08b98 at (no spam) l2g2000yqd.googlegroups.com...
On 2 Nov., 14:26, James Harris <james.harri... at (no spam) googlemail.com> wrote:
Quote: On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
"bartc" <ba... at (no spam) freeuk.com> wrote in message
...
I'll add my comments here as they are similar to bartc's.
I don't understand your hash technique. I would just have decoded an
opcode at a time (ie. 8-bits at a time), although I've long forgotten
the
intricacies of intel opcode decodes. If ebx points to the next
instruction
of 32-bit code:
movzx eax,byte [ebx]
inc ebx
jmp [eax*4 + mainopcode_jumptable]
When further decodes are required, a similar 3 lines but with a
different
table.
And after successfully dealing with an opcode, repeat those initial
three
lines (no need to return to a common loop point).
what is to say that the main interpreter switch is dealing with matters
of
decoding?...
what is to say that this would actually be any faster than having it NOT
deal with matters of decoding?...
...
apart from needing it to calculate jump targets, the main interpreter
almost
does not need to even keep track of EIP...
As I understand it your hash function looks up instructions that you
have already decoded. That's quite sophisticated and I infer from that
that your decoding is expensive. Whether that's the case or not I
would think that it's likely to be fastest to decode by simple lookup
tables.
<--
Do I understand this correctly?
The instructions are already decoded and when they
are executed a second decode (switch) is necessary.
This sounds very strange for me, but maybe I just
misunderstand the problem.
-->
the instructions are decoded into an unpacked form (structs), which
basically decode the serialized form of the instruction.
the switch is used mostly to go from this opcode structure into the actual
logic for "doing" whatever the instruction indicates.
so, the switch is used for instruction dispatch, not for decoding.
<--
I have a question:
Why don't you decode to function pointers instead
of integers (which need to be decoded again)?
Function pointers would even be faster than a
lookup table.
-->
this would require redesigning the interpreter.
it would also be very limiting and almost completely destroy modularity
between the decoder and instruction dispatcher (since each would need to be
concerned with what goes on in the other), and would interfere with
structural re-use (for example, using the same decoder, but using it for
different tasks).
integers allow keeping a higher level of abstraction...
for example, I could copy my current opcode decoder back into my assembler
library to provide an "improved" disassembler, or maybe reuse it for other
contexts where I might want to decode x86, ...
it is much like how I grabbed a lot of components from my assembler to build
an interpreter, only that I had to tweak some things and bolt on the big
mass of logic for actually "doing" stuff, ...
<--
The Seed7 interpreter (hi) works with function
pointers. That way no switch is necessary when
a program is executed.
-->
it is an interesting idea, but I don't yet know of an ideal way to use this
idea...
it could be possible, but the logic would likely be located in the
interpreter, rather than the decoder, and would likely be done as a sort of
cache step.
however, it could easily expand to a huge level of complexity for the
individual leaf operations, since each would need to be its own
self-contained function and likely fully specialized (rather than the
current strategy which involves a sort of layered specialization approach,
where different things are calculated along the various steps in the
process...).
.... |
|
|
| Back to top |
|
|
|
|
|
All times are GMT
The time now is Wed Dec 09, 2009 11:17 pm
|
|