Main Page | Report this Page
Computers Forum Index  »  Computer Languages (Misc)  »  the 'switch' limit......
Page 3 of 4    Goto page Previous  1, 2, 3, 4  Next

the 'switch' limit......

Author Message
BGB / cr88192...
Posted: Mon Nov 02, 2009 10:54 pm
Guest
"James Harris" <james.harris.1 at (no spam) googlemail.com> wrote in message
news:76f128e3-5d9e-4905-bcad-405077e22e1e at (no spam) n35g2000yqm.googlegroups.com...
On 2 Nov, 16:00, tm <thomas.mer... at (no spam) gmx.at> wrote:
Quote:
On 2 Nov., 14:26, James Harris <james.harri... at (no spam) googlemail.com> wrote:



<snip>

Quote:

The Seed7 interpreter (hi) works with function
pointers. That way no switch is necessary when
a program is executed.

<--
Yes, a big switch will be slow except in the special case that the
data values are adjacent and the compiler converts it to a simple
indexing operation. Rather than hope the compiler does what one wants
it's better to express the function pointers in an array directly,
IMHO.
-->

a big downside with an array though is that an array is very sensitive to
positioning and ordering.
this would be difficult here since much of my handling is done with
tool-assigned symbolic constants...

granted, my tool does generally assign values sequentially, but there is
little to say that these values are not subject to change.

now, the downside of MSVC and switches is that MSVC apparently does not do
them well...

apparently, MSVC does it sort of like this:

calc reg=index //depends on input expr
mov [esp+X], reg
mov reg, [esp+X]
sub reg, base
mov [esp+X], reg
cmp [esp+X], (limit-base)
jnbe default //yet to investigate how this one works
mov reg, [esp+X]
lea reg3, [Y]
movzx reg, [reg3+reg+Z]
mov reg, [reg3+reg*SZPTR+W]
add reg, reg3
jmp reg

or, essentially, it works, but I can imagine less terrible ways to do jump
tables...
X/Y/Z/W are apparently switch-specific magic constants...


and, for unrelated reasons, I am thinking I may need to write some sort of
IDL-based tool:
IDL -> export interface; IDL -> header; IDL -> import interface.

this is mostly because non-local code gluing is getting problematic, and
more automated means may be preferable to manually writing lots of headers.

granted, via a naive approach, both the importer and exporter IDL's would
need to be exactly the same for the thing to work.

I may need to investigate...
 
bartc...
Posted: Mon Nov 02, 2009 11:08 pm
Guest
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:bDzHm.910$Ym4.134 at (no spam) text.news.virginmedia.com...
Quote:

"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclvg3$ubg$1 at (no spam) news.albasani.net...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:xxnHm.748$Ym4.334 at (no spam) text.news.virginmedia.com...
What slowdown are you getting compared with a real machine?

currently, by simplistic tests, around 170x at present...

I put together a very quick test for the following code:

mov ebx,100000000
l1:
sub ebx,1
jnz l1
hlt

Emulating only the last 3 instructions, the emulation was 30x slower than
the real code (the first 3 instructions), using a mix of hll and
assembler.
Could be a bit tighter with dedicated registers, but I'm a bit short of
time.

The 'tighter' asm code yielded about a 15x slowdown compared with executing
on the actual processor.

I also tried pure C code and was surprised it managed to achieve only a 30x
slowdown (this test code is below).

Now, this instruction (sub reg,imm) isn't the most complex (only one operand
and no complex addressing), only the zflag is set, and it ignores prefixes
(ie. it assumes a 32-bit op). Even so, there is plenty of margin here for
adding more stuff, although it's not clear how well this approach will
scale.

You say you're getting 170x and that's with already partly decoded
instructions... So I think this approach could be worth looking at again.

#include <stdio.h>
#include <stdlib.h>

typedef void (*fnptr)(void);
typedef unsigned char byte;

int registers[8];
byte *pcptr;
byte zflag;
byte stopped;

fnptr opcodetable[256];
fnptr arithtable[256];

void arith81(void) {
arithtable[(*(pcptr+1))>>3]();
}

void jumpnz75(void) {
if (!zflag)
pcptr += *((signed char*)(pcptr+1))+2;
else
pcptr += 2;
}

void haltf4(void) {
stopped=1;
}

void subregimm(void) {
int reg = *(pcptr+1) & 7;
registers[reg] -= *( (int*)(pcptr+2));
zflag = registers[reg]==0;
pcptr += 6;
}

int main(void)
{
/* l1: sub ebx,1; jnz l1; hlt */
byte testcode[] = {0x81, 0xeb, 0x01, 0x00,0x00,0x00, 0x75, 0xf8, 0xf4};

opcodetable[0x81]=&arith81;
opcodetable[0x75]=&jumpnz75;
opcodetable[0xf4]=&haltf4;

arithtable[0x1d]=&subregimm;

pcptr = testcode;
registers[3] = 100000000;

printf("EBX register at start: %d\n",registers[3]);

stopped=0;

do {
opcodetable[*pcptr]();
} while (!stopped);

printf("EBX register at end: %d\n",registers[3]);

}

--
Bartc
 
bartc...
Posted: Mon Nov 02, 2009 11:49 pm
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcn3d1$mpo$1 at (no spam) news.albasani.net...
Quote:

"James Harris" <james.harris.1 at (no spam) googlemail.com> wrote in message
news:475f3699-141a-46ea-ab84-dbbfb0bf09c7 at (no spam) d5g2000yqm.googlegroups.com...
On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
"bartc" <ba... at (no spam) freeuk.com> wrote in message

possibly, it could be faster, except that there are many "hidden" costs
with x86...
even if one could use a direct-lookup to match the opcode, they would
still need to decode the ModRM/SIB/disp mess, which is itself not likely
to be

I seem to remember that some modrm value indicates that SIB follows? SIB
then just needs a function that uses logic obtain the effective address, or
perhaps uses a 256-way function table.

Example: add [mem],reg

This can be quickly decoded down to the level where it knows it's adding a
register to memory. It even knows if it's 8 bits, or 16/32 bits (this latter
needs an extra check). Since there's a memory address involved, a function
call will sort out the modrm/sib/disp stuff (and maybe even step the program
counter).

Now you have the address A, and a register code R: *A += registers[R], and
you're done.

OK, it's not quite that simple, but I'm sure it's possible to do all this
with simpler code than you've shown below (I don't know if the following
code is prehash or posthash).

Quote:
rm=BGBV86_ResolveRMAddr(ctx, op);

switch(op->opnum)
{
...
case BGBV86_OP_ADD:
aq=BGBV86_GetRegGeneric(ctx, op->reg);
bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
aq+=bq;
BGBV86_SetRegGeneric(ctx, op->reg, aq);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;

case BGBV86_OP_JB:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_B);
break;
...
case BGBV86_OP_LOOP:
aq=BGBV86_GetRegGeneric(ctx, BGBV86_REG_ECX);
aq--;
BGBV86_SetRegGeneric(ctx, BGBV86_REG_ECX, aq);
if(aq)BGBV86_JumpAddr(ctx, ctx->eip + op->imm);
break;

--
Bartc
 
Rod Pemberton...
Posted: Tue Nov 03, 2009 2:37 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclak6$4ej$1 at (no spam) news.albasani.net...
Quote:

at which point the main switch became the bottleneck...

How so? Too many case values? Too widely dispersed case values which
prevents optimization? What?


Rod Pemberton
 
Rod Pemberton...
Posted: Tue Nov 03, 2009 2:37 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hckpa8$aj2$1 at (no spam) news.albasani.net...
Quote:

but, the hash is not used for opcode lookup/decoding, rather it is used
for grabbing already decoded instructions from a cache (which is
based on memory address).


Your hash generates, what, 64k of possible hash values or memory locations?
What if you reduce the size to 4k? 4k/sizeof(void *)? Will this allow to
compiler to simplify the generated assembly?

The randomness doesn't have to come from multiplication. It can come from
other sources such as a lookup array of randomized data.


Rod Pemberton
 
Rod Pemberton...
Posted: Tue Nov 03, 2009 2:38 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcjarj$vro$1 at (no spam) news.albasani.net...
Quote:

hash EIP (currently: "((EIP*65521)>>16)&65535");


The shift truncates the value to the mask size. I.e., &65535 is not needed.
Yes?


Rod Pemberton
 
Rod Pemberton...
Posted: Tue Nov 03, 2009 2:38 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcke2u$n5l$1 at (no spam) news.albasani.net...
Quote:

according to the profiler, another major source of time use is:
"rip=ctx->sreg_base[0]+ctx->eip;"


What happens if you eliminate struct ctx? I.e., make both sreg_base and eip
separate variables. What happens with file scope? ... with local scope?


Rod Pemberton
 
BGB / cr88192...
Posted: Tue Nov 03, 2009 4:34 am
Guest
"Rod Pemberton" <do_not_have at (no spam) nohavenot.cmm> wrote in message
news:hcnjkn$7bt$1 at (no spam) aioe.org...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcke2u$n5l$1 at (no spam) news.albasani.net...

according to the profiler, another major source of time use is:
"rip=ctx->sreg_base[0]+ctx->eip;"


What happens if you eliminate struct ctx? I.e., make both sreg_base and
eip
separate variables. What happens with file scope? ... with local scope?


what happens?...

well, then, I could no longer do multi-threading...

ctx essentially represents the current simulated thread context, and there
may be 1 or more (OS-level) worker threads essentially serving as virtual
processors.

I am not willing to make a design change which would essentially prohibit
multi-threaded operation...


Quote:

Rod Pemberton



 
BGB / cr88192...
Posted: Tue Nov 03, 2009 4:54 am
Guest
"Rod Pemberton" <do_not_have at (no spam) nohavenot.cmm> wrote in message
news:hcnjhr$72k$1 at (no spam) aioe.org...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclak6$4ej$1 at (no spam) news.albasani.net...

at which point the main switch became the bottleneck...

How so? Too many case values? Too widely dispersed case values which
prevents optimization? What?


it jumped near the top of the list in the profiler...

basically, this is because this is a more or less central location through
nearly all control flows (before branching off into all of the deeper
internals of the interpreter).


this is because, one may eliminate slowdowns in one place, and the app gets
overall faster, but in terms of the profiler, the load has shifted somewhere
else, and one could then normally optimize this location for yet further
gains.


sometimes though, it will shift to code which can't be optimized, and in an
interpreter, when the top load shifts to the main switch statement (AKA: the
central part driving operation of the interpreter), it is my observation
that often one is essentially rapidly approaching the optimizability of an
interpreter.

from the POV of further optimization, it is usually better if the running
time is mostly in leaf functions, since this case is usually much easier to
optimize (for example, by optimizing the caller such that they are called
less often).

from the POV of the main loop, there is only a single complexity: O(n).


Quote:

Rod Pemberton




 
BGB / cr88192...
Posted: Tue Nov 03, 2009 5:03 am
Guest
"Rod Pemberton" <do_not_have at (no spam) nohavenot.cmm> wrote in message
news:hcnjj4$755$1 at (no spam) aioe.org...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hckpa8$aj2$1 at (no spam) news.albasani.net...

but, the hash is not used for opcode lookup/decoding, rather it is used
for grabbing already decoded instructions from a cache (which is
based on memory address).


Your hash generates, what, 64k of possible hash values or memory
locations?
What if you reduce the size to 4k? 4k/sizeof(void *)? Will this allow to
compiler to simplify the generated assembly?

The randomness doesn't have to come from multiplication. It can come from
other sources such as a lookup array of randomized data.


I tried both ways, but it does not seem to make much difference between 4k
and 64k for the hash.
I used 64k figureing it would scale better, but then thinking of it, a full
hash might end up using an unreasonably large amount of memory (4MB? 16MB?
more?...), whereas a 4k hash is self-limiting I guess (maybe ~1MB, assuming
around 256 bytes per decode-op...).

actually, I checked, with the current size of the DecodeOp structure, the
current memory limit is around 2MB with a 64k hash, and would be around
128kB with a 4k hash...


note that, generally, a multiplication is cheaper than an array lookup,
however an array lookup is generally cheaper than a division or modulo...


Quote:

Rod Pemberton



 
BGB / cr88192...
Posted: Tue Nov 03, 2009 5:15 am
Guest
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:_eFHm.1021$Ym4.889 at (no spam) text.news.virginmedia.com...
Quote:

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:bDzHm.910$Ym4.134 at (no spam) text.news.virginmedia.com...

"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclvg3$ubg$1 at (no spam) news.albasani.net...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:xxnHm.748$Ym4.334 at (no spam) text.news.virginmedia.com...
What slowdown are you getting compared with a real machine?

currently, by simplistic tests, around 170x at present...

I put together a very quick test for the following code:

mov ebx,100000000
l1:
sub ebx,1
jnz l1
hlt

Emulating only the last 3 instructions, the emulation was 30x slower than
the real code (the first 3 instructions), using a mix of hll and
assembler.
Could be a bit tighter with dedicated registers, but I'm a bit short of
time.

The 'tighter' asm code yielded about a 15x slowdown compared with
executing on the actual processor.

I also tried pure C code and was surprised it managed to achieve only a
30x slowdown (this test code is below).


in a few rare cases, I have managed to get 15x with an interpreter.
it depends, however, highly on the particular type of code being run.


Quote:
Now, this instruction (sub reg,imm) isn't the most complex (only one
operand and no complex addressing), only the zflag is set, and it ignores
prefixes (ie. it assumes a 32-bit op). Even so, there is plenty of margin
here for adding more stuff, although it's not clear how well this approach
will scale.

You say you're getting 170x and that's with already partly decoded
instructions... So I think this approach could be worth looking at again.


there are many reasons for such a level of slowdown...

one of the major ones may well be the level of abstraction I am using for a
lot of the code.
in general, it is hardly a design aimed at "max speed", since in general I
put "making it work" and "having clean code" above "having the maximum raw
speed"...


Quote:
#include <stdio.h
#include <stdlib.h

typedef void (*fnptr)(void);
typedef unsigned char byte;

int registers[8];
byte *pcptr;
byte zflag;
byte stopped;

fnptr opcodetable[256];
fnptr arithtable[256];

void arith81(void) {
arithtable[(*(pcptr+1))>>3]();
}

void jumpnz75(void) {
if (!zflag)
pcptr += *((signed char*)(pcptr+1))+2;
else
pcptr += 2;
}

void haltf4(void) {
stopped=1;
}

void subregimm(void) {
int reg = *(pcptr+1) & 7;
registers[reg] -= *( (int*)(pcptr+2));
zflag = registers[reg]==0;
pcptr += 6;
}

int main(void)
{
/* l1: sub ebx,1; jnz l1; hlt */
byte testcode[] = {0x81, 0xeb, 0x01, 0x00,0x00,0x00, 0x75, 0xf8, 0xf4};

opcodetable[0x81]=&arith81;
opcodetable[0x75]=&jumpnz75;
opcodetable[0xf4]=&haltf4;

arithtable[0x1d]=&subregimm;

pcptr = testcode;
registers[3] = 100000000;

printf("EBX register at start: %d\n",registers[3]);

stopped=0;

do {
opcodetable[*pcptr]();
} while (!stopped);

printf("EBX register at end: %d\n",registers[3]);

}


at which cost?...

as can be seen, this does not even address a tiny fraction of the issues
which exist in the x86 ISA (for example, how do you intend to handle issues
like "Virtual Addressing" and segmentation, maybe page-table translation,
.... ?).

what even about eflags?...

maybe keep these sorts of issues in mind.

as well, consider that the code above looks, well, terrible...


Quote:
--
Bartc
 
BGB / cr88192...
Posted: Tue Nov 03, 2009 5:23 am
Guest
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:wQFHm.1039$Ym4.360 at (no spam) text.news.virginmedia.com...
Quote:

"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcn3d1$mpo$1 at (no spam) news.albasani.net...

"James Harris" <james.harris.1 at (no spam) googlemail.com> wrote in message
news:475f3699-141a-46ea-ab84-dbbfb0bf09c7 at (no spam) d5g2000yqm.googlegroups.com...
On 2 Nov, 06:50, "BGB / cr88192" <cr88... at (no spam) hotmail.com> wrote:
"bartc" <ba... at (no spam) freeuk.com> wrote in message

possibly, it could be faster, except that there are many "hidden" costs
with x86...
even if one could use a direct-lookup to match the opcode, they would
still need to decode the ModRM/SIB/disp mess, which is itself not likely
to be

I seem to remember that some modrm value indicates that SIB follows? SIB
then just needs a function that uses logic obtain the effective address,
or perhaps uses a 256-way function table.

Example: add [mem],reg

This can be quickly decoded down to the level where it knows it's adding a
register to memory. It even knows if it's 8 bits, or 16/32 bits (this
latter needs an extra check). Since there's a memory address involved, a
function call will sort out the modrm/sib/disp stuff (and maybe even step
the program counter).

Now you have the address A, and a register code R: *A += registers[R], and
you're done.

OK, it's not quite that simple, but I'm sure it's possible to do all this
with simpler code than you've shown below (I don't know if the following
code is prehash or posthash).


the decode does decode...

the main problem with trying to do ModRM and SIB decoding via tables would
be that one would have to provide an entry for every spot in the table.

the use of generic logic code may be an acceptable tradeoff, FWIW...


Quote:
rm=BGBV86_ResolveRMAddr(ctx, op);

switch(op->opnum)
{
...
case BGBV86_OP_ADD:
aq=BGBV86_GetRegGeneric(ctx, op->reg);
bq=BGBV86_ImageReadUGeneric(ctx, rm, op->width);
aq+=bq;
BGBV86_SetRegGeneric(ctx, op->reg, aq);
BGBV86_AdjustArithFlagsUGeneric(ctx, aq, op->width);
break;

case BGBV86_OP_JB:
BGBV86_JumpAddrCC(ctx, ctx->eip + op->imm, BGBV86_COND_B);
break;
...
case BGBV86_OP_LOOP:
aq=BGBV86_GetRegGeneric(ctx, BGBV86_REG_ECX);
aq--;
BGBV86_SetRegGeneric(ctx, BGBV86_REG_ECX, aq);
if(aq)BGBV86_JumpAddr(ctx, ctx->eip + op->imm);
break;


the code below is well after the hash step...


this is some of the code from the main interpreter logic.
its point is mostly to point out that both opcodes and registers are
generally handled symbolically, and at a fairly high level of abstraction vs
the raw machine code.

but, admittedly, it is far from being free performance-wise...

personally, I pursue performance, but not at the cost of clean coding
practices...


Quote:
--
Bartc
 
BGB / cr88192...
Posted: Tue Nov 03, 2009 5:29 am
Guest
"Rod Pemberton" <do_not_have at (no spam) nohavenot.cmm> wrote in message
news:hcnjjs$7af$1 at (no spam) aioe.org...
Quote:
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcjarj$vro$1 at (no spam) news.albasani.net...

hash EIP (currently: "((EIP*65521)>>16)&65535");


The shift truncates the value to the mask size. I.e., &65535 is not
needed.
Yes?


errm, this would be if I were doing these calculations with 32-bit unsigned
arithmetic...

actually, most of my addressing calculations are being done with 64-bit
signed arithmetic mostly since the interpreter may also handle simulated
long-mode, and because 64-bit arithmetic is far less prone to issues related
to overflow behavior...

(ctx->eip is usually 32-bit EIP, but may also be a 64-bit RIP, FWIW...).


granted, if I were to compile my code on a 32-bit system, this would likely
hurt performance fairly severe...
 
bartc...
Posted: Tue Nov 03, 2009 6:00 am
Guest
"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcnsmm$vu8$1 at (no spam) news.albasani.net...
Quote:

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:_eFHm.1021$Ym4.889 at (no spam) text.news.virginmedia.com...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:bDzHm.910$Ym4.134 at (no spam) text.news.virginmedia.com...

"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclvg3$ubg$1 at (no spam) news.albasani.net...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:xxnHm.748$Ym4.334 at (no spam) text.news.virginmedia.com...
What slowdown are you getting compared with a real machine?

currently, by simplistic tests, around 170x at present...

The 'tighter' asm code yielded about a 15x slowdown compared with
executing on the actual processor.

I also tried pure C code and was surprised it managed to achieve only a
30x slowdown (this test code is below).

in a few rare cases, I have managed to get 15x with an interpreter.
it depends, however, highly on the particular type of code being run.

You say you're getting 170x and that's with already partly decoded
instructions... So I think this approach could be worth looking at again.


there are many reasons for such a level of slowdown...

one of the major ones may well be the level of abstraction I am using for
a lot of the code.
in general, it is hardly a design aimed at "max speed", since in general I
put "making it work" and "having clean code" above "having the maximum raw
speed"...

With this sort of code the performance level will be inherent; messing about
with switches and possibly changing to function pointers is going to make
only a small difference.

Quote:
int registers[8];
....


Quote:
at which cost?...

as can be seen, this does not even address a tiny fraction of the issues
which exist in the x86 ISA (for example, how do you intend to handle
issues like "Virtual Addressing" and segmentation, maybe page-table
translation, ... ?).

This is starting to depend on how serious an emulation you want of the x86
processor. I would have been happy emulating a single virtual task, and not
intrude into OS and device driver territory, or even into memory caches and
instruction pipelines.

You might have point about virtual memory mapping, but then it's also
possible to make use of such facilities on the host processor (the one
running the emulator); I would expect the memory of the emulated task to
exist inside the memory space of the emulator.

Elsewhere you mentioned threading, but was this threading of your
'interpreter', or were you also trying to deal with threading in the cpu
(multiple cores and whatever)?

Quote:
what even about eflags?...

You mean stuff like the Virtual Interrupt Pending flag? If I needed to model
a complete processor with 100% accuracy, then I might be bothered with all
that. If I just wanted to run an exe application, this is probably not
necessary.

Quote:

maybe keep these sorts of issues in mind.

as well, consider that the code above looks, well, terrible...

Thanks...

--
bartc
 
BGB / cr88192...
Posted: Tue Nov 03, 2009 6:17 am
Guest
"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:lgLHm.1171$Ym4.295 at (no spam) text.news.virginmedia.com...
Quote:

"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hcnsmm$vu8$1 at (no spam) news.albasani.net...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:_eFHm.1021$Ym4.889 at (no spam) text.news.virginmedia.com...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:bDzHm.910$Ym4.134 at (no spam) text.news.virginmedia.com...

"BGB / cr88192" <cr88192 at (no spam) hotmail.com> wrote in message
news:hclvg3$ubg$1 at (no spam) news.albasani.net...

"bartc" <bartc at (no spam) freeuk.com> wrote in message
news:xxnHm.748$Ym4.334 at (no spam) text.news.virginmedia.com...
What slowdown are you getting compared with a real machine?

currently, by simplistic tests, around 170x at present...

The 'tighter' asm code yielded about a 15x slowdown compared with
executing on the actual processor.

I also tried pure C code and was surprised it managed to achieve only a
30x slowdown (this test code is below).

in a few rare cases, I have managed to get 15x with an interpreter.
it depends, however, highly on the particular type of code being run.

You say you're getting 170x and that's with already partly decoded
instructions... So I think this approach could be worth looking at
again.


there are many reasons for such a level of slowdown...

one of the major ones may well be the level of abstraction I am using for
a lot of the code.
in general, it is hardly a design aimed at "max speed", since in general
I
put "making it work" and "having clean code" above "having the maximum
raw
speed"...

With this sort of code the performance level will be inherent; messing
about
with switches and possibly changing to function pointers is going to make
only a small difference.


yep...

a little more fiddling, a little improvement, can't expect that much more at
this point...


Quote:
int registers[8];
...

at which cost?...

as can be seen, this does not even address a tiny fraction of the issues
which exist in the x86 ISA (for example, how do you intend to handle
issues like "Virtual Addressing" and segmentation, maybe page-table
translation, ... ?).

This is starting to depend on how serious an emulation you want of the x86
processor. I would have been happy emulating a single virtual task, and
not
intrude into OS and device driver territory, or even into memory caches
and
instruction pipelines.

You might have point about virtual memory mapping, but then it's also
possible to make use of such facilities on the host processor (the one
running the emulator); I would expect the memory of the emulated task to
exist inside the memory space of the emulator.


I do partial address-translation, but, yes, given this is not strictly an
emulator, the address translation mechanisms differ some from that of the
real CPU (I am using spans which pretend to be collections of pages...).

this is mostly because spans was a little faster and had a few other uses,
and because 'userspace' code is not likely to notice the difference. I may
add full paging support eventually though.


Quote:
Elsewhere you mentioned threading, but was this threading of your
'interpreter', or were you also trying to deal with threading in the cpu
(multiple cores and whatever)?


both...

the interpreter will simulate multiple threads, but the interpreter will
also run in a multi-threaded environment, and may be itself
multi-threaded...

I don't want to commit to any overly restrictive design decisions, such as
ones which would break in the face of multi-threading.

I have had this sort of bad experience before...


Quote:
what even about eflags?...

You mean stuff like the Virtual Interrupt Pending flag? If I needed to
model
a complete processor with 100% accuracy, then I might be bothered with all
that. If I just wanted to run an exe application, this is probably not
necessary.


basic eflags behavior is needed even for code to run, so alas, nearly every
arithmetic op needs to set eflags as appropriate, otherwise they risk that
code will fail to work.

hence, I simulate most of the eflags behavior (except maybe PF and a few
others, the PF case being mostly because this one would be difficult to
check and also because almost no modern code is likely to depend on it).


Quote:

maybe keep these sorts of issues in mind.

as well, consider that the code above looks, well, terrible...

Thanks...


hmm...

I guess I am mostly just really fussy about coding practices...

I have had some bad experiences in the past though.

cleanly designed code is, generally, far more resistant to gradual bit-rot,
which starts to matter a whole lot as one keeps working on a codebase over
some amount of time (years or more...).

bit-rot is an almost inescapable issue, and one is better served in writing
code which can resist this condition to some extent...

so, I am fussy I guess...



oh well, here is the great evil switch which is all off in top position in
the profiler...
note: this mask is because the 'configuration' field exists in the same
place as the 'flags', and is actually a proper subset of the flags, only
that I have ended up special-casing many combinations of flags to
'synthesize' a field I could use for a switch.

yes, the compiler does compile this switch to a jump table...
(and, yes, you see correctly if the existence of 5 argument opcodes is
noticed. we can thank AVX for this one...).


int BGBV86_ExecOpcode(BGBV86_Context *ctx, BGBV86_DecodeOp *op)
{
switch(op->rm_fl&BGBV86_RMFL_CFGMASK)
{
case BGBV86_RMCFG_BASIC:
BGBV86_ExecOpcode_Basic(ctx, op); break;
case BGBV86_RMCFG_REG:
BGBV86_ExecOpcode_Reg(ctx, op); break;
case BGBV86_RMCFG_RM:
BGBV86_ExecOpcode_RM(ctx, op); break;
case BGBV86_RMCFG_REGRM:
BGBV86_ExecOpcode_RegRM(ctx, op); break;
case BGBV86_RMCFG_RMREG:
BGBV86_ExecOpcode_RMReg(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2:
BGBV86_ExecOpcode_RegRMReg2(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2:
BGBV86_ExecOpcode_RMRegReg2(ctx, op); break;
case BGBV86_RMCFG_REGREG2RM:
BGBV86_ExecOpcode_RegReg2RM(ctx, op); break;
case BGBV86_RMCFG_RMREG2REG:
BGBV86_ExecOpcode_RMReg2Reg(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2REG3:
BGBV86_ExecOpcode_RegRMReg2Reg3(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2REG3:
BGBV86_ExecOpcode_RMRegReg2Reg3(ctx, op); break;
case BGBV86_RMCFG_REGREG2RMREG3:
BGBV86_ExecOpcode_RegReg2RMReg3(ctx, op); break;
case BGBV86_RMCFG_RMREG2REGREG3:
BGBV86_ExecOpcode_RMReg2RegReg3(ctx, op); break;

case BGBV86_RMCFG_IMM:
BGBV86_ExecOpcode_Imm(ctx, op); break;
case BGBV86_RMCFG_REGIMM:
BGBV86_ExecOpcode_RegImm(ctx, op); break;
case BGBV86_RMCFG_RMIMM:
BGBV86_ExecOpcode_RMImm(ctx, op); break;
case BGBV86_RMCFG_REGRMIMM:
BGBV86_ExecOpcode_RegRMImm(ctx, op); break;
case BGBV86_RMCFG_RMREGIMM:
BGBV86_ExecOpcode_RMRegImm(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2IMM:
BGBV86_ExecOpcode_RegRMReg2Imm(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2IMM:
BGBV86_ExecOpcode_RMRegReg2Imm(ctx, op); break;
case BGBV86_RMCFG_REGREG2RMIMM:
BGBV86_ExecOpcode_RegReg2RMImm(ctx, op); break;
case BGBV86_RMCFG_RMREG2REGIMM:
BGBV86_ExecOpcode_RMReg2RegImm(ctx, op); break;
case BGBV86_RMCFG_REGRMREG2REG3IMM:
BGBV86_ExecOpcode_RegRMReg2Reg3Imm(ctx, op); break;
case BGBV86_RMCFG_RMREGREG2REG3IMM:
BGBV86_ExecOpcode_RMRegReg2Reg3Imm(ctx, op); break;
case BGBV86_RMCFG_REGREG2RMREG3IMM:
BGBV86_ExecOpcode_RegReg2RMReg3Imm(ctx, op); break;
case BGBV86_RMCFG_RMREG2REGREG3IMM:
BGBV86_ExecOpcode_RMReg2RegReg3Imm(ctx, op); break;

case BGBV86_RMCFG_REGREG1:
BGBV86_ExecOpcode_RegReg1(ctx, op); break;
case BGBV86_RMCFG_REG1REG:
BGBV86_ExecOpcode_Reg1Reg(ctx, op); break;

default: BGBV86_CheckRaiseUD(ctx); break;
}

return(0);
}
 
 
Page 3 of 4    Goto page Previous  1, 2, 3, 4  Next
All times are GMT
The time now is Sat Dec 12, 2009 1:07 am