Main Page | Report this Page
 
Computers Forum Index  »  Computer - Graphics (Algorithms)  »  Converting a floating point texture to a rgba texture...
Page 2 of 2    Goto page Previous  1, 2

Converting a floating point texture to a rgba texture...

Author Message
Skybuck Flying...
Posted: Sun Oct 04, 2009 2:48 pm
Guest
With the parser I have it's gonna be a peace of cake and then I am gonna be
filthy rich ! LOL :)

Bye,
Skybuck Wink =D
 
Skybuck Flying...
Posted: Sat Oct 10, 2009 7:25 pm
Guest
I just gave this pipeline simulation a test... without actually using any
simulator code yet...

And it seems very limited... only 100.000 instructions can be recorded or
so... maybe a 1.000.000 but that's very little... just an initializing loop
takes like 8000 * 10 instructions = 80.000 instructions or so...

So this pipeline simulation is not worth much... though maybe it could give
some insight into some cycles or so...

All in all probably not worth investigating any further since it's pretty
clear that memory lookups slow it down... and other tests already show the
cpu can't do anything else while it's waiting for memory or so ?!?

At least it seemed like that for me... I could wrong though ;)

Bye,
Skybuck.
 
Skybuck Flying...
Posted: Fri Oct 23, 2009 3:48 am
Guest
My latest insights into the possibility of executing corewars on a gpu have
made me doubt if the performance is going to be any good... it's probably
not going to be any faster than a cpu... maybe even significantly slower
depending on the number of passes that are needed.

Calculations also assume that all executors would actually run in parallel
at full speed which is also probably a flawed assumption... this could mean
that ultimate performance could even be far worse for gpu.

Conclusions for parallel processors:

1. Huge memory requirements just to be able to store stuff and also cache
stuff.

This is mostly where my current graphics card is kinda lacking... only 512
MB... that's not really that much for parallel stuff... where for each
parallel stuff only a little bit of work would be done ;)

I could continue trying to develop something... but I now have serious
doubts that it would achieve any good speed... at least with the current
design... which is probably a very good design... maybe the best one... only
the other idea might give some performance benefit the speculative execution
one... but I doubt that will be any good for sequantial warriors... unless
something more complex is done with loop iteration predict per processing
element or so... that's a bit too advanced for my taste...

I think it's time to start spending my time on other projects...

Maybe in the future when programming has become more easy... and when more
resources are available I might give it another try... but using opengl/cg
shaders probably has too much programming overhead and especially to little
resources available... hardware wise as well.. too little memory.

It's kinda a bummer...

I shall do one last calculation which would be an optimistic calculation
just to see if something can be done:

4 input textures + 4 output textures * 4 elements per texture * 3 bytes = 96
bytes.

512 MB / 96 = 5.33333333 mega elements per texture.

5.3333333 mega sqrt = 2364x2364 texture size or so.

core size = 8000 + warriors 2 * (8000 processes + 500 pspace) = 8.000 +
17.000 = 25.000 elements + 10 for little overhead or so...

Means 2364*2364 / 25010 = 223 simulators in gpu at best.

cycles per simulator could be anywhere from 1000 to 100.000 cycles per
second.

Worst case scenerio: 223 * 1000 = 223000 cycles per second... could even be
worse if not fully executed in parallel... but gpu does have many cores...
like 200 so might actually execute in parallel.

Best case scenerio: 223 * 100.000 = 22.352.161 cycles for entire gpu.

This is pretty optimisic... probably a bit too optimistic... probably more
passes required... or maybe not...

but let's say 22 million cycles per second for gpu.

Cpu achieves 16 million for dual core... so gpu is not really spectacular...
and I need something spectacular...

The 100.000 above is assuming that opengl doesn't need to bind the cg
program all the time...

It probably would need to re bind... so that would make it 10x times slower
or so... so gpu might actually achieve only 2 million cycles per second
which would be bad.

So conclusion in other short words:

It's like having a cpu which can do 223 cycles in parallel... but it can
only do it 10.000 per second or so... so finally speed would be: 2.230.000
cycles per second... which is just miserable.

So that's my latest guess at what the performance would be... miserable ! ;)

Bye,
Skybuck.
 
Skybuck Flying...
Posted: Fri Oct 23, 2009 4:06 am
Guest
However I just had a radically new idea...

What if the shader itself uses 50.000 local integers or so...

Then the shader could use all those local integers as if it was local
memory... and simply execute everything in one pass... this would/should
greatly increase the execution speed.

The question is now how much local memory/integers/variables can a shader
have ?!

A simple test with an array of ints could shed some light on this for
example:

void myshader()
{
int myvar[50000];

}

^ if something like that compiles than that could be very interesting ! ;)

Bye,
Skybuck.
 
Skybuck Flying...
Posted: Fri Oct 23, 2009 4:34 am
Guest
Ok,

I tested this theory (from last posting) and it seems to compile with some
slight modifications.

It seems for loops are limited to 4096 ? Not sure what that is...

What if it was a while loop ?

Maybe ints limited to range 4096 ? I am not sure...

For now the core could be split into a lower and upper half and then this
code works:

Now idea yet of what performance would be... also no idea how many of these
could run in parallel without blowing things up ?! ;)

Time will tell... now time for some performance indication testing with fx
composer 2.5.

Fingers crossed, code example:

/*

% Description of my shader.
% Second line of description for my shader.

keywords: material classic

date: YYMMDD

*/

struct Tinstruction
{
short mWord1;
short mWord2;
short mWord3;
};

typedef short Tprocess;

float4x4 WorldViewProj : WorldViewProjection;

float4 mainVS(float3 pos : POSITION) : POSITION{
return mul(WorldViewProj, float4(pos.xyz, 1.0));
}

float4 mainPS() : COLOR
{

int vIndex;
/*
// works:
int vLowerCore[4000];
int vHigherCore[4000];

for (vIndex=0; vIndex < 4000; vIndex++)
{
vLowerCore[vIndex] = vLowerCore[vIndex] + 1;
}

for (vIndex=0; vIndex < 4000; vIndex++)
{
vHigherCore[vIndex] = vHigherCore[vIndex] + 1;
}
*/

// works as well... highly interesting !
Tinstruction vLowerCore[4000];
Tinstruction vHigherCore[4000];

for (vIndex=0; vIndex < 4000; vIndex++)
{
vLowerCore[vIndex].mWord1 = vLowerCore[vIndex].mWord1 + 1;
}

for (vIndex=0; vIndex < 4000; vIndex++)
{
vHigherCore[vIndex].mWord1 = vHigherCore[vIndex].mWord1 + 1;
}

Tprocess vLowerProcess[4000];
Tprocess vHigherProcess[4000];

for (vIndex=0; vIndex < 4000; vIndex++)
{
vLowerProcess[vIndex] = vLowerProcess[vIndex] + 1;
}

for (vIndex=0; vIndex < 4000; vIndex++)
{
vHigherProcess[vIndex] = vHigherProcess[vIndex] + 1;
}


return float4(1.0, 1.0, 1.0, 1.0);
}

technique technique0 {
pass p0 {
CullFaceEnable = false;
VertexProgram = compile vp40 mainVS();
FragmentProgram = compile fp40 mainPS();
}
}

Bye,
Skybuck.
 
Skybuck Flying...
Posted: Fri Oct 23, 2009 4:55 am
Guest
I just tried to do some performance testing with fx composer 2.5...

It gives some error "GPuPerformanceUnsupported" ?!?

It did give some indication 10 Gpixels / sec ?!?

Probably flawed indication...

I think I could use this technique to try and implement a parallel corewar
simulator...

The data would be loaded from a texture map just once at the start of the
shader...

Then the shader runs a full simulator battle, maybe even multiple in one
go/pass.

And then it simply returns the battle results in a little output texture...

Could be nice if it works ! ;)

Example for two warriors in core:

This way the constraints would be:

First constraint:

Maximum ammount of simulators in gpu memory possible:

512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2 ) =
512 MB / 48000 + 34016 =
512 MB / 82016 =
536870912 / 82016 = 6545 simulators in core !

Now the pixel shaders would simply run each simulator side by side for as
far as possible...

I have no idea what the performance for the pixel shader would be...

But for now I will take a guess...

6545 simulators * 80.000 cycles * 2 warriors * 100 battles =

104.720.000.000 instructions to execute at least.

Each instruction is about 6 bytes...

So that's a bandwidth requirement of:

628.320.000.000 bytes

The true bandwidth is something like:

50 GB/sec which is: 5.368.709.1200 bytes

So clearly the bandwidth is a limiter/constraint...

So estimated time for shader to complete based on bandwidth constraint would
be:

628.320.000.000 bytes / 5.368.709.1200 bytes / sec =

628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.

So instructions per second exected would be:

104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.

For two warriors that would mean 4.475.213.675 cycles per second.

Let's see.. a dual core cpu achieves 16.000.000 cycles per second.

The gpu performance would be staggering/very good.. however I have a feeling
there must be another bottleneck/constraint somewhere....

There could also be an execution constraint for the gpu.

Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.

I think that's about:
15.7 * 1000 * 1000 * 1000 = 15.700.000.000

So far this seems within range of the number of above.

Conclusion: performance could be staggering/super speed !

Speed up over cpu would be:

4.475.213.675 / 16.000.000 =
4475213675 / 16000000 = 279.7

The gpu would be about 280 times faster than a cpu !

That's the kind of performance gain I am looking for ! ;)

Me very happy about that number ! =D

As long as the code will compiled this should definetly be achieveable !

However there is still a little catch... these numbers do not include the
initialization... this would
need to be done for each battle... but that's probably pretty quickly done
as well...

Even a 200 speed up would be real nice ! ;)

So these numbers are very encouraging and I will definetly continue my
development efforts to get a parallel gpu corewars executor going ! Wink
SmileSmileSmileSmile:)

Bye,
Skybuck =D
 
Skybuck Flying...
Posted: Fri Oct 23, 2009 5:00 am
Guest
"Skybuck Flying" <BloodyShame at (no spam) hotmail.com> wrote in message
news:8d20b$4ae0fef5$d53372a9$11893 at (no spam) cache5.tilbu1.nb.home.nl...
Quote:
I just tried to do some performance testing with fx composer 2.5...

It gives some error "GPuPerformanceUnsupported" ?!?

It did give some indication 10 Gpixels / sec ?!?

Probably flawed indication...

I think I could use this technique to try and implement a parallel corewar
simulator...

The data would be loaded from a texture map just once at the start of the
shader...

Then the shader runs a full simulator battle, maybe even multiple in one
go/pass.

And then it simply returns the battle results in a little output
texture...

Could be nice if it works ! ;)

Example for two warriors in core:

This way the constraints would be:

First constraint:

Maximum ammount of simulators in gpu memory possible:

512 MB / ( 8000*6 bytes + 2 * (8000 + 500+4) * 2 ) =
512 MB / 48000 + 34016 =
512 MB / 82016 =
536870912 / 82016 = 6545 simulators in core !

Now the pixel shaders would simply run each simulator side by side for as
far as possible...

I have no idea what the performance for the pixel shader would be...

But for now I will take a guess...

6545 simulators * 80.000 cycles * 2 warriors * 100 battles =

104.720.000.000 instructions to execute at least.

Each instruction is about 6 bytes...

So that's a bandwidth requirement of:

628.320.000.000 bytes

The true bandwidth is something like:

50 GB/sec which is: 5.368.709.1200 bytes

I made a little typo there in the dots:

Correct dotted value is:

53.687.091.200

However the calculations were still done properly... because I removed
the dots later on ! ;)

So calculations are correct ! ;)

Bye,
Skybuck ! Wink :)

Quote:

So clearly the bandwidth is a limiter/constraint...

So estimated time for shader to complete based on bandwidth constraint
would be:

628.320.000.000 bytes / 5.368.709.1200 bytes / sec =

628320000000 bytes / 53687091200 bytes / sec = 11.7 seconds.

So instructions per second exected would be:

104.720.000.000 / 11.7 = 8.950.427.350 instructions per second.

For two warriors that would mean 4.475.213.675 cycles per second.

Let's see.. a dual core cpu achieves 16.000.000 cycles per second.

The gpu performance would be staggering/very good.. however I have a
feeling there must be another bottleneck/constraint somewhere....

There could also be an execution constraint for the gpu.

Stats/specs say something like: Fill rate: 15.7 billion pixels/sec.

I think that's about:
15.7 * 1000 * 1000 * 1000 = 15.700.000.000

So far this seems within range of the number of above.

Conclusion: performance could be staggering/super speed !

Speed up over cpu would be:

4.475.213.675 / 16.000.000 =
4475213675 / 16000000 = 279.7

The gpu would be about 280 times faster than a cpu !

That's the kind of performance gain I am looking for ! ;)

Me very happy about that number ! =D

As long as the code will compiled this should definetly be achieveable !

However there is still a little catch... these numbers do not include the
initialization... this would
need to be done for each battle... but that's probably pretty quickly done
as well...

Even a 200 speed up would be real nice ! ;)

So these numbers are very encouraging and I will definetly continue my
development efforts to get a parallel gpu corewars executor going ! Wink
SmileSmileSmileSmile:)

Bye,
Skybuck =D
 
Skybuck Flying...
Posted: Sun Oct 25, 2009 3:10 am
Guest
The error was probably related to gtx 7900 which doesn't support certain
performance benchmarks... the gtx 8800 does...

Anyway back to the story...:

Even more interesting could be to completely leave the core, processes and
pspace out of the texture maps...

Since those "entities" can be done/initialized in the shader itself.

What remains is the warrior's code... that could be supplied into the
texture map... parameters maybe not possible... I would be worried that it
would be pre-compiled/computed which is unwanted.

To keep it simple each warrior could be stuffed into 100 cells... even if
they not all used... plus a size indicating how large it really is...

This means the number of simulators could be:

512 MB / (100 * 6 bytes + 2) =
536870912 / 602 = 891812 simulators ! LOL.

This could allow a "battlefield" of 944 x 944 ;)

Hmm seems a bit overkill for now... my battlefield would be 60x60 or so...
but maybe later I try 944x944 or so...

For now I shall not do any calculations how long this would take... just
want to "document" the idea a little bit ;)

Bye,
Skybuck.
 
Skybuck Flying...
Posted: Sun Oct 25, 2009 3:42 am
Guest
"Skybuck Flying" <BloodyShame at (no spam) hotmail.com> wrote in message
news:67037$4ae38952$d53372a9$1360 at (no spam) cache4.tilbu1.nb.home.nl...
Quote:
The error was probably related to gtx 7900 which doesn't support certain
performance benchmarks... the gtx 8800 does...

Anyway back to the story...:

Even more interesting could be to completely leave the core, processes and
pspace out of the texture maps...

Since those "entities" can be done/initialized in the shader itself.

What remains is the warrior's code... that could be supplied into the
texture map... parameters maybe not possible... I would be worried that it
would be pre-compiled/computed which is unwanted.

To keep it simple each warrior could be stuffed into 100 cells... even if
they not all used... plus a size indicating how large it really is...

This means the number of simulators could be:

Hmm program start needed as well

So this becomes:

512 MB / (100 * 6 bytes + 4) =

536870912 / 604 = 888859 simulators

Max battlefield 942 x 942

Bye,
Skybuck.
 
Skybuck Flying...
Posted: Wed Oct 28, 2009 1:28 am
Guest
I was losing confidence if it's gonna work because I don't know what will
happen if a shader uses many variables...

So I decided to do a little test... a little input texture... and some local
variables like 8000*4*32 bits.

And some code to try and force the gpu/cg compiler to actually use all of
them and not illiminate them...

Surprisingly it did seem to work... only problem is that FX Composer takes
multiple seconds to render something... it also allocates gigabytes of
memory... and then the whole application freezes.

I tried to make the shader only work for a few pixels... but alas.. it still
uses gigabytes.

It does seem to render some white now and then which was probably the result
of the shader which summed everything up more or less.

Maybe I need to develop my own cg editor minimalistic development
environment which is more aimed at large scale or so...

Hmmm..

Bye,
Skybuck.
 
 
Page 2 of 2    Goto page Previous  1, 2
All times are GMT
The time now is Sun Nov 22, 2009 4:45 am