Main Page | Report this Page
Science Forum Index  »  Cryptography Forum  »  using smaller bases than possible so as to make use of...
Page 1 of 1    

using smaller bases than possible so as to make use of...

Author Message
yawnmoth...
Posted: Thu Oct 29, 2009 4:48 pm
Guest
According to section 5.2.2 of <http://math.libtomcrypt.com/files/
tommath.pdf>, the "defaults for LibTomMath are β (the radix) = 2**28
and α (the maximum value) = 2**64". On the surface, it seems like
2**32 would be a better choice but I'm wondering if that wasn't chosen
so that the comba technique could be used on more digits? If the
radix were 32 bits then the comba technique could only be used on one
digit, per floor(2**64 / (2**(2*32) - 2**(33) + 1)) == 1, which would
seem to make it not all that useful.
 
Tom St Denis...
Posted: Thu Oct 29, 2009 11:26 pm
Guest
On Oct 29, 10:48 pm, yawnmoth <terra1... at (no spam) yahoo.com> wrote:
[quote]According to section 5.2.2 of <http://math.libtomcrypt.com/files/
tommath.pdf>, the "defaults for LibTomMath are β (the radix) = 2**28
and  α (the maximum value) = 2**64".  On the surface, it seems like
2**32 would be a better choice but I'm wondering if that wasn't chosen
so that the comba technique could be used on more digits?  If the
radix were 32 bits then the comba technique could only be used on one
digit, per floor(2**64 / (2**(2*32) - 2**(33) + 1)) == 1, which would
seem to make it not all that useful.
[/quote]
It's a tradeoff made because LTM is 100% portable C. So to do the
muladd in C you need to use fewer bits per digit. In the case of 28
it means that your products are 8 bits shorter than a unsigned long
long, meaning you can have a column 256 deep for a max int size of
28*256=7168 with that routine, LTM can handle bigger as it will just
not use the comba routine on larger ints.

If I had chosen [say] 30-bits per digit, than I'd have 4 bits to spare
and can only handle numbers upto 16x30=480 bits with the comba, and
have to use the slower method for anything larger. So while 28-bits
means more multiplications, it's faster [than 30-bits] because the
muladd step is more efficient.

Hope that explains that.

Tom
 
yawnmoth...
Posted: Fri Oct 30, 2009 5:45 am
Guest
On Oct 30, 4:26 am, Tom St Denis <t... at (no spam) iahu.ca> wrote:
[quote]On Oct 29, 10:48 pm, yawnmoth <terra1... at (no spam) yahoo.com> wrote:

According to section 5.2.2 of <http://math.libtomcrypt.com/files/
tommath.pdf>, the "defaults for LibTomMath are β (the radix) = 2**28
and  α (the maximum value) = 2**64".  On the surface, it seems like
2**32 would be a better choice but I'm wondering if that wasn't chosen
so that the comba technique could be used on more digits?  If the
radix were 32 bits then the comba technique could only be used on one
digit, per floor(2**64 / (2**(2*32) - 2**(33) + 1)) == 1, which would
seem to make it not all that useful.

It's a tradeoff made because LTM is 100% portable C.  So to do the
muladd in C you need to use fewer bits per digit.  In the case of 28
it means that your products are 8 bits shorter than a unsigned long
long, meaning you can have a column 256 deep for a max int size of
28*256=7168 with that routine, LTM can handle bigger as it will just
not use the comba routine on larger ints.

If I had chosen [say] 30-bits per digit, than I'd have 4 bits to spare
and can only handle numbers upto 16x30=480 bits with the comba, and
have to use the slower method for anything larger.  So while 28-bits
means more multiplications, it's faster [than 30-bits] because the
muladd step is more efficient.

[/quote]
I don't see muladd discussed in the pdf, but Java's BigInteger.java
implements it and, to be honest, I'm not really sure I see the
advantage of it there. From their multiply function:

private int[] multiplyToLen(int[] x, int xlen, int[] y, int ylen,
int[] z) {
int xstart = xlen - 1;
int ystart = ylen - 1;

if (z == null || z.length < (xlen+ ylen))
z = new int[xlen+ylen];

long carry = 0;
for (int j=ystart, k=ystart+1+xstart; j>=0; j--, k--) {
long product = (y[j] & LONG_MASK) *
(x[xstart] & LONG_MASK) + carry;
z[k] = (int)product;
carry = product >>> 32;
}
z[xstart] = (int)carry;

for (int i = xstart-1; i >= 0; i--) {
carry = 0;
for (int j=ystart, k=ystart+1+i; j>=0; j--, k--) {
long product = (y[j] & LONG_MASK) *
(x[i] & LONG_MASK) +
(z[k] & LONG_MASK) + carry;
z[k] = (int)product;
carry = product >>> 32;
}
z[i] = (int)carry;
}
return z;
}

There are a total of three loops (two sets) - the second for loop has
another for loop inside it whereas the first one doesn't. The mulAdd
function is largely identical save for that the second set of for
loops isn't there - it's just the first one. In doing it that way,
they save themselves from having to add z[k] & LONG_MASK. Plus, if x
is just one digit long then xstart will be 0 and i, in the second for
loop, will start off at -1, which isn't >= 0, so the second for loop
will never be ran. Sure, maybe saving yourself from having to do an i
[quote]= 0 operation is worthwhile, but it doesn't seem like one trivial
operation like that is going to have much of an impact in the long run.[/quote]
 
Tom St Denis...
Posted: Fri Oct 30, 2009 7:35 am
Guest
On Oct 30, 11:45 am, yawnmoth <terra1... at (no spam) yahoo.com> wrote:
[quote]I don't see muladd discussed in the pdf, but Java's BigInteger.java
implements it and, to be honest, I'm not really sure I see the
advantage of it there.  From their multiply function:
[/quote]
Read section 5.2.2 on page 97 of my book [published, it'll be around
page 97 in the public domain copy] if you're so inclined.

In particular, my inner loop involves no shifting or fixups. So the O
(n^2) part of the algorithm is reduced to the simplest muladd
possible.

There are different ways to accomplish the same work, but I chose this
way because it was the simplest for me given the constraints I had.

Tom
 
Tom St Denis...
Posted: Fri Oct 30, 2009 7:40 am
Guest
On Oct 30, 11:45 am, yawnmoth <terra1... at (no spam) yahoo.com> wrote:
[quote]        for (int i = xstart-1; i >= 0; i--) {
            carry = 0;
            for (int j=ystart, k=ystart+1+i; j>=0; j--, k--) {
                long product = (y[j] & LONG_MASK) *
                               (x[i] & LONG_MASK) +
                               (z[k] & LONG_MASK) + carry;
                z[k] = (int)product;
                carry = product >>> 32;
            }
            z[i] = (int)carry;
        }
        return z;
    }
[/quote]
Also,

This inner loop requires 3 loads [really it can be optimized to 2], 1
store, 1 shift and 1 muladd. The inner loop in LTM requires 2 loads
and a muladd. So it's simpler while requiring more muladds. The
trick is does the load/store bandwidth get washed out in the
multiplication bandwidth?

The difference is they take a single digit of one multiplicand and
multiply it against every digit of the other multiplicand. So they
don't actually produced finished columns until much later. Whereas in
my loops after every iteration of the inner loop one digit of the
product is finished.

Which is better is a matter of taste I guess.

In the case of performance I just re-wrote it all into TFM where I use
full size digits [and fixed precision] and inline asm helpers for the
muladds [ported to a half-dozen platforms]. The performance boost is
insane, and the code is still just as simple to read.

Tom
 
 
Page 1 of 1    
All times are GMT - 5 Hours
The time now is Mon Nov 30, 2009 9:02 pm