(updated 10/i/08)
In an earlier post I described Stange’s algorithm for efficiently finding terms in elliptic nets (with a view to pairing computation). I also made the observation that a shorter block structure could be used for doubling- but once employed, it was not possible to perform a double-and-add. This meant that unless the desired term had a higher power of two as a factor then savings would be minor.
However, for a cost it is possible to ‘upgrade’ these short blocks to long blocks, since they contain enough information to recover the missing (k+4,0) term:
=\frac{(2,0)^2\times(k+3,0)\times(k+1,0)-(3,0)\times{(k+2,0)}^2}{(k,0)})
Better still, this only introduces an additional two multiplications and one inversion, since some of the terms feature in the precomputation (and assuming (2,0)2 is computed once and stored for subsequent use):
=\frac{(2,0)^2B(k+2)-(3,0)A(k+2)}{(k,0)})
Thus, given a short block centred at k, we can obtain the short block centred at 2k+1 (that is, perform a shortDoubleAdd). The dependencies are as follows:

Notice that a shortDoubleAdd is more expensive than DoubleAdd, even though it gives a short rather than long block! Thus a purely short-block algorithm would perform worse than the standard algorithm for binary strings with a high Hamming weight, since for each Double-and-add an inversion is introduced in place of a multiplication. However, when the Hamming weight is low, then the occasional cost of an inversion is balanced by the savings accrued during short doublings. To exploit this, whilst guarding against too many inversions, we introduce an algorithm that uses a mixture of standard (‘long’) and short blocks. Since only a single inversion is required to switch to long blocks, doing so allows us to more efficiently compute long runs of 1s in the binary representation by spreading the cost across several DoubleAdds.
We consider the generation of long or short blocks with centre 2k (double) or 2k + 1 (double-and-add) from long or short blocks of centre k. The cheapest such operation is the generation of the short block with centre 2k from a short or long block with centre k, at a cost of 31 multiplications, 6 squarings and no inversions. Using this as a base line, each procedure introduces the following additional operations:
| Procedure |
M |
S |
I |
| DoubleShortFromShort |
0 |
0 |
0 |
| DoubleLongFromShort |
6 |
1 |
1 |
| DoubleAddShortFromShort |
4 |
1 |
1 |
| DoubleAddLongFromShort |
6 |
1 |
1 |
| DoubleShortFromLong |
0 |
0 |
0 |
| DoubleLongFromLong |
4 |
1 |
0 |
| DoubleAddShortFromLong |
2 |
1 |
0 |
| DoubleAddLongFromLong |
4 |
1 |
0 |
We adopt a windowing approach with two-bit windows: that is, bit bi informs whether we are to double or double-and-add, but bi-1 is also examined to determine whether we should generate a long or short block.
- For bibi-1=00, the short block approach clearly minimises the cost through these two bits.
- For bibi-1=11, one should stay with long blocks if these are already in use, to avoid inversion. If short, adopting the long block immediately will mean only a single inversion is required for the following run of 1s.
- For bibi-1=10, if short, then an inversion is required whether you go long or not: since being long is not necessary for the following double, we keep the multiplication count down by 2 by staying short. Similarly for long: no inversion is required to perform the Double-and-add for either length, but as the next operation will be a double, we go short to avoid the unnecessary 2 multiplications.
- For bibi-1=01, then it is always worth staying short if you already are, defering the inversion until it is strictly required for bi-1=1 (possibly choosing to go long then based on bi-2). If currently long, going short will save 4 multiplications and a squaring (approximately 5 multiplications). Even if it proves necessary to upgrade to long for the very next bit, that will only cost around 3.6 multiplications (based on 1I=1.6M, see performance section). Thus even a single zero bit is worth going short for.
Hence the approach is to always go to (or stay with, if already the case) short blocks, unless bibi-1=11 in which case one should go to (or stay with) long blocks. Thus a 2-bit window is sufficient to determine appropriate block length, leading to the following algorithm.
2-Bit Window Algorithm
Double-and-add Mixed-blocks Algorithm
INPUT: Integer n and long block centred at 1.
OUTPUT: Block centred at n.
- Compute binary digits di of n such that n=(dkdk-1…d1)2 with dk=1.
- Set c=1 (centre), Set status=’long’
- For i=k-1 down to 2 do:
- If status=’long’
- If d_i=1
- If d_{i-1}=1 Compute block with centre 2c+1 via DoubleAddLongFromLong; Set c to 2c+1.
- Else Compute short block with centre 2c+1 via DoubleAddShortFromLong; Set c to 2c+1; Set status=’short’.
- Else
Compute short block with centre 2c via DoubleShortFromLong; Set c to 2c; Set status=’short’.
- Else
- If d_i=1
- If d_{i-1}=1 Compute block with centre 2c+1 via DoubleAddLongFromShort; Set c to 2c+1; Set status=’long’.
- Else Compute block with centre 2c+1 via DoubleAddShortFromShort; Set c to 2c+1.
- Else
Compute short block with centre 2c via DoubleShortFromShort; Set c to 2c.
-
- If d_1=1
- If status=’short’ Compute block with centre 2c+1 via DoubleAddShortFromShort.
- Else Compute block with centre 2c+1 via DoubleAddShortFromLong.
- Else Compute short block with centre 2c via DoubleShortFromShort.
- Return final block.
Performance
As described before, the maximum possible gain is when n is a power of two, in which case the algorithm proceeds entirely by short doubles. In this case, there is a 12 percent reduction in the number of multiplications/squarings performed, with no inversions required.
Brute-force analysis of all possible 16-bit strings gives an average reduction of around 9 percent in the number of multiplications/squarings performed. Costing each inversion at 1.6 multiplications (based on average performance in SAGE for a 256-bit prime field), this leads to an average reduction of around 5 percent in the number of multiplications required for such strings. Testing several hundred 256-bit strings gives a similar figure.
Clearly, inversion is not viable if it will lead to a division-by-zero error. However, since the first zero along (i,0) will arise at (m,0), no such error will occur when performing Tate pairing computations.
A summary of this post and the earlier one on Stange’s algorithm is available as pdf, containing (hopefully) clearer copies of the dependency graphs.