Fast and Secure elliptic Curve Scalar Multiplication Over Prime Fields Using Special Additi
- 格式:pdf
- 大小:172.71 KB
- 文档页数:14
素域椭圆曲线的基本参数素域椭圆曲线(Elliptic Curves over Finite Fields)是数论中的一个重要概念,主要用于密码学和编码理论等领域。
在素域椭圆曲线中,基本参数包括以下几个:1. 有限域(Finite Field):素域椭圆曲线的定义域是一个有限域,通常表示为GF(p),其中p是一个质数。
2. 椭圆曲线方程(Elliptic Curve Equation):一个三次多项式方程,用于定义椭圆曲线。
通常表示为y^2 = x^3 + ax + b(其中a和b是定义域GF(p)中的元素)。
3. 阶(Order):椭圆曲线上点的个数。
对于素域椭圆曲线,阶通常表示为n,即满足nG = 0(G为椭圆曲线上的无穷远点)的最小正整数n。
4. 生成点(Generator Point):一个用于表示椭圆曲线的点,通常是阶为n的生成元。
生成点一般表示为(x, y),满足椭圆曲线方程。
5. 阶的倍数(Multiples of the Order):阶的倍数是指满足nG = 0的点。
这些点可以用来构造椭圆曲线上的加法群。
6. 标量乘法(Scalar Multiplication):椭圆曲线上点的加法运算可以通过标量乘法实现。
对于定义域GF(p)中的元素k,标量乘法计算公式为kP = (x', y'),其中P为椭圆曲线上的一个点。
7. 双线性对(Weil Pairing):双线性对是一种满足一定性质的映射,用于将椭圆曲线上的点映射到一个有限域上的元素。
这种映射在密码学中具有重要意义,特别是在椭圆曲线密码学中,用于实现数字签名和密钥交换等任务。
这些基本参数在素域椭圆曲线的理论和应用中都具有关键作用。
理解和掌握这些参数有助于更好地学习和研究素域椭圆曲线。
Aspects of Hyperelliptic Curves over Large Prime Fields in Software Implementations(To appear in:Proceedings of CHES2004)Roberto Maria AvanziInstitute for Experimental Mathematics(IEM)−University of Duisburg-EssenEllernstraße29,D-45326Essen,Germanymocenigo at exp-math.uni-essen.deDepartment of Electrical Engineering and Information Sciences Ruhr University of Bochum,Universit¨a tsstraße150,D-44780Bochum,GermanyAbstract.We present an implementation of elliptic curves and of hyperellipticcurves of genus2and3over primefields.To achieve a fair comparison between thedifferent types of groups,we developed an ad-hoc arithmetic library,designed toremove most of the overheads that penalize implementations of curve-based cryp-tography over primefields.These overheads get worse for smallerfields,and thusfor larger genera for afixed group size.We also use techniques for delaying mod-ular reductions to reduce the amount of modular reductions in the formulae for thegroup operations.The result is that the performance of hyperelliptic curves of genus2over primefields is much closer to the performance of elliptic curves than previously thought.For groups of192and256bits the difference is about14%and15%respectively.Keywords.Elliptic and hyperelliptic curves,cryptography,efficient implementa-tion,primefield arithmetic,lazy and incomplete modular reduction.1IntroductionIn1988Koblitz[21]proposed to use hyperelliptic curves(HEC)as an alternative to elliptic curves(EC)for designing cryptosystems based on the discrete loga-rithm problem(DLP).EC are just the genus1HEC.Cryptosystems based on EC need a much shorter key than RSA or systems based on the DLP infinitefields: A160-bit EC key is considered to offer security equivalent to that of a1024-bit RSA key[25].Since the best known methods to solve the DLP on EC and on HEC of genus smaller than4have the same complexity,these curves offer the same security level,but HEC of genus4or higher offer less security[12,38].Until recently,HEC have been considered not practical[36]because of the difficulty offinding suitable curves and their poor performance with respect to EC.In the subsequent years the situation changed.Firstly,it is now possible to efficiently construct genus2and3HEC whose Jacobian has almost prime order of cryptographic relevance.Over primefields one can either count points in genus2[13],or use the complex multiplication (CM)method for genus2[29,39]and3[39].Secondly,the performance of the HEC group operations has been consider-ably improved.For genus2thefirst results were due to Harley[17].The state of the art is now represented by the explicit formulae of Lange:see[23,24]and further references therein.For genus3,see[32,33](and also[14]).HEC are attractive to designers of embedded hardware since they require smallerfields than EC:The order of the Jacobian of a HEC of genus g over a field with q elements is≈q g.This means that a160-bit group is given by an EC with q≈2160,by an HEC of genus2with q≈280,and genus3with q≈253. There has been also research on securing implementations of HEC on embedded devices against differential and Goubin-type power analysis[2].The purpose of this paper is to present a thorough,fair and unbiased com-parison of the relative performance merits of generic EC and HEC of small genus 2or3over primefields.We are not interested in comparing against very special classes of curves or in the use of prime moduli of special form.There have been several software implementations of HEC on personal com-puters and workstations.Most of those are in even characteristic(see[35,32], [33],and also[40,41]),but some are over primefields[22,35].It is now known that in even characteristic,HEC can offer performance comparable to EC.Until now there have been no concrete results showing the same for prime fields.Traditional implementations such as[22]are based on general purpose software libraries,such as gmp[16].These libraries introduce overheads which are quite significant for small operands such as those occurring in curve cryp-tography,and get worse as thefields get smaller.Moreover,gmp has no native support for fast modular reduction techniques.In our modular arithmetic library, described in§2.1,we made every effort to avoid such overheads.On a PC we get a speed-up from2to5over gmp for operations infields of cryptographic relevance (see Table1).We also exploit techniques for reducing the number of modular reductions in the formulae for the group operations.We thus show that the performance of genus2HEC over primefields is much closer to the performance of EC than previously thought.For groups of192resp. 256bits the difference is approximately14%,resp.15%.The gap with genus3 curves has been significantly reduced too.See Section3for more precise results.While the only significant constraint in workstations and commodity PCs may be processing power,the results of our work should also be applicable to other more constrained environments,such as Palm platforms,which are also based on general-purpose processors.In fact,a port of our library to the ARM processor has been recentlyfinished and yields similar results.In Section2,we describe the implementation of the arithmetic library and of the group operations.In Section3,we give timings and draw our conclusions. Acknowledgements.The author thanks Gerhard Frey for his constant support.Tanja Lange great-ly influenced this paper at several levels,theoretical and practical,considerably improving its qual-ity.The author acknowledges feedback,support and material from Christophe Doche,Pierrick Gaudry,Johnny Großsch¨a dl,Christof Paar,Jean-Jacques Quisquater,Jan Pelzl,Nicolas Th´e riault and Thomas Wollinger.Thanks also to the anonymous referees for several useful suggestions.2ImplementationWe use the following abbreviations:w is the bit length of the characteristic of the primefield.M,S and I denote a multiplication,a squaring and an inversion in the field.m and s denote a multiplication and a squaring,respectively,of two w-bit integers with a2w-bit result.R denotes a modular(or Montgomery)reduction of a2w-bit integer with a w-bit result.2.1Prime Field LibraryWe already said that standard long integer software libraries introduce several types of overheads.One is thefixed function call overhead.Other ones arise from the processing of operands of variable length in loops,such as branch mispre-dictions at the beginning and end of the loops,and are negligible for very large operands.For operands of size relevant for curve cryptography the CPU will spend more time performing jumps and paying branch misprediction penalties than doing arithmetic.Memory management overheads can be very costly,too.Thus,the smaller thefield becomes,the higher will be the time wasted in the overheads.Because of the larger number offield operations in smallerfields, HEC suffer from a much larger performance loss than EC.2.1.1Design.Our software library nuMONGO has been designed to allow effi-cient reference implementations of EC and HEC over primefields.It implements arithmetic operations in rings Z/N Z with N odd,with the elements stored in Montgomery’s representation[31],and the reduction algorithm is Montgomery’s REDC function–see§2.1.3for some more details.Many optimization techniques employed are similar to those in[6].nuMONGO is written in C++to take advantage of inline functions,overloaded functions statically resolved at compile time for clarity of coding,and operator overloading for I/O only.All arithmetic operations are implemented as imperative functions.nuMONGO contains no classes.All data structures are minimalistic.All elements of Z/N Z are stored in vectors offixed length of32-bit words.All tem-porary memory is allocated on the stack.No data structure is ever dynamically resized or relocated.This eliminates memory management overheads.The routines aim to be as simple as possible.The least possible number of routines are implemented which still allow to perform all desiredfield operations: They are built from elementary operations working on single words,available as generic C macros as well as assembler macros for x86-compatible CPUs.A CPU able to process32-bit operands is assumed,but not necessarily a32-bit CPU –the library in fact compiled also on an Alpha.Inlining was used extensively, most loops are unrolled;there are very few conditional branches,hence branchmispredictions are rare.There are separate arithmetic routines for all operand sizes,in steps of32bits from32to256bits,as well as for48–bitfields(80and 112-bitfields have been implemented too,but gave no speed-up over the96and 128-bit routines).2.1.2Multiplication.We begin with two algorithms to multiply“smallish”multi-precision operands:Schoolbook multiplication and Comba’s method[10].The next two algorithms take as input two -word integers u=(u −1,..., u1,u0)and v=(v −1,...,v0),and output the2 -word integer r=(r2 −1,...,r0) such that r=uv.Schoolbook multiplication Comba’s methodThe schoolbook method multiplies thefirst factor by each digit of the second factor,and accumulates the ba’s method instead,for each digit of the result,say the k th one,computes the partial products u i v j on the diagonals i+j=k,adding these double precision results to a triple precision accumulator.It requires fewer memory writes and more reads than the schoolbook multiplication. This is the method adopted in[6].For both methods,several optimizations have been done.They can both be used with Karatsuba’s trick[20].In our experience,Comba’s method did not perform better than the school-book method(on the ARM the situation is different).This may be due to the fact that the Athlon CPU has a write-back level1cache[1],hence several close writes to the same memory location cost little more than one generic write.For n=192 and n=256we reduced a n-bit multiplication to three n/2-bit multiplications by means of Karatsuba’s trick.For smaller sizes and for224-bit operands,the schoolbook method was still faster.For the considered operand sizes,squaring routines did not bring a significant improvement over the multiplication routines,hence they were not included. 2.1.3Montgomery’s reduction without trial division.Montgomery[31] proposed to speed up modular reduction by replacing the modulus N by a larger integer R coprime to N for which division is faster.In practice,ifβis the machine radix(in our caseβ=232)and N is an odd -word integer,then R=β .Division by R and taking remainder are just shift and masking operations.Let REDC(x)be a function which,for0 x<NR,computes xR−1mod N.The modular residue x is represented by its r-sidu¯x=xR mod N.Addition, subtraction,negation and testing for equality are performed on r-sidus as usual.Note that x=REDC(xyR mod N,soI NPUT:A2 -word integer x=(x2 −1,...,x1,x0),and N,n 0andβas above.O UTPUT:The -word integer y such that y=xR−1mod N and0 y<N.This algorithm is,essentially,Hensel’s odd division for computing inverses of 2-adic numbers to a higher base:At each iteration of the loop,a multiple of N is added to y such that the result is divisible byβ,and then y is divided byβ(a one word shift).After the loop,y≡x/β ≡xR−1mod N and y<2N.If y N, a subtraction corrects the result.The cost of REDC is,at least in theory,that of a schoolbook multiplication of -word integers,some shifts and some additions; In practice it is somewhat more expensive,but still much faster than the naive reduction involving long divisions.We did not use the interleaved multiplication with reduction[31]:It usually performs better on DSPs[11],but not on general-purpose CPUs with few registers.2.1.4Inversion.With the exception of32-bit operands,inversion is based on the extended binary GCD,and uses an almost-inverse technique[19]withfi-nal multiplication from a table of precomputed powers of2mod N.This was the fastest approach up to about192bits.For32-bit operands we got better per-formance with the extended Euclidean algorithm and special treatment of small quotients to avoid divisions.Inversion was not sped up further for larger input sizes because of the intended usage of the library:For elliptic curves over prime fields,inversion-free coordinate systems are much faster than affine coordinates, so there is need,basically,only for one inversion at the end of a scalar multi-plication.For hyperelliptic curves,fields are quite small(32to128bits in most cases),hence our inversion routines have optimal performance anyway.There-fore,Lehmer’s method or the improvements by Jebelean[18]or Lercier[26]have not been included in thefinal version of the library.2.1.5Performance.In Table1we show some timings of basic operations with gmp version4.1and nuMONGO.The timings have been measured on a PC with a1 GHz AMD Athlon Model4processor,under the Linux operating system(kernel version2.4).Our programs have been compiled with the GNU C Compiler(gcc) versions2.95.3and3.3.1.For each test,we took the version that gave the bestTable 1.Timings of basic operations in µsec (1GHz AMD Athlon PC)and ratiosLib /Op /Bits 3264128192256n u M O N G O0.02010.0540.1460.3610.05360.0970.2410.4161.85 4.45711.222.32.667 1.796 1.651 1.15225.10229.5228.9428.7m 0.0940.160.2380.3540.508R 0.2340.4230.81 1.154 1.528I 2.53 6.4113.321.2629.6R/m 2.489 2.644 3.403 3.26 3.008I/(R+m)7.71310.9912.6914.114.54timings.nuMONGO always performed best with gcc 3.3.1,whereas some gmp tests performed better with gcc 2.95.31.We describe the meaning of the entries.There are two groups of rows,grouped under the name of library used to benchmark the following operations:multiplication of two integers (m ),modular or Montgomery reduction (R ),modular or Montgomery inversion (I ).The ratios of a reduction to a multiplication and of an inversion to the time of a multiplication together with a reduction are given,too:The first ratio tells how many “multiplications”we save each time we save a reduction using the techniques described in the next subsection;the second ratio is the cost of a field inversion in field multiplications.The columns correspond to the bit lengths of the operands.A few remarks:1.nuMONGO can perform better than a far more optimized,but general purpose library.In fact,the kernel of gmp is entirely written in assembler for most ar-chitectures,including the one considered here.2.For larger operands gmp catches up with nuMONGO ,the modular reduction re-maining slower because it is not based on Montgomery’s algorithm.3.nuMONGO has a higher I/(m+R)ratio than gmp .This shows how big the over-heads in general purpose libraries are for such small inputs.2.2Lazy and Incomplete reductionLazy and incomplete modular reduction are described in [3].Here,we give a short treatment.Let p <2w be a prime,where w is a fixed integer.We consider expres-sions of the form ∑d i =1a i b i mod p with 0 a i ,b i <p .Such expressions occur in the explicit formulae for HEC.To use most modular reduction algorithms,in-cluding Montgomery’s,at the end of the summation,we have to make sure that all partial sums of ∑a i b i are smaller than p 2w .Some authors (for example [27])suggested to use small primes,to guarantee that the condition ∑a i b i <p 2w isalways satisfied.Note that [27]exploited the possibility of accumulating several partial results before reduction for the extension field arithmetic,but not at the group operation level.The use of small primes at the group operation level has been considered also in [14]after the present paper appeared as a preprint.How-ever,“just”using primes which are “small enough”would contradict one of our design principles,which is to have no restriction on p except its word length.What we do,additionally,is to ensure that the number obtained by removing the least significant w bits of any intermediate result remains <p .We do this by adding the products a i b i in succession,and checking if there has been an overflow or if the most significant half of the intermediate sum is p :if so we subtract p from the number obtained ignoring the w least significant bits of the intermediate result.If the intermediate result is 22w ,the additional bit can be stored in a carry.Since all intermediate results are bounded by p 2w +1<(p +2w )2w ,upon subtraction of p 2w the carry will always be zero.This requires as many operations as allowing intermediate results in triple precision,but less memory accesses are needed:In practice this leads to a faster approach,and at the end we have to reduce a number x bounded by p 2w ,making the modular reduction easier.This technique of course works with any classical modular reduction algo-rithm.That it works with Montgomery’s r-sidus and with REDC is a consequence of the linearity of the operator REDC modulo p.nuMONGO supports Lazy (i.e.delayed)and Incomplete (i.e.limited to the num-ber obtained by removing the least significant w bits)modular reduction.Thus,an expression of the form ∑d −1i =0a i b i mod p can be evaluated by d multiplications but only one modular reduction instead of d .A modular reduction is at least as expensive as a multiplication,and often much more,see Table 1.Remark 1.We cannot add a reduced element to an unreduced element in Mont-gomery’s representation.In fact,Montgomery’s representationbbc =bcR mod p .Hence,a and bc have been multiplied by different constants mod p to obtain b a +c bears no fixed relation to a +bc .2.3Implementation of the Explicit FormulaeWe assume that the reader is acquainted with elliptic and hyperelliptic curves.2.3.1Elliptic Curves.We consider elliptic curves defined over a field F of odd characteristic greater than 3given by a Weierstrass equationE :y 2=x 3+a 4x +a 6(1)where the polynomial x 3+a 4x +a 6has no multiple roots.The set of points of E over (any extension of)the field F and the point at infinity O form a group.There are 5different coordinate systems [9]:affine (A ),the finite points “be-ing”the pairs (x ,y )that satisfy (1);projective (P ),also called homogeneous ,where a point[X,Y,Z]corresponds to(X/Z,Y/Z)in affine coordinates;Jaco-bian(J),where a point(X,Y,Z)corresponds to(X/Z2,Y/Z3);and two variants of J,namely,Chudnowski Jacobian(J c),with coordinates(X,Y,Z,Z2,Z3),and modified Jacobian(J m),with coordinates(X,Y,Z,a4Z4).They are accurately de-scribed in[9],where the formulae for all group operations are given.It is possible to add two points in any two different coordinate systems and get a result in a third system.For example,when doing a scalar multiplication,it is a good idea to keep the base point and all precomputed points in A,since adding those points will be less expensive than using other coordinate systems.For EC,only few savings in REDC s are possible.Let us work out an example,namely,how many REDC s can be saved in the addition A+P=P.Let P1=(X1,Y1),P2=[X2,Y2,Z2]and P3=[X3,Y3,Z3]. Then,P3=P1+P2is computed as follows[9]:u=Y2−Y1Z2v=X2−X1Z2,A=u2Z2−v3−2v2X1Z2,X3=vA,Y3=u(v2X1Z2−A)−v3Y1Z2,Z3=v3Z2.For the computation of u and v no savings are possible.We cannot save any re-ductions in the computation of A=u2Z2−v3−2v2X1Z2because:We need v3 reduced anyway for Z3,A must be available also in reduced form to compute X3, and from v2X1Z2we subtract A in the computation of Y3;It is then easy to see that here no gain is obtained by delaying reduction.But Y3can be computed by first multiplying u by v2X1Z2−A,then v3by Y1Z2,adding these two products and reducing the sum.Hence,one REDC can be saved in the addition formula.For affine coordinates,no REDC s can be saved.Additions in P allow saving of 1REDC,even if one of the two points is in A.With no other addition formula we can save reductions.For all doublings we can save2REDC s,except for the dou-bling in J m,where no savings can be done due to the differences in the formulae depending on the introduction of a4Z4.In Table2,we write the operation counts of the implemented operations.Re-sults for genus2and3curves are given,too.The shorthand C1+C2=C3means that two points in the coordinate systems C1and C2are added and the result is given in C3,where any of the C i can be one of the applicable coordinate systems. Doubling a point in C1with result in C2is denoted by2C1=C2.The number of REDC s is given separately from the multiplications and squarings.2.3.2Hyperelliptic Curves.An excellent,low brow,introduction to hyper-elliptic curves is given in[28],including proofs of the facts used below.A hyperelliptic curve C of genus g over afinitefield F q of odd characteristic is defined by a Weierstrass equation y2=f(x),where f is a monic,square-free poly-nomial of degree2g+1.In general,the points on C do not form a group.Instead, the ideal class group is used,which is isomorphic to the Jacobian variety of C.Its elements are represented by pairs of polynomials and[7]showed how to compute with group elements in this form.A generic ideal class is represented by a pair ofTable2.Costs of Group Operations on EC and HECDoubling Additioncosts operation operation 2A=A I,2m,1s,3R7m,5s,10R P+P=P A+P=PEC2J=J12m,4s,16R8m,3s,11R 5m,6s,9R J c+J c=J c A+J c=J c 2J m=J m13m,6s,19R9m,5s,14R I,22m,5s,22R A+A=Ag=22P=P45m,5s,42R40m,3s,33R 34m,7s,37R N+N=N A+N=Ng=32A=A I,76(m/s),55Rpolynomials U(x)=x g+∑d−1i=0U i x i,V(x)=∑d−1i=0V i x i∈F q[x]such that for eachrootξof U(x),(ξ,V(ξ))is a point on C(equivalently,U(x)divides V(x)2−f(x)). The affine coordinates are the2g-tuple[U g−1,...,U1,U0,V g−1,...,V1,V0].2.3.2.1Genus2.For genus2there are two more coordinate systems besides affine(A):in projective coordinates(P)[30]:a quintuple[U1,U0,V1,V0,Z]cor-responds to the ideal class represented by[x2+U1/Z x+U0/Z,V1/Z x+V0/Z]; with Lange’s new coordinates(N)[24],the sextuple[U1,U0,V1,V0,Z1,Z2]corre-sponds to the ideal class[x2+U1/Z21x+U0/Z21,V1/Z31Z2x+V0/Z31Z2].The sys-tem N is important in scalar multiplications since it has the fastest doubling.We refer to[24]for the formulae.We now see in an example–the addition formula in affine coordinates–how lazy and incomplete reductions are used in practice.Table3is derived from results in[24],but restricted to the odd characteristic case.The detailed breakdown of the REDC s we can save follows:1.In Step1we can save one REDC in the computation of r,since we do not needthe reduced value of z2z3and z21u10anywhere else.2.In Step3we do not reduce w2=ı0w0,since it is used in the computation ofs 1and s 0,which are sums of products of two elements.So only3REDC s are required to implement Step3:for w3and for thefinal results of s 1and s 0.This is a saving of two REDC s.3.In Step5,it would be desirable to leave the coefficients l 1and l 0of l unre-duced,since they are used in the following two steps only in additions with other products of two elements.But l 1=u21s 0+u20is a problem:we cannot add reduced and unreduced quantities(see Remark1).We circumvent this by computing the unreduced products L1=u21s 0(in place of 1)and L0=u20s 0.Two REDC s are saved.4.In Step6,it is u30=(s 0−u11)(s 0−z1)+L1+2v21w4+(2u21+z1)w5+z2.We need only one REDC to compute the(reduced)sum of thefirst four products:Note that,at this point,L1is already known and we already counted the saving of one REDC associated to it.So,we save a total of two REDC s.Table3.Addition in genus2,deg u1=deg u2=2I NPUT:[u1,v1],[u2,v2],with deg u1=deg u2=2,and f=x5+f3x3+f2x2+f1x+f0O UTPUT:[u3,v3]=[u1,v1]+[u2,v2]N OTATION:u i=x2+u i1x+u i0and v i=v i1x+v i0Step Cost compute resultant r of u1,u2:1S,3Mcompute almost inverse of u2modulo u1(ı=ı1x+ı0=r/u2mod u1):-compute s =rs≡(v1−v2)ımod u1:5Mcompute s =x+s0/s1=x+s 0/s 1and s1:I,2S,5Mcompute l =s u2=x3+l 2x2+l 1x+l 0:2Mcompute u3=(s(l+2v2)−k)/u1:3Mcompute v3≡−(l+v2)mod u3:4M total I,3S,22M Summarizing,for one addition in affine coordinates in the most common case, we need12Mul s,13MulNoREDC s and6REDC s.Thus,we save7REDC s.We implemented addition and doubling in all coordinate systems.To speed up scalar multiplication,we also implemented addition in the cases where one of the two group elements to be added is given in A and the second summand and the result are both given either in P or N.In Table2we write the operation counts of the implemented operations.The table contains also the counts for EC and genus3curves(see the next paragraph). The number of modular reductions is always significantly smaller than the num-ber of multiplications.2.3.2.2Genus3.Affine coordinates are the only coordinate system currently available for genus3curves.The formulae in[32,33]contain some errors in odd characteristic.We took the formulae of[40]–which are for general curves of the form y2+h(x)y=f(x),and have been implemented only in even characteristic with h(x)=1–and simplified them for the case of odd characteristic,h(x)=0, and vanishing second most significant coefficient of f(x).A pleasant aspect of these formulae is that a large proportion of modular reductions can be saved:at least21in the addition and14in the doubling(see Table2).2.4Scalar MultiplicationThere are many methods for computing a scalar multiplication in a generic group,which can be used for EC and HEC.See [15]for a survey.A simple method for computing s ·D for an integer s and a ideal class D isbased on the binary representation of s .If s =∑n −1i =0s i 2i where each s i =0or 1,then n ·D can be computed assD =2(2(···2(2(s n −1D )+s n −2D )+···)+s 1D )+s 0D .(2)This requires n −1doublings and on average n /2−1additions on the curve (the first addition is replaced by an assignment).On EC and HEC,adding and subtracting an element have the same cost.Hence one can use the non adjacent form (NAF)[34],which is an expansions =∑n i =0s i 2i with s i ∈{0,±1}and s i s i +1=0.This leads to a method needing ndoublings and on average n /3−1additions or subtractions.A generalization of the NAF uses “sliding windows”:The w NAF [37,8]ofthe integer s is a representation s =∑n j =0s j 2j where the integers s j satisfy thefollowing two conditions:(i)either s j =0or s j is odd and |s j | 2w ;(ii)of any w +1consecutive coefficients s j +w ,...,s j at most one is nonzero.The 1NAF co-incides with the NAF.The w NAF has average density 1/(w +2).To compute a scalar multiplication based on the w NAF one first precomputes the ideal classes D ,3D ,...,(2w −1)D ,and then performs a double-and-add step like (2).A left-to-right recoding with the same density as the w NAF can be found in [4].3Results,Comparisons and ConclusionsTable 4reports the timings of our implementation.Since nuMONGO provides sup-port only for moduli up to 256bits,EC are tested only on fields up to that size.For genus 2curves on a 256bit field,a group up to 513bits is possible:We choose this group size as a limit also for the genus 3curves.All benchmarks were performed on a 1GHz AMD Athlon (Model 4)PC,under the Linux operating system (kernel version 2.4).The compilers used were the GNU C Compiler (gcc ),versions 2.95.3and 3.3.1and all the performance considerations made in §2.1.5apply.All groups have prime or almost prime order.The elliptic curves up to 256bits have been found by point counting on random curves,the larger ones as well as the genus 2and 3curves have been constructed with the CM method.For each combination of curve type,coordinate system and group size,we averaged the timings of several thousands scalar multiplications with random scalars,using three different recodings of the scalar:the binary representation,the NAF,and the w NAF.For the w NAF we report only the best timing and the corresponding value of w .We always keep the base ideal class and its multiples in affine coordinates,since adding an affine point to a point in any coordinate sys-。
第37卷第4期2023年8月南华大学学报(自然科学版)Journal of University of South China(Science and Technology)Vol.37No.4Aug.2023收稿日期:2023-02-12基金项目:湖南省教育厅科学研究一般项目(20C1619)作者简介:钱垂昇(1997 ),男,硕士研究生,主要从事网络信息安全方面的研究㊂E -mail:187****9871@㊂∗通信作者:王㊀彦(1971 ),男,教授,博士,主要从事智能信息处理和智能控制方面的研究㊂E -mail:wangyan5406@DOI :10.19431/ki.1673-0062.2023.04.011一种抗弱曲线故障攻击的SM2数字签名算法设计钱垂昇,曾玖贞,王㊀彦∗(南华大学电气工程学院,湖南衡阳421001)摘㊀要:SM2数字签名算法是中国版的椭圆曲线数字签名算法,尽管该算法的设计在数学理论是安全的,但在算法的具体实现时却容易遭受物理攻击㊂因此,加强SM2数字签名算法在实现过程中的抗攻击性具有重要意义㊂本文基于故障感染思想提出了一个针对SM2数字签名算法的抗故障攻击策略,通过改变算法中的标量运算操作,使得算法遭受攻击后故障将在签名过程中扩散,从而破坏攻击者利用错误签名快速检索签名私钥的条件㊂实验结果表明,此防御策略不仅可以抵御弱曲线故障攻击,还可以防御弱曲线故障和二次故障注入的结合攻击㊂此外,本文还将椭圆曲线算法中常用点检测抗故障攻击策略和本文提出的故障感染防御策略都在现场可编程逻辑列阵上实现,对两种策略的硬件面积开销㊁单次签名时间开销进行比较,结果显示,本文提出的策略在硬件性能上比基于点检测的策略更优越㊂关键词:SM2数字签名算法;弱椭圆曲线;现场可编程逻辑列阵实现;抗故障攻击中图分类号:TN918文献标志码:A 文章编号:1673-0062(2023)04-0083-07Design of SM 2Digital Signature Algorithm Against WeakCurve Fault AttackQIAN Chuisheng ,ZENG Jiuzhen ,WANG Yan ∗(School of Electrical Engineering,University of South China,Hengyang,Hunan 421001,China)Abstract :The SM2digital signature algorithm is the Chinese version of the Elliptic Curve Digital Signature Algorithm (ECDSA).Although the design of the algorithm is mathemati-cally safe,it is vulnerable to physical attacks when the algorithm is implemented.There-fore,it is of great significance to strengthen the attack resistance of SM2digital signaturealgorithm in the implementation process.Based on the idea of fault infection,this paperproposes an anti-fault attack strategy for SM2digital signature algorithm,which changes38第37卷第4期南华大学学报(自然科学版)2023年8月the scalar operation in the algorithm so that the fault will spread in the signing process afterthe algorithm is attacked,thereby destroying the conditions for attackers to quickly retrievethe signature private key by using incorrect signatures.Experimental results show that theproposed defense strategy can not only resist weak curve fault attacks but also defendagainst combined attacks of weak curve faults and secondary fault injection.In addition,we also implement the common point detection anti-fault attack strategy in elliptic curve al-gorithm and the fault infection prevention strategy proposed on Field Programmable LogicArray(FPGA)and compare the hardware area overhead and single signature timeoverhead of the two strategies,and the results show that the proposed strategy is superior tothe point detection-based strategy in hardware performance.key words:SM2digital signature algorithm;weak elliptic curve;FPGA implementation;against fault attack0㊀引㊀言随着数字化的进程全面推进,信息安全也越来越受到公众关注,加密技术则是当前解决信息安全问题的主要技术手段㊂密码算法作为加密技术的核心,一般分为加密解密㊁签名验签㊁密钥交换三种类型,其中数字签名算法(digital signature algorithm,DSA)用于确保信息传输过程中数据的真实性㊁完整性以及不可抵赖性[1]㊂自1976年Whitfield Diffie和Martin Hellman首次提出数字签名的概念,至今已经出现了许多优秀的数字签名算法㊂在众多数字签名算法中,椭圆曲线数字签名算法(elliptic curve digital signature algorithm, ECDSA)凭借同等安全水平下密钥更短㊁签名速度更快等优点而被广泛应用㊂SM2数字签名算法(SM2digital signature algorithm,SM2-DSA)是中国版的ECDSA,虽然两者的安全性都依托于椭圆曲线离散对数难题(elliptic curve discrete logarithm problem,ECDLP),但SM2-DSA的安全性要高于ECDSA[2]㊂尽管理论上SM2-DSA和ECDSA都是安全的,但如果在算法实现的物理设备中引入故障注入攻击,就极有可能导致设备的安全性失效㊂故障注入攻击是指在密码设备注入特定故障,使得设备输出故障信息,最终通过分析故障信息得到设备中使用的密钥㊂文献[3-6]中介绍了几种针对椭圆曲线密码系统(elliptic curve cryp-tography,ECC)的故障注入攻击方案㊂其中,弱曲线故障攻击利用故障注入将算法运行中使用的安全椭圆曲线变成另一条不安全的弱椭圆曲线,并通过Pohlig-Hellman算法和Pollard Rho算法破解弱椭圆曲线上的ECDLP,从而获取算法中的私钥[6]㊂为了提高密码算法在实际应用中的安全性,研究人员又进行了许多抗故障攻击的防御策略研究㊂对于在ECC算法而言,有效点检测是最常用的故障攻击防御策略㊂在算法运行时,一旦中间状态点被注入故障,该点就会脱离原来的椭圆曲线,从而影响之后一系列的点运算,所以可以在输出结果前对标量运算生成的点进行检测,如果生成的点不在原来的椭圆曲线上,则拒绝输出结果㊂文献[7]针对ECDSA提出了双重随机点检测策略,以防御弱曲线故障攻击㊂该防御策略在点检测的基础上加入了随机延迟和双重检测机制,解决了简单点检测策略在二次故障注入下失效的问题㊂文献[8]中将低成本的故障检测与恢复机制和坐标轴随机化技术相结合,提出了一个可以同时防御差分功耗攻击和差分故障攻击的低成本方案㊂当受到差分故障注入攻击时,可以利用蒙哥马利阶算法中间变量关系的不变性检测到故障,然后将标量运算跳回到上一个有效状态,从而保证故障注入失效的同时不会影响最终的正确结果㊂除了有效点验证思想,故障感染思想也是抗故障攻击研究的重点㊂文献[9]中介绍了一种基于故障感染的故障防御策略,以抵抗针对SM2-DSA的基格故障攻击㊂该策略通过故障在签名生成过程中的传播将格攻击所需要的条件破坏掉,从而保证即使发生关键信息泄露,也能够抵御格攻击㊂目前对SM2-DSA的抗故障攻击研究主要集48第37卷第4期钱垂昇等:一种抗弱曲线故障攻击的SM2数字签名算法设计2023年8月中在基格故障攻击和差分故障攻击,关于弱曲线故障攻击的防御研究还比较少㊂本文基于故障感染思想提出一个新的故障攻击防御策略,以获得一种抗弱曲线故障攻击的改良版SM2-DSA,利用故障的传播,破坏签名过程中所使用的随机标量值与签名私钥之间的关系,从而保证即使攻击者破解了弱椭圆曲线上的ECDLP 也无法获得签名私钥㊂并通过现场可编程逻辑阵列(filed pro-grammable gate array,FPGA)实现了改良后的算法硬件电路设计,对电路的防御效果进行测试㊂为了更好地验证防御策略的有效性,设计了一种硬件木马,用来模拟针对硬件电路的随机故障注入㊂同时还设计了一个基于简单点检测防御策略的算法硬件电路作为对比实验,分析对比了两种防御策略的性能和硬件实现上的开销㊂1㊀SM2数字签名算法基础1.1㊀椭圆曲线基本理论本文中对SM2-DSA 的分析都是基于素数域F p 下的椭圆曲线E (a ,b ),其Weierstrass 方程定义如下所示:E :y 2=x 3+ax +b mod p(1)其中a ,b ɪF p ,且满足4a 3+27b 2ʂ0mod p ㊂椭圆曲线E (a ,b )上的所有点和素数域F p 上的无穷远点O 构成一个椭圆曲线群E (F p ),而E (F p )上的标量运算kP =P +P + +P 为椭圆密码算法的核心㊂其中点P (x P ,y P )为E (F p )上除无穷远点O 外的任意一个点,k ɪ{1, ,p -1}㊂标量运算由点加和倍点两种点运算组成,两种基本点运算的公式如下:设(x 3,y 3)=(x 1,y 1)+(x 2,y 2),则:x 3=λ2-x 1-x 2y 3=λ(x 1-x 3)-y 1{(2)其中λ=3x 2+a 2y 1,x 1=x 2,y 1=y 2y 2-y 1x 2-x 1,其他ìîíïïïï(3)此外,对于E (F p )上所有的点P (x P ,y P )还满足:P +O =P ;-P =P (x P ,p -y P )㊂标量运算之所以成为椭圆密码算法的核心,是因为由其衍生出的ECDLP 为算法提供了数学理论安全㊂ECDLP 定义为:在E (F p )上有一个非无穷远点P ,且点P 的阶为n ㊂存在一个整数k ɪ[1, ,n -1],使得Q =kP ,那么k 就称为以P 为底Q 的离散对数㊂假设知道P 与Q ,求解满足条件的k ,则称为椭圆离散对数难题㊂当前求解ECDLP 最好的方法就是Pohlig-Hellman 算法和Pollard Rho 算法的结合算法,但其破解ECDLP 的时间复杂度仍是指数级O (q ),其中q 为n 的最大素因子㊂而在密码算法所选取的椭圆曲线点,其阶n 本身就是一个大素数㊂所以一般情况下,即使是最好的算法,也无法在有限时间内破解算法中的ECDLP㊂1.2㊀SM2数字签名算法SM2-DSA 作为中国政府发布的改良版ECDSA,其签名生成过程与签名时使用的标准椭圆曲线与ECDSA 都有所不同㊂在进行签名前,SM2-DSA 需要先进行预计算生成用户杂凑值Z A :Z A =H v (A ENTL ID A a b x G y G x A y A )(4)式中:H v 为SM3杂凑函数;ID A 为签名方A 的标识符;A ENTL 是由ID A 的比特位长度转换成的两字节值㊂SM2-DSA 的签名生成过程如算法1所示㊂1)算法1:SM2数字签名算法输入:素数域特征p ,椭圆曲线参数(a ,b ),基点G (x G ,y G )以及基点的阶n ,待签名信息M ,用户杂凑值Z A ,签名私钥d A ㊂输出:签名对(r ,s )㊂1:e =H v (Z A M );2:选择一个随机整数k ɪ[1,n -1];3:标量运算Q (x 1,y 1)=kG ;4:计算r =e +x 1mod n ㊂如果r =0或r +k =n ,跳回第2步;5:计算s =(1+d A )-1(k -rd A )mod n ㊂如果s =0,跳回第2步;6:输出签名对(r ,s )㊂SM2-DSA 硬件实现的整体架构如图1所示,自上而下划分为密码协议层㊁椭圆曲线点运算层㊁模运算层㊂本文采用了蒙哥马利阶算法实现标量运算模块,不仅适合在硬件上实现,同时也可以抵御简单功耗攻击,其具体过程如算法2所示[10]㊂同时,基于加法器实现了模运算中最重要的模乘运算,使得整个硬件设计更加地轻量化㊂58第37卷第4期南华大学学报(自然科学版)2023年8月图1㊀SM2-DSA 硬件架构Fig.1㊀SM2-DSA hardware architecture2)算法2:蒙哥马利阶算法输入:椭圆曲线上的点P (x P ,y P ),标量值k =(k n -1, ,k 1,k 0)㊂输出:Q (x Q ,y Q )=kP ㊂1:计算Q 0=P ,Q 1=2P ;2:从i =n -2到i =0遍历k i ;2.1:如果k i =1,Q 0=Q 0+Q 1,Q 1=2Q 1;2.2:否则,Q 1=Q 0+Q 1,Q 0=2Q 0;3:遍历结束,输出Q 0㊂2㊀针对SM2数字签名算法的抗弱曲线故障攻击防御策略2.1㊀弱曲线故障攻击SM2数字签名算法攻击者在SM2-DSA 执行签名生成时,向椭圆曲线参数a 的坐标值注入一段连续随机故障㊂这使得算法1中的第3步标量运算落在一条弱椭圆曲线上,并导致算法输出一个错误签名对㊂攻击者再根据错误签名对和二次剩余定理得到弱椭圆曲线参数和标量运算结果Qᶄ的坐标,进而可以破解弱椭圆曲线上的ECDLP,得到随机标量值kᶄ满足Qᶄ=kᶄGᶄ㊂但kᶄ并不是签名过程使用的真实随机标量k ,二者的关系如式(5)所示㊂k =kᶄ+inᶄ,i =0,1,2, ,n nᶄéëêêùûúú(5)式中:n 为标准椭圆曲线上基点G 的阶;nᶄ为弱椭圆曲线上基点Gᶄ的阶㊂根据算法1第6步中签名s 的生成公式,可以推导出私钥d A 的表达式,其如式(6)所示㊂显然,签名对(r ,s )是已知的,所以只需要将式(5)代入表达式中,则可以求出私钥的所有可能值㊂通常n 与nᶄ之间相差较小,所以私钥d A 的可能值个数也就很少㊂由于公钥P A 与私钥d A 之间满足关系:P A =d A G ,则可以根据这个关系快速筛选出正确的私钥值㊂d A =(k -s )(r +s )-1mod n(6)㊀㊀综上,通过上述方法就可以破解出签名私钥d A 的值㊂除了对参数a 进行故障注入外,还可以对基点G 的坐标值进行故障注入也可以得到相同的效果,具体的分析过程参见文献[6]㊂2.2㊀基于点检测的防御策略根据弱曲线故障攻击的攻击原理,遭受攻击后签名算法将在一条非标准椭圆曲线上运行,所以点检测是防御这种攻击最常用的手段㊂基于点检测思想,对SM2-DSA 的签名过程进行改进,在签名对(r ,s )输出前增加一步点检测操作㊂具体来说,由于公私钥对(P A ,d A )是在标准椭圆曲线上预先生成的,所以正常情况下根据式(6)可以得到如下等式:P A ==(k -s )(r +s )-1G(7)如果等式(7)成立,说明签名对生成和公私钥对生成使用的是同一条标准椭圆曲线,即代表签名过程中未遭受弱曲线故障攻击,可以正常输出签名对㊂否则,就代表签名过程中遭受到攻击,算法将拒绝输出签名对㊂点检测-SM2-DSA 的详细过程如算法3所示㊂算法3:点检测-SM2数字签名算法㊂输入:素数域特征p ,椭圆曲线参数(a ,b ),基点G (x G ,y G )以及基点的阶n ,待签名信息M ,用户杂凑值Z A ,签名私钥d A ,签名公钥P A ㊂输出:签名对(r ,s )㊂1:e =H v (Z A M );2:选择一个随机整数k ɪ[1,n -1];3:标量运算Q (x 1,y 1)=kG ;4:计算r =e +x 1mod n ㊂如果r =0或r +k =n ,跳回第2步;5:计算s =(1+d A )-1(k -rd A )mod n ㊂如果s =0,跳回第2步;6:标量运算D =(k -s )(r +s )-1G ;7:判断D ==P A 是否成立:7.1:如果成立,则正常输出签名对(r ,s );7.2:否则,清除签名对,拒绝输出;8:结束签名㊂从上述算法过程可以看出,基于点检测策略改进后的SM2-DSA 相比于原版的算法,主要多出了一个标量运算操作㊂此外,算法最终是通过一个条件来判断是否正常输出签名对㊂而在实际的应用中,这样的判断操作很容易因为二次故障注68第37卷第4期钱垂昇等:一种抗弱曲线故障攻击的SM2数字签名算法设计2023年8月入而被跳过[7]㊂一旦跳过条件判断将使得签名对直接输出,就意味着点检测机制已经失效㊂2.3㊀基于故障感染的防御策略从2.1节中关于SM2-DSA的弱曲线故障攻击的描述中可以知道,虽然攻击没有直接针对私钥d A,但是通过破解ECDLP可以得到弱椭圆曲线上的随机标量值kᶄ㊂再利用kᶄ与签名中使用的真实随机标量值k之间的关系㊁真实随机标量值k与私钥d A之间的关系,可以快速检索出正确私钥㊂基于故障感染思想,对原SM2-DSA进行改进,改进后的SM2-DSA遭受故障攻击后会在签名过程中不断的传播故障,最终将快速检索私钥d A 的建立条件破坏掉,从而使得攻击者即使破解出kᶄ的值也无法在有限的时间内检索出正确的私钥值㊂改进后的签名过程如算法4所示㊂算法4:故障感染-SM2数字签名算法㊂输入:素数域特征p,椭圆曲线参数(a,b),基点G(x G,y G)以及基点的阶n,待签名信息M,用户杂凑值Z A,签名私钥d A,签名公钥P A㊂输出:签名对(r,s)㊂1:e=H v(Z A M);2:选择一个随机整数kɪ[1,n-1]; 3:标量运算T(x T,y T)=kG+d A G;4:计算D(x D,y D)=T-P A;5:计算r=e+x D mod n㊂如果r=0或r+k=n,跳回第2步;6:计算s=(1+d A)-1(k-rd A)mod n㊂如果s=0,跳回第2步;7:输出签名对(r,s)㊂当算法正常运行时,签名过程中使用的椭圆曲线和生成公私钥对所使用的是同一条标准椭圆曲线㊂根据公私钥之间的关系,算法中的第3㊁4步相当于式(8)㊂将式(8)代替第3㊁4步,此时的签名过程与原版SM2-DSA的签名过程是一样的,输出的即为正确的签名对㊂D(x D,y D)=kG+d A G-d A G=kG㊂(8)㊀㊀如果算法运行时遭受到弱曲线故障攻击,则算法4中第3步的标量运算就会在一条故障椭圆曲线上执行,这就意味着T和P A不在同一椭圆曲线上㊂那么即使此时算法中的第4步在实际计算中依旧使用了式(2)和式(3)所示的运算法则,但最终的运算结果Dᶄ(x Dᶄ,y Dᶄ)已经不是故障椭圆曲线上的点㊂如果攻击者想继续利用弱曲线的特性进行故障分析,那么就只能恢复故障曲线上的点Tᶄ(x Tᶄ,y Tᶄ)的坐标,然后通过破解ECDLP找到满足Tᶄ(x Tᶄ,y Tᶄ)=kᶄT Gᶄ的随机标量值kᶄT㊂假设攻击者成功破解ECDLP并得到的kᶄT的值㊂但是由算法4中第3步的标量运算可知,kᶄT与真实随机标量值k㊁私钥d A都有关系,具体表现如式(9)所示㊂其中nᶄ为故障椭圆曲线上基点Gᶄ的阶,n为标准椭圆曲线上基点G的阶,真实随机标量值k 和私钥值d A是两个未知的随机数㊂通过式(9)变换可以得到真实随机标量值的表达式k=kᶄT+ inᶄ-d A,或是私钥的表达式d A=kᶄT+inᶄ-k㊂无论是计算真实随机标量值k还是私钥值d A,表达式的右边都有一个未知的随机数㊂在这种情况下,没有比穷举法更快的方式可以检索到唯一正确的随机标量值k或私钥值d A㊂目前签名中所使用的随机标量值k和私钥值d A一般都是256bit的大数,而想要通过穷举法在有限的时间内完成一次对某个256bit未知数的检索,以目前的计算水平是无法做到的㊂并且从表达式来看,即使在最好的情况下(即表达式中i=0时)也要进行一次检索㊂而在最坏的情况下,要进行2nnᶄéëêêùûúú+1次检索才能达到目的㊂所以基于故障感染策略改进后的SM2-DSA,即使在遭受弱曲线故障攻击后故障曲线上的ECDLP被破解的情况下,攻击者也无法获取签名私钥d A㊂kᶄT+inᶄ=k+d A,i=0,1,2,3, ,2n nᶄéëêêùûúú㊂(9)实际上,算法4中第3步的两次标量运算可以合并为一次标量运算T(x T,y T)=(k+d A)G㊂即相比于原版的SM2-DSA,基于故障感染策略改进后的SM2-DSA在签名过程中只多了一次点加运算㊂并且,相对于点检测策略容易因二次故障注入而跳过输出前的判断操作,最终造成防御失效来说,故障感染策略并没有这样明显的缺陷㊂3㊀实验分析本文中共设置了三组对比实验,包括无防御策略组㊁点检测防御组㊁故障感染防御组㊂每一组都使用Verilog硬件语言完成电路设计,然后用Modelsim工具进行电路功能仿真验证,最后基于FPGA芯片对三组电路进行综合实现和性能评估㊂整个实验都在3.2GHz㊁4核CPU的个人电脑上进行㊂实验中所有SM2-DSA电路在进行仿真验证时都采用同一条官方推荐的256bit标准椭圆曲线[11]㊂78第37卷第4期南华大学学报(自然科学版)2023年8月硬件木马是攻击者在芯片设计或制造时植入的微小电路,当该电路在某种条件下被激活就会导致芯片功能发生改变或芯片的内部信息泄露[12]㊂基于这种特性,硬件木马可以很好和针对硬件的故障攻击相结合,基于故障攻击的硬件木马结构如图2所示㊂实验中设计的硬件木马激活后有两种状态,第一种状态下可以向椭圆曲线参数注入起始位置随机的㊁连续L bit 的翻转故障,第二种状态下可以向椭圆曲线参数注入指定起始位置的㊁连续L bit 的特定翻转故障(L =1㊁2㊁4㊁8㊁16)㊂通过在首次故障注入后再屏蔽签名输出前的判断操作来模拟点检测SM2-DSA 电路上的二次故障注入,同时在故障感染SM2-DSA 电路上采用首次故障注入后随机激活硬件木马,再注入一次故障来模拟二次故障注入㊂图2㊀基于故障攻击的硬件木马示意图Fig.2㊀Schematic diagram of a hardware Trojanbased on a fault attack根据弱曲线故障攻击的攻击原理可知,并不是故障攻击后得到的每一条故障曲线都是弱椭圆曲线㊂只有当故障曲线上基点Gᶄ(x Gᶄ,y Gᶄ)的阶nᶄ具有一个较小的最大素因子q 时,才能保证故障曲线上的ECDLP 被成功破解㊂以当前的计算水平,当最大素因子q 满足0.886q <2100时,该椭圆曲线就可以被定义为弱椭圆曲线[13]㊂考虑到本实验中电脑的实际算力,先对无防御策略SM2-DSA 电路进行多次故障注入模拟,然后恢复出故障曲线参数并对故障曲线上基点Gᶄ(x Gᶄ,y Gᶄ)的阶nᶄ进行分析,选取出满足最大素因子q 的比特长度l q <70的故障曲线㊂将这些故障曲线对应的故障注入位置和翻转情况一一记录,然后根据所记录的情况,用硬件木马电路的第二种激活状态对点检测SM2-DSA 电路和故障感染SM2-DSA 电路注入特定故障,使得故障注入后的签名过程使用上述对应的l q <70的故障曲线㊂对三组电路进行的弱曲线故障攻击的情况如表1所示,其中l q 为故障攻击后得到故障曲线上基点的阶的最大素因子的比特长度, - 代表在无防御策略的SM2-DSA 电路上并未进行弱曲线故障注入与二次故障注入结合攻击的实验㊂从表1中可以看出,无任何防御策略的原始SM2-DSA 电路易遭受弱曲线故障攻击威胁㊂虽然点检测SM2-DSA 电路可以防御单次的弱曲线故障攻击,但如果攻击者通过二次故障注入跳过输出前的判断操作,则点检测SM2-DSA 电路对弱曲线故障攻击的防御就会失效㊂而对于故障感染SM2-DSA 电路,无论是单次的弱曲线故障攻击还是与二次故障注入结合的弱曲线故障攻击,都可以成功防御㊂表1㊀三组SM2-DSA 电路上弱曲线故障攻击情况Table 1㊀Weak curve fault attack on three sets of SM2-DSA circuitsl q 弱曲线故障攻击无防御策略组点检测防御组故障感染防御组弱曲线故障攻击+二次故障注入攻击无防御策略组点检测防御组故障感染防御组46攻击成功攻击失败攻击失败-攻击成功攻击失败52攻击成功攻击失败攻击失败-攻击成功攻击失败53攻击成功攻击失败攻击失败-攻击成功攻击失败56攻击成功攻击失败攻击失败-攻击成功攻击失败59攻击成功攻击失败攻击失败-攻击成功攻击失败61攻击成功攻击失败攻击失败-攻击成功攻击失败63攻击成功攻击失败攻击失败-攻击成功攻击失败㊀㊀实验最后在Vivado 平台上基于Xilinx Kintex-7系列的xc7k325芯片对三组电路进行综合实现㊂设计点检测SM2-DSA 电路时,将点检测操作中的标量运算和签名生成中的标量运算复用了同一个88第37卷第4期钱垂昇等:一种抗弱曲线故障攻击的SM2数字签名算法设计2023年8月标量运算器㊂从表2可以看出,相对于无防御策略的SM2-DSA电路,点检测SM2-DSA在面积开销上额外多出了15.7%的查找表(look-up table,LUT)和13.1%的触发器(flip-flop,FF),但在时间上点检测SM2-DSA电路却额外多出了94.0%的开销,而故障感染SM2-DSA在签名过程中只比原版算法多出了一个点加操作,所以硬件实现后,故障感染SM2-DSA电路在面积开销上比无防御策略的SM2-DSA电路额外多出了7.4%的LUT和3.4%的触发器,在时间上额外多出了0.84%的开销㊂综上所述,基于故障感染的改良版SM2-DSA电路不仅可以防御弱曲线故障攻击,还可以防御弱曲线故障与二次故障注入的结合攻击㊂同时其在硬件的面积开销和时间开销上也比一般的点检测SM2-DSA电路表现地更出色㊂表2㊀本文三组SM2-DSA电路基于FPGA硬件实现的性能比较Table2㊀Performance comparison of three sets ofSM2-DSA circuits based on FPGAhardware implementations实验组时钟频率/MHz LUT flip-flop单次签名生成时间/ms无防御策略组100202328039 3.418点检测防御组100234199093 6.631故障感染防御组100217288313 3.447 4㊀结㊀论本文针对SM2-DSA算法提出了一个基于故障感染思想的抗弱曲线故障攻击的防御策略,根据该防御策略对SM2-DSA硬件电路进行改进,并在改进后的电路上模拟故障注入攻击仿真实验㊂仿真结果表明,基于故障感染的策略可以同时抵御弱曲线故障攻击和二次故障注入攻击㊂此外,基于故障感染策略的电路和基于一般点检测策略的电路㊁无防御策略的电路一同在Kintex-7FPGA 上进行性能分析,分析结果表明,故障感染防御策略不仅在硬件额外面积开销上比一般点检测策略少了大约一半,在单次签名生成时间上也只增加了0.84%,远低于点检测策略增加的94.0%㊂参考文献:[1]程朝辉.数字签名技术概览[J].信息安全与通信保密,2020(7):48-62.[2]孙荣燕,蔡昌曙,等.国密SM2数字签名算法与ECDSA算法对比分析研究[J].网络安全技术与应用,2013(2):60-62.[3]RUSSON A.Differential fault attack on montgomery ladder and in the presence of scalar Randomization[C]//Progress in Cryptology-INDOCRYPT2021:22nd International Con-ference on Cryptology in India.Jaipur,India:Springer In-ternational Publishing,2021:287-310.[4]CAO W Q,SHI H S,CHEN H,et ttice-based weak curve fault attack on ECDSA[C]//ICT Systems Security and Privacy Protection:36th IFIP TC11International Conference,SEC2021.Cham,Switzerland:Springer In-ternational Publishing,2021:146-161.[5]TAKAHASHI A,TIBOUCHI M.Degenerate fault attacks on elliptic curve parameters in OpenSSL[C]//2019IEEE European Symposium on Security and Privacy(Euro S& P).Stockholm,Sweden:IEEE,2019:371-386. [6]QIAN C S,WANG Y,WANG M H,et al.Security analysis of SM2signature algorithm based on fault attack[C]// Second International Symposium on Computer Technology and Information Science(ISCTIS2022).Guilin,China: SPIE,2022,12474:1247402.[7]张宇.基于ECDSA的故障攻击研究[D].西安:西安电子科技大学,2014:29-34.[8]喻潇.抗旁路攻击的双域ECC标量乘电路研究与设计[D].南京:南京航空航天大学,2020:58-60. [9]CAO W Q,FENG J Y,ZHU S F,et al.Practical lattice-based fault attack and countermeasure on SM2signature algorithm[C]//Information and Communications Security: 17th International Conference,ICICS2015.Beijing,China: Springer International Publishing,2016:62-70. [10]CHAKRABORTY A,BHATTACHARYA S,DIXIT T H,et al.Template attack on SPA and FA resistant imple-mentation of montgomery ladder[J].IET information se-curity,2016,10(5):245-251.[11]IT security techniques.Digital signatures with appendixPart3:Discrete logarithm based mechanisms:ISO/IEC14888-3:2006[S].Geneva,Switzerland:ISO(Interna-tional Organization for Standardization),2018:155.[12]黄钊,王泉,杨鹏飞.硬件木马:关键问题研究进展及新动向[J].计算机学报,2019,42(5):993-1017.[13]MARTÍNEZ V G,GONZALEZ-MANIANO L,MUÑOZA M,et al.Secure elliptic curves in cryptogrophy[J].Computer and network security essentials,2018(1):283-298.98。
一种高效安全的椭圆曲线标量乘算法陈熹;祝跃飞【摘要】Most secure Elliptic Curve Scalar Multiplication(ECSM) algorithms based on Point Verification(PV) and Coherency Check(CC) have low efficiency. Aiming at the problem, this paper proposes a new secure algorithm based on ternary representation and proves its correctness. The analysis about its efficiency in the affine coordinates and Jacobian coordinates is presented, whose result shows that the computational efficiency is improved while guaranteeing the security.%基于点验证和基于一致性检测的椭圆曲线标量乘安全算法一般运算效率低下.为此,通过对错误探测方法进行改进,提出一种基于三进制的椭圆曲线标量乘算法,给出算法的正确性证明,并在仿射坐标和Jacobian坐标下对其进行分析,结果表明,在保证安全性的前提下,该算法的效率有较大提高.【期刊名称】《计算机工程》【年(卷),期】2012(038)018【总页数】4页(P103-106)【关键词】点验证;一致性检测;椭圆曲线标量乘;错误分析攻击;三进制表示;仿射坐标;Jacobian坐标【作者】陈熹;祝跃飞【作者单位】解放军信息工程大学信息工程学院,郑州450002;解放军信息工程大学信息工程学院,郑州450002【正文语种】中文【中图分类】TP301.61 概述椭圆曲线密码体制(Elliptic Curve Cryptosystem, ECC)是由文献[1-2]分别提出的,而 ECC独特的优势使其成为公钥密码学研究的热点之一。
A Fa s t Alg o rit h m o f S c a la rMu lt ip lic a t io n B a s e d o n S id e-Ch a n n e l At o m ic it yHa o Y ujie1,Yin Shi21School of Computer Science and Engineering,Univers ity of Electronic Science and Technology of China, Chengdu611731,P.R.China2School of Mathematical Sciences,University of Electronic Science and T echnology of China,Chengdu 611731,P.R.ChinaAb str act:Simple power analysis is the most dev astating attack o n the secu rity of elliptic cur ve scalar multiplication and can prob ably retrieve the secret key.In this paper,we analyze the formulas of point doubling and addition on Jacobi-quartic Curve in projective coordination. In addition,a fast and secure side-channel atomic scalar multiplication algorithm is proposed using the side-channel atomic pared with the previous methods,the new algorithm is more ef cient.For192bits scalar using NAF recoding, the ef ciency of the new algorithm is increased by about6.7%~23%if S/M=0.8or12.7%~33.2%if S/M=0.6.Key wor ds:jacobi-quartic curve;scalar multipli-cation;simple po wer analysis;side-ch ann el atomicityI.INTRODUCTIONAfter Elliptic Curve Cryptography(ECC)being proposed independently by Miller[1]and Koblitz [2],it has become the focus of intensive research in various areas.The basic operation of ECC systems is multiplication of kP,wher e k is an in teg er and P is an elliptic curve point.This operation is called scalar multiplication,which dominates the execution time of elliptic curve cryptographic schemes,such as EC-DH,EC-NR and EC-DSA [3].For the elliptic curve with Weierstrass form,the complexities of point doubling and addition are quite different.And their ratio is between1:1.3and 1:2.3[4].In this case,the cryptographic algorithms could be easily attacked by simple power analysis (SPA),ther ef ore som e sp ecial strategies ar e needed.To deal with SPA,a lot of methods have been proposed(such as Montgomer y Ladd er [5-6]).But these methods usually take the expense of increasing computatio n cost.Side-channel atom icity method,pro pos ed by Ch ev allier-Mames[7]dissolves point operations into small homogenous blocks,known as atomic blocks, which are made sufficiently small to make this approach inexpensive.It is efficient and suitable论文集锦for all scalar multiplication algorithms.And it just works on the premise that the cost of square equals to multiplication.In software implementations,timing and power consumption have been shown to be S/M=0.8or 0.6[8],making these operations directly distinguishable through power analysis.Therefore,Liu [9]proposed an improved side-channel atomicity:S-N-A-M-N-A-A (Squaring -Negation-Addition -Multiplication -Negation -Addition -Addition)to solve this problem.Fo l lo win g th ese id eas,w e p ro p o se a n enhanced scalar multiplication algorithm using a new atom ic b lo ck s fo rm A-M-N-S-N-A-A (Addition -Multiplication -Negation –Squaring -Neg ation -Addition -Ad dition)wh ere the group operations of point doubling and addition are on Jacobi-quartic curve [10].II.BACKGROUNDA.Side-Channel AttackSide-C han nel Attack (SCA)[11]reveals the secret information by sampling and analyzing the side-channel information like timing and power consumption.SCA can be divided into two classes of SPA and DPA (differential power attacks).Simple power attacks use information from one observation to break the secret.Generally speaking,th e m ethods resistin g SPA can be transformed into resisting DPA by randomization [12].For concision,this paper mainly discusses the SPA resisting method.De nition[7]:Given a set of processes {Π0,Π1,…,Πn },a common side-channel atomic block Γfor Π0,Π1,…,Πn is a side-channel equivalent sequence of instructions so that each process Πj (0≤j ≤n)can be expressed as the repetition of this block Γ.That is to s ay,ato mic b lock s can no t be distinguished from one another through simple-channel analysis because each one contains the same pattern of basic f ield operations.In [7],an atomic block M-A-N-A (field multiplication,ad dition ,neg ation,addition)is pr opos ed tomake the power curve cyclical.This resistance method is suitable for most of the cryptographic algor ithms,in cluding vario us form s of po int multiplication algorithm (binary algorithm,the window method,etc.)[7-13].B.Jacobi-quar tic Cur veLet K be a eld with char(K)≠2,3.An elliptic curve in Jacobi-quartic form is de ned by E(K)[14]:24221y x ax =++,where 210a ≠(1)The group operations are as follows.Let P 1=(x 1,y 1)and P 2=(x 2,y 2)be the points on the Jacobi-quartic curve,P 3=(x 3,y 3)=P 1+P 2,then we can get:122132212222212121212123222121(1)(2)2()(1)x y x y x x x x x y y ax x x x x x y x x +=++++=(2)The identity element is the point (0,1)and -P 1=(-x 1,y 1).It should be noted that the equation (2)still holds when P l =P 2.That is to say,the dou bling and ad dition o per ations on Jacobi-quartic curve are identical.C.Point Doub ling and Addition For mulas on J acob i-qu ar tic Cur veIn projective coordinates,(X :Y :Z)representsthe point (X/Z,Y /Z 2)on the curve.The group operations in affine coordinates involve finite field inversion,which is a very costly operation (the cost of one inversion over f inite f ields is 100times of the cost of field m ultiplication [8]).To avoid these inversions,we use projective coordinates to represent the Jacobi-quartic curve.By replacing (x,y)=(X/Z,Y /Z 2),formula (1)in affine coordinates can be transformed into the one in projective coordinates as follows:242242Y X aX Z Z =++,where 210a ≠(3)(1)Point DoublingI n 2007,Cater [10]gave a new f ast point doubling f ormula of Jaco bi-qu ar tic curv e on projective coordinates to improve the efficiency.Let P 1=(X 1,Y 1,Z 1),P 3=(X 3,Y 3,Z 3)be two points on the Jacobi-quartics curve,and P 3=2P 1.The point P 3can be computed as follows:2223111111()()X Y X Z Y X Z =++222231111()()Z X Z X Z =+222223111332(())()Y Y X Z X Z =++To make these formulas suitable for our scalar multiplication algorithm based on side-channel atomicity,we use the modified coo rdinates (X:Y :Z:X 2:Z 2:XZ).The operations can be defined as fo llows.We d enote 211()A X Z =+,1B YA =,22111()C Y X Z =+a n d l et 23D X =,23E Z =,33F X Z =.Then we can get 3X C B =222231111()()Z X Z X Z =+232()Y C D E =+Finally,(X 3:Y 3:Z 3:D:E:F)is obtained.(2)Point AdditionIn 2007,Duquesne,et al.[15]proposed a mixed-addition formula of Jacobi-quartic curve to reduce the number of point additions to 1M+1S.And at the same year,Hisil optimized the point addition operation and reduce the cost to 7M+3S [10].If two points P 1=(X 1,Y 1,Z 1)and P 2=(X 2,Y 2,Z 2)are in projective coordinates,then the addition of P 3=(X 3,Y 3,Z 3),P 3=P 1+P 2is computed as:3121212X X Y Z X YZ =+222232211Z Z Z X X =222222231212112212112233()(()())Y X X Z Z X Z X Z YY X Z X Z X Z =+++++If we use the redundant representation (X:Y :Z:X 2:Z 2:XZ),the operations can be defined as follows.3111222()()X X Z Y X Z Y C D =++3Z B A=3()Y GH E F =+Where 2212A X X =,2212B Z Z =,1122C X Z X Z =,12D YY =,2G A B C =++,22221122()()H X Z X Z D C =++++,23E X =,23F Z =,33J X Z =+,2()/2W J H =.After that,(X 3:Y 3:Z 3:E:F:W )is obtained.III.FAST ALGOR ITHM OF SCALAR MULTIP LIC ATION BASE D ON S IDE -CHANNEL ATOMICITYIn this section,we divide po int doubling andaddition of Jacobi-quartic cur ve on projectivecoordinates into atomic blocks.The form of atomic blocks is A-M-N-S-N-A-A and each block contains 1M+1S+3A.The atomic blocks are presented in algorithm 1and algorithm 2respectively,and the scalar multiplication algorithm based on side-channel atomicity are designed.Algorithm 1:Atomic Blocks Algorithm for Point Doubling inp ut:P 1=(X 1,Y 1,Z 1,X 12,Z 12,X 1Z 1)outpu t:2P 1=(X 3,Y 3,Z 3,X 32,Z 32,X 3Z 3)Put T 1←X 1,T 2←Y 1,T 3←Z 1,T 4←X 12,T 5←Z 12,T 6←X 1Z 11:T 8←T 4+T 5,T 9←T 2*T 8,T 5←-T 5,T 10←T 92,Δ,T 1←T 1+T 3,T 4←T 4+T 52:T 10←T 10+T 10,T 3←T 4*T 8,Δ,T 1←T 12,T 1←-T 1,Δ,Δ3:Δ,T 2←T 2*T 1,Δ,T 5←T 32,T 7←-T 5,T 1←T 2+T 9,T 9←T 10+T 74:Δ,T 6←T 1*T 3,Δ,T 4←T 12,T 8←-T 4,T 2←T 8+T 9,ΔReturn(T 1,T 2,T 3,T 4,T 5,T 6)Since Algorithm 1contains 4atomic blocks,adoubling costs 4M+4S+12A.Algorithm 2:Atomic Blocks Algorithm for Point Additioninp ut:P 1=(X 1,Y 1,Z 1,X 12,Z 12,X 1Z 1),P 2=(X 2,Y 2,Z 2,X 22,Z 22,X 2Z 2),T 0=1/2outpu t:P 1+P 2=(X 3,Y 3,Z 3,X 32,Z 32,X 3Z 3)Put T 1←X 1,T 2←Y 1,T 3←Z 1,T 4←X 12,T 5←Z 12,T 6←X 1Z 1,T 7←X 2,T 8←Y 2,T 9←Z 2,T 10←X 22,T 11←Z 22,T 12←X 2Z 21:T 13←T 6+T 2,T 14←T 4*T 10,T 7←-T 14,Δ,Δ,T 1←T 12+T 8,T 4←T 4+T 52:T 10←T 10+T 11,T 11←T 5*T 11,Δ,Δ,Δ,T 3←T 7+T 11,T 14←T 11+T 143:Δ,T 6←T 6*T 12,Δ,T 5←T 32,T 11←-T 5,T 7←T 6+T 6,T 14←T 14+T 74:Δ,T 2←T 2*T 8,Δ,Δ,Δ,T 6←T 2+T 6,Δ5:Δ,T 13←T 1*T 13,T 8←-T 6,Δ,Δ,T 1←T 8+T 13,Δ6:T 2←T 1+T 3,T 7←T 4*T 10,Δ,T 4←T 12,T 10←-T 4,T 10←T 10+T 11,T 7←T 7+T 67:Δ,T 14←T 14*T 7,Δ,T 9←T 22,T 7←-T 7,T 2←T 14+T 10,T 6←T 7+T 98:Δ,T 6←T 0*T 6,Δ,Δ,Δ,Δ,ΔReturn(T 1,T 2,T 3,T 4,T 5,T 6)论文集锦Since Algorithm2contains8atomic blocks,anaddition costs8M+8S+24A.From algor ithm1and algorithm2,we canfind that atomic doubling and addition consistof a ser ies of iden tical atom ic blocks.Thescalar multiplication algorithm based on side-channel atomicity can be implemented by usingthe method proposed in[7].First we give a querymatrix which can control the correctly executingsequence of each atomic block.The matrix iscomposed of subscripts of the atom variables.In the algorithm,‘Δ’is the redundant operationwhich can be implemented by selecting someregisters which do not affect the results of thealgorithm.Each atom has18variables,and pointdoubling and addition operation contain a totalof12atomic blocks,so we form a12×18matrixum,n(0≤m≤11,0≤n≤17).,8459285510977113445 101010348771111777777 7772217753751299107 777613774184298777 13621441071411111128445 101011115113333333117141114 5556612555311576614147 777228888888626888 888131138611111118222 2137410m nu=131341104101011776 6661414766927721410697777606777777777777 Th e f ir st4r ows represent a dou bling an d th e last8rows r epr esent an addition.In the computation of scalar multiplication,a doubling is always followed by an addition or doubling,but an addition is always followed by a doubling.So it needs to compute12atomic blocks(when the bit is1)or just4atomic blocks(when the bit is 0)each time.Therefore,a control condition is needed to ensure the atomic blocks are executed corr ectly accord ing to the bit number.Scalar multiplication alg orithm using sid e-channel atomicity on Jacobi-quartic is as follows:Algo rithm3:Scalar Mu ltip licatio n Based on Sid e-Ch annel Atomicityin pu t:P1=(X1,Y1,Z1,X12,Z12,X1Z1)=(X2,Y2,Z2,X22,Z22,X2Z2),t,Scalar k=(1,kt-2,…,k),Matrix um,n,T=1/2.ou tp ut:Q=k P.1:Put T1=X1,T2=Y1,T3=Z1,T4=X12,T5=Z12,T6=X1Z1,T7=X2,T8=Y2,T9=Z2,T10=X22,T11=Z22,T12=X2Z2;2:i=t-2,s=1;3:while(i≥0)do3.1:m=(s)(m+1);3.2:s=ki(m div11)+(i k)(mdiv3);3.3:,0,1,2m m mu u uT T T=+,,3,4,5*m m mu u uT T T=,,6,7m mu uT T=,,8,92m mu uT T=, ,10,11m u muT T=,,12,13,14m m mu u uT T T=+,,15,16,17m m mu u uT T T=+;3.4:i=i–s;4:return(T1,T2,T3,T4,T5,T6).IV.PERFORMANCE ANALYZEIn this paper,we use the atomic blocks with the f orm A-M-N-S-N-A-A,and the computation time cost for each block is1M+1S+3A.From algorithm1and algorithm2,we can see that the costs of point doubling and addition on Jacobi-quartic curve based on this atomic blocks ar e 4M+4S+12A and8M+8S+24A respectively.In the method by Chevallier-Mames using atomic blocks M-A-N-A,po in t dou blin g and addition cost 10M+20A and16M+32A respectively in Jacobian coordinates.Wh ile extended to the Modified Jacobian coordinates,the costs are reduced to 8M+16A and14M+18A.Mishra improved the method of Chevallier-Mames,and reduced the computing cost of point addition to11M+22A in Jacobian coordinates.Then in2008,Longa [16]decreased the cost of point doubling and addition to8M+12A and12M+18A by using the atomic blocks M-N-A-M-N-A-A.The costs of above methods are compared in Table I.Table I The Cost of Point Doubling and Addition withDifferent Atomic BlocksAlgorithm Doubling AdditionChevallier-Mames(J m)8M+16A14M+28A Mishra(J)10M+20A11M+22ALonga8M+12A12M+18AThis Paper4M+4S+12A8M+8S+24A As we can see from the Table I,our method signif icantly reduces the cost co mpared with the previous m eth ods using M-A-N-A atomic b lock s,.And it is n oticeab le that even weequivalent square to multiplication(1S=1M),the efficiency of point doubling using A-M-N-S-N-A-A is equal to the fastest method proposed by Longa.And compared with Chevallier-Mames, its cost can be cut do wn by2M+8A.For the point addition,although it costs7M more than the method of Mishra,it costs8A less than Chevallier-Mames.The point addition only occurs when the bits ar e non-zero in scalar multiplication. Since usually the non-zer o bits in the scalar representation are sparse,we can choose a proper scalar representation to minimize the point addition disadvantage as much as possible.To make more comparison of the methods,we use the scalar k with the length192bits.And the representation Non-Adjacent Form(NAF)[17]is used in the traditional scalar multiplication.NAF has the fewest nonzero bits among all the signed binary representations of k.The average density of nonzero bits among all NAF of length l is approximately l/3.So,the computation of the scalar multiplication requires192point doubling s and 64point additions.In hardware implementation for the scalar multiplication,1S=1M.But in many cases of software implementation,1S=0.8M, and1S=0.6M,1A=0.05M[8].Table II shows the total computation costs of all the methods for scalar multiplication.Table II Experimental results of different methods Algorithm S/M=1S/M=0.8S/M=0.6 Chevallier-Mames3238.4M3238.4M3238.4M Chevallier-Mames(J m)2675.2M2675.2M2675.2M Mishra(J)2886.4M2886.4M2886.4MLonga2476.8M2476.8M2476.8MThis Paper2752M2496M2163.2MIt can be seen from Table III that the speed of our algo rith m is close to that of th e b est algorithm Longa when S/M=0.8.According to our calculations the improvement for S/M=0.8 is23%against Chevallier-Mames method and 10%against Chevallier-Mames method under Modif ied Jaco bian coord inates.While with S/M=0.6,our algo rithm achieves the highest efficiency among above mentioned methods.It improvement is12.7%against Shamir’s trick, and is about19.1%~33.2%against the method by Chevallier-Mames,Mishra.500100015002000250030003500CM CM(Jm)M(J)L This PaperS/M=1S/M=0.8S/M=0.6Fig.1Cost with Different S/MIn Figu re1,CM,CM(Jm),M(J)and L ar e separately represented the algorithms of Chevallier-Mames,Chevallier-Mames(J m),Mishra(J)and Longa respectively.From Figure1,we can see that with the decrease of S/M,the ef ciency of our algorithm is increased.V.CONCLUSIONSIn this paper,we present a fast algorithm for scalar multiplication based on side-channel atomicity (A-M-N-S-N-A-A).With this,novel solutions towards resistance against side-channel attacks were presented.The efficiency of our algorithm is greatly improved compared with the previous methods.In addition,the point doubling and point addition operation using the new side-channel atomicity is suitable for a wider range of scalar multiplication.For example,if using the sliding window algorithm,the algorithm’s ef ciency will increase greatly.Our future research work will focus on parallelizing the algorithm and designing security guaranteed faster scalar multiplication methods.Acknowledg ementsThis paper is sponsored and nancial supported by National Natural Science Foundation of China(NSFC),grant No.论文集锦61003121and Sichuan Province High Technology Program under No.2009CD00014.References[1]MILLER e of Elliptic Curves in Cryptography[C]//Proceedings of Advances in Cryptology-CRYPTO’85, LNCS,1986,218(483):417–426.[2]KOBLIT Z N.El lip ti c Curv e Cryp tos yst ems[J].Mathematics of Computation.1987,48:203–209.[3]LIU Duo,DAI Yiqi.A New Algorithm of Elliptic CurveMulti-Scalar Multiplication[J].Chin ese Journal of Computers.2008,31(7):1131–1137.[4]Institute of Electrical and Electronics,Inc.IEEE P1363/D9Standard Speci cations for Public-key Cryptography[S].New Y ork,USA,2001.[5]MONTEGOMERY P L.Speeding the Pollard and EllipticCurve Methods of Factorization[J].Mathematics of Computation,1987,48(177):243–264.[6]WANG Zhen,FAN Shuqin.New Montgomery-basedSemi-systolic Multiplier for Even-type GNB of GF(2^m) [EB/OL]./2010/218.[7]CHEVALLIER-MAMES B,CIET M,JOYE M.Low-Cost Solutions for Preventing Simple Side-Channel Analysis:Side-Channel Atomicity[J].IEEE Transactions on Computers,2004,53(6):760–768.[8]LIM C H,HWANG H S.Fast Implementation of EllipticCurveArithmetic in GF(pm)[C]//Proceedings of PKC’00, Springer,2001:405–421.[9]LIU Shuanggen.Study of Fas t and Secure ScalarMultiplication Algorithms on Elliptic Curves(in Chinese)[D].Xi’an:Xidian University in Candidacy,2008:35.[10]GAUDRY.Explicitformulas Database[DB/OL].http:///EFD.[11]DYN N,LEVIN D,GREGORY J.A Butter y SubdivisionScheme for Surface Interpolation with Tension Control[J].ACM Trans Graph,1990,(9):160–169.[12]CORON J S.Resistance against Differential PowerAnalys is for E llip tic Cu rv e Crypto syst ems[C]// Proceedings of the Workshop on Cryptographic Hardwareand Embedded Systems-CHES1999.Berlin:Springer, 1999:292–302.[13]MISHRA P K.Pipelined Computation of ScalarMultiplication in Elliptic Curve Cryptosystems[J].IEEE Transactions on Computers,2006,25(8):1000–1010. [14]BILLET O,JOYE M.The Jacobi Model of an EllipticCurve and Side-channel Analysis[C]//Proceedings of AAECC-15,2003,2643:34–42.[15]DUQUESNE S.Improving the Arithmetic of EllipticCurves in the Jacobi Model[J].Information Processing Letters,2007,104(3):101–105.[16]LONGA P,MIRI A.New Multibase Non-Adjacent FormScalar Multiplication and its Application to Elliptic Curve Cryptosystems(Extended Version)[EB/OL].http://eprint./2008/052.[17]SIL VERMAN J.The Arithmetic of Elliptic Curves[M].Springer-V erlag,1986.BiographiesHao Y ujie,received her Master of Engineeringdegree of Computer Application in July2005atNorthern Jiaotong University.She is currentlya PhD candidate and professor at School of Computer Science and Engineering,University of Electronic Science and Technology of China.Her research interests include Computer Network Technology and Applications, Network and Information System Security.E-mail:haoyujie@ Y in Shi,will receive his Bachelor of Sciencedeg ree in In format ion an d Compu tati onScience in July2011at University of ElectronicScience and Technology of China(UESTC). He is currently a fourth-year undergraduate in School of Mathematical Sciences,UESTC.His research interests include wireless network security,human computer interaction, computer graphics,an d artificial i ntelligence.E-mail:justlyn@。