Hardware Implementation of Residue Multipliers based Signed RNS Processor for Cryptosystems

: The Residue Number System (RNS) characterize large integer numbers into smaller residues using moduli sets to enhance the performance of digital cryptosystems. A parallel Signed Residue Multiplication (SRM) algorithm, VLSI parallel array architecture for balanced (2 n -1, 2 n , 2 n +1) and unbalanced (2 k -1, 2 k , 2 k +1) word-length moduli are proposed which in turn are capable of handling signed input numbers. Balanced 2 n -1 SRM is used as a reference to design an unbalanced 2 k -1 and 2 k +1. The synthesized results show that the proposed 2 n -1 SRM architecture achieves 17% of the area, 26% of speed, and 24% of Power Delay Product (PDP) improvement compared to the Modified Booth Encoded (MBE) architectures discussed in the review of the literature. The proposed 2 n +1 SRM architecture achieves 23% of the area, 20% of speed, and 22% of PDP improvement compared to recent counterparts. There is a significant improvement in the results due to the fully parallel coarsely grained approach adopted for the design, which is hardly attempted for signed numbers using array architectures. Finally, the proposed SRM modules are used to design {2 n -1, 2 n , 2 n +1} special moduli set based RNS processor, and the real-time verification is performed on Zynq (XC7Z020CLG484-1) Field Programmable Gate Array (FPGA).


Introduction
In cloud computing and the Internet of Things (IoT), data security is one of the major concerns for service providers. Therefore a dedicated hardware cryptography support is needed for modern electronic devices [1], [2], [3][4] [5]. In recent years, Elliptic Curve Cryptography (ECC) [6] has received scientific interest as it ensures more security through hard underlying mathematical problems. It leads to an increase in the length of the key, and as a result, performing faster arithmetic operations on larger integers have become the bottleneck problem. RNS based arithmetic operation [7,8] is a solution through which residue multiplication has become the heart of computation architecture. The natu-ral defense offered by RNS against attacks is another reason for the selection of residue arithmetic as the prime candidate in cryptosystems [9,10].
Similarly to the above operation, modular exponentiation [11] is a time-critical operation that is widely used in cryptographic algorithms like RSA. The modular exponentiation operation is performed in the form of residue multiplication. Therefore, the employment of efficient high-speed residue multiplication is vital in public-key encryption and decryption.
Typical hardware implementation of the RNS based application is dependent on the chosen moduli set. The selection of RNS Moduli [12] and the width of the residue decide the efficiency and performance of the cryptosystems. A {2 n -1,2 n ,2 n +1} special moduli set representation is a pairwise relatively co-prime standard RNS. These moduli set has a unique advantage in which two or more numbers do not have the same representation. Special moduli set shows better representational efficiency [12] compared to that of other moduli set and also maintains a good balance between the different moduli in a given moduli-set. Based on the number of bits used to represent the input, moduli and residue output are classified into balanced and unbalanced word-length moduli multiplication [13] [14].
Modified Booth Encoded (MBE) modulo multiplication scheme is relatively faster and can handle both signed, and unsigned numbers, the researcher's attention turned towards it, and many modifications of the same are reported in recent years [15,16,17,18,19,20]. The residue multipliers based on diminished-1 input representation in array and bit pair recoding booth algorithm are seen in [16,17,21]. Based on the conducted survey, it is evident that there is no work based on a signed array modulo multiplication scheme reported in the literature. The reasons for the above could be based on the complexity in handling the Partial Product (PP) and poor speed performance. This is one of the reasons that have highly motivated us to attempt a proposal on an array-based high-speed area-efficient parallel SRM module for RNS. In this work, the high-speed performance is achieved by a new multiplication methodology incorporating parallelism in PP generation and addition process.
Six significant contributions for this work include (i) an SRM algorithm for 2 n -1, 2 n +1 and 2 n balanced word-length moduli (ii) an SRM for 2 k -1, 2 k +1 and 2 k unbalanced word -length moduli (iii) Mathematical modeling of SRM algorithm for each moduli (iv) VLSI characterization of proposed SRM algorithm in terms of high-speed area-efficient Carry Save Adder (CSA) architecture and very high-speed Han Carlson parallel prefix-based SRM array architecture (v) Functional verification of the proposed modules in FPGA and synthesis in ASIC (vi) Design of RNS Processor to demonstrate the effectiveness of the proposed algorithm.
The paper is structured as follows: In Section 2, the related works connected to residue multipliers with various moduli sets performance are analyzed. In Section 3, characteristic equation, algorithm, and VLSI architecture are presented for both balanced (2 n -1, 2 n , 2 n +1) and unbalanced (2 k -1, 2 k , 2 k +1) word-length moduli. The design of the RNS processor is given in section 4. In section 5, Synthesis results, performance analysis, and RNS processor implementation are presented. The conclusion for the proposed work is drawn in section 6.

Review of Existing Work
An MBE based 2 n -1 multiplication module to reduce the number of PPs is presented in [22]. The results show a significant improvement in area and delay. However, they fail to address power consumption. A radix-8 booth encoded RNS 2 n -1 multiplier [14] using unbalanced word length of moduli supporting sizeable dynamic range with adaptable delay to achieve less area and power consumption is presented. The same authors have designed a radix-8 2 n -1 & 2 n +1 multiplier with a balanced word length of moduli in [18] using various modulo properties. The author claims that less area and power are achieved by using CSA in [14] and parallel prefix adders in [18] for efficient addition operations with a slight increase in delay for lower bit width. Improved booth selector and encoder architecture consist of MUX, and the EXOR gate for the 2 n -1 MBE multiplication algorithm is presented in [23]. The architecture improves the speed performance and efficiency, but the introduction of MUX in selector architecture leads to a slight increase in area requirement, and also power consumption is not discussed.
A compact ordinary array structure [15] based 2 n +1 multiplication scheme by grouping the PPs and modify the correction bit are presented. The PP is reduced by the CSA tree, and the final carry propagation addition is carried out by prefix structure in order to achieve better area and delay performance in which the power consumption is not discussed. By introducing a new PP formation scheme, a binary-weighted representation based modulo 2 n +1 multiplier is presented in [19] and is extended to implement a multiply-add unit. The authors have achieved less area and power consumption with similar delay performance compared to [15]. A radix-4 MBE architecture with a diminished-1 input representation and dadda tree reduction scheme, which can handle zero operands with better speed and area, is discussed in [16]. A compressor structure is introduced in [24] for PP reduction. This work achieves less power, delay, and consumes less area compared to [15].
A hybrid input representation approach with a radix-4 booth encoding scheme utilizing one binary-weighted operand and diminished-1 input representation for the other operand is explained in [17]. The architecture supports both odd and even value of n. The authors have achieved a compact area with an enhanced speed compared to the existing multipliers. The radix-8 booth encoded 2 n +1 multiplier for balanced word length moduli is designed in [18] using hard multiple generators, bias, and adders. The authors claim that the area and power reduction is accomplished compared to radix-4 and array type multiplier. However, there is an increase in operation time. In [20], the authors have improved the hard multiple generator method with a minimum number of bias terms compared to [18]. Two novel methods to increase the performance and to improve the efficiency of the radix-8 modulo 2 n +1 multiplier are explained in [20]. The first method significantly reduces the amount of bias, and the second one is new hard multiple generators based on a parallel-prefix structure computes carry only for odd positions. These schemes result in a lightweight parallel-prefix adder for the computation of triple the number with significant area-saving and improved fan-out. It achieves less area and power compared to the radix-8 booth multiplier [18]. There is an increase in HMG delay compared to [18] and almost maintains the same delay performance for multiplier operation compared to [18].
The problem in MBE based architecture is that it requires an efficient booth selector and encoder compared to the array-based architectures. The former scheme reduces the number of PPs and improves speed performance. However, it invites additional hardware costs during implementation. Our proposed work is an entirely different approach compared to [18], [20] designed to address the above issues. In the proposed approach, split array type architecture is considered for implementing the 2 n +1 operation, which occupies less area compared to the MBE scheme. Array architecture is a non-encoded architecture compared to the booth, so it does not require hard multiples for processing the PPs. The problem of an increased number of PPs in an array is addressed in the proposed scheme by splitting array structure into four segments, and full parallelism is maintained in PP additions also. The parallelism in the architecture ensures improved speed by maintaining the area advantage of the general array structure. The handling of signed numbers in array architecture is another reason for which the array scheme is less explored for data processing applications. The represen-tation of signed numbers is addressed in the proposed architecture using appropriate constants.

Proposed balanced word-length SRM
In balanced word-length modulo multiplication, the number of bits required representing the input, moduli, and output bits are summarized in Table 1. The type mentioned above of multiplication called balanced residue multiplication as it maintains a balanced bit-width between input, output, and moduli representation, as given in Table 1. In literature, the design problem of 2 n -1 and 2 n +1 residue multiplication is achieved through MBE schemes, whereas the possibilities of addressing this problem using array architecture are hardly considered, especially for signed numbers. The hierarchical approach for signed array multiplication presented in [25]. The motivation behind this work is the regularity in VLSI implementation and the reduced area budget offered by the array architectures compared to MBE architecture. The delay problem usually found in array architecture compared to the MBE scheme is addressed here using hierarchy based processing of the input bits and parallel addition structure. For comparative analysis, the adder structure is realized using CSA and Han Carlson parallel prefix [26] based schemes. The mathematical background, algorithm, and architecture of proposed residue 2 n -1, 2 n +1, and 2 n multiplications are presented in the following subsections.

Number of input bits A & B n
Moduli representation bits n n n+1 Number of output bits -P n n n+1

Proposed 2 n -1 SRM
The 2 n -1 modulo multiplication module is one of the essential operations in the RNS independent arithmetic channel. The mathematical background, algorithm, and the proposed architectures for the signed 2 n -1 residue multiplier are given below.

Mathematical modeling
Consider the 2's complement signed number representation of two binary numbers A and B as given in Eq.
The 2 n -1 residue product representation is given in Eq.
Step 1. Partitioning of Input bits and Generation of intermediate PPs W, X, Y, Z using hierarchical partitioning multiplier [25] Step 2. PP arrangement: The generated PPs are arranged [25], and a constant is added, as shown in Fig. 1 where m=n/2.
Step 3. Rearrangement of Intermediate PPs: Fig. 2 shows the rearrangement of PPs, and the addition process flow carried out for the 2 n -1 residue multiplication, and the corresponding mathematical operations are given in Eq. (4) - (6). The notations and operators used in this mathematical modeling are summarized in Table 2 and Table 3 respectively.
The final product is ( ) The compensation bits are expressed as The proposed 2 n -1 SRM algorithm is given below generation stage, and adder stage. The four parallel modules in the intermediate PP generation stage M-I,  M-II, M-III, M-IV indicates the hardware required for computing W, X, Y, Z given in [25]. The four independent parallel addition process observed in the architecture is the main reason for achieving high performance in the proposed array architecture. The compensation bits are gets added in the final stage to obtain modulo results. CSA and Han-Carlson parallel prefix adder structure is incorporated in Fig. 3 in order to analyze the performance. The results of the proposed work are further discussed in Section 5.

Proposed 2 n +1 SRM
The 2 n +1 residue multiplication problem is considered as a demanding operation in RNS Processor due to the increase in moduli output range compared to 2 n and 2 n -1 multiplications, as represented in Table 1. In the proposed scheme, the increased moduli output range is regulated using the diminished-1 approach for both multiplier and multiplicand. The primary advantage of using the proposed scheme is that this architecture can handle exceptional cases like 'all-zeros' input and

Architecture
The architecture of proposed 2 n -1 residue multiplication is shown in Fig. 3. The architecture consists of three stages, namely the partitioning stage, intermediate PPs 'all-ones' input, which consecutively produce the correct results. This architecture handles the bit positions higher than n-1 by complementing and mapping them to the LSBs. The mathematical background, algorithm, and the proposed architectures for signed 2 n +1 residue multipliers are given in the below subsections.

Mathematical modeling
The diminished-1 representation of binary inputs A and B are modified as A' & B' , which is given in Eq.
The residue product P is given by the following Eq.(9) The methodology and arrangements of PP are the same as step 1 and step 2 of signed 2 n -1, but the inputs are A' and B' . The final product is obtained by rearranging the PPs of Fig. 1 in such a way to obtain the result of 2 n +1 residue multiplication. Fig. 4 shows the rearrangement of PPs, the position of PPs, and the addition process flow carried out for the 2 n +1 multiplication, and the same is represented in Eq. (10) - (20). The mathematical operations performed between Row 1 to Row 4 are given below Row 2: Row 3: ( ) Where C b is given in Eq. (19) ( ) The 2 n +1 multiplication is given in Eq. (20) [ ] ( ) ( )

Algorithm
The proposed 2 n +1 SRM algorithm is given below

Proposed unbalanced word-length SRM
The unbalanced word-length moduli multiplier typically used in applications different bit-width proportion between input, moduli, and output is required. In unbalanced word-length residue multiplication, the number of bits required to represent the input, moduli, and output bit-width, which are summarized in Table 4. The strategy followed to design 2 k -1module is derived from the 2 n -1 balanced module. However, the 2 k +1 is not derived from the 2 n +1 balanced module because it may lead to comparatively complex architecture with more delay penalty. Instead, 2 n -1 balanced design is converted to an unbalanced 2 k +1 by modifying the final result of 2 n -1 multiplication.

Architecture
The overall architecture arrangement of 2 n +1 is similar to that of 2 n -1 except for the fact that it has some additional modules to perform 2's complement operation and Inverted End Around Carry (IEAC), as shown in Fig.  5. However, the compensation generation scheme is complicated compared to 2 n -1 architecture.

Mathematical modeling
The operation required to obtain Module I (W) follows the same pattern as in 2 n -1. The X & Y are given in Eq. (21) and (22). Z is not required for computing 2 n because it has a higher weight position compared to 2 n value.
The final 2 n product is given in Eq. (23)

Proposed 2 k -1 and 2 k +1 SRM
Mathematical modeling Let us consider the n bit output of balanced 2 n -1 multiplication given in Eq. (5). It is split into two halves P L and P H, as shown in Fig. 6 to obtain the result k=n/2 & k=n/4 bits, and the corresponding equations are given in (24) -(25).

Architecture
The unbalanced SRM architecture for 2 k -1 and 2 k +1 is depicted in Fig. 7. The architecture is derived from proposed 2 n -1 SRM.

2 k SRM
The residue multiplication 2 | | k P A B = × is derived from a 2 n balanced residue multiplier equation. The characteristic equations of 2 k unbalanced residue multiplication are given in Eq. (26) 4 RNS processor 4

.1 Architecture
In general, the cryptographic algorithm requires many rounds of arithmetic operations in order to create the ciphertext. Instead of doing such lengthy arithmetic operations in binary representation, residue values can be used to save the area and time budget. The proposed balanced and unbalanced word-length residue multipliers are used for implementing special moduli set based RNS computing platforms, as given in Fig.  8. The RNS processing system consists of three blocks, namely Forward Converter (FC), Independent Modulo Arithmetic Processing Unit (IMAPU), and Reverse Converter (RC) [13], [27]. The proposed SRM architectures are used to design arithmetic channels and RC. The FC and RC blocks convert the binary number to residue number and vice versa. The IMAPU block consists of application-based arithmetic operations or any other desired operations in modulo representation. The RC operation can be performed using the Chinese Remainder Theorem (CRT) [28] or Mixed Radix Conversion (MRC) [29]. In this paper, the MRC technique [13,27] is considered for the conversion in the RC block. The characteristic equations of MRC given in Eq. (27) - (29) shows that the operation can be done by modulo subtractions, multiplicative inverses, and residue multiplication. Here the multiplicative inverse is computed using the Extended Euclidean algorithm (EECD) [30]. From [13,27]  The mixed-radix digits are derived as,  The effectiveness of the proposed multiplier is tested by designing decoupled modulo arithmetic channels and memoryless MRC reverse converter, as shown in Fig. 8. An example calculation depicting the dataflow in the architecture is given in Table 5.

Range analysis
The permissible number ranges for balanced and unbalanced word-length residue multipliers are shown in Table 6. The bit-width required to represent triple moduli set {2 n -1, 2 n , 2 n +1} balanced system is 3n+1bits whereas the maximum number of bits required for unbalanced moduli {2 k +1,2 k , 2 k -1} system is 3k+1.

FPGA synthesis
The architecture level functional verification of the proposed design is coded using Verilog HDL and simulated in the Xilinx ISIM tool. The results corresponding to hardware architectures are synthesized in Xilinx Synthesis Technology (XST) for balanced and unbalanced type residue multipliers. The results of the proposed architecture with CSA (Proposed-I) and prefix-based adders (Proposed-II) are presented in Table 7 and Table 8, respectively.

Performance analysis
From Table 9, the area comparison of 2 n -1 SRM shows that the proposed architecture I & II requires less area compared to other multipliers [14][22] [23]. The synthesis results show that the proposed design I occupy 17% -22%, and design II occupies a 10% lesser area than existing modulo MBE. Delay analysis indicates that the proposed-I has a 17% -24% speed improvement, and Proposed-II excels in speed by 26% -30%. Power analysis shows that the total power required for the design is almost the same compared to recent works.
In 2 n +1 SRM architectures, the proposed designs outperforms the other multipliers in area efficiency and speed improvement [15, 16,17,18,19,20,21]. Proposed architecture I save area in the range of 23% -44%, whereas the proposed architecture II reduces the area in the range of 10% -32% compared to existing MBE architectures. The speed improvement of proposed-I and II lies between the ranges of 10% -35% and 20% -39%, respectively. The power profiles of the proposed multipliers are almost the same as that of recent works.
Since the proposed unbalanced residue multipliers are derived from proposed balanced residue multipliers, they follow the same trend in the area, delay, and power metrics, which are presented in Table 10.
The core problem addressed in this work is the improvement of speed performance of residue signed array multiplier, which generally consumes less area than its booth type counterparts. To achieve this objective, an enormous parallel operation from start to end is envisioned, designed, and implemented. It is inferred

Hardware Implementation of RNS Processor
RNS processing examples discussed in Section 4 and the architecture is shown in Fig. 8 is simulated, and ISIM simulated results are shown in Fig. 9. The synthesis of the RNS Processor is done for both FPGA and ASIC platforms. The results for the same are presented in Table 11. The synthesized netlist of the RNS processor is implemented by targeting to the Xilinx Zynq board (XC7Z020CLG484-1).

Conclusion
A new array signed residue multiplication scheme for balanced (2 n -1, 2 n +1, 2 n ) and unbalanced (2 k -1, 2 k +1, 2 k ) word-length moduli are proposed in this paper. The proposed architecture with enormous parallelism is realized by incorporating CSA and Han-Carlson prefix adder structures into it. The existing and proposed multipliers are synthesized in both ASIC and FPGA technologies. From the synthesis results, the proposed-I 2 n -1 residue multiplication scheme saves 17% area. However, the scheme with prefix structure achieves 26% speed and 24% PDP improvement compared to state of the art MBE 2 n -1 residue multipliers. Similarly, a balanced 2 n +1 proposed-I saves 23% area requirement. Speed and PDP improvement of proposed-II is 20% and 22 %, respectively, compared to the state of the art 2 n +1 residue multipliers. The unbalanced multipliers derived from the balanced multiplier follows the same trend. Finally, the proposed residue arithmetic modules are used in arithmetic channel creation, reverse converter design of {2 n -1, 2 n , 2 n +1} triple moduli set RNS Processor and the same is implemented as hardware using Zynq (XC7Z020CLG484-1) device for real-time verification.
The results indicate that the proposed designs can be efficiently utilized to improve the speed and area performances of RNS based cryptographic applications like RSA and ECC. The results also show that the proposed-I SRM architecture implemented using CSA may be used for area constrained RNS applications, and the Proposed-II SRM architecture using prefix can be used for high-speed applications.

Conflict of Interest
We have no conflict of interest to declare.