CMOS High-Performance 5-2 and 6-2 Compressors for High-Speed Parallel Multipliers

In this article, the design procedure for high-speed 5-2 and 6-2 compressors, along with their analysis, has been discussed. With the help of the combinational logic consisting of the 4-2 compressor and 3-2 counter blocks, a high-performance structure for 6-2 compressor has been achieved, which shows significant speed improvement over previous architectures. The optimization has been carried out by reducing the carry rippling issue between the adjacent compressor structures. Also, the help of some modifications, the proposed 6-2 compressor will turn into a 5-2 compressor where the latency of the critical path has considerably been reduced, illustrating the superiority of designed circuits. The corresponding latencies of the proposed 5-2 and 6-2 structures are equal to 3.5 and 4 XOR logic gates, respectively, demonstrating speed boosting of 15% and 20% compared to the best-reported architectures. In addition, the power consumption and the transistor count of proposed circuits are have remained at a moderate level. Therefore, by considering the Power-Delay Product (PDP), our work will be a good choice for high-speed parallel multiplier design. Post-layout simulation results based on TSMC 90nm standard CMOS process and 0.9V power supply have been presented to confirm the correct functionality of the implemented compressors. These results have also been used as a fair comparison infrastructure between the proposed works and redesignated architectures of the previously reported schemes.


Introduction
In 1981 a new architecture was introduced by Weinberger, which was based on the cascaded carry-save adders [1]. Getting popular as 4-2 compressor, the structure was composed of two Full Adders (FAs). One of the outputs of such structure (denoted as C out ) is horizontally sent to the next sitting compressor block with a higher binary bit position value [2]. Figure 1 illustrates the explained structure in which X1 -X4 constitute the main input bits while the fifth input comes from the previous compressor with a lower binary bit value [3].
Along with the development of VLSI circuits, which provided the possibility for hardware-level realization of the parallel multipliers [4], the applications of such systems were greatly increased in Digital Signal Processors (DSPs) and microprocessors. A thorough literature review depicts that the parallel multiplier lies in the critical path for delay measurement of its former system [5]. Also, the most significant part of the propagation latency in a parallel multiplier belongs to the Partial Product Reduction Tree (PPRT) consisting of the compressors [6]. As a consequence, delay reduction for PPRT in state-of-the-art schemes is one of the main concerns for the circuit designers.
One of the effective solutions for this purpose that impressively applies to the large input bit multiplications is the utilization of high compression rate compressors such as 5-2, 6-2, and 7-2 blocks [7]. Following the general concept of the 4-2 compressor, the cascading of three FAs will result in the conventional architecture of 5-2 compressor [7]. At least a latency of 5 XOR logic gates (in gate-level) is expected for this realization where with the help of some optimizations reported in [8], the delay has been reduced to 4 XOR gates. Although there are many structures reported in the literature [8,9,10,11,12,13,14], none of them have been able to reach latencies less than 4 XOR gates. This partially comes down to the lack of deep consideration and investigation of the original truth table for the 5-2 compressor block.
The structure of a 6-2 compressor consists of three horizontal paths which transfer their corresponding bits to the adjacent block. Although several works have been introduced over recent years for hardware implementation of this structure [10,15,16,17,18,19], none of them could achieve latencies less than 5 XOR logic gates.
A comprehensive study over n-2 compressor design depicts that there are some neutral states in the conventional truth table. These states can be helpful in the simplification of the Karnaugh Map (KM) for the corresponding compressor. The main objective is to obtain better speed performance while the power consumption would not be degraded. This concept is the basis of the idea in this article to design a high-performance 6-2 compressor. Furthermore, after successfully applying the mentioned idea to the 6-2 structure and by using some simplifications, a novel 5-2 compressor will be presented later in this paper. This scheme also outperforms previous works when the Power-Delay Product (PDP) is concerned.
One important issue that needs to be considered in the design of the proposed structures is the reduction of carry rippling problem between the adjacent compres-sor blocks. For instance, in the horizontally cascaded architectures of Figure 2, the conventional design principles will lead to the rippling of input signals to three compressor blocks. The same concept also holds for 6-2 compressor. However, in this paper, careful design considerations have utilized to reduce the carry rippling problem by one level. The effect of such decrement is the key factor for speed enhancement.
The organization of the paper is as follows. Section 2 explains the analysis and design of the proposed highperformance 6-2 compressor. In section 3, the novel 5-2 compressor architecture has been discussed. Post-Layout simulation results, along with the comparison tables, have been presented in section 4. Finally, section 5 contains the conclusions.

The 6-compressor
The conventional structure of a 6-2 compressor comprises two FAs along with one 4-2 compressor, as shown in Figure 3. Based on the defined architecture, a 6-2 compressor is composed of nine inputs and five outputs. The following expression describes the relationship between the input and output bits [10]: As Eq. (1) shows, all of the input bits along with output have the same weighting while the other four outputs will assume the next higher binary bit value. By considering the proposed architectures for 4-2 compressor reported in [2] and [3], which exhibit 2 XOR logic gatelevel delay from inputs to the outputs, the latency of less than 5 XOR logic gates is expected for implementation of the 6-2 compressor in Figure 3.
The design introduced in [10], directly uses the conventional structure for hardware realization of a 6-2 compressor. Because the 4-2 compressor of [7] was employed, the latency of the circuit was equal to 5 XOR gates. If the compressor of [3] is going to be used in that structure, again, a slight amount of gate-level reduction can be achieved for the critical path delay.
In [15] and [19], gate-level simplification has been utilized for the design of 6-2 compressor. However, the critical path delay which belongs to output is still 5 XOR gates. The architectures used in [17] and [18] have modified the conventional structure by using two 4-2 compressors and one adder. But they were not able to reduce the latency, and due to the higher transistor count, neither active area nor power consumption has been improved.
By considering the aforementioned implementations and their drawbacks, in this paper, a new architecture has been proposed for the 6-2 compressor. In this realization, the gate-level delay has been reduced considerably. The proposed circuit is shown in Figure 4, in which the pins labeled I 1 , I 2 , I 3 , I 4 , I 5 , and I 6 , along with C in1 , C in2 , and C in3 , constitute the input bits. Also, C out1 , C out2 , and C out3 , along with Carry6 -2 and Sum6 -2 form the output bits. The operating principle of the proposed structure can be seen as a 4-2 compressor that is fed by I 1 , I 2 , I 3 , I 4 , and I 5 inputs. These inputs are then utilized to generate C out1 and C out2 outputs of the 6-2 compressor. If at least two of five inputs have high-level logic value, then one of C out1 and C out2 outputs will rise to high-level voltage. If at least four input bits have the logic value of one, then both of the outputs will get the high logic value as well. By considering the conventional architecture of the 4-2 compressor described in [7], the Boolean expressions pertaining to the functions of C out1 and C out2 will be as follows: The odd number of ones at the input pins will result in the corresponding sum output signal (denoted as Sum4 -2) to get the high logic value.
For the comparison of three horizontal inputs (represented as C in1 , C in2 , and I 6 ), an FA has been utilized where the corresponding carry output of this stage will establish the C out3 output of the proposed 6-2 compressor. In order to increase the speed performance, the three gates model of FA from [12] has been employed to implement such block. This output abides by the following relation:   . .
The combination of two sum outputs coming from 4-2 compressor and the FA block will construct the Sum output of the 6-2 compressor, which is noted as Sum6 -2. Finally, for Carry output will have: The calculated Boolean expressions clearly demonstrate that the carry rippling issue has been reduced by one, which is a notable advantage of the proposed 6-2 compressor.
To calculate the gate-level latency, we refer to the calculations provided in [7]. As shown in Figure 5(a) and discussed in [7], the latency of two output XOR/XNOR gate is considered Δ. As a consequence, the propagation delay of the single output XOR gate in Figure 5(b) will be equal to 0.75 Δ, considering the fact that the Transmission Gate (TG) transistors are channel-ready transistors. For a channel-ready transistor, the corresponding latency is half of a transistor where the bias signal of the gate didn't put the transistor in ON state. The same hypothesis also holds for the Multiplexer (MUX) gate of Figure 6, which is composed of two paralleled TGs. In the normal mode, the normalized latency of this gate will be equal to 0.5 Δ, while for the channel-ready case, the corresponding delay will be reduced to 0.25 Δ.
Furthermore, the three input OR gate can be treated almost the same as two output XOR-XNOR gate con-cerning the propagation latency. Also, a single pass transistor-like device will have a delay value of 0.25 Δ of the unit gate [2]. The NAND/NOR gates will also be treated the same as the MUX circuit, which is operating in the normal mode.
Therefore, according to the circuit structure in Figure  4, the critical path of the proposed 6-2 compressor will belong to Carry output. The start point for this path is at C out2 output coming from the adjacent compressor, which then enters the FA block inside the proposed present 6-2 compressor. Such a gate-level delay (T d ) will abide by: T t t t t = + + + (6) in which t Cout2 defines the corresponding delay for the generation of C out2 from the previous level compressor block. Moreover, t FA illustrates the time interval for the generation of output in Figure 4. The terms t AND and t 3OR represent the latencies for AND, and three input OR gates, respectively. By considering the optimizations pertaining to the conversion of AND to NOR and OR to NAND, t AND and t 3OR can be replaced by t NOR and t 3NAND to reduce the gate-level latency.
Because the 4-2 compressor block reported in [3] has been utilized to achieve higher speed performance, t Count2 will be equal to Δ. On the other hand, in an FA block, two single output XOR gates are needed to produce Sum. Therefore, t FA will be equal to 1.5 Δ. Substitution of the obtained values in Eq. (6), will result in: 1.5 0.5 4 d T = ∆ + ∆+ ∆+ ∆ = ∆ As Eq. (7) expresses, the gate-level delay of the proposed compressor is almost equal to 4 XOR logic gates, which is a substantial improvement for the delay performance of the proposed 6-2 compressor compared with the previously reported works.

The 5-2 compressor
By applying some modifications, the designed 6-2 compressor can be used as a high-speed 5-2 compressor. In order to achieve this configuration, Sum. Therefore, t FA will be equal to 1.5 Δ. output should be connected to the ground. The proposed architecture has been illustrated in Figure 7.
Because one of the inputs to the FA cell has been eliminated, therefore, this block can be replaced with a Half Adder (HA), which is denoted in Figure 7 as H-Adder. Equations (8), (9), (10), and (11) present the corresponding expressions for the four output signals of this system: With the help of the same definitions for propagation latency, it can easily be concluded that the delay of the critical path for the proposed 5-2 compressor will belong to the Carry signal. By starting from the C out1 node of the adjacent compressor, the path will end in the HA block. This can be expressed in terms of gate-level delay (T' d ), by means of the following expression: T t t t t = + + + (12) In this equation, t Count1 represents the delay associated with the generation of C out1 from the previous compressor block, which is equal to the summation of two different terms.
Again, by considering the 4-2 compressor architecture of [3], the first term will belong to a two output XOR-XNOR gate. The second term will pertain to the latency of two cascaded MUX gates with channel-ready transistors. As a consequence, t Cout1 will be equal to 1.5Δ.
For a HA block, the propagation latency denoted by t HA will be equal to that of a single output XOR gate (0.75Δ). But for the AND gate, the combination of a NAND and inverter gate is needed, which makes the corresponding delay of this gate (t AND ) be equal to Δ. Finally, t' MUX represents the propagation latency of a channel-ready MUX, which is equal to 0.25Δ. The sum of these latencies will result in the following value: As Eq. (13) expresses, the gate-level delay of the designed compressor is almost 3.5 XOR logic gates, which is a notable improvement for the speed performance of this block.

Simulation results
In order to perform a thorough analysis and comparison for the simulation results of the works, which are discussed in this paper, at the first step, the layouts of the designed circuits have been drawn, and their parasitics have been extracted. After that, some of the recently reported structures in the literature for 6-2 and 5-2 compressors with remarkable results and improvements have been selected and redesigned here using the similar gates employed for this work. Figure 8 illustrates the layout schemes for the proposed architectures. In Figure 8(a), the layout of the 6-2 compressor has been shown, which occupies an active area of 55×35mm 2 . For the implemented 5-2 compressor, as demonstrated in Figure 8(b), an active area of 35×30mm 2 is needed.
In order to resemble the practical situation in the simulation setup, each of the inputs is driven by a buffer circuit. Every output has been passed through a buffer block as well to be able to drive the next stage without loading effect on the compressor circuit. In addition, the compressor blocks are used in a parallel fashion to cover the critical path, as shown in Figure 9. It must be mentioned that the configuration of Figure  9 is used for the case in which the propagation delay between two adjacent compressor structures finishes in the second compressor block. If the propagation delay is rippled for three horizontally cascaded compressors, then three compressors must be considered in the simulation environment. This fact has been presumed for the architectures reported in [9], [10], [12], [13], [14], [18] and [19] (similar as Figure 2). Meanwhile, in the simulation environment, the propagation delay has been measured from the point that the earliest input transition reaches 50% of V dd , to the point where the latest output signal rises to 50% of the V dd voltage. [7]. The worst-case, which demonstrates the longest latency from inputs to the outputs (considering the signal propagation between the adjacent blocks), shows the delay of the critical path.

The 6-2 compressor
Post-layout simulations using HSPICE for TSMC standard 90nm CMOS technology along with 0.9V power supply have been carried out to measure the delay, and power consumption of the proposed 6-2 compressor and the works reported in [10], [18] and [19] which are redesigned and optimized here to make a fair comparison.  The result of the comparison for the power-delay measurement of the designed structure and the selected works is shown in Figure 11. The results indicate that the lowest PDP value belongs to the proposed architecture. The design reported in [19], which has a delay value of about 481ps will need 53% more time to produce the output compared to the designed scheme.  Also, the measurement results for the power dissipation at the operating frequency of 100MHz demonstrate that the proposed compressor circuit has the lowest value of power consumption, while the design reported in [10] which shows the lowest power consumption among the competitors, consumes 6% more power compared to the proposed circuit. Table 1 illustrates the comparison between the redesigned 6-2 configurations based on their circuitry and simulation results obtained by the author.

The 5-2 compressor
As described for 6-2 compressor, post-layout simulations using HSPICE for TSMC standard 90nm CMOS technology along with 0.9V power supply have been carried out to measure the delay and power consumption of the proposed 5-2 compressor and redesigned architectures reported in [8], [9], [10], [12], [13], and [14]. Figure 12 illustrates the results for the correct functionality of the proposed structure along with the delay measurement, which depicts the value of 183ps for the delay of the critical path. The result of the comparison for the power-delay measurement of the proposed design and the redesigned works is shown in Figure 13. The results indicate that the lowest delay value belongs to the designed scheme. Also, the measurement result and comparison for power dissipation at the operating frequency of 100MHz demonstrates that the power consumption of the proposed design is about 97µW, which is not the lowest among the presented works. The designs reported in [13] and [14] have the lowest power consumption. However, the lowest PDP belongs to the proposed scheme. Finally, Table 2 summarizes the comparison between the redesigned 5-2 structures based on their circuitry and simulation results obtained by the author.

Conclusions
In this article, a novel 6-2 compressor was introduced, which outperforms the previous designs from the viewpoint of the propagation delay for the critical path. The proposed architecture is based on the combinational circuit consisting of a 4-2 compressor and FA, which has resulted in a 20% speed improvement in comparison with the other works reported in this criterion. The design approach for the proposed compressor was able to reduce the number of transistors used in the circuit. This is another advantage of the presented circuit, which will occupy a small area on the chip while the carry rippling problem has been reduced by one level.
With some extra modifications, the proposed 6-2 compressor can be used as a 5-2 compressor. In this case, the latency of the critical path has also been reduced significantly. The gate-level delay of the proposed 5-2 and 6-2 compressors is equal to 3.5 and 4 XOR logic gates, respectively, which demonstrate a 15% speed enhancement for the proposed 5-2 architecture compared with the best-reported works. The aforementioned advantages illustrate the capability of the implemented circuits for utilization in high compression multiplication systems.
Simulation results of the proposed circuits using HSPICE for TSMC 90nm standard CMOS technology and 0.9V power supply represent the delay value of 314ps and 183ps for the proposed 6-2 and 5-2 compressors, respectively. The power dissipation is also 269µW and 97µW for these structures at the operating frequency of 100MHz.