Energy Efficient and Low dynamic power Consumption TCAM on FPGA

: Ternary Content Addressable Memories [TCAM] based on Field Programmable Gate Arrays [FPGA] are widely used in artificial intelligence [AI] and networking applications. TCAM macros are unavailable within the FPGA; therefore, they must be emulated using SRAM-based memories, which require FPGA resources. Compared to state-of-the-art designs, the proposed FPGA-based TCAM implementation will save significant resources. This methodology makes use of the Lookup Table RAMS (LUTRAMs), slice carry-chains, and flip-flops (FF) allowing simultaneous mapping of rules and deeper pipelining respectively. The TCAM implementation results in lower power consumption, fewer delays and lower resource utilization. It outperforms conventional FPGA-based TCAMs in terms of energy efficiency (EE) and performance per area (PA) by at least 3.34 and 8.4 times respectively, and 56% better than existing FPGA designs. The proposed method outperforms all previous approaches due to its low dynamic power consumption when considering the huge size of TCAM emulation on SRAM-based FPGAs.


Introduction
Artificial intelligence (AI) is speeding up and becoming more accurate and reliable.The centralized server is used to connect applications from the edge to the cloud.Due to the rapid growth of internet-connected devices and an increase in internet traffic, today's systems require very fast searches.For IP routing and Internet Protocol (IP) forwarding, routers are key components of networking equipment.Routers receive a packet of data and decide where to route it.They must provide fast packet routing by searching through large amounts of data.High-speed searches are also required in CPUs, database engines, and neural networks.
The latest Xilinx and Intel FPGA chips are increasingly being used as data plane accelerators for Software Defined Networking (SDN) [1].The FPGA industry continually launches software development toolkits to process and classify packets quickly and efficiently [2].Ethernet/IP forwarding, firewalls, and QoS (Quality of Service) require packet processing and classification.There are three types of matching techniques used in classification.They are Longest Prefix Matching (LPM) [3], Matching with Wildcards [4], and Exact Matching (EM) [5].Matching with wildcards is the most challenging task.
Switching, Routing, QoS tables and Access Control List (ACL) are all stored in a high-speed memory to allow for forwarding decisions and limits.These memories (lookups) contain information about results, such as whether a packet with a particular destination IP address should be dropped according to an ACL.Cisco Catalyst switches use specialized memory architectures, called CAMs and TCAMs, to store these memory tables.

Related works
Content Addressable Memories (CAM) [6] deal only with the binary digits (0's and 1's), whereas the Ternary Content Addressable Memories (TCAM) deal with (0's, 1's, and x), where "x" represents Don't care.TCAMs are not available inside the FPGAs as they must be emulated using memory and logic resources and this leads to a significant resource overhead.Researchers have consequently been working on reducing the resource consumption of FPGA-based TCAMs.TCAMs are made up of three basic parts: storage memory, a priority encoder, and match logic.A major cost component of FP-GAs based on SRAM is their storage memories, which comprise the actual TCAM contents to be searched.Bosshart et al. [7] optimize the storage memory needs of TCAMs by combining dual-output LUTs and partial reconfiguration.This saves many storage memory resources.
Match logic generates a flag for each incoming key, and this consumes a significant amount of resources because it must be done simultaneously for all memory locations at high speed.Ullah et al. [8] propose a novel idea for efficiently mapping the matching logic in Xilinx FPGAs by exploiting the built-in carry-chain resources.As the size of the key and rules to be stored in the TCAM increases, the storage and matching logic requirements increase as well.As a result, it is worth looking into optimizing both storage and matching logic resources at the same time.
Reviriego et al. [23] used 5 × 2 LUTs to emulate TCAMs as well as modern SRAM-based FPGAs with their reconfiguration capabilities for storing and updating TCAM rules.Compared to PR-TCAM [23], BPR-TCAM [8] uses a slice built-in carry-chain to reduce matching logic in TCAMs.Both approaches rely on partial reconfiguration for updating TCAM stored rules.Another resource for TCAM-emulation, in addition to LUTRAM, is distributed RAM [22], [24], [25].Ullah et al. [21] used distributed RAMs in a 6 × 1 configuration for resource allocation in the same slice to obtain greater performance per area (PA), in addition to using carry chains for the match-logic reduction.
The D-TCAM [26] structure uses LUTRAMs on a 6 × 1 Xilinx template to store TCAMs and pipeline finegrain by using its built-in slice register to gain higher throughput (TP).The previous work using the LUTRAMS in the 5 x 2 configuration and all of the FFs in the SLI-CEM were used to improve the throughput and performance per area.
To implement broader TCAM words, the partial match results must be transferred from the current slice's carry chain to the next slice's carry chain [24].By utilizing the TCAM ANDing logic in the carry-chain, it is possible to achieve the desired TCAM bit density while saving a significant amount of LUT resources.It increases area performance by at least 67 percent and energy efficiency by at least 2.5 times.Frac-TCAM [27] utilizes RAM32M to construct the 8 × 5 TCAM compared to 4 × 6 TCAM used in DURE, thus almost doubling the utilization density.Moreover, LUTRAM outputs can be pipelined via in-slice registers.In comparison to existing approaches, logic utilization and TP can be enhanced, resulting in improved PAs.
By combining BRAM and LUTRAM, Comp-TCAM [28] can implement the TCAM architecture regardless of the type of memory and can be adapted to meet the system requirements.A decrease of 41.6% in hardware resource utilisation has no effect on the functionality.
In this paper, a TCAM emulation on Xilinx SRAM-based FPGAs to achieve a storage reduction in LUTRAMs and a match reduction in logic resources is presented.To accelerate the arithmetic operations, the match bits from the distributed RAMs are efficiently AND-cascaded using the FPGA's built-in carry chains.Ullah et al. [8] used only one built-in carry-logic for matching one of the rules.The proposed work used only one built-in carry logic for matching two of the rules (i.e)., dual-output LUTs are connected to the two built-in carry logic compared to LH-CAM [10].It is capable of mapping a single output LUTRAM matching logic.This makes the delay time shorter and the design clock rate faster because it doesn't use any logic or routing resources.
The main contributions of the paper are listed below: 1.
An FPGA resource-saving TCAM emulation scheme has been proposed that signifi cantly reduces the resources needed to emulate an individual TCAM.

2.
The mapping of two rules using dual-output LUTs and then using the built-in carry-chain to implement the match logic.Thus, additional logic or routing resources are not required for the matching logic.This reduces the delay time and achieves a high clock speed.

3.
TCAM is designed to be scalable in terms of lookup rate, power consumption, device utilization, and energy-effi ciency.

Proposed TCAM architecture
Consider the TCAM emulation on SRAM-based FPGAs.For example, N = 4 and W = 4, i.e., a 4 × 4 TCAM, where W denotes the key size or width, and depth is denoted by N. The key size is 4, and each 4 × 1 SRAM has two input address lines.The TCAM can be divided into two blocks, as shown in Fig. 1 and each of the four rules r0, r1, r2, and r3 is mapped to a 4 × 1 SRAM.The top block is indexed by b0 and b1, whereas the bottom block is indexed by b2 and b3.The outputs of the SRAM are combined using AND gates known as match logic.The choice of SRAM implementation primitives, as well as its width and depth extension, is essential to the efficient TCAM designs on FPGA.as the keyword and connected to the 5-bit LUT input (A4:A0).Rule 6 is stored in memory M1, and Rule 7 is stored in memory M2.The rules are updated using the write address inputs as shown in Fig. 2.This LUT output is connected to the built-in carry-chain for implementing the match logic.Note that this is worth mentioning as the proposed TCAM utilized the one carry-chain to reduce the routing resources and extra logic needed, resulting in a higher design clock rate and less delay time.
To implement 1×20 TCAM as shown in Fig. 3 Fig. 5 shows the proposed TCAM update logic (highlighted in blue dots).The Write Enable "WE" line is short and connected to all the LUTRAM blocks.In the current write cycle, "WE" lines are demultiplexed with the row ID to determine which row needs to be updated.
For columns with the same key lines, column update logic takes care of blocks in the same column.Serial shift registers are implemented as SRL32 in SLICEM for column update logic.For the depth varying from 64 to 1024 × 20 columns, only 32 SRL32 is required for implementation.Similarly, for the depth varying from 64 to 1024 × 40 columns, only 64 SRL32 is required for implementation.It is noted that, when the key size is increased from 20 to 40, the SRL32 utilization is doubled.An incoming key value is compared with the 5-bit

Results and discussion
The Xilinx Virtex-7, 28-nm, XC7V2000TFHG1761-2L FPGA device is used to implement the proposed TCAM architecture with a −2 speed grade.There are  TCAM.The Virtex-7 FPGA from Xilinx can support eight FFs within a single SLICE.To maximize the utilization of SLICE resources, the proposed TCAM fully exploits the FF available in the SLICE.With this approach, a pipelined TCAM architecture can be implemented without the use of additional SLICEs.The speed and dynamic power consumption achieved by the proposed TCAMs shown in Table 2 for the different TCAM configurations.The TCAM inserts registers between the input and the TCAM module, as well as between the TCAM module and the reduction logic.
The proposed TCAM achieves speeds from 372 to 888 MHz for different sizes.However, the proposed TCAM degrades minimally as its size increases, and the degradation does not double as the TCAM's size increases.For example, when moving from 64 × 20 to 128 × 20 and from 64 × 160 to 128 × 160, the speed decreases by 15.16 and 35MHz respectively.Similarly, when moving from 64 × 20 to 64 × 40 and from 64 × 80 to 64 × 160, the speed decreases by 202 and 75.1 MHz respectively.In Table 2, TCAMs proposed in this paper scale well with size, when analysing FPGA resource utilization and clock speed.Vivado's power analyser reports these values for default switching activity, after the post-implementation.From the above table it is evident that, as the TCAM size increases, the power consumption also increases.A 64 × 20 configuration consumes 7mW of power dynamically, while a 512 × 160 configuration consumes 122mW.
Table 3 compares the proposed TCAM architecture's in terms of parameters like the normalized slices, normalized speed, PA, TP, update rate, energy per bit, and EDP with state-of-the-art FPGA TCAMs.By using the following equation, the number of normalized slices can be found: The normalized speed is calculated to provide a good comparison between different FPGA technology nodes: (2) Throughput (TP), which is another important factor in TCAMs, is calculated with the help of the following equation: ( The proposed work has a throughput of 26.64, 50, and 83 Gbit/sec, which is better than the existing work for the TCAM sizes of 512 × 40, 512 × 80, and 512 × 160, as seen in Table 3. The update rate is defined as the ratio of clock rate (MHz) to the clock cycles, and its unit is a million updates per second (MUPS) as follows: Update rate (MUPS) = (4) In the literature, Performance per area (PA), represented mathematically is (5)  PA results in Table 3 show that the suggested TCAM implementation outperforms prior work by a narrow margin.The lower resource usage is due to the search and matching logic.Then, using equation 6, the energy/bit/ search (Ebs) is calculated.
Another important parameter for comparing TCAMs is Energy Delay Product (EDP), which is determined using the following equation: Energy Delay Product (EDP)=Energy*Delay (7) It is observed that the proposed TCAM is 22%, 21%, 24%, and 26% more efficient than Frac-TCAM respectively, for the different TCAM sizes in slice resource utilizations.Compared to state-of-the-art designs, the proposed TCAM has less slice utilization due to slice carry chain utilization.Compared with Frac-TCAM, BPR-TCAM, DURE, D-TCAM, and Comp-TCAM, the proposed TCAM achieves higher clock speed due to inbuilt slice carry chain utilization, SLICEM registers, and RAM32M for the different TCAM sizes.
TCAM size 512 × 40 has a dynamic power consumption of 34mW and a delay time of 1.5 ns.Thus, the energy consumption is 16.60 fJ/bit/search and the EDP is 24.9 ns.fJ/bit/search.The EDP achieved in the proposed work is 3.37 and 8.4 times lower than that of DURE [24] and UE-TCAM [17] respectively, and is the lowest among the various FPGA-based TCAM architectures.TCAM size 1024 × 160 is a larger TCAM that uses 190mW of dynamic power and has a delay time of 2.38ns.Therefore, its EDP is 27.60 ns.fJ/bit/search, almost 46 times less than that of the 150-kbit TCAM implementation in [22].Thus, the proposed work is also a very energy-efficient TCAM architecture.

Conclusion
An FPGA implementation of a TCAM that uses SRAM for higher energy efficiency and resource efficiency is presented.By leveraging the architecture of Xilinx FPGAs, TCAMs can be emulated efficiently.Utilizing LUTRAMs with dual outputs within the latest seven series FPGAs, as well as built-in slice registers and carry chains, a scalable TCAM architecture is proposed.When compared to the conventional 8 × 5 TCAM, the suggested design can map an 8 × 1 TCAM, virtually doubling the utilization density.In addition, the use of in-slice registers to pipeline LUTRAM outputs allows for high-speed operation, and the utilization of carry-chain logic for match reduction archives lower slice utilization.Hence, both logic utilization and TP are enhanced, resulting in a better PA compared with the existing approaches.It achieved an EE and PA that were at least 3.34 and 8.4 times and 56% better than those of the other FPGAbased TCAM solutions, respectively.The large size of TCAM emulation on SRAM-based FPGAs, this solution outperforms the existing solutions with its low dynamic power consumption.

Figure 2 :Figure 3 :
Figure 2: An architecture for mapping a LUT to a carry chain , LUTs (LUTA, LUTB, LUTC, and LUTD) are stacked with different keywords[19:0]  and have eight different rules through O5 and O6.These eight rules are connected to the single carry chain logics i.e., LUT output O6 is connected to the select line of carry-chain logic, and the LUT output O5 is connected to the input of the same carry-chain logic.With the proposed TCAM, six input LUTs in a dual-output mode are combined with eight flip-flops as well as the carry chain in a single slice to provide an 8 × 5 configuration (compared to 4 × 6 for single output LUTs).As shown in Fig.2, O5 is connected to the select signal of the carry chain through D5FFMUX, D5FF, DOUTMUX and LUT output O6 is connected to the data inputs of the carry chain via DCY0, MUXCY, D5FFMUX and D5FF.In this manner, a fully pipelined TCAM structure is designed, resulting in improved performance such as TP and EDP, while resource utilization is the same as in a non-pipelined structure.TCAMs with large dimensions can combine multiple basic blocks.To increase the depth of a TCAM, more basic blocks have to be stacked vertically, where each basic block implements a 1 × 20 TCAM.All the basic blocks have the same keyword.As shown in Fig.4(c), TCAM's width can also be extended by configuring multiple basic blocks with the same depth simultaneously to produce the final match signals.

Figure 4 :
Figure 4: a Architecture of Basic Block combining four LUTs and carry chain into 1 × 20 TCAM; b Depth Extension; c Width Extension global counter and the binary value is written into the SRL32.A 3-bit counter is present inside the SRL fill logic that controls the demultiplexer, and it increments once every 33 times the global 5-bit counter.

Figure 5 :
Figure 5: Architecture of the proposed TCAM with update logic.

Table 1 :
Resource Utilization for the proposed TCAM A SLICE capable of implementing an 8 × 5 TCAM is the fundamental building block.As a result, keys multiply by 5 and rules multiply by 8.The results are based on the implementation of post-place and post-route.TCAM storage and update logic resources are required for different configurations, that is, 512 × 20, 512 × 40, 512 × 80, 512 × 160, and 1024 × 160 as mentioned in Table1.The table shows that the storage part of TCAMs uses a lot more resources than the update logic.only utilizes three FPGA resources: LUTRAMs for storing TCAM rules, FFs registers for deeper pipelining, and slice carry chain for match logic.It is important to note that no logic LUTs are needed to implement AND gates since the rules are linked to LUT Carry-chains.Resource utilization can be observed to be directly related to the TCAM's size.It should be noted that the LUT as logic for