Software-friendly hardware provides maximum flexibility, opening the door to high-performance data acceleration

In cloud computing and edge computing, the industry is craving high performance that can support various applications. To meet this need.

The first accelerators adopted by hyperscalers such as Amazon, Facebook and Microsoft were heavily customized designs. These companies can ensure the economies of scale they need in building their own board designs, whether based on their own application-specific integrated circuits (ASICs), or using off-the-shelf FPGAs and GPUs. From a cost and time perspective, enterprise data center and edge computing users struggle to find reasonable scale in such custom chip-level designs. However, designing custom ASICs and boards is not required. The need for standard interfaces such as Ethernet and PCIe makes the use of standard board-level products not only possible, but also desirable.

As a long-term provider of hardware acceleration products, BittWare has been designing PCIe-sized, FPGA-based boards for customers in many fields, from high-performance computing to cloud acceleration to instrumentation, and has accumulated rich experience in this area. experience of. Now, as a subsidiary of Molex Group, BittWare can leverage its global supply network and deep relationships with server vendors such as Dell and HP Enterprise. BittWare is the only significant volume supplier that works with multiple mainstream FPGA suppliers to meet the quality certification, validation, product lifecycle management and support needs of enterprise customers looking to deploy at scale for mission-critical applications FPGA accelerator.

An important differentiator for BittWare’s implementation in these applications is the company’s extensive software support for its FPGA-based accelerators. Each accelerator card comes with driver software for Linux and Windows systems, allowing it to be quickly integrated into various systems via PCIe and Ethernet connections. In addition to supporting the communication between the main CPU and the accelerator card, the driver also supports access to the embedded firmware on the accelerator card. This firmware handles numerous management and self-test functions.

They enable FPGA circuits to be reconfigured for new functions as needed, and they also provide some monitoring routines for power consumption, voltage, and temperature. If the cooling function in the host system fails, the firmware acting as supervisor can shut down the accelerator card to avoid thermal overload. In addition, the software bundle includes various reference designs so that developers can quickly build configurations that allow them to test the functionality of the accelerator card and start working on their own applications.

For the latest generation of accelerator cards, BittWare has worked closely with Achronix. Achronix is ​​the only FPGA supplier that can offer both standalone FPGA chips and embedded FPGA (eFPGA) semiconductor intellectual property (IP). The VectorPath™ S7t-VG6 accelerator card uses Achronix’s Speedster® 7t FPGA chip built on the 7nm process and combines many functions to not only provide high-throughput data acceleration internally, but also support today’s systems ranging from machine learning to advanced instrumentation. The highly distributed, networked architecture required.

Software-friendly hardware provides maximum flexibility, opening the door to high-performance data acceleration

Figure 1: VectorPath S7t-VG6 accelerator card

Software-friendly hardware provides maximum flexibility

By providing direct support for distributed architectures, the Speedster7t FPGA chips used in the VectorPath S7t-VG6 accelerator cards mark a significant shift from traditional FPGA architectures, making it easier for software-oriented developers to build custom processing units. This innovative new architecture is radically different from traditional FPGAs from vendors such as Intel and Xilinx, which were not designed with a focus on data acceleration.

In designing the Speedster7t’s architecture, Achronix created an FPGA chip that maximizes system throughput while also improving ease of use for computer architects and developers. A key differentiator of the Speedster7t FPGA chip compared to traditional FPGA architectures is that it includes an innovative two-dimensional network-on-a-chip (2D NoC) that links processing units within the logic array with various on-chip high-speed interfaces and memory ports. data transfer.

Traditional FPGAs require users to design circuits to connect their accelerators to Fast Ethernet or PCIe data ports and/or memory ports. Typically, a stand-alone system consists of multiple accelerators connected to multiple high-speed ports. For example, the diagram below illustrates a scenario where two accelerators are connected to two storage ports to share a storage space. This scenario uses a FIFO to manage clock domain crossing (CDC) between the memory and the FPGA clock. In addition, a switching function is required in the FPGA logic fabric to manage addressing, arbitration, and backpressure. In traditional FPGAs, this functionality consumes significant FPGA resources and is complex enough to degrade system performance and complicate timing closure.

Achronix takes a software design-to-hardware approach where Ethernet and other high-speed I/O ports can be easily connected to custom accelerator functions using a two-dimensional network-on-chip (2D NoC). The Speedster7t NoC no longer needs to design CDC and switching functions to connect accelerators to high-speed data or memory ports. By simply connecting these functions to the NoC, connectivity challenges are eliminated, simplifying designs, reducing consumption of FPGA resources, improving performance, and simplifying timing closure.

Software-friendly hardware provides maximum flexibility, opening the door to high-performance data acceleration

Figure 2: Challenges of traditional FPGA design

Software-friendly hardware provides maximum flexibility, opening the door to high-performance data acceleration

Figure 3: Speedster7t 2D network-on-chip enables software-friendly hardware

To enable high-performance arithmetic operations, each Speedster7t device features a large array of programmable compute cells, which are placed in order in machine learning processor (MLP) cell modules. The MLP is a highly configurable compute-intensive unit block that supports up to 32 multiply/accumulate (MAC) operations per cycle. In accelerator-centric designs, the presence of MLPs enables efficient sharing of resources between fully programmable logic and hard-wired arithmetic units.

While some FPGAs tend to use HBM2 memory, where the FPGA and memory are assembled into an expensive 2.5D package, the Speedster7t family uses the GDDR6 memory standard interface. This interface provides the highest performance achievable with off-chip memory today at a significantly lower cost, making it easier for teams to implement accelerators with high-bandwidth memory arrays. A GDDR6 memory controller can support 512 Gbps of bandwidth. The VectorPath S7t-VG6 accelerator card can provide eight sets of memory, and the total memory bandwidth can reach 4 Tbps. In addition, there is a DDR4 interface on board that can be used to access data at lower frequencies or where GDDR6 throughput is not required.

The VectorPath S7t-VG6 accelerator card provides many high-performance interfaces to support distributed architecture and high-speed host communication. Now, the accelerator card offers PCIe Gen 3.0 16-lane compliance and certification, as well as a path to Gen 4 and Gen 5 qualifications. For Ethernet connectivity, the accelerator card uses a widely supported optical interface module capable of handling ultra-high wire speeds of up to 400 Gbps according to the QSFP-DD and QSFP56 standards.

There is also an OCuLink expansion port on the other end of the accelerator card to support many other low-latency application scenarios. For example, OCuLink ports can be used to connect accelerator cards to various peripherals, such as NVMe storage arrays for computational storage or database acceleration applications. An OCuLink connection can be a better choice than a PCIe interface to the host processor because it provides a highly deterministic connection that eliminates system-level latency and jitter. The OCuLink port can also introduce other network connections, which can be extended to realize various port specifications other than QSPF-DD or QSFP56.

Software-friendly hardware provides maximum flexibility, opening the door to high-performance data acceleration

Figure 4: VectorPath’s network and storage interfaces

Also included on the front Panel of the VectorPath S7t-VG6 accelerator card are multiple clock inputs that are typically required when synchronizing multiple accelerator cards together. Two SMB clock input connectors support clock inputs from 1PPS and 10 MHz, which are connected to the jitter cleaner before entering the FPGA. Once in the FPGA, these clocks can be multiplied or divided to the frequency required for a specific application.

Further expansion is possible with general purpose digital I/O headers. This I/O port supports single-ended 3.3V connections and low-voltage differential (LVDS) signaling, allowing custom signals such as external clocks, flip-flops, and dedicated I/Os to connect directly to the Speedster7t FPGA. This expansion port can also be used to retrofit VectorPath accelerator cards to legacy hardware.

Software-friendly hardware provides maximum flexibility, opening the door to high-performance data acceleration

Figure 5: VectorPath clock input and GPIO

Suitable for small and large batches

The VectorPath S7t-VG6 accelerator card has taken every detail into consideration, such as support for passive and active air cooling and liquid cooling. In addition, BittWare and Achronix ensure long-term supply and support for areas such as medical that require longer product lifecycles. In these markets, the short product life cycle of GPU-based PCIe accelerator cards does not match the need for system service support of more than 10 years.

For higher volume requirements, especially in scenarios such as edge computing, customers can use BittWare’s cost reduction program to simplify hardware, which is designed to support only the I/O options customers need. In addition, BittWare can also provide board design files and use of the software and drivers included with the VectorPath S7t-VG6 accelerator card. It is also possible to move towards custom system-on-chip (SoC) devices using Achronix’s Speedcore eFPGA IP. Customers can build their own SoCs that include the Programmability of the Speedster7t, but have the cost structure of an ASIC.

For better development and easier deployment, the VectorPath S7t-VG6 accelerator card is available as a pre-integrated multi-core server from BittWare in the form of its TeraBox platform. Available in form factors from 2U to 5U, TeraBox’s rack-mountable chassis can accommodate up to 16 BittWare PCIe accelerator cards and are managed by dual-socket Intel Xeon processors. As a complete solution, TeraBox provides customers with the fastest mechanism to get FPGA development up and running. With the support of Bittworks II and FPGA Devkit software, users can use TeraBox directly and start development work immediately. Alternatively, customers can purchase preconfigured servers that include BittWare accelerator cards from Dell and HP Enterprise.

Figure 6: Deployment of the TeraBox Platform

The Links:   2DI200-100 PT150S16