">

FPGA chips are coming on fast in the race to accelerate AI 

AI is hungry, hyperscale AI ravenous. Both can devour processing, electricity, algorithms, and programming schedules. As AI models rapidly get larger and more complex (an estimated 10x a year), a recent MIT study warns that computational challenges, especially in deep learning, will continue to grow.

But there’s more. Service providers, large enterprises and others also face unrelenting pressures to speed up innovation, performance, and rollouts of neural networks and other low-latency, data-intensive applications, often involving exascale cloud and High-Performance Computing (HPC). These dueling demands are driving technology advances and adoption of a growing universe of Field Programmable Gate Arrays (FPGAs).

Early leader gains a new edge

In the early days of exascale computing and AI, these customer-configurable integrated circuits played a key role. Organizations could program and reprogram FPGAs onsite to handle a range of changing demands. As time went on, however, their performance and market growth got outpaced by faster GPUs and specialized ASICs.

Now, innovations like high-speed AI tensor logic blocks, configurable embedded SRAM, and lightning-fast transceivers and interconnects are putting this early leader back in the race. Technology advances provide a great balance of performance, economy, flexibility, and scale needed to handle today’s AI challenges, says Ravi Kuppuswamy, general manager of Custom Logic Engineering at Intel.

“FPGAs offer hardware customization with integrated AI and can be programmed to deliver performance similar to a GPU or an ASIC,”  explains Kuppuswamy. “ The reprogrammable, reconfigurable nature of an FPGA lends itself well to a rapidly evolving AI landscape, allowing designers to test algorithms quickly and get to market fast and scale quickly.

Consider the el Stratix 10 NX FPGA.  Introduced in June, the company’s first AI-optimized FPGA family was designed to address the rapid rise in AI model complexity. New architectural changes brought the existing Stratix 10 in the same ballpark as GPUs. The new FPGA family delivers up to a 15x increase in operations-per-second over its predecessor. The boost gives exascale customers a viable FPGA option for quickly developing customized, highly differentiated end products. The new FPGA is optimized for low latency and high-bandwidth AI, including real-time processing such as video processing, security and network virtualization.

The ability of FPGAs to deliver higher compute density while reducing development time, power, and total cost of ownership is deepening and expanding its role as the architecture of choice for small-and medium-batch data center AI requiring high performance and heavy data flows.

Global FPGA market to double by 2026

The growing importance is reflected in rising global sales. GrandView Research projects a 9.7% compound annual growth rate (CAGR) from 2020 to 2027. The firm points to several major drivers, including adoption across data centers and HPC systems. Other analysts forecast similar growth, with estimates of global sales between $4 billion and $13 billion, fueled by growing demand in AI and ML. McKinsey expects FPGAs will handle 20% of AI training in 2025, up from nearly nothing in 2017.

FPGA global market 2020-2027

Image credit: Verified Market Research

Analysts agree: FPGAs will have broad appeal across industries, especially wireless communications, cloud service providers (CSPs), cybersecurity systems, aerospace and defense, automotive, and others. Not all adoption will be for AI, but industry watchers say more and more will.

Inside key FPGA innovations

To better understand the appeal, and how advances in FPGAs can help organizations better handle current AI challenges, let’s take a closer look at key innovations in the Intel Stratix 10 NX FPGA.

  1. High-performance AI Tensor (matrix) blocks. AI is computationally intensive. To enhance the arithmetic functionality of the new FPGA, Intel and partner Microsoft rearchitected the device to accelerate data center AI workloads. They replaced the existing embedded DSP (digital signal processing) blocks with a new type of AI-optimized tensor arithmetic block that delivers high compute density.

Explains Deepali Trehan, general manager and senior director of FPGA Product Marketing at Intel: “The challenge was to keep in place all of the good things in the device — memory, logic, routing, transceivers, HBM — and fit the new AI Tensor blocks into the same location that the previous DSP block sat, so the FPGA could be brought into production much quicker, with lower risk.”

The AI Tensor Blocks contain dense arrays of lower-precision multipliers typically used in AI applications. Architects increased the number of multipliers and accumulators to 30 each, up from two in the DSP block. The design is tuned for common matrix-matrix or vector-matrix multiplications used in a wide range of AI computations and convolutional neural networks (CNNs). The single AI Tensor Block achieves up to 15X more INT8 throughput than standard DSP block in its predecessor, enabling significantly increased AI for both small and large matrix sizes.

  1. Near-compute memory. Integrated 3D stacks of high-bandwidth (HBM2) DRAM memory allow large, persistent AI models to be stored on-chip. That results in lower latency helps prevent memory-bound performance challenges in large models.

The ability to mix and match components makes it easier to customize a wider range of FPGA chips for a diverse array of AI and hyperscale applications.

Image credit: Intel

  1. High-bandwidth networking and connectivity. Slow I/O can choke AI. What good is super-fast math processing and memory if they’re bottlenecked in the interconnects between chips and chiplets or CPUs and accelerators? So another key advance focuses on reducing or eliminating bandwidth connectivity as a limiting factor in multi-node (“mix-and-match”) FPGA designs.

To speed networking and connectivity, the new Intel Stratix 10 NX adds up to four 57.8 Gbps PAM4 transceivers to implement multi-node AI inference solutions. Multiple banks of high-speed transceivers enable distributed or unrolled algorithms across the datacenter. The device also incorporates hard IP such as PCIe Gen3 x16 and 10/25/100G Ethernet MAC/PCS/FEC. Support for super-fast CXL, faster transceivers, and Ethernet can be added by swapping out these modular tiles connected by EMIB.

Taken together, these interlocking innovations let the FPGA better handle larger, low-latency models needing greater compute density, memory bandwidth, and scalability across multiple nodes, while enabling reconfigurable custom functions. 

Tech help for many AI challenges

Technology innovations in today’s FPGAs enable improvements in many common AI requirements:

Overcoming I/O bottlenecks. FPGAs are often used where data must traverse many different networks at low latency. They’re incredibly useful at eliminating memory buffering and overcoming I/O bottlenecks — one of the most limiting factors in AI system performance. By accelerating data ingestion, FPGAs can speed the entire AI workflow.

Providing acceleration for high performance computing (HPC) clusters. FPGAs can help facilitate the convergence of AI and HPC by serving as programmable accelerators for inference.

Integrating AI into workloads. Using FPGAs, designers can add AI capabilities, like deep packet inspection or financial fraud detection, to existing workloads.

Enabling sensor fusion. FPGAs excel when handling data input from multiple sensors, such as cameras, LIDAR, and audio sensors. This ability can be extremely valuable when designing autonomous vehicles, robotics, and industrial equipment.

Adding extra capabilities beyond AI. FPGAs make it possible to add security, I/O, networking, or pre-/post-processing capabilities without requiring an extra chip, and other data-and compute-intensive applications.

Microsoft expands pioneering use

Exascale cloud service providers are already deploying the latest FPGAs, often on supercomputers. They’re accelerating service-oriented tasks, such as network encryption, inference and training, memory caching, webpage ranking, high-frequency trading, video conversion, and improving overall system performance.

Take Microsoft. In 2010, the company pioneered using FPGAs on Azure and Bing to accelerate internal workloads such as search indexing and software defined networking (SDN). In 2018, they reported a 95% gain in throughput, 8x speed increase with 15% less power, and a 29% decrease in latency on Microsoft Azure Hardware integrated with Project Brainwave, a deep learning platform for real-time AI inference in the cloud and on the edge.

Today, the company says Microsoft Azure is the world’s largest cloud investment in FPGAs. Microsoft continues expanding its use of FPGAs for deep neural networks (DNN) evaluation, search ranking, and SDN acceleration to reduce latency and free CPUs for other tasks.

Image credit: Microsoft

The FPGA-fueled architecture is economical and power-efficient, according to Microsoft, with a very high throughput that can run ResNet 50, an industry-standard DNN requiring almost eight billion calculations, without batching. That means AI customers do not need to choose between high performance or low cost, the company says.

The company is continuing its partnership with Intel to develop next-generation solutions for its hyperscale AI. “As Microsoft designs our real-time multi-node AI solutions, we need flexible processing devices that deliver ASIC-level tensor performance, high memory and connectivity bandwidth, and extremely low latency,” explains Doug Burger, technical fellow, Microsoft Azure Hardware

Top applications for FPGAs

Many data center applications and workloads will benefit from the new AI optimizations in Intel FPGAs. Among them:

Natural Language Processing, including speech recognition and speech synthesis. NLP models are typically large and getter larger. The need to detect, recognize, and understand the context of various languages, followed by translation to the target language is a growing use for language translation applications, a common NLP workload. These expanded workload requirements drive model complexity, which results in the need for more compute cycles, more memory, and more networking bandwidth, but at very low latencies so as not to break a conversational-like flow. Compared to GPUs, FPGAs excel in handling low batch (single words or phrases) with low latency and high performance.

Security including deep packet inspection, congestion control identification, and fraud detection.  The FPGAs enable real-time data processing applications where every micro-second matters. The device’s ability to create custom hardware solutions with direct ingestion of data through transceivers and deterministic, low latency compute elements enable microsecond-class real-time performance.

Real-time video analytics including content recognition, video pre-and post-processing, and video surveillance. The new FPGAs excel here because of their hardware customization ability, which allows implementation of custom processing and I/O protocols for direct data ingestion.

Business benefits: Performance and TCO

How do these technological advances translate into specific benefits for organizations? Customer experiences show that optimized FPGAs offer several advantages for deep learning applications and other AI workloads:

High real-time performance and throughput. FPGAs can inherently provide low latency as well as deterministic latency for real-time applications. That means, for example, video can bypass a CPU and be directly ingested into the FPGA. Designers can build a neural network from the ground up and structure the FPGA to best suit the model. In general, the more the FPGA can do with the data before it enters the CPU, the better, as the CPU can then be used for higher priority tasks.

Value and cost. FPGAs can be reprogrammed for different functionalities and data types, making them one of the most cost-effective hardware options available. Furthermore, FPGAs can be used for more than just AI. By integrating additional capabilities onto the same chip, designers can save on cost and board space. FPGAs have long product life cycles, so hardware designs based on FPGAs can have a long product life, measured in years or decades. This characteristic makes them ideal for use in industrial defense, medical, automotive, and many others.

Above: The new FPGAs meet the biggest, expanding needs of today’s service providers and enterprises.

Image credit:  Allied Market Research

Reusability and upgradability are big pluses. Design prototypes can be implemented on FPGA, verified, and implemented on an ASIC. If the design has faults, a developer can change the HDL code, generate bit stream, program to FPGA, and test again. While ASICs may cost less per unit than an equivalent FPGA, building them requires a non-recurring expense (NRE), expensive software tools, specialized design teams, and long manufacturing cycle.

Low power consumption. FPGAs are not usually considered “low power”. Yet on cost per watt, they can match or beat fixed-function counterparts, especially ASICs and ASSPs (application-specific standard products) that have not been optimized.  With FPGAs, designers can fine-tune the hardware to the application, helping meet power efficiency requirements. FPGAs also accommodate multiple functions, delivering more energy efficiency from the chip. It’s possible to use a portion of an FPGA for a function, rather than the entire chip, allowing the FPGA to host multiple functions in parallel. Besides enabling power savings, Intel Hyperflex FPGA Architecture also reduces IP size, freeing resources for greater functionality.

Bottom line: FPGAs for high bandwidth, low latency and power

For all their new advantages, FPGAs are not a do-everything chip for AI, notes Jason Lawley, technical marketing director of XPU at Intel. The spatial architecture of FPGAs is ideal for delivering data to customized, optimized and differentiated end products, he says. But as the company’s new vision makes clear, organizations also need scalar, vector, and matrix processors. “This breadth lets companies choose the right balance of power, performance and latency for the workload,” explains Lawley.

Further, selecting the best chip, for data center, cloud or edge is not a one-time choice. “Increasingly, developers will be able to select the right architecture for their challenge, then have the flexibility to change if requirements change.” Intel’s OneAPI, a simplified cross-architecture programming model, ties together the different processor architectures. So software developed for one processor type can be used without rewriting for another. So, too will, new, scalable, open hardware and software infrastructure help developers speed development and deployment.

Other technological developments are helping drive adoption. Intel’s advanced packaging, including Embedded Multi-die Interconnect Bridge (EMIB) and the industry-first Foveros 3D stacking technology, are enabling new approaches in FPGA architecture. High-density interconnects enable high bandwidth at low power, with I/O density on par with or better than competitive approaches.

Once an unexciting part in the engineering toolbox, FPGAs are again becoming a popular chip choice for speeding development and processing for low-latency deep learning, cloud, search and other computationally-intensive applications. Today’s FPGAs offer a compelling combination of power, economy, and programmable flexibility for accelerating even the biggest, most complex, and hungriest models.

With workloads expected to increase in both size and breadth over the next decade, smart use of spatial and other architectures will be the key to competitive differentiation and success, especially for exascale companies.

Read More

Exit mobile version