Hardware

Adelia: A 4-nm LLM Processing Unit With Streamlined Dataflow and Dual-Mode Parallelism for Maximizing Hardware Efficiency

Adelia: A 4-nm LLM Processing Unit With Streamlined Dataflow and Dual-Mode Parallelism for Maximizing Hardware Efficiency 150 150

Abstract:

The proliferation of large language models (LLMs) as cross-domain foundation models is fueled by aggressive scaling in both parameter counts and inference-time computation. The emergence of sophisticated reasoning models further accelerates this trend, demanding longer context windows and escalating the computational and memory burdens of inference. A fundamental challenge arises …

View on IEEE Xplore

EMO-CIM: An Input/Stationary-Data Similarity-Aware Computing-In-Memory Design for Variable Vector-Wise Computation in Edge Multioperator AI Acceleration

EMO-CIM: An Input/Stationary-Data Similarity-Aware Computing-In-Memory Design for Variable Vector-Wise Computation in Edge Multioperator AI Acceleration 150 150

Abstract:

We propose an edge multioperator computing-in-memory (EMO-CIM) design that supports variable vector-wise multiply-and-accumulate (MAC) in CNN, Depthwise (DW)-Convolution, and Attention operators. It features: 1) a single EMO-CIM bank (ECB) excels in variable vector-wise MAC (V-MAC) for multioperators; 2) merging local input-shared compute units (LISCUs) with a decode-unit and adder-tree (DUAT) facilitates …

View on IEEE Xplore

A Multiply-and-Accumulate SAR-ADC-Based Hybrid Slepian Beamformer

A Multiply-and-Accumulate SAR-ADC-Based Hybrid Slepian Beamformer 150 150

Abstract:

This article introduces a hybrid Slepian beamforming receiver architecture with low power and area costs. Traditional large-scale true-time-delay (TTD) beamformers for wideband wireless communication suffer from high power consumption and high hardware costs. As an alternative, the Slepian beamforming approach reduces the number of analog-to-digital conversions (ADCs) and delays for …

View on IEEE Xplore

SparseCol: A 1320 BTOPS/W Precision-Scalable NPU Exploiting Training-Free Structured Bit-Level Sparsity and Dynamic Dataflow

SparseCol: A 1320 BTOPS/W Precision-Scalable NPU Exploiting Training-Free Structured Bit-Level Sparsity and Dynamic Dataflow 150 150

Abstract:

Bit-serial computation enables sequential processing of data at the bit level, providing several advantages, such as scalable computational precision. This approach has gained significant attention, especially for exploiting bit-level sparsity (BLS) in AI workloads. While current bit-serial processors leverage BLS to eliminate the computation associated with zero bits, they face …

View on IEEE Xplore

A 28-nm Digital Compute-in-Memory Ising Annealer With Asynchronous Random Number Generator for Traveling Salesman Problem

A 28-nm Digital Compute-in-Memory Ising Annealer With Asynchronous Random Number Generator for Traveling Salesman Problem 150 150

Abstract:

This work presents a compact digital compute-in-memory (DCIM) Ising annealer targeting large-scale combinatorial optimization. A centroid-based weight mapping method combined with hierarchical clustering reduces the memory capacity required for traveling salesman problem (TSP) weights, enabling efficient mapping with limited on-chip storage. An asynchronous random number generator (ARNG) based on dual …

View on IEEE Xplore

MEGA.mini: An Energy-Efficient NPU Leveraging a Novel Big/Little Core With Hybrid Input Activation for Generative AI Acceleration

MEGA.mini: An Energy-Efficient NPU Leveraging a Novel Big/Little Core With Hybrid Input Activation for Generative AI Acceleration 150 150

Abstract:

This article presents a processor for the acceleration of generative AI (GenAI) based on a novel heterogeneous core architecture called MEGA.mini. The processor introduces three algorithmic features: 1) fixed-point (FXP) and floating-point (FP) hybrid input activation (IA) representation; 2) a delayed-statistics-based normalization (NORM); and 3) conditional polynomial-based nonlinear activation (NLA) approximation. These …

View on IEEE Xplore

A Calibration-Free Pipelined-SAR ADC With Cross-Stage Gain-Mismatch Error Shaping and Inherent Noise Shaping

A Calibration-Free Pipelined-SAR ADC With Cross-Stage Gain-Mismatch Error Shaping and Inherent Noise Shaping 150 150

Abstract:

This article presents a calibration-free pipelined-successive-approximation-register (SAR) analog-to-digital converter (ADC) based on the proposed cross-stage gain-mismatch-error shaping (CS-GMES) mechanism. The CS-GMES is realized by including the entire 2nd stage into MES operation to unify the gain error and the 2nd-stage mismatch error. A feedback capacitor provides cross-stage connection and mismatch …

View on IEEE Xplore

Space-Mate: A 303.5-mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial Computing

Space-Mate: A 303.5-mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial Computing 150 150

Abstract:

Simultaneous localization and mapping (SLAM) provides crucial ego-pose information and 3-D maps of the user environment, which are fundamental to emerging mobile spatial computing devices. Dense 3-D mapping and accurate pose estimation are particularly necessary for applications like augmented reality (AR) and autonomous navigation. However, existing SLAM processors are typically …

View on IEEE Xplore

A Microscaling Multi-Mode Gain-Cell Computing-in-Memory Macro for Advanced AI Edge Device

A Microscaling Multi-Mode Gain-Cell Computing-in-Memory Macro for Advanced AI Edge Device 150 150

Abstract:

The microscaling (MX) format is an emerging data representation that quantizes high-bitwidth floating-point (FP) values into low-bitwidth FP-like values with a shared-scale (SS) exponent. When implemented with computing-in-memory (CIM), MX allows an attractive tradeoff between accuracy and hardware efficiency for specific neural network (NN) workloads. This work presents the first …

View on IEEE Xplore