Memory management

Adelia: A 4-nm LLM Processing Unit With Streamlined Dataflow and Dual-Mode Parallelism for Maximizing Hardware Efficiency

Adelia: A 4-nm LLM Processing Unit With Streamlined Dataflow and Dual-Mode Parallelism for Maximizing Hardware Efficiency 150 150

Abstract:

The proliferation of large language models (LLMs) as cross-domain foundation models is fueled by aggressive scaling in both parameter counts and inference-time computation. The emergence of sophisticated reasoning models further accelerates this trend, demanding longer context windows and escalating the computational and memory burdens of inference. A fundamental challenge arises …

View on IEEE Xplore

A 7.5-μW 35-Keyword End-to-End Keyword Spotting System With Random Augmented On-Chip Training

A 7.5-μW 35-Keyword End-to-End Keyword Spotting System With Random Augmented On-Chip Training 150 150

Abstract:

Fully integrated keyword spotting (KWS) systems designed for low-power operation face two major challenges. First, increasing the number of supported keywords significantly raises system complexity and power consumption. Second, most existing systems are not personalized to individual users, as they are trained on data from native English speakers, leading to …

View on IEEE Xplore

An Electrophysiology-Optogenetics Closed-Loop Bi-Directional Neural Interface for Sleep Regulation With 0.2-μJ/class Multiplexer-Based Neural Network

An Electrophysiology-Optogenetics Closed-Loop Bi-Directional Neural Interface for Sleep Regulation With 0.2-μJ/class Multiplexer-Based Neural Network 150 150

Abstract:

This work proposed a multiplexer-based neural network (MUXnet), a multiplexer-based, multiplier-free neural network (NN) structure applicable to the implementation of all inner product-based NN layers. An on-chip MUXnet-based neural signal processing unit (NSPU) was designed, achieving a state-of-the-art accuracy of 82.4% on a public human sleep staging dataset, with the lowest …

View on IEEE Xplore

HUTAO: A Reconfigurable Homomorphic Processing UniT With Cache-Aware Operation Scheduling

HUTAO: A Reconfigurable Homomorphic Processing UniT With Cache-Aware Operation Scheduling 150 150

Abstract:

Fully homomorphic encryption (FHE) enables privacy-preserving machine learning (PPML) at the cost of intensive computational overhead, which necessitates the use of domain-specific accelerators. To achieve comprehensive support for leveled FHE, this article presents a reconfigurable multi-scheme FHE processor that supports both client-side encryption/decryption and server-side evaluation. First, a reconfigurable …

View on IEEE Xplore

A 3-D HBI Compliant 1.536 TB/s/mm2 Bandwidth Scalable Attention Accelerator With 22.5-GOPS Throughput High Speed SoftMax for Quantized Transformers in Intel 3

A 3-D HBI Compliant 1.536 TB/s/mm2 Bandwidth Scalable Attention Accelerator With 22.5-GOPS Throughput High Speed SoftMax for Quantized Transformers in Intel 3 150 150

Abstract:

This letter presents a novel hardware accelerator compatible with <3- $\mu $ m pitch 3-D Cu-Cu hybrid bonding interconnect (HBI) technology, particularly designed to efficiently execute multihead attention (MHA) of encoder transformer models. We present an accelerator that addresses performance losses due to low precision models by incorporating specialized hardware optimizations …

View on IEEE Xplore

MEGA.mini: An Energy-Efficient NPU Leveraging a Novel Big/Little Core With Hybrid Input Activation for Generative AI Acceleration

MEGA.mini: An Energy-Efficient NPU Leveraging a Novel Big/Little Core With Hybrid Input Activation for Generative AI Acceleration 150 150

Abstract:

This article presents a processor for the acceleration of generative AI (GenAI) based on a novel heterogeneous core architecture called MEGA.mini. The processor introduces three algorithmic features: 1) fixed-point (FXP) and floating-point (FP) hybrid input activation (IA) representation; 2) a delayed-statistics-based normalization (NORM); and 3) conditional polynomial-based nonlinear activation (NLA) approximation. These …

View on IEEE Xplore

A 29-Gb/mm2 1-Tb 3-b/Cell 3-D Flash Memory With CMOS Direct Bonded Array (CBA) Technology

A 29-Gb/mm2 1-Tb 3-b/Cell 3-D Flash Memory With CMOS Direct Bonded Array (CBA) Technology 150 150

Abstract:

This article reports a 1-Tb 3-b/cell 3-D flash memory fabricated with CMOS direct bonded array (CBA) technology. Compaction of circuits and wires achieves the highest bit density in the world over 29 Gb/mm2 with 332-word line (WL) layers. The bit density is improved by 71% from a previous generation despite …

View on IEEE Xplore

MINOTAUR: A Posit-Based 0.42–0.50-TOPS/W Edge Transformer Inference and Training Accelerator

MINOTAUR: A Posit-Based 0.42–0.50-TOPS/W Edge Transformer Inference and Training Accelerator 150 150

Abstract:

Transformer models have revolutionized natural language processing (NLP) and enabled many new applications, but are challenging to deploy on resource-constrained edge devices due to their high computation and memory demands. We present MINOTAUR, an edge system-on-chip (SoC) for inference and fine-tuning of Transformer models with all memory on the chip. …

View on IEEE Xplore