Computer architecture

A 194.6-TOPS/W Pipelined All Current-Domain Mixed-Signal Compute in Memory in 28-nm CMOS

A 194.6-TOPS/W Pipelined All Current-Domain Mixed-Signal Compute in Memory in 28-nm CMOS 150 150

Abstract:

Mixed-signal CIM (MS-CIM) faces bit-cell nonlinearity, poor linearity at high frequency, and throughput limits. We present a hybrid pipelined current-domain MS-CIM macro featuring bit-cell matched linearization interface (BMLI) and loop-unrolled successive approximation refinement (SAR) ADC fabricated in 28-nm CMOS. A $256{\,}\times {\,}256$ SRAM array with 8-bit inputs, 8-bit weights achieve 10.16-TOPS …

View on IEEE Xplore

Open DRAM Model Part II: Enabling Processing-in-Memory in 3D DRAM

Open DRAM Model Part II: Enabling Processing-in-Memory in 3D DRAM 150 150

Abstract:

Processing-in-memory (PIM) by implementing Boolean logic function in DRAM has been proposed to alleviate the memory wall problem in data-intensive computing. However, quantitatively evaluating DRAM-based logic operations across different DRAM architectures remains challenging due to the lack of publicly available DRAM cell and peripheral transistor models that accurately capture their …

View on IEEE Xplore

A 3-nm FinFET 563-kbit 35.5-Mbit/mm2 Dual-Rail SRAM With 3.89-pJ/Access High Energy Efficient and 27.5-μW/Mbit One-Cycle Latency Low-Leakage Mode

A 3-nm FinFET 563-kbit 35.5-Mbit/mm2 Dual-Rail SRAM With 3.89-pJ/Access High Energy Efficient and 27.5-μW/Mbit One-Cycle Latency Low-Leakage Mode 150 150

Abstract:

This article presents a high-density (HD) 6T SRAM macro designed in 3-nm FinFET technology with an extended dual-rail (XDR) architecture, addressing active energy and leakage for mobile applications. Two key innovations are introduced: the delayed-wordline in write operation (DEWL) technique and a one-cycle latency low-leakage access mode (1-CLM). The XDR …

View on IEEE Xplore

A 32-Channel 85.4 dB SNDR Time-Multiplexed Neural Recording Front-End Achieving Within-Conversion Artifact Recovery

A 32-Channel 85.4 dB SNDR Time-Multiplexed Neural Recording Front-End Achieving Within-Conversion Artifact Recovery 150 150

Abstract:

This article presents a 32-channel time-multiplexed highly digital neural recording front-end (RFE) that exhibits sub- $8.8~\mu $ s recovery latency from large stimulation artifacts while delivering high-resolution data at the Nyquist rate. The RFE is time-shared across 32 channels for low-frequency electrocorticography (ECoG) recording and across four channels for high-frequency action potentials (…

View on IEEE Xplore

A 3 nm FinFET 125 TOPS/W-29 TFLOPS/W, 90 TOPS/mm2-17 TFLOPS/mm2 SRAM-Based INT8, and FP16 Digital-CIM Compiler With Support for Multi-Weight Update/Cycle

A 3 nm FinFET 125 TOPS/W-29 TFLOPS/W, 90 TOPS/mm2-17 TFLOPS/mm2 SRAM-Based INT8, and FP16 Digital-CIM Compiler With Support for Multi-Weight Update/Cycle 150 150

Abstract:

This article presents an static random-access memory (SRAM)-based digital compute-in-memory (CIM) compiler implemented with 3 nm high- $\kappa $ metal gate (HKMG) FinFET technology, supporting flexible INT8 and FP16 formats for weight and activation multiply-accumulate (MAC) operations, offering configuration flexibility, high accuracy, and improved area and power efficiency. The FP16 digital …

View on IEEE Xplore

A 11.0-TOPS/W Diffusion Accelerator With Temporal Data Reuse for Real-Time Text-to-Motion Generation

A 11.0-TOPS/W Diffusion Accelerator With Temporal Data Reuse for Real-Time Text-to-Motion Generation 150 150

Abstract:

Text-to-motion models are AI systems that generate human motion sequences directly from natural language descriptions, serving as key enablers for immersive virtual avatars and interactive digital humans in AR/VR ecosystems. However, state-of-the-art text-to-motion diffusion models suffer from substantial computational costs due to their iterative nature, making them ill-suited for …

View on IEEE Xplore

LUT-Based Convolutional Tsetlin Machine Accelerator With Dynamic Clause Scaling for Resources-Constrained FPGAs

LUT-Based Convolutional Tsetlin Machine Accelerator With Dynamic Clause Scaling for Resources-Constrained FPGAs 150 150

Abstract:

The rapid growth of machine learning (ML) workloads, particularly in computer vision applications, has significantly increased computational and energy demands in modern electronic systems, motivating the use of hardware accelerators to offload processing from general-purpose processors. Despite advances in computationally efficient ML models, achieving energy-efficient inference on resource-constrained edge devices …

View on IEEE Xplore

A Logic-Compatible 2-Transistor Embedded Bipolar RRAM MACRO: A 28-nm Multiple-Time Programmable (MTP) Memory Without Extra Masks

A Logic-Compatible 2-Transistor Embedded Bipolar RRAM MACRO: A 28-nm Multiple-Time Programmable (MTP) Memory Without Extra Masks 150 150

Abstract:

This letter presents a 2-transistor (2T) bipolar embedded resistive RAM (eRRAM) MACRO fabricated in a 28-nm high-k metal gate (HKMG) process for multitime programmable (MTP) applications. To overcome the scaling bottlenecks of traditional embedded Flash, this work utilizes an extra-mask-free, pure front-end-of-line (FEOL) integration, offering a robust solution for automotive …

View on IEEE Xplore

MITTA: A Multi-Task Transformer Accelerator With Mixed Precision Structured Sparsity and Hierarchical Task-Adaptive Power Management

MITTA: A Multi-Task Transformer Accelerator With Mixed Precision Structured Sparsity and Hierarchical Task-Adaptive Power Management 150 150

Abstract:

This article presents MITTA, the first silicon-proven transformer accelerator optimized for multi-task inference across both natural language processing (NLP) and image processing domains. MITTA accelerates a task-sharing algorithm that minimizes sub-task computation by reusing both activations and weights from a shared base task, requiring only sparse delta computation for sub-tasks. …

View on IEEE Xplore