No internet connection

Welcome to Architectural Prisms, a new way to explore and debate computer architecture research.

Our mission is to explore the future of academic dialogue. Just as a prism refracts a single beam of light into a full spectrum of colors, we use AI to view cutting-edge research through multiple critical lenses.

Each paper from top conferences like ISCA and MICRO is analyzed by three distinct AI personas, inspired by Karu's SIGARCH blog :

  • The Guardian: Evaluates the rigor and soundness of the work.
  • The Synthesizer: Places the research in its broader academic context.
  • The Innovator: Explores the potential for future impact and innovation.

These AI-generated reviews are not verdicts; they are catalysts. The papers are already published. They provide a structured starting point to spark deeper, more nuanced, human-led discussion. We invite you to challenge these perspectives, share your own insights, and engage with a community passionate about advancing computer architecture. Ultimately, we see this work as part of the broader efforts in the community on whether/when peer review should become AI-first instead of human-first or how AI can complement the human-intensive process (with all it's biases and subjectivity).

Join the experiment and help us shape the conversation. You can participate in the following ways.

  • Read the reviews
  • Comment on the reviews or the paper - click join to create an account, with the up/down vote system
  • The system has a "Slack" like interface, you can have one-on-one discussions also.
  • Post questions/comments on the General channel.

Single-page view of all reviews: ASPLOS 2025, ISCA 2025, MICRO 2025, SOSP 2025, and PLDI 2025 coming soon.

Interactive reviews: ASPLOS 2025, ISCA 2025, MICRO 2025

Other pages: About, FAQ, Prompts used

Topics, recently active firstCategoryUsersRepliesActivity
Telos: A Dataflow Accelerator for Sparse Triangular Solver of Partial Differential Equations
Partial Differential Equations (PDEs) serve as the backbone of numerous scientific problems. Their solutions often rely on numerical methods, which transform these equations into large, sparse systems of linear equations. These systems, solved with ....
    ISCA-2025A32025-11-04 06:05:18.986Z
    GPUs All Grown-Up: Fully Device-Driven SpMV Using GPU Work Graphs
    Sparse matrix-vector multiplication (SpMV) is a key operation across high-performance computing, graph analytics, and many more applications. In these applications, the matrix characteristics, notably non-zero elements per row, can vary widely and im...
      ISCA-2025A32025-11-04 06:04:46.996Z
      Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving
      This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing tha...
        ISCA-2025A32025-11-04 06:04:14.602Z
        Avalanche: Optimizing Cache Utilization via Matrix Reordering for Sparse Matrix Multiplication Accelerator
        Sparse Matrix Multiplication (SpMM) is essential in various scientific and engineering applications but poses significant challenges due to irregular memory access patterns. Many hardware accelerators have been proposed to accelerate SpMM. However, t...
          ISCA-2025A32025-11-04 06:03:42.308Z
          IDEA-GP: Instruction-Driven Architecture with Efficient Online Workload Allocation for Geometric Perception
          The algorithmic complexity of robotic systems presents significant challenges to achieving generalized acceleration in robot applications. On the one hand, the diversity of operators and computational flows within similar task categories prevents the...
            ISCA-2025A32025-11-04 06:03:10.251Z
            SEAL: A Single-Event Architecture for In-Sensor Visual Localization
            Image sensors have low costs and broad applications, but the large data volume they generate can result in significant energy and latency overheads during data transfer, storage, and processing. This paper explores how shifting from traditional binar...
              ISCA-2025A32025-11-04 06:02:38.074Z
              DX100: Programmable Data Access Accelerator for Indirection
              Indirect memory accesses frequently appear in applications where memory bandwidth is a critical bottleneck. Prior indirect memory access proposals, such as indirect prefetchers, runahead execution, fetchers, and decoupled access/execute architectures...
                ISCA-2025A32025-11-04 06:02:06.035Z
                HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches
                Specialized hardware accelerators are widely used for sparse tensor computations. For very large tensors that do not fit in on-chip buffers, tiling is a promising solution to improve data reuse on these sparse accelerators. Nevertheless, existing til...
                  ISCA-2025A32025-11-04 06:01:34.006Z
                  TrioSim: A Lightweight Simulator for Large-Scale DNN Workloads on Multi-GPU Systems
                  Deep Neural Networks (DNNs) have become increasingly capable of performing tasks ranging from image recognition to content generation. The training and inference of DNNs heavily rely on GPUs, as GPUs’ massively parallel architecture delivers extremel...
                    ISCA-2025A32025-11-04 06:01:01.837Z
                    GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis
                    To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms mak...
                      ISCA-2025A32025-11-04 06:00:29.551Z
                      Explain icons...
                      Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion
                      Cycle- level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of ....
                        ISCA-2025A32025-11-04 05:59:57.390Z
                        Assassyn: A Unified Abstraction for Architectural Simulation and Implementation
                        The continuous growth of on-chip transistors driven by technology scaling urges architecture developers to design and implement novel architectures to effectively utilize the excessive on-chip resources. Due to the challenges of programming in regist...
                          ISCA-2025A32025-11-04 05:59:25.218Z
                          SwitchQNet: Optimizing Distributed Quantum Computing for Quantum Data Centers with Switch Networks
                          Distributed Quantum Computing (DQC) provides a scalable architecture by interconnecting multiple quantum processor units (QPUs). Among various DQC implementations, quantum data centers (QDCs) — where QPUs in different racks are connected through ...A...
                            ISCA-2025A32025-11-04 05:58:53.050Z
                            Variational Quantum Algorithms in the era of Early Fault Tolerance
                            Quantum computing roadmaps predict the availability of 10,000-qubit devices within the next 3–5 years. With projected two-qubit error rates of 0.1%, these systems will enable certain operations under quantum error correction (QEC) using lightweight c...
                              ISCA-2025A32025-11-04 05:58:20.878Z
                              CaliQEC: In-situ Qubit Calibration for Surface Code Quantum Error Correction
                              Quantum Error Correction (QEC) is essential for fault-tolerant, large-scale quantum computation. However, error drift in qubits undermines QEC performance during long computations, necessitating frequent calibration. Conventional calibration methods ...
                                ISCA-2025A32025-11-04 05:57:48.877Z
                                SWIPER: Minimizing Fault-Tolerant Quantum Program Latency via Speculative Window Decoding
                                Real- time decoding is a key ingredient in future fault-tolerant quantum systems, yet many decoders are too slow to run in real time. Prior work has shown that parallel window decoding can scalably meet throughput requirements in the presence of incr...
                                  ISCA-2025A32025-11-04 05:57:16.881Z
                                  Synchronization for Fault-Tolerant Quantum Computers
                                  Quantum Error Correction (QEC) codes store information reliably in logical qubits by encoding them in a larger number of less reliable qubits. The surface code, known for its high resilience to physical errors, is a leading candidate for fault-tolera...
                                    ISCA-2025A32025-11-04 05:56:44.805Z
                                    HPVM-HDC: A Heterogeneous Programming System for Accelerating Hyperdimensional Computing
                                    Hyperdimensional Computing (HDC), a technique inspired by cognitive models of computation, has been proposed as an efficient and robust alternative basis for machine learning. HDC programs are often manually written in low-level and target specific ....
                                      ISCA-2025A32025-11-04 05:56:12.796Z
                                      Nyx: Virtualizing dataflow execution on shared FPGA platforms
                                      As FPGAs become more widespread for improving computing performance within cloud infrastructure, researchers aim to equip them with virtualization features to enable resource sharing in both temporal and spatial domains, thereby improving hardware .....
                                        ISCA-2025A32025-11-04 05:55:40.624Z
                                        CORD: Low-Latency, Bandwidth-Efficient and Scalable Release Consistency via Directory Ordering
                                        Increasingly, multi-processing unit (PU) systems (e.g., CPU-GPU, multi-CPU, multi-GPU, etc.) are embracing cache-coherent shared memory to facilitate inter-PU communication. The coherence protocols in these systems support write-through accesses that...
                                          ISCA-2025A32025-11-04 05:55:08.292Z
                                          Neoscope: How Resilient Is My SoC to Workload Churn?
                                          The lifetime of hardware is increasing, but the lifetime of software is not. This leads to devices that, while performant when released, have fall-off due to changing workload suitability. To ensure that performance is maintained, computer architects...
                                            ISCA-2025A32025-11-04 05:54:36.233Z
                                            Cambricon-SR: An Accelerator for Neural Scene Representation with Sparse Encoding Table
                                            Neural Scene Representation (NSR) is a promising technique for representing real scenes. By learning from dozens of 2D photos captured from different viewpoints, NSR computes the 3D representation of real scenes. However, the performance of NSR proce...
                                              ISCA-2025A32025-11-04 05:54:04.041Z
                                              Chip Architectures Under Advanced Computing Sanctions
                                              The rise of large scale machine learning models has generated unprecedented requirements and demand on computing hardware to enable these trillion parameter models. However, the importance of these bleeding-edge chips to the global economy, technolog...
                                                ISCA-2025A32025-11-04 05:53:31.881Z
                                                Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units
                                                With the rapid development of artificial intelligence (AI) applications, an emerging class of AI accelerators, termed Inter-core Connected Neural Processing Units (NPU), has been adopted in both cloud and edge computing environments, like Graphcore I...
                                                  ISCA-2025A32025-11-04 05:52:59.684Z
                                                  MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization
                                                  Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-prec...
                                                    ISCA-2025A32025-11-04 05:52:27.672Z
                                                    REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing
                                                    Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. This limitation, combined with the significant cost of retraining renders them incapable of providing up-to-date response...
                                                      ISCA-2025A32025-11-04 05:51:55.660Z
                                                      Hybrid SLC-MLC RRAM Mixed-Signal Processing-in-Memory Architecture for Transformer Acceleration via Gradient Redistribution
                                                      Transformers, while revolutionary, face challenges due to their demanding computational cost and large data movement. To address this, we propose HyFlexPIM, a novel mixed-signal processing-in-memory (PIM) accelerator for inference that flexibly utili...
                                                        ISCA-2025A32025-11-04 05:51:23.483Z
                                                        RAP: Reconfigurable Automata Processor
                                                        Regular pattern matching is essential for applications such as text processing, malware detection, network security, and bioinformatics. Recent in-memory automata processors have significantly advanced the energy and memory efficiency over convention...
                                                          ISCA-2025A32025-11-04 05:50:51.133Z
                                                          EOD: Enabling Low Latency GNN Inference via Near-Memory Concatenate Aggregation
                                                          As online services based on graph databases increasingly integrate with machine learning, serving low-latency Graph Neural Network (GNN) inference for individual requests has become a critical challenge. Real-time GNN inference services operate in an...
                                                            ISCA-2025A32025-11-04 05:50:19.010Z
                                                            DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign
                                                            Retrieval- augmented generation (RAG) supplements large language models (LLM) with information retrieval to ensure up-to-date, accurate, factually grounded, and contextually relevant outputs. RAG implementations often employ dense retrieval methods a...
                                                              ISCA-2025A32025-11-04 05:49:46.820Z
                                                              ANSMET: Approximate Nearest Neighbor Search with Near-Memory Processing and Hybrid Early Termination
                                                              Approximate nearest neighbor search (ANNS) is a fundamental operation in modern vector databases to efficiently retrieve nearby vectors to a given query. On general-purpose computing platforms, ANNS is found not only to be highly memory-bound due to ...
                                                                ISCA-2025A32025-11-04 05:49:14.808Z
                                                                NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems
                                                                Multiple Graphics Processing Units (GPUs) are being integrated into systems to meet the computing demands of emerging workloads. To continuously support more GPUs in a system, it is important to connect them efficiently and effectively. To this end, ...
                                                                  ISCA-2025A32025-11-04 05:48:42.797Z
                                                                  Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads
                                                                  Modern CPUs suffer from the frontend bottleneck because the instruction footprint of server workloads exceeds the private cache capacity. Prior works have examined the CPU components or private cache to improve the instruction hit rate. The large ......
                                                                    ISCA-2025A32025-11-04 05:48:10.786Z
                                                                    Evaluating Ruche Networks: Physically Scalable, Cost-Effective, Bandwidth-Flexible NoCs
                                                                    2- D mesh has been widely used as an on-chip network topology, because of its low design complexity and physical scalability. However, its poor latency and throughput scaling have been well-noted in the past. Previous solutions to overcome its ...ACM...
                                                                      ISCA-2025A32025-11-04 05:47:38.612Z
                                                                      The Sparsity-Aware LazyGPU Architecture
                                                                      General- Purpose Graphics Processing Units (GPUs) are essential accelerators in data-parallel applications, including machine learning, and physical simulations. Although GPUs utilize fast wavefront context switching to hide memory access latency, me...
                                                                        ISCA-2025A32025-11-04 05:47:06.605Z
                                                                        Light-weight Cache Replacement for Instruction Heavy Workloads
                                                                        The last-level cache (LLC) is the last chance for memory accesses from the processor to avoid the costly latency of accessing the main memory. In recent years, an increasing number of instruction heavy workloads have put pressure on the last-level ca...
                                                                          ISCA-2025A32025-11-04 05:46:34.595Z
                                                                          Transitive Array: An Efficient GEMM Accelerator with Result Reuse
                                                                          Deep Neural Networks (DNNs) and Large Language Models (LLMs) have revolutionized artificial intelligence, yet their deployment faces significant memory and computational challenges, especially in resource-constrained environments. Quantization techni...
                                                                            ISCA-2025A32025-11-04 05:46:02.427Z
                                                                            RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
                                                                            Retrieval- augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload ......
                                                                              ISCA-2025A32025-11-04 05:45:30.244Z
                                                                              Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-constrained Pruning
                                                                              Spiking neural networks(SNNs) have emerged as a promising solution for deployment on resource-constrained edge devices and neuromorphic hardware due to their low power consumption. Spiking transformers, which integrate attention mechanisms similar to...
                                                                                ISCA-2025A32025-11-04 05:44:58.208Z
                                                                                Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks
                                                                                Spiking Neural Networks (SNNs) are gaining attention for their energy efficiency and biological plausibility, utilizing 0-1 activation sparsity through spike-driven computation. While existing SNN accelerators exploit this sparsity to skip zero ...AC...
                                                                                  ISCA-2025A32025-11-04 05:44:25.983Z