Perspective
Open access
Published: 11 February 2025

The neurobench framework for benchmarking neuromorphic computing algorithms and systems

Nature Communications volume 16, Article number: 1545 (2025) Cite this article

7104 Accesses
5 Altmetric
Metrics details

Subjects

Abstract

Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. This article presents NeuroBench, a benchmark framework for neuromorphic algorithms and systems, which is collaboratively designed from an open community of researchers across industry and academia. NeuroBench introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent and hardware-dependent settings. For latest project updates, visit the project website (neurobench.ai).

Neuromorphic computing at scale

Article 22 January 2025

Opportunities for neuromorphic computing algorithms and applications

Article 31 January 2022

Neuro-inspired computing chips

Article 21 July 2020

Introduction

In recent years, the rapid growth of artificial intelligence (AI) and machine learning (ML) has resulted in increasingly complex and large models in pursuit of higher accuracy and range of use cases¹. The substantial growth rate of model computation exceeds efficiency gains realized through Moore and Dennard technology scaling², indicating a looming limit to continued advancements with existing techniques. This issue is compounded by the open challenges of adapting such methods for resource-constrained edge devices (tinyML) in order to enable pervasive and decentralized intelligence through the Internet of Things (IoT)³. As such, the urgency for exploring new resource-efficient and scalable computing architectures has intensified.

Neuromorphic computing has emerged as a promising area in addressing these challenges, aiming to unlock key hallmarks of biological intelligence by porting primitives and computational strategies employed in the brain into engineered computing devices and algorithms^4,5,6. Neuromorphic systems hold a critical position in the investigation of novel architectures, as the brain exemplifies an exceptional model for accomplishing scalable, energy-efficient, and real-time embodied computation.

Initially, the term “neuromorphic” referred specifically to approaches that aimed to emulate the biophysics of the brain by leveraging physical properties of silicon, as proposed by Mead in the 1980’s⁷. However, the field of neuromorphic computing research has since grown to encompass a wide range of brain-inspired computing techniques at the algorithmic, hardware, and system levels⁴. While the range of approaches is diverse, neuromorphic computing research generally utilizes mechanisms emulating or simulating biophysical properties more closely than conventional methods, aiming to reproduce high-level performance and efficiency characteristics of biological neural systems.

Neuromorphic algorithms⁸ encompass neuroscience-inspired methods which strive towards goals of expanded learning capabilities, such as predictive intelligence, data efficiency, and adaptation, and include approaches such as spiking neural networks (SNNs) and primitives of neuron dynamics, plastic synapses, and heterogeneous network architectures. Algorithm exploration often makes use of simulated execution on readily-available conventional hardware such as CPUs and GPUs, with the goal of driving design requirements for next-generation neuromorphic hardware.

Neuromorphic systems⁹ are composed of algorithms deployed to hardware, which seek greater energy efficiency, real-time processing capabilities, and resilience compared to conventional systems. Neuromorphic hardware utilizes a variety of biologically-inspired hardware approaches, including analog neuron emulation, event-based computation, non-von-Neumann architectures, and in-memory processing. Neuromorphic systems target a wide range of applications, from neuroscientific exploration, to low-power edge intelligence and datacenter-scale acceleration.

Despite its promises, progress in the field of neuromorphic research is impeded due to the absence of fair and widely-adopted objective metrics and benchmarks^8,10. Without such benchmarks, the validity of neuromorphic solutions cannot be directly quantified, hindering the research community from measuring technological advancement. Standard and rigorous benchmarking is necessary for the neuromorphic community to objectively assess and compare the achievements of novel approaches, and make evidence-based decisions on which directions show promise for achieving breakthrough efficiency, speed, and intelligence, thereby helping to focus research and commercialization efforts on techniques that concretely improve on prior work and conventional computing. Neuromorphic benchmarks have been previously proposed for classical vision^11,12 and audition tasks¹³, open-loop¹⁴ and closed-loop¹⁵ tasks, and for SNN simulator performance assessment¹⁶. While prior works have made valuable contributions, there are opportunities to further advance the field by addressing three outstanding challenges:

Lack of a formal definition – The variety of approaches to exploring brain-inspired principles creates difficulties in defining a set of criteria for what should be benchmarked as a “neuromorphic” solution. Closed definitions can impose narrow assumptions and thus risk unfairly excluding promising methods. This challenge necessitates inclusive benchmarks that can be applied generally across the spectrum of potential approaches, allowing for flexible implementation while focusing on task capabilities and metrics of interest such as temporal processing and efficiency. Furthermore, the benchmarks should ideally allow for direct comparison of neuromorphic and conventional approaches.
Implementation diversity – A wide array of different frameworks targeting different goals, such as neuroscientific exploration¹⁷ and automatic SNN training¹⁸, are used in neuromorphic research. This diversity, which has been instrumental in exploring the landscape of bio-inspired techniques following different methodologies and abstraction levels, comes at the cost of portability and standardization, which in turn limits the ease of benchmark implementation. Benchmarks require common infrastructure that unites tooling to enable actionable implementation and comparison of new methods.
Rapid research evolution – Neuromorphic approaches are continually and rapidly evolving as part of an emerging field. As the research community continues to make technological progress, so too should benchmark suites and methodology expand to foster inclusion and capture salient performance metrics. An iterative benchmark framework with structured versioning will facilitate productive foundational and evolving performance evaluation.

To tackle these challenges, this article presents NeuroBench, a dual-track, multi-task benchmark framework. NeuroBench addresses the existing neuromorphic benchmark challenges by advancing prior work in three distinct ways. Firstly, the benchmark framework reduces assumptions regarding the specific solution being assessed, encouraging inclusive participation of neuromorphic and non-neuromorphic approaches by utilizing general, task-level benchmarking and hierarchical metric definitions which capture key performance indicators of interest. Secondly, the NeuroBench benchmarks are associated with a common open-source benchmark harness tool which facilitates actionable benchmark implementation and offers structure for further expansion to neuromorphic algorithm frameworks and systems. Finally, NeuroBench establishes an iterative, community-driven initiative designed to evolve over time to ensure representation and relevance to neuromorphic research, analogous to the well-established MLPerf benchmark framework for machine learning^19,20. As a whole, NeuroBench intends to align the neuromorphic research community on standard benchmarking, providing a dynamically evolving platform to ensure ongoing relevance and facilitate advancements through workshops, competitions, and a centralized leaderboard.

As Fig. 1 shows, the NeuroBench framework involves two tracks to enable agile algorithm and system development. As an emerging technology, neuromorphic hardware has not converged to a single platform which is commercially available, thus a large fraction of neuromorphic research explores algorithmic advancement on conventional systems which may not be optimal for performance. Thus, NeuroBench consists of an algorithm track for hardware-independent evaluation and a system track for fully deployed solutions. The algorithm track defines four novel benchmarks for neuromorphic methods across diverse domains, namely few-shot continual learning, computer vision, motor cortical decoding, and chaotic forecasting, and utilizes complexity metrics to analyze solution costs. Such hardware-independent benchmarking enables algorithmic exploration and prototyping, especially when simulating algorithm execution on non-neuromorphic platforms. Meanwhile, the system track defines standard protocols to measure the real-world speed and efficiency of neuromorphic hardware on benchmarks ranging from standard machine learning tasks to promising fields for neuromorphic systems, such as optimization. Up-to-date information on the latest benchmarks and official results can be found on the NeuroBench website (https://neurobench.ai/).

**Fig. 1: The two NeuroBench tracks: algorithms and systems.**

Each NeuroBench track includes defined datasets, metric and measurement methodology, and modular evaluation components to enable flexible development. Promising methods identified from the algorithm track will inform system design by highlighting target algorithms for optimization and relevant system workloads for benchmarking. The system track in turn enables optimization and evaluation of performant implementations, providing feedback to refine algorithmic complexity modeling and analysis. The interplay between the tracks creates a virtuous cycle: algorithm innovations guide system implementation, while system-level insights accelerate further algorithmic progress. This approach allows NeuroBench to advance neuromorphic algorithm-system co-design. Both the algorithm and system track will be extended and co-developed as NeuroBench continues to expand.

In the next few sections, we describe the algorithm track, including general complexity metric definitions, benchmark tasks, and common infrastructure tooling. We apply the framework to report baseline results for each algorithm benchmark, which outline unexplored research opportunities in optimizing algorithmic architectures and training of sparse, stateful models to achieve greater performance and resource efficiency. Then, we show baseline results established in the system track to assess neuromorphic performance across promising application workloads. By outlining both tracks, we provide a roadmap towards standardizing benchmark procedures in both hardware-independent and hardware dependent settings.

Algorithm Track Benchmark Framework

The algorithm benchmark track aims to evaluate algorithms in a system-independent manner, separating algorithm performance from specific implementation details. The implementation platform can thus be ill-matched to the particular algorithm benchmark that it executes (e.g., SNN execution via dense matrix multiplication on a GPU), and the algorithm complexity and expected performance can be examined in a theoretical manner, motivating agile prototyping and functional analysis. Furthermore, minimal assumptions are made about the solutions tested, promoting inclusion of diverse algorithmic approaches.

The framework, as illustrated in Fig. 2, is composed of inclusively-defined benchmark metrics, datasets and data loaders, and common harness infrastructure, shown in red. The metrics focus on assessing algorithm correctness on specific tasks as well as capturing general metrics that reflect the architectural complexity, computational demands, and storage requirements of the models. The datasets and data loaders specify the details of the tasks used for evaluation and ensure consistency across benchmarks. Finally, the harness infrastructure automates runtime execution and result output for the algorithm benchmark specified by the input interface, which consists of the user’s model and customizable components for data processing and desired metrics, shown in green and orange.

**Fig. 2: An overview of the NeuroBench algorithm track software architecture.**

Algorithm track metrics

The algorithm track establishes solution-agnostic primary metrics which are generally relevant to all types of solutions, including artificial and spiking neural networks (ANNs, SNNs). Firstly, there are correctness metrics, which measure the quality of the model predictions on the particular task, such as accuracy, mean average precision (mAP), and mean-squared error (MSE). The correctness metrics are specified per task for each benchmark. Next, there are complexity metrics, which measure the computational demands of the algorithm. In the first iteration of the NeuroBench algorithm track, we assume a digital, time-stepped execution of the algorithm and define the following complexity metrics:

Footprint – A measure of the memory footprint, in bytes, required to represent a model, which reflects quantization, parameters, and buffering requirements. The metric summarizes (and can be further broken down into) synaptic weight count, weight precision, trainable neuron parameters, data buffers, etc. Zero weights are included, as they are distinguished in the connection sparsity metric.
Connection Sparsity – For a given model, the connection sparsity is the number of zero weights divided by the total number of weights, accumulated over all layers. 0 refers to no sparsity (fully connected) and 1 refers to full sparsity (no connections). This metric accounts for deliberate pruning and sparse network architectures.
Activation Sparsity – During execution, the average sparsity of neuron activations over all neurons in all model layers, for all timesteps of all tested samples, where 0 refers to no sparsity (i.e., all neurons are always activated), and 1 refers to the case where all neurons have a zero output.
Synaptic Operations – Average number of synaptic operations per model execution, based on neuron activations and the associated fanout synapses. This metric is further subdivided into dense, effective multiply-accumulate, and effective accumulate synaptic operations (Dense, Eff_MACs, Eff_ACs). Dense accounts for all zero and nonzero neuron activations and synaptic connections, and reflects the number of operations necessary on hardware that does not support sparsity. Eff_MACs and Eff_ACs only count effective synaptic operations by disregarding zero activations (e.g., produced by the ReLU function in an ANN or no spike in an SNN) and zero connections, thus reflecting operation cost on sparsity-aware hardware. Synaptic operations with non-binary activation are considered multiply-accumulates (MACs), while those with binary activation are considered accumulates (ACs).

Footprint and connection sparsity are classified as static metrics, which can be analytically determined from the model only. Activation sparsity, synaptic operations, and correctness are classified as workload metrics, which are dependent on execution or simulation of the model based on the benchmark data.

In addition to the above complexity metrics, the algorithm track proposes to define Model Execution Rate, corresponding to the rate, in Hz, at which the model’s forward inference pass needs to be executed. For example, if a model is designed to process data from an event camera with a 50 ms input stride, the model execution rate is 20 Hz. The execution rate is a critical feature of the algorithm which provides intuition into the tradeoff between latency and computational footprint of a deployed model, and is reported directly by the solution designer in benchmark results since it neither needs to be calculated nor extracted from the model or its outputs.

The complexity metrics are measured independently of the underlying hardware and therefore do not explicitly correlate with post-deployment latency or energy consumption. However, they provide valuable insight into algorithm performance and resource requirements, enabling high-level comparison and facilitating prototyping. For instance, the execution rate and number of synaptic operations can be taken together to estimate the speed and dynamic power of a model deployed to certain hardware, and the footprint and connection sparsity can be used to proxy hardware resource utilization.

Furthermore, the algorithm track can be extended with solution-specific secondary metrics, which can offer deeper insights by using information specific to particular types of solutions. For example, for algorithms geared towards analog hardware, noise robustness is an important solution-specific metric. In addition, approaches with complex neuron dynamics may warrant measuring the overall complexity of a neuron update (i.e., type and counts of operations necessary to simulate the update), which can be combined with the total number of neuron updates in a model pass to calculate the cost of state updates. Such solution-specific metrics are expected to be community-driven and will be included in future NeuroBench algorithm track releases.

Algorithm track benchmarks

The v1.0 iteration of the NeuroBench algorithm track includes four benchmarks for neuromorphic computing research. The benchmarks were chosen by the NeuroBench community to capture key ongoing challenges for neuromorphic algorithm design. The list of tasks highlights features which are relevant to neuromorphic research interests: few-shot continual learning, object detection utilizing the high dynamic range and temporal resolution of event cameras, sensorimotor decoding based on cortical signals, and low-dimensional predictive modeling useful for prototyping resource-constrained networks that are suitable for small mixed-signal systems.

Benchmark tasks are listed below and summarized in Table 1. Detailed specifications of benchmark tasks are provided in the Methods section.

Keyword Few-Shot Class-Incremental Learning (FSCIL) – Learning new tasks from a small amount of experiences while retaining knowledge of prior tasks is a hallmark of biological intelligence and a long-standing goal of general AI²¹. It is especially a key challenge to endow edge devices with the ability to adapt to their environments and users. This benchmark thus evaluates the capacity of a model to successively incorporate new keywords over multiple sessions (class-incremental), with only a handful of samples from the new classes to train with (few-shot). The FSCIL task is a recently established benchmark in the computer vision domain²², but it has not yet been adapted to other data modalities. Aligning with a neuromorphic interest in temporal data modalities, this benchmark introduces a FSCIL task with streaming audio data using the large Multilingual Spoken Word Corpus (MSWC)²³ keyword classification dataset. The task is designed to be approached in two phases: pre-training and incremental learning. First, for pre-training, a set of 100 words spanning 5 base languages (English, German, Catalan, French, Kinyarwanda) with 500 training samples each are made available to train an initial model. Next, for incremental learning, the model undergoes 10 successive sessions to learn words from 10 new languages (Persian, Spanish, Russian, Welsh, Italian, Basque, Polish, Esparanto, Portuguese, Dutch) in a few-shot learning scenario. Each incremental session adds 10 words of the corresponding session language with only 5 training samples available per word. After each session, the model is tested in classification accuracy on all prior learned classes, including the 100 base pre-training classes and the few-shot-learned classes, therefore evaluating the FSCIL solution on its ability to learn new classes while retaining knowledge about the previously learned ones. Each session learns a new language, for a total knowledge base of 200 keywords by the end of the benchmark.
Event Camera Object Detection – Object detection is a widely-used computer vision task with applications in robotics, autonomous driving, and surveillance. Such scenarios at the edge may require high energy efficiency and real-time performance, which can be achieved via event-based vision sensors²⁴. The event camera object detection benchmark uses the Prophesee 1 Megapixel automotive detection dataset²⁵, a large labeled object detection dataset with over 15 h of event camera video from the front of a car driving in various scenarios. Predetermined training, validation, and testing splits include 11.2 h, 2.2 h, and 2.2 h of recording, respectively. Pedestrian, two-wheeler, and car object classes are used in evaluation, and correctness is measured using COCO mean average precision (mAP)²⁶.
Non-human Primate (NHP) Motor Prediction – Studying models which can accurately replicate features of biological computation presents opportunities in understanding sensorimotor behavior and developing closed-loop methods for future robotic agents. It also is foundational to the development of wearable or implantable neuro-prosthetic devices that can accurately generate motor activity from neural or muscle signals. This benchmark utilizes a dataset consisting of multi-channel recordings from the sensorimotor cortex of two non-human primates (NHP Indy and NHP Loco) during reaching movements, along with corresponding fingertip motion of the reach²⁷. Six total sessions are included from the dataset, for a total of 8712 seconds of data. The task is to train a model to predict the two-dimensional components of finger velocity using recent neural data. The sessions are treated independently (i.e., models are trained separately for each session), and the data is split to allow the first 75% for training and validation and the last 25% for evaluation. Correctness of the predictions is evaluated by the coefficient of determination (R²) score against the true finger velocity targets, averaged over all six sessions.
Chaotic Function Prediction – The real-world data benchmarks presented thus far are high-dimensional and can require large networks to achieve high accuracy, raising challenges for solution types with limited I/O support and network capacity, such as mixed-signal edge prototype solutions. To address this, we include a synthetic benchmark based on prediction of one-dimensional Mackey-Glass time series²⁸, which can be effectively tackled by smaller networks. Mackey-Glass has been widely adopted as a benchmark for evaluating temporal predictors, including neuromorphic models^29,30,31. The task involves prediction of the next timestep value f(t + Δt) given the current timestep value f(t). The model is trained and validated using the first half of the time series, during which the ground truth state f(t) are supplied to the model to predict the next timestep $f ^{\prime} (t+\Delta t)$. During the evaluation, the model uses its prior prediction $f^{\prime} (t)$ to generate each next value $f^{\prime} (t+\Delta t)$, autoregressively forecasting the second half of the time series. Correctness is measured using symmetric mean absolute percentage error (sMAPE) of the generated time series against the target time series, a standard metric in forecasting³². The benchmark includes a set of 14 Mackey-Glass time series, which vary by the equation parameter τ, the delay constant. Lyapunov time (L), the expected predictability timescale for chaos³³, is used as the time unit for each time series. The total length of each series is 20 Lyapunov times, and 75 points are sampled per Lyapunov time (Δt = L/75).

Table 1 NeuroBench algorithm track v1.0 benchmarks

Full size table

Algorithm track benchmark harness

The NeuroBench algorithm benchmarks are wrapped in a harness which standardizes the benchmark interfaces. The harness provides benchmark users with a consistent framework for loading data, processing data and model outputs, and calculating and reporting metrics, thereby ensuring fair and standard comparisons of the results. It is built with straightforward interfaces which are designed to be extended with new frameworks, algorithms, and tasks. The benchmark harness is open-source for use and development (https://github.com/NeuroBench/neurobench).

The components of the algorithm benchmark harness are summarized in Fig. 2. Datasets are loaded in a common format and pass through Processors to be pre-processed. The Model generates predictions based on the processed data, and Accumulators post-process the predictions, for instance to accumulate spikes and transform to labels. Static metrics of algorithm footprint and connection sparsity are calculated via model analysis, while metrics of correctness, activation sparsity, and synaptic operations are calculated using predictions and model execution traces. For benchmark users, task evaluation simply involves utilizing the existing dataloaders, processors, and metrics within the harness and wrapping their own code to fit the standard interfaces.

Currently, the harness and all baseline models are built using PyTorch³⁴ or frameworks based on it, such as snnTorch¹⁸ and SpikingJelly³⁵. Due to its modular structure and simple interfaces, the harness can grow to be compatible with further neuromorphic tools such as Lava³⁶ and Fugu³⁷. Furthermore, it also supports the extension of data and metric pipelines in order to implement additional benchmark tasks. Widely validated benchmarks in keyword³⁸ and gesture classification¹², which are foundational in neuromorphic and conventional research^13,39,40,41, have been incorporated into the harness to complement the novel tasks in the NeuroBench v1.0 suite. Any novel or existing benchmarks can make use of the harness infrastructure for open reproducibility, and also to garner interest in the community towards long-term task support and appearance in NeuroBench-affiliated leaderboards and challenge events.

Algorithm track limitations and further extensions

Before diving into the baseline results, it is worth discussing several possible improvements to the NeuroBench algorithm track framework in its current form. Specifically, the initial iteration of metrics is restricted to the assumption of digital, time-stepped algorithm execution. While complexity analysis of such prototypes can serve as an intermediate step for solutions intended for analog or continuous time deployment, the metric measurements are not yet defined for those execution settings. Informed by further benchmark implementations, future versions of NeuroBench will extend inclusiveness by expanding measurement protocols to include such algorithms.

Furthermore, the synaptic operations metric, intended to capture model computation cost, currently does not account for neuron updates. The dynamics of neuron models, including mechanisms like leakage and reset, can vary heavily in complexity. However, counting the number and type of operations from neuron updates, as well as estimating their overall costs, depends on the specific arithmetic or circuit implementation. Thus, they are not accounted for in the broader algorithmic complexity metrics. The algorithmic metric framework can be extended with solution-specific metrics that assume a particular implementation platform to estimate neuron update costs, which have been previously defined⁴². These estimates can then be combined with the total number of neuron updates per model computation to measure overall network operation complexity during evaluation.

Data pre- and post-processing can also amount to significant costs not yet captured in the NeuroBench algorithm track metrics. Such costs are, however, captured in the deployed metrics of the system track, which accounts for data processing hardware as part of the overall system during performance and efficiency measurements. Data processing metrics will be added as a separate complexity category for the algorithm track benchmark in the future.

The v1.0 algorithm track benchmark suite is also intended to expand in the future. This could include covering further data modalities such as inertial measurement unit (IMU) sensing⁴³ and extending to closed-loop sensorimotor tasks to demonstrate embodied intelligence. As with the initial benchmarks, further tasks will undergo approval and development by the open NeuroBench community before being included in a future versioned benchmark suite.

Algorithm track baseline results

In our first iteration of the algorithmic track, we report baseline algorithm performance on each benchmark using various model architectures, including artificial neural networks commonly used in deep learning, spiking neural networks, and reservoir networks. We evaluate each benchmark with two substantially different algorithm baselines. From these evaluations, we extract baseline comparisons, identify trends, and uncover motivations for future research. Except for the event camera object detection task, each benchmark utilizes a novel data split, and all tasks use novel metric measurement. The presented baselines are a snapshot of the solution search space and will be starting points for leaderboards, thereby calling for further research to push the state of the art for each task. Detailed specifications of each of the baselines can be found in the Methods section.

Keyword FSCIL

The keyword FSCIL task has an ANN and SNN baseline, using different model architectures:

M5 ANN – The ANN baseline uses a tuned version of the M5 deep convolutional network architecture⁴⁴, with samples pre-processed into Mel-frequency cepstral coefficients (MFCC). The network contains four successive convolution-normalization-pooling layers, followed by a readout fully-connected layer. Each model execution (forward pass) uses the data from the full pre-processed sample, and convolution kernels are applied over the temporal dimension of the samples. This is reported as a 1 Hz model execution rate.
SNN – The SNN baseline uses a recurrent SNN with adaptive leaky integrate-and-fire (LIF) neurons and heterogeneous time constants⁴⁵. The SNN consists of two recurrent adaptive LIF layers and one linear output layer. Audio samples are pre-processed to binary spike trains using Speech2Spikes⁴⁶, which relies on a Mel Spectrogram with the same parameters as the MFCC of the ANN baseline. Each input timestep to the model represents 5 ms of audio data, thus the model has a 200 Hz model execution rate. Output neuron activations are summed over time to produce the word class prediction.

After pre-training using standard batched training, the ANN and SNN baseline networks reach high accuracies on the base classes of 97.09% and 93.48%, respectively. As reported by the model execution rate metric, the SNN baseline computes each sample over 200 passes, using an order of magnitude fewer effective AC synaptic operations compared to the ANN baseline’s effective MACs per model execution. Considering both the model execution rate and synaptic operation metrics, the number of aggregated ACs over the length of the sample (200 ∗ 3.65 × 10⁵ = 7.30 × 10⁷) exceeds the Dense and effective MAC operations necessary for the ANN baseline, which spatially flattens the sample and processes it in one model execution. However, outside of the static-length keyword classification scenario, the low-cost per-execution temporal processing of SNNs can enable efficient, always-on, high-frequency prediction capabilities in deployed continuous audio recognition scenarios.

We present two approaches for the incremental stage for both the ANN and SNN baselines. The frozen models are locked after pre-training on base classes and have 0% accuracy on all new incremental classes, providing a reference for models with no learning or catastrophic forgetting of prior classes. The prototypical models employ a prototypical network⁴⁷ for incremental learning, which is a feature-based clustering approach that can be implemented with a simple linear readout layer on top of the pre-trained network backbone. Prototypical weights and biases of prior and incremental classes are directly defined based on the average features of the corresponding class and directly substitute pre-trained readout layer parameters. The complexity results in Table 2 thus empirically apply to both the frozen and prototypical models.

Table 2 Baseline results for the keyword few-shot class-incremental learning task

Full size table

The test accuracy for the baseline models over all sessions, as well as the test accuracy on only the new incrementally-learned classes, are shown in Fig. 3. Using prototypical networks, the ANN model reaches 89.27% accuracy on average over all sessions, demonstrating significant greater performance of 21.41 accuracy points with respect to the frozen model. The accuracy on new classes, averaged over all incremental sessions, is 79.61%. The SNN prototypical baseline, on the other hand, reaches 75.27% accuracy on average over all sessions, surpassing the frozen SNN performance by 9.97 accuracy points, with an average accuracy on new classes over all sessions of 57.23%.

Fig. 3: Test accuracy per session on the keyword FSCIL task for prototypical and frozen baselines, with the accuracy on both base classes and incrementally-learned classes (left), and accuracy on all incrementally-learned classes only (right).

The accuracy loss over the incremental sessions is similar between the ANN and SNN prototypical baselines. However, the lower overall accuracy of the SNN is largely due to the conversion from the original backpropagation-trained readout classifier, which is used in the frozen baseline, to the prototype readout classifier. On the base classes (session 0 in Fig. 3), the ANN sees a drop of 2.37% between the frozen and prototypical baselines, while the SNN has a larger drop of 9.17%. The larger drop indicates that our particular SNN baseline has a less general feature extraction than the ANN. This may be due to the challenges of backpropagation through time for online temporal inference to learn to extract long-term temporal keyword features with the chosen spiking recurrent model. Additionally, the Speech2Spikes⁴⁶ pre-processing algorithm converting audio to spikes may also cause information loss. Overall, the keyword FSCIL benchmark presents opportunities for further research in learning methods, preprocessing, and model architectures for continual learning of temporal data.

Event camera object detection

The event camera object detection task reports a prior baseline, the RED ANN, and a novel conversion of the architecture to a hybrid ANN-SNN model:

RED ANN – The RED architecture²⁵ consists of blocks of feed-forward squeeze-and-excite⁴⁸ convolutional layers followed by blocks of recurrent convolution-LSTM (ConvLSTM⁴⁹) layers. A single-shot detection (SSD⁵⁰) head is used to predict the location and class of the bounding box based on multi-scale outputs from the recurrent layers. Raw event data is binned into 50 ms and pre-processed into time surfaces.
Hybrid – The hybrid ANN-SNN architecture adopts feedforward LIF spiking neural layers to replace the ConvLSTM layers in RED, and shares the same feed-forward convolutional blocks as the RED. It uses the same input encoding method and SSD head as the RED model.

Results for the two networks can be found in Table 3. The RED ANN represents the current state-of-the-art correctness on the benchmark, at 0.429 mAP. The Hybrid network is a smaller network, reflected by the footprint and synaptic operations metrics measuring an order of magnitude smaller than for the RED ANN. The smaller size comes at the expense of lower correctness of 0.271 mAP.

Table 3 Baseline results for the event camera object detection task

Full size table

For the RED ANN, the activation sparsity metric (0.634) represents zero activations by the ReLU function for each neuron. From this, one may expect that the number of effective operations (operations with a nonzero activation and nonzero weight) would be around 35% of dense operations, however the actual ratio is 87%. This is due to the presence of normalization layers applied to activations before synaptic weight multiplication. Furthermore, neurons with lower activation frequency in the network tend to have a smaller fanout than neurons with high activation frequency. Thus, while activation sparsity alone can provide a proxy for the cost of the network, architectural characteristics may impede actual computation reduction, and the synaptic operations must be considered in tandem.

The Hybrid network demonstrates a significant reduction in total effective operations against dense operations, outlining significant gains if deployed on specialized sparsity-aware hardware. However, for the particular network, the number of effective ACs, generated by the spiking neuron components, is two orders of magnitude smaller than the number of effective MACs within the ANN components. Such a hybrid network may not warrant specialized accumulation units, and the baseline motivates further research in hybrid networks with a larger proportion of spiking neuron activity compared to artificial neuron activity.

NHP motor prediction

Small fully-connected, feedforward networks were developed for the NHP motor prediction baselines:

ANN – In the ANN baseline, the cortical activity from the 50 most recent data samples is buffered to be used as network input. The network has two hidden layers and 2 final outputs predicting X and Y velocities, with a fully-connected topology of N_ch-32-48-2, where N_ch refers to the channels of cortical data (96 for NHP Indy, and 192 for NHP Loco). Batch normalization is applied after each hidden layer.
SNN – The SNN uses the data samples directly as input to the network, without buffering. It has a hidden layer of 50 LIF neurons, for a fully connected topology of N_ch-50-2 LIF neurons. The output neurons do not have a reset mechanism, and the membrane potential is directly read to produce the output velocities.

Table 4 shows the results for the ANN and SNN baselines, averaged between sessions from each NHP (Indy and Loco). The ANN and SNN are similar in footprint size and number of dense operations per model forward pass, and also reach comparable prediction quality based on R² score. Each model is small in footprint and operation count, demonstrating that this task can be solved by shallow edge networks, validating prior studies⁵¹.

Table 4 Baseline results for the NHP motor prediction task, for NHP Indy (96-channel data, top), and NHP Loco (192-channel data, bottom)

Full size table

Between the baselines, the SNN realizes similar correctness at significantly reduced complexity compared to the ANN. Extremely high activation sparsity in the SNN (0.998) directly translates to low effective accumulate operations, demonstrating the adequacy of stateful, binary-activation neuron models for sparse regression tasks. Meanwhile, similarly to the RED ANN in the event camera object detection task, activation sparsity in the ANN baseline does not translate to effective operation efficiency, as batch normalization is applied to activations before multiplication with synaptic weights.

We conduct further exploration for increasing task accuracies with more complex ANN and SNN models: ANN_Flat and SNN_Flat. For these networks, 50 data samples of buffered input are split into n_p = 7 accumulated bins. For ANN_Flat, the 7 bins are spatially flattened as input to the network, so its topology is (7 × N_ch)-32-48-2. SNN_Flat uses the N_ch-32-48-2 topology, and the 7 bins are temporally flattened as input, presented to the network as separate input timesteps. Each prediction still uses the membrane potential of the output neurons after input timesteps, and the network is reset for each prediction. Layer normalization is also applied on the SNN_Flat inputs.

Figure 4 shows plots of complexity and predictive quality of all four baseline networks. Both flattened networks demonstrate significantly greater R² performance than the other two networks. However, the larger input dimension of the ANN_Flat network is reflected in its greater footprint, and the increased model timesteps and layer normalization sharply increase the effective operations of SNN_Flat by two orders of magnitude compared to the simpler SNN. Thus, while input flattening and normalization increase the quality of model predictions for ANNs and SNNs, each comes with a significant complexity trade-off.

**Fig. 4: Footprint and effective synaptic operations vs R², for four task baselines.**

Chaotic function prediction

The chaotic function prediction task has two recurrent ANN baselines, which feature distinct network architectures:

Long short-term memory (LSTM) – LSTMs are a class of recurrent ANN architectures⁵², utilizing multiple gates for selective retention or omission of past information. The LSTM baseline consists of a single LSTM with a hidden state of 100 neurons, followed by a feed-forward layer to produce single-dimension output predictions. In addition, the LSTM baseline utilizes explicit memory by buffering 50 previous datapoints, spatially flattening them into 50 input channels.
Echo state network (ESN) – ESNs are randomized recurrent ANNs that belong to a class of algorithms known collectively as reservoir computing⁵³, featuring more biologically-inspired principles than LSTMs despite not being spiking networks. Standard ESNs have only one hidden layer (the reservoir), where synaptic connections projecting input data to the hidden layer and recurrent synaptic connections within the hidden layer are chosen randomly and stay fixed during the training. The model architecture for the ESN baseline has two neurons in the input layer, which projects the Mackey-Glass function input and additional constant bias input into a hidden layer of 186 neurons. Within the hidden layer, the probability of recurrent connections is set to 0.11.

The LSTM and ESN models were evaluated on a Mackey-Glass time series with τ = 17. The model is evaluated over 30 instantiations of the system; in each instance the start point is shifted forward by half of the Lyapunov time. The model is re-initialized and re-trained on each instance, and the results are averaged over all 30 instances. Table 5 shows the averaged results for the LSTM and ESN model baselines.

Table 5 Baseline results for the chaotic function prediction task

Full size table

The ESN model is architecturally unique compared to the other ANN and SNN baselines. The connection sparsity metric (0.876) reflects the high number of zero-weight connections across its reservoir hidden layer. Due to this sparsity, hardware with support for sparse synaptic representation by ignoring zero weights would require less memory to represent the network, thus decreasing the deployed footprint of the model. The high connection sparsity of the ESN leads to significant reduction in synaptic operations - the ESN uses an order of magnitude fewer effective operations (4.37 × 10³) than the LSTM (6.03 × 10⁴), while achieving comparable sMAPE. The activation sparsity of the ESN is 0 due to neurons using $\tanh (\cdot )$, rather than ReLU activations.

Furthermore, we show the generalization and robustness capabilities of the particular ESN and LSTM models by applying them, with fixed hyperparameter sets, to other Mackey-Glass time series. Figure 5 shows the sMAPE score of the models over varied time series with the τ Mackey-Glass parameter varying between 17 and 30. The models were trained independently for each time series. As the Mackey-Glass τ parameter characterizes the time-delay of the system, its increase roughly corresponds to prediction difficulty, shown by the increasing sMAPE trend through the plot. Notably, the LSTM maintains an error that is relatively lower than that of the ESN for all τ > 18. However, the LSTM uses explicit memory via input buffering, so it is conjectured that the historical data allows for greater robustness to the varying time series characteristics. The ESN uses only one previous timestep, so its memory is only implicitly retained within its hidden layer. While the ESN tunes well to the τ = 17 case and demonstrates greatly reduced effective operations compared to the LSTM, the same set of hyperparameters does not generalize as well to other time series. Further research is motivated in explicit memory buffers versus implicit memory within the network state for trade-offs in single-series forecasting performance, complexity, and generalization capability.

**Fig. 5: Correctness (sMAPE) of ESN and LSTM models evaluated on Mackey-Glass time series with varying τ parameters.**

Discussion and opportunities for further research

Baseline results for the four v1.0 algorithm track tasks compare the correctness and complexities of various solution types. Compared to ANNs, SNNs and ESNs demonstrate complexity advantages such as smaller footprints, high sparsity, and accumulate rather than multiply-and-accumulate operations. Especially on the motor prediction and chaotic function prediction regression tasks, the SNN and ESN baselines already achieve competitive correctness at lower complexity than the ANN and LSTM counterparts. Further research opportunities in model architectures, data pre-processing and buffering, and training paradigms to achieve greater performance is enabled by the standard framework and tooling provided by NeuroBench.

System Track Benchmark Framework

While the algorithm track aims to benchmark solutions in a system-independent manner via complexity analysis, the NeuroBench system track aims to evaluate deployed execution time, throughput, and efficiency of systems comprised of an algorithm deployed and tailored to a hardware platform. Previous benchmark studies have examined neuromorphic systems under various applications, including keyword spotting^39,54, audio and video processing⁵⁵, and combinatorial optimization^56,57. While these studies have demonstrated neuromorphic system advantages, the benchmark tasks have been unaligned. In order for the hallmarks of neuromorphic hardware to be aptly judged against conventional systems and foster the expansion of neuromorphic solutions, transparent and objective comparisons must be made on standard tasks between sufficiently mature neuromorphic systems head-to-head, as well as against conventional systems.

A key challenge for benchmarking neuromorphic hardware is that systems are implemented and deployed at vastly different scales to serve diverse applications, from cloud services (e.g., multi-chip platforms like Loihi⁵⁸ and SpiNNaker⁵⁹) to embedded sensing intelligence (e.g., Speck⁶⁰ and SNP⁶¹). This range is visualized in Fig. 6.

**Fig. 6: Types of neuromorphic systems at various integration scales.**

Existing benchmarks for conventional systems have individual focuses across high-performance⁶², datacenter-level computing¹⁹, and embedded processing⁶³, utilizing a tailored set of benchmark tasks to address the capabilities and requirements across the different computing scales. Thus, rather than pursuing a one-size-fits-all suite of tasks, the goal of the NeuroBench system track is to develop benchmarks at various scales and use cases, under multiple application areas in which both conventional and neuromorphic platforms may compete. The selected v1.0 NeuroBench system track benchmarks represent key commercial application areas for existing systems, and they differ from the tasks in the algorithm track, which are more research-oriented. As benchmark results continue to identify properties of highly effective algorithms and systems, the two tracks will converge to the same selction of tasks that are seen as the most impactful for future progress in the field.

In this section, we present the system track guidelines outlining metrics and tasks, representing collective design between multiple owners and vendors of neuromorphic hardware. Baseline benchmark results for neuromorphic and conventional systems are reported, and further official results will be collected and announced at a regular cadence, akin to the MLPerf suite⁶⁴. As with all other facets of the NeuroBench framework, the system track guidelines will continue to be adapted and extended iteratively as benchmark results are produced and shared.

System track metrics

In order to be representative of the properties of a deployed system, the system benchmarks, like the algorithm benchmarks, are assessed at the task level for the overall system, as opposed to operation- or kernel-level assessment of individual components. Task-level benchmarks enable straightforward comparison between systems of any type with regard to their abilities to solve problems, and the overall system-level measurement describes the realistic capability and efficiency of a whole solution.

Each individual system benchmark uses task-specific metrics aligned with correctness, timing, and efficiency to measure the system under test (SUT). The following general considerations are applied to each category:

Correctness – In other system benchmarks, such as the closed category of the MLPerf Inference framework¹⁹, the same trained model is used to benchmark all SUTs, and a correctness threshold is imposed to ensure optimizations such as lower precision do not disrupt task performance. Due to the tight coupling between an algorithm and its system implementation in many existing neuromorphic hardware solutions, the particular model used to solve a NeuroBench system track benchmark task is unconstrained. Therefore, correctness must be measured to verify the validity of the solution. No correctness thresholds are imposed on submissions, but the benchmark leaderboard will impose tiers of solution correctness on submissions to evaluate accuracy-efficiency trade-offs of system approaches.
Timing – Depending on the task, timing performance can include measurements of sample throughput or execution time. Individually, the former entails an offline, batched inference benchmark, while the latter aligns with a streaming benchmark, in which one inference does not start until the previous one ends. Together, both throughput and execution time should be reported for tasks in which the SUT runs multiple inferences at any given time, each representing a request which must be responded to within a constrained window. The MLPerf Inference framework has defined widely-adopted general task scenarios corresponding to each of these categories (offline, single-stream, and server, respectively), and the NeuroBench system track will use these scenario guidelines where applicable to maintain consistency and build on conventional frameworks. In addition, neuromorphic systems are also applied to tasks in which there is no notion of discrete sample throughput or execution, such as for heuristic approximations of intractable problems or operation over a continuous stream of data (e.g., from an event camera). Timing performance should be defined on a per-benchmark basis for such tasks, such as a time-to-solution latency or percentage of execution which exceeds a real-time threshold.
Efficiency – Conventional system benchmarks such as TOP500⁶² for HPC and MLPerf Inference¹⁹ for deep learning do not require power measurement submission in the main benchmark, instead allowing for separate submissions to an adjacent power track (Green500⁶⁵ and MLPerf Power⁶⁶, respectively). Not only has efficiency been usually considered as a second-order metric for conventional systems, it is also notoriously difficult to precisely measure. However, as energy efficiency is a key hallmark of biology and thus is a focus of neuromorphic research, power and energy consumption must be first-order metrics in the NeuroBench system track. Similarly to timing metrics, efficiency metrics should be tailored on a per-benchmark basis, i.e., a real-time always-on processing task may focus on average power, while offline batched systems focusing on high-throughput inference may focus on both peak power and energy per inference.

Ideally, a benchmark framework should include a strict and consistent set of measurement methodologies, including power monitor devices, chip interfaces, and data loading and measurement software. However, neuromorphic systems currently explore a broad range of varied implementation approaches, board-level integration, and development maturity^67,68, and such platform diversity creates difficult challenges for completely consistent methodology. For instance, among mature large-scale neuromorphic systems, implementation strategies range from digital^58,59,69 to mixed-signal approaches^70,71, and system-level integration extends past single-chip boards into servers including hundreds of chips^59,72,73. Thus, to enable an initial step towards consistency in the system track while ensuring openness, we focus on the development of guidelines for transparent documentation, as they provide the foundation for shared methodology among highly diverse solutions. While there may be differences in how metrics are measured, salient details will be available to contextualize the results, allowing for holistic analysis, and leading the way for future consistency by enforcing transparency.

Benchmark submissions may perform separate runs to report performance and power in order to demonstrate system flexibility (e.g., a ‘performance-mode’ run optimal for execution time and an ‘efficiency-mode’ run optimal for energy), however in all runs, both metrics must be reported.

Importantly for the NeuroBench system track, in measuring timing and efficiency, data pre- and post-processing must be taken into account. Neuromorphic methods will often consume and produce non-standard (e.g., event-based) data modalities, the processing of which may consume a significant amount of the overall execution time and may not be computed on the neuromorphic hardware itself. As many instances of neuromorphic hardware cannot be deployed without such associated processing, it is essential that measurements capture the cost of data processing, which stands in contrast with conventional system benchmarks whose measurements start from pre-processed data⁶⁴.

System track benchmarks

Two benchmark specifications for the v1.0 system track are defined in this article, covering embedded to datacenter scales. Full benchmark details are available in the Methods section.

Acoustic Scene Classification – The acoustic scene classification benchmark challenges systems to classify audio into predefined categories based on the environmental audio context. Such capabilities are key for embedded computing in low-power hearable devices, which can utilize them to automatically adjust sound equalisation profiles, appropriately target microphone denoising, and support active noise cancellation. The application further challenges systems to fulfill technical requirements, such as always-on and real-time operation, and time series processing. Acoustic scenes provide a rich repertoire of features that are necessary for prediction, thus this task is a complement to keyword classification, which mainly focuses on shorter-term features (e.g., phonemes) with a relatively smaller feature repertoire. The benchmark evaluates the classification capabilities of both neuromorphic systems and conventional computing platforms using datasets from the DCASE challenge⁷⁴. These datasets consist of a myriad of audio recordings from diverse environments, including airports, public parks, and buses, thus providing a comprehensive foundation for testing both application- and system-level performance. The NeuroBench subset of the DCASE dataset includes 41360/16240 train/test samples across four classes (airport, street traffic, bus, park). The task will be presented under the single-stream task scenario, providing one 1-second sample to the SUT at a time. Classification probability will be sampled to determine the correctness of the prediction. As the NeuroBench system track allows for unconstrained algorithmic implementation, pre-processing and inference metrics should be separately measured and reported together, which differs from prior system benchmarking that only measure inference^39,63. Timing results report on-device average execution time per sample. Since the platform diversity of edge-targeted systems poses inherent inconsistencies in efficiency measurement, power should be reported under idle and active contexts, following prior benchmark study methodology³⁹. Idle power measures the system prepared for inference with the model loaded, and active power measures the system running pre-processing or inference. The difference between active and idle measurements offers dynamic power, which is used along with execution time to calculate dynamic energy-per-sample.
QUBO – As a non-ML task, NeuroBench incorporates quadratic unconstrained binary optimization (QUBO). QUBO is a particularly beneficial first optimization task for NeuroBench for multiple reasons. First, the binary variables are a natural fit for neuromorphic systems with purely binary spike communication. Second, real-world QUBO applications typically feature sparse cost matrices⁷⁵ which benefit from the sparse synaptic connectivity and execution that neuromorphic systems are often optimized for⁷⁶. Third, the benchmark is easily scalable from mobile to datacenter-level systems, promoting benchmark inclusivity among neuromorphic systems which have varying capacity and scaling capabilities. The initial set of QUBO workloads in NeuroBench searches for the maximum independent (i.e. unconnected) set of nodes in graphs, a task that has wide applications across industry and academia, such as resource allocation in wireless networks, portfolio optimization, and task scheduling⁷⁷. NeuroBench provides a QUBO generator that can uniquely specify each workload by three specific parameters provided by the benchmark: the number of graph nodes, the density of graph connections, and a random seed. The generator provides a large dataset for reliable statistics and allows scaling from modest workloads for small-scale and prototype systems to large workloads for larger-scale systems. The graph sizes specified by the benchmark increase in a pseudo-geometric progression (10, 25, 50, 100, 250, ...), and submissions are encouraged to extend the problem size to the limits of the SUT. Graph density ranges from 1% to 30%, in order to show the relationship between the number of connections and SUT power. 5 random seeds for each setting should be tested. Optimization algorithms use heuristic methods to iteratively refine approximate solutions to intractable problems that cannot be completely solved. The QUBO benchmark thus measures solution optimality and energy consumption after fixed, pre-set runtimes, removing any timing measurement. Solution optimality is defined as BKS-Gap, a relative gap between the current SUT’s solution compared against the best-known solution (BKS) to the same problem found using a high-powered solver with a long runtime.

Baseline results

Baseline results for each of the two system track benchmarks are provided for a mature neuromorphic system against a conventional platform. Like for the algorithm track baseline results, the system track baselines are intended to snapshot the solution space and provide starting points for the task leaderboards. Further details on each baseline system are available in the Methods section, and in-depth system documentation for the Xylo ASC baseline and CPU/Loihi 2 QUBO baselines are provided by Ke et al.⁷⁸ and Pierro et al.⁵⁷, respectively.

Acoustic scene classification

For the acoustic scene classification task, two baseline embedded systems are reported:

CPU – The CPU baseline is an Arduino Nano 33 BLE, which uses an ARM Cortex M4 microcontroller for both pre-processing and inference. The digital audio sample is pre-processed using Mel-filterbank energies (MFE), and inference uses a conventional CNN. Execution time is measured using on-chip timers, and power is measured using total system power.
Xylo – The Synsense Xylo⁷⁹ neuromorphic baseline uses a feed-forward SNN with multiple synaptic time constants⁵⁴. As the system is intended for continuous, real-time audio processing, the board uses an analog front end to pre-process analog audio signals directly from a microphone into spikes for the digital inference engine. To conform with a digital benchmark dataset, a simulator of the analog pre-processor generates spikes, which are routed to the inference module. Execution time and power are measured using on-board instruments.

Table 6 lists baseline results for the neuromorphic Xylo system against an Arduino system. Compared to prior neuromorphic audio system benchmarking³⁹, which takes a server-class CPU as a point of comparison, we adopt a fairer approach by focusing on low-power edge application and comparing against an Arduino embedded microprocessor. At comparable inference accuracy, Xylo exhibits 60.9 × less dynamic inference power and 33.4 × less dynamic inference energy consumption than the Arduino.

Table 6 Baseline results for the acoustic scene classification task

Full size table

QUBO

Three baselines are measured for the QUBO benchmark:

Simulated Annealing (SA) – The simulated annealing solver uses Markov chain Monte Carlo (MCMC) sampling to probabilistically explore the search space.
Tabu Search (TABU) – Tabu search solvers maintain and iterate a list of prohibited actions in order to prevent the search from remaining in local minima or revisiting states. Both the TABU and SA solver baselines use the D-Wave Samplers library⁸⁰ on an Intel Core i9-7920X desktop-class CPU, with power measured using Intel SoC Watch.
Loihi 2 – The Loihi 2 neuromorphic system solver uses an SNN formulation of the simulated annealing algorithm which enables solving via neural dynamics, and parallelization via stochastic refractory periods. The baseline is implemented on one Loihi 2 chip on the 8-chip Kapoho Point board, and internal power instrumentation measures all compute and memory components of the chip.

Figure 7 shows the optimality reached by the CPU- and Loihi 2-based solvers after different timeouts. For tight time constraints, at 10⁻² seconds timeout or less, Loihi 2 finds feasible solutions to workloads 4 × larger than the CPU. But for timeout lasting 10 seconds or longer, the CPU running TABU provides the lowest BKS-Gap, incentivizing algorithmic advances for neuromorphic optimization systems. Figure 8 illustrates the power consumed during runtime. Across the workloads, the Loihi 2 solver requires 37.24 × less power compared to the best CPU solver.

**Fig. 7: Percentage gap from the best known solution (BKS-Gap%) for the QUBO workloads with QUBO matrices at 15% density (lower is better).**

**Fig. 8: Power consumption of the QUBO solvers running simulated annealing (SA) or TABU search on CPU, and the parallelized version of simulated annealing on Loihi 2.**

Discussion and future work

The initial baselines for the v1.0 system track compare correctness, timing, and efficiency of neuromorphic systems against conventional CPU systems in domains of both audio classification and optimization. Against mature, commercially-developed CPU systems, for both edge and server use cases, the neuromorphic systems show strong advantages in general efficiency, as well as further promises in terms of timing and correctness. In future NeuroBench iterations, the system track benchmarks can be unified under common tooling, similar to the algorithm track. Software toolchains such as Lava³⁶, Fugu³⁷, SPyNNaker⁸¹, and Samna⁸², among others, have been developed to interface with specific hardware platforms. Many of the stacks are built with general paradigms to support extension to any backend, and the community is actively moving towards developing standards for deployment tools. The current v1.0 benchmark specifications allow for open algorithm and software design in order to demonstrate fully optimized performance for neuromorphic systems. As standards mature in the future, a core focus of the NeuroBench system track is to introduce a closed-algorithm benchmarking category that leverages the recently proposed NIR model description framework⁸³ as a general, cross-platform tool for benchmarking key workloads of interest across many different platforms.

Discussion

Benchmarking neuromorphic computing has faced challenges stemming from the diversity of neuromorphic approaches, the range of implementation and deployment tools, and rapid research evolution. NeuroBench addresses these challenges as a framework for the inclusive, actionable, and iterative benchmarking of neuromorphic solutions, by including novel tasks and metrics, open-source and extendable harness tooling, and facilitating systematic growth via community collaboration. NeuroBench is supported and developed by a broad community of neuromorphic researchers to be a standard, agreed-upon benchmarking framework for neuromorphic technology.

Future directions of the NeuroBench initiative will build on the baselines outlined in this article to increase the scope of the benchmark framework. One important direction for NeuroBench is towards closed-loop benchmarks^15,84. Biological systems excel in interacting with dynamic environments, demonstrating high energy efficiency, real-time reaction, and versatility. As such, embodied intelligence with adaptive sensory and action capabilities are of interest to neuromorphic research. In closed-loop scenarios, the objective is to sense and act within an environment to complete a task, rather than to statically process a frozen dataset, thus the benchmark harness infrastructure and measurement protocols will be extended to facilitate such benchmarks.

Further important directions will be to increase the inclusivity of NeuroBench. While at present, the algorithm track harness supports PyTorch-based libraries, further coverage can be garnered by extending the interfaces to support other software libraries, potentially utilizing portable tooling such as NIR⁸³ as a standard for connecting to benchmark measurements. In addition, the system track guidelines can be extended to define benchmark protocols for continuous-time execution and exploratory hardware platforms in simulation stages, such as memristive hardware. All future NeuroBench expansion will be informed by the collected results and continue to be driven by the interests and development of the broader community.