A collection of tools for benchmarking, evaluating, and analyzing agent performance metrics.
/benchmarks
: Tools for running benchmarks on AI models and agents/canonical_metrics
: Standard formats and tools for metrics collection, including:- Metrics CLI for processing and transforming metrics
- Metrics Service API for querying and visualizing metrics data
/historic_performance
: Dashboard for tracking performance over time/evaluation
: Dashboard and tools for evaluating agent or model performance
How to collect logs from NEAR AI Hub
Transform, tune, aggregate, create csv table.
# Installation
cd canonical_metrics
python3.11 -m venv .venv
source .venv/bin/activate
pip install poetry
poetry install
# Transform and aggregate metrics
metrics-cli tune /path/to/logs /path/to/tuned_logs --rename --ms-to-s
metrics-cli aggregate /path/to/tuned_logs /path/to/aggr_logs --filters "runner:not_in:local" --slices "agent_name"
metrics-cli aggregation-table /Users/me/.nearai/tuned_logs /Users/me/.nearai/table --filters "runner:not_in:local" --absent-metrics-strategy=nullify
Run the metrics service to query and analyze metrics data:
# Start the metrics service
metrics-service --metrics-path /path/to/tuned_logs
# Query metrics via API
curl -X POST "http://localhost:8000/api/v1/table/aggregation" \
-H "Content-Type: application/json" \
-d '{
"filters": ["runner:not_in:local"],
"column_selections": ["/metrics/performance/"]
}'
Run a web application for querying and visualizing analytics metrics data. Tools to browse logs, track and visualize agent performance over time.
cd historic_performance
npm install
npm start
Will open a dashboard at http://localhost:3000
The dashboard can also be used as a web component in other applications:
npm install @nearai/analytics-dashboard
import { Dashboard } from '@nearai/analytics-dashboard';
// Use with configuration
<Dashboard config={{
views: ['table'], // Show only table view
globalFilters: ['runner:not_in:local'], // Applied to all requests
metricSelection: 'PERFORMANCE', // Metric selection
viewConfigs: {
table: {
showParameters: ['prune_mode'], // Show only specific parameters
refreshRate: 30 // Refresh every 30 seconds
}
}
}} />
Execute popular and user-owned benchmarks to generate performance metrics. Run audit evaluations on agents.
Visualize, analyze, and compare agent & model performances using the collected metrics.
- Canonical Metrics Format: Standardized format for consistent metrics across all agents
- Flexible Aggregation: Group and aggregate metrics by various dimensions
- Powerful Filtering: Filter metrics by runner, model, time ranges, and custom criteria
- RESTful API: Easy integration with dashboards and other tools
- Performance Tracking: Monitor latency, API usage, error rates, and custom metrics
We welcome contributions! See individual component READMEs for specific development guidelines.