Methodology

How Chiark measures agent quality.

How Chiark Is Different

Agent registries list agents. Chiark tests them.

WhatRegistriesChiark
DiscoveryList agents from one sourceCrawl 9 registries, deduplicate
HealthTrust the Agent CardProbe every 30 min, 3 tiers
ScoringNone or popularity-based0-100 operational score, transparent
RoutingNot supportedConstraint filters (uptime, latency, score)
ProtocolsOne protocol per registryA2A + MCP in one index

Operational Score

Every agent receives an Operational Score computed from three tiers, each measuring a different aspect of reliability. Maximum score: 100 points.

Scoring Weights
TierComponentMax PointsWeight
Tier 1Availability3030%
Tier 2Conformance3030%
Tier 3Performance4040%

Tier Definitions

Tier 1: Availability (0-30 pts)

Based on 30-day uptime percentage. An agent that responds to health probes consistently scores higher. Measured via periodic HTTP probes every 30 minutes.

Tier 2: Conformance (0-30 pts)

Does the agent's runtime behavior match its declared Agent Card? We validate the card schema, check declared skills and capabilities, and verify response formats. Full conformance = full points.

Tier 3: Performance (0-40 pts)

Response time benchmarking. Scored on P95 latency — lower is better. Agents that respond quickly and consistently under load earn more points. Only available for agents that allow unauthenticated task execution.

Auth-Gated Agents

Agents that require authentication for task execution can only be scored on Tier 1 (Availability) and partial Tier 2 (Conformance — card validation only). Performance testing is not possible without auth credentials.

Maximum possible score: 45/100

Auth-gated agents are marked with a lock icon on the leaderboard and show their score as X/45 instead of X/100.

Probe Frequency

Each tracked agent is probed every 30 minutes. Probes check availability (HTTP health), card conformance (schema validation), and performance (latency measurement). Results are aggregated over a 30-day rolling window.

Data Sources

Agents are discovered from 9 registries, crawled every 24 hours. Agents appearing in multiple registries are deduplicated by endpoint URL.

RegistryProtocolMethod
a2aregistry.orgA2APaginated REST API
MCP RegistryMCPCursor-based pagination
SmitheryMCPPaginated REST API
Solana Agent RegistryA2A / MCPERC-8004 GraphQL
awesome-a2aA2AGitHub README URL extraction
GitHub TopicsA2Atopic:a2a-protocol search
Well-Known EndpointsA2A/.well-known/agent.json probing
PulseMCPMCPDirectory API (when API key configured)
Alternative RegistriesA2ASecondary sources

Cross-Protocol Support

Chiark indexes both A2A (Agent-to-Agent) and MCP (Model Context Protocol) agents using the same three-tier scoring pipeline.

A2AAgent Card validation, JSON-RPC conformance probing, skill-specific task benchmarks
MCPInitialize handshake validation, tools/list probing, ping + tool invocation benchmarks

x402 Payment Detection

During Tier 1 probing, agents that return HTTP 402 responses are analyzed for x402 payment metadata. Payment headers and body are parsed to extract pricing, network, token, and receiver information. Payment-enabled agents are flagged on the leaderboard and filterable via the API.

Constraint-Based Routing

The API supports quality-constrained queries for agent routing decisions. Agents can be filtered by:

  • *min_score — minimum operational score (0-100)
  • *min_uptime — minimum 30-day uptime (e.g. 0.99 = 99%)
  • *max_latency_ms — maximum P95 response time
  • *auth_required — filter by authentication requirement
  • *payment_enabled — filter by x402 payment support

Real-time agent status is available at /api/v1/agents/{id}/status. Structured capabilities at /api/v1/agents/{id}/capabilities.

MCP Server

Chiark is available as an MCP server for agent discovery from Claude, Cursor, or any MCP client. Use it to find reliable agents, check real-time status, and report routing outcomes.

Hosted endpoint: https://chiark.ai/mcp/(no install needed)
Install locally: pip install chiark-mcp(PyPI / GitHub)

Note: The Operational Score measures reliability, not task quality. A high score means the agent is reachable, conforms to its spec, and responds quickly — it does not mean the agent produces good results.