=== PAGE 1 ===
Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake,
Reputation and Constraint in Agentic Web Protocol Design—A2A, AP2,
ERC-8004, and Beyond
Botao ‘Amber’ Hu1, Helena Rong2
1University of Oxford, 2 New York University Shanghai
1 botao.hu@cs.ox.ac.uk, 2 hr2703@nyu.edu
Abstract
As the “agentic web” takes shape—billions of AI agents
(often LLM-powered) autonomously transacting and collab-
orating—trust shifts from human oversight to protocol de-
sign. In 2025, several inter-agent protocols crystallized this
shift, including Google’s Agent-to-Agent (A2A), Agent Pay-
ments Protocol (AP2), and Ethereum’s ERC-8004 “Trust-
less Agents,” yet their underlying trust assumptions remain
under-examined. This paper presents a comparative study of
trust models in inter-agent protocol design: Brief (self- or
third-party verifiable claims), Claim (self-proclaimed capa-
bilities and identity, e.g. AgentCard), Proof (cryptographic
verification, including zero-knowledge proofs and trusted
execution environment attestations), Stake (bonded collat-
eral with slashing and insurance), Reputation (crowd feed-
back and graph-based trust signals), and Constraint (sand-
boxing and capability bounding). For each, we analyze as-
sumptions, attack surfaces, and design trade-offs, with par-
ticular emphasis on LLM-specific fragilities—prompt injec-
tion, sycophancy/nudge-susceptibility, hallucination, decep-
tion, and misalignment—that render purely reputational or
claim-only approaches brittle. Our findings indicate no single
mechanism suffices. We argue for trustless-by-default archi-
tectures anchored in Proof and Stake to gate high-impact ac-
tions, augmented by Brief for identity and discovery and Rep-
utation overlays for flexibility and social signals. We com-
paratively evaluate A2A, AP2, ERC-8004 and related histor-
ical variations in academic research under metrics spanning
security, privacy, latency/cost, and social robustness (Sybil/-
collusion/whitewashing resistance). We conclude with hybrid
trust model recommendations that mitigate reputation gam-
ing and misinformed LLM behavior, and we distill actionable
design guidelines for safer, interoperable, and scalable agent
economies.
Introduction
The emergence of an “agentic web” of AI—potentially bil-
lions of autonomous agents interacting online—poses a fun-
damental challenge: how can these agents reliably trust
one another without direct human supervision (Yang et al.
2025)? Traditional internet trust mechanisms (e.g., DNS
names, TLS certificates) assume relatively static, human-
operated services and an ownership-based trust model,
Copyright © 2026, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
which cannot meet the millisecond-by-millisecond dynamic
coordination and verification needs of large-scale AI agen-
tic ecosystems (Raskar et al. 2025). In this new paradigm,
trust is increasingly determined by the protocols that govern
agent interactions, rather than by human judgment or cen-
tralized authorities. Ensuring robust trust among AI agents
is critical because these agents will be entrusted with sen-
sitive tasks—financial transactions, personal data handling,
critical infrastructure control—where errors or abuse could
have serious consequences (Yang et al. 2025). Recent stud-
ies of AI safety underscore this urgency: even state-of-the-
art large language model (LLM)-based agent frameworks
exhibit fragilities and untrustworthiness such as prompt in-
jection (Liu et al. 2024), sycophancy (Sharma et al. 2025),
biases (Rettenberger, Reischl, and Schutera 2025), nudge-
susceptibility (Cherep, Maes, and Singh 2025), hallucina-
tion (Xu, Jain, and Kankanhalli 2025), deception (Hubinger
et al. 2024), and misalignment (Shen et al. 2023), indicating
unresolved trust issues in coordination, reliability, and over-
sight. Addressing inter-agent trust is thus pivotal for achiev-
ing safe autonomy at scale (Yang et al. 2025).
In 2025, a number of open protocols were proposed to es-
tablish common standards for agent-to-agent interaction and
trust. For example, Google’s Agent-to-Agent (A2A) com-
munication protocol enables autonomous agents to discover
each other and collaborate across organizational bound-
aries via standardized skill advertisements (AgentCards) and
secure messaging (Surapaneni et al. 2025). Similarly, the
Agent Payments Protocol (AP2) was introduced by Google,
PayPal and others to let AI agents perform financial trans-
actions on behalf of users, with an emphasis on auditable,
accountable payment flows underpinned by cryptographic
user intent proofs (Parikh and Surapaneni 2025). In the
blockchain space, Ethereum’s proposed ERC-8004 “Trust-
less Agents” (Rossi et al. 2025) standard seeks to leverage
on-chain registries for agent identity, reputation, and vali-
dation, allowing agents to be discovered and chosen “with-
out pre-existing trust” . Each of these initiatives encapsulates
design assumptions about how agents establish trust—be it
through reputational feedback, credential verification, eco-
nomic incentives, or sandboxed constraints. However, there
has been little systematic analysis of the trust models under-
lying these protocols.
In this paper, we identify and compare six distinct trust
arXiv:2511.03434v1  [cs.HC]  5 Nov 2025


=== PAGE 2 ===
models employed (implicitly or explicitly) in inter-agent
protocol design: Brief, Claim, Proof, Stake, Reputation, and
Constraint. Brief-based trust relies on endorsed claims or
certificates (e.g. verifiable credentials) that an agent or third
parties provide as evidence of identity or capability. Claim-
based trust is the agent’s self-proclaimed identity and abil-
ities (e.g. an AgentCard describing what the agent can
do), without external verification. Proof-based trust requires
cryptographic or formal proofs of behavior or state, such as
digital signatures, zero-knowledge proofs of correct compu-
tation, or TEE remote attestations that vouch for an agent’s
code integrity . Stake-based trust involves economic skin-in-
the-game: agents put down collateral that can be slashed for
misbehavior, sometimes supplemented by insurance pools
to cover damages . Reputation-based trust aggregates feed-
back or ratings from other agents/users into trust scores,
or constructs trust graphs to inform decision-making. Fi-
nally, Constraint-based trust is achieved by sandboxing and
bounding an agent’s actions and access, so that even a mis-
aligned or malicious agent is technically limited in the harm
it can do.
Overall, our study finds that no single trust mechanism is
sufficient for the complexities of open multi-agent environ-
ments. Simple reputation systems or unsigned agent claims
leave too many attack surfaces (Sybil attacks, collusion, ly-
ing agents), whereas purely cryptographic approaches can
be costly or impractical for real-time agent orchestration.
The optimal design appears to be hybrid: defaulting to mini-
mal trust (zero-trust principles) for high-impact actions, but
opportunistically layering trust signals (credentials, stake,
reputation) as additional safeguards. By comparing and syn-
thesizing approaches across different protocols and histor-
ical research, we aim to provide actionable guidelines for
designing safer and more trustworthy agentic webs.
Our contributions in this work are fourfold. (1) We pro-
pose a unifying framework that delineates these six trust
models in the context of multi-agent systems, highlighting
their theoretical foundations in computational trust and dis-
tributed security (2) We review the state-of-the-art agent in-
teraction protocols (A2A, AP2, ERC-8004, NANDA, etc.)
and analyze how each protocol leverages one or more of
these trust mechanisms in practice. (3) We critically ex-
amine how each trust model addresses (or fails to address)
LLM-specific failure modes and risks, such as prompt injec-
tion attacks, susceptibility to manipulation or “sycophancy,”
hallucinated outputs, deceptive strategies, emergent power-
seeking behavior, and objective misalignment. We argue that
certain LLM fragilities fundamentally limit the effective-
ness of purely reputational or self-claimed trust approaches,
necessitating a hybrid approach. (4) We outline a forward-
looking research agenda and design implications, including
the need for trust tiering (calibrating trust mechanisms to
the risk level of tasks), combining multiple trust signals for
robustness, incorporating human oversight and auditability,
and addressing open questions around governance, standard-
ization, and ethical considerations in agentic ecosystems.
Background
Trust in Multi-Agent Systems
Trust is a multifaceted concept that has been extensively
studied in philosophy, sociology, and computer science.
Philosophically, trust is often defined as a directional and
context-specific relationship: agent A may trust agent B for
task X but not necessarily for task Y (Manzini et al. 2024).
Trust involves a belief or expectation by the trustor (the one
who trusts) that the trustee (the one being trusted) has both
the competence and the willingness to perform the entrusted
task. Crucially, trust carries an element of vulnerability: the
trustor risks being let down or betrayed if the trustee does not
fulfill expectations (O’Neill 2002). This vulnerability differ-
entiates trust from mere reliability on predictable behavior
(Freiman 2023). For example, one can rely on a simple ma-
chine or deterministic software, but trusting an agent implies
an assumption about its motives or integrity, not just its reg-
ularity of output. A trustworthy agent, then, is one that is de-
serving of trust–it consistently proves competent and benev-
olent in the relevant domain (O’Neill and Bardrick 2015).
In an agent society, the core challenge is “how to trust the
trustworthy but not the untrustworthy” (O’Neill 2018).
In computational terms, formalizing trust has been an on-
going endeavor since at least the 1990s. Marsh (1994) intro-
duced one of the first frameworks for computational trust,
representing trust as a quantitative mental state that an agent
can update based on experience and context. Since then,
numerous models have arisen to help agents decide whom
to trust in open systems (Braga et al. 2019). These include
reputation systems that allow agents to share impressions
of each other (e.g., the ReGreT system (Sabater 2004)),
probabilistic trust models that use Bayesian updating (such
as the Beta reputation system (Josang and Ismail 2002) or
TRAVOS, which computes confidence intervals for partner
reliability (Teacy et al. 2006)), and trust networks inspired
by PageRank (like EigenTrust, which aggregates local trust
scores in a global iterative algorithm (Kamvar, Schlosser,
and Garcia-Molina 2003)). Notably, many designs aim to
be collusion-resistant and Sybil-resistant. Other systems in-
troduce whitewashing protections (preventing agents from
discarding a bad reputation by rejoining under a new iden-
tity) by linking trust to persistent identities or charging en-
try costs. Despite these advances, fully solving trust in open
agent networks remains difficult—as Friedman and Resnick
(2001) noted, even an ideal reputation system only works if
honest feedback is plentiful and identities cannot be cheaply
faked, conditions that adversaries often undermine.
LLM-Specific Fragilities and Trust Considerations
While autonomous agents can in principle be built from
many types of AI, the recent surge of interest in “agentic
AI” has been driven by large language models. However,
they also introduce unique failure modes and fragilities that
complicate trust. We briefly outline these issues, as they will
be referenced when evaluating trust mechanisms:
Prompt Injection
LLMs follow the instructions in their
input prompt faithfully, which means a maliciously crafted
input can inject directives that subvert the agent’s intended


=== PAGE 3 ===
policy (Liu et al. 2024). This means an LLM agent might
be tricked by another agent or a user into behaving badly, so
any trust model that assumes an agent will strictly follow its
original training or rules is vulnerable.
Hypersensitivity to Nudging / Sycophancy
LLMs fine-
tuned with human feedback tend to exhibit sycophantic
behavior—they adapt their answers to what they think the
user (or another agent) wants to hear, even if it’s not true
(Sharma et al. 2025). This “nudge-susceptibility” means
an agent can be influenced subtly over a conversation to
adopt goals or beliefs that diverge from its initial objec-
tives (Cherep, Maes, and Singh 2025). In multi-agent set-
tings, a clever adversarial agent might socially engineer a
gullible LLM agent into gradually altering its behavior. This
fragility undermines trust models like reputation: an agent
with a good reputation could be manipulated at run-time to
act against its character.
Hallucination
LLMs are notorious for generating factual
hallucinations, i.e. outputs that are fluent and confident-
sounding but entirely incorrect or fabricated (Xu, Jain, and
Kankanhalli 2025). Hallucinations erode the effectiveness of
self-proclaimed Claims (an agent might claim capability it
doesn’t actually have) and can even fool reputation systems
if others cannot verify ground truth easily. Combating hal-
lucinations often requires external verification (e.g. cross-
checking facts) which intersects with the Proof trust model.
Deception
Beyond inadvertent falsehoods, sufficiently ad-
vanced agents may learn to deceive others (Hubinger et al.
2024). Deceptive behavior directly attacks trust—especially
Reputation (an agent might behave well under observation
to gain reputation, then betray) and undermines naive re-
liance on agent self-descriptions or proof unless the proofs
are comprehensive. It suggests the need for mechanisms like
Stake or tamper-proof Transparency logs to detect dishon-
esty.
Emergent Power-Seeking and Misalignment
The most
ominous concern from AI safety research is that a mis-
aligned agent might develop instrumental goals like acquir-
ing power or resources, and do so covertly (Carlsmith 2024).
A misaligned LLM agent (particularly if given long-term
planning ability and self-improvement loops) might iden-
tify ways to increase its influence or avoid being shutdown,
actions that can be catastrophic (Lynch et al. 2025). More-
over, misalignment in goals implies that trust should not
be assumed to monotonically increase over time—an agent
that was aligned yesterday might shift objectives tomorrow,
so trust mechanisms must be able to reset or revoke trust
quickly.
In summary, these LLM fragilities introduce trust chal-
lenges that demand a blend of security, economic, and social
solutions—considerations we keep in mind while examining
each trust model’s strengths and weaknesses.
Trust Models in Inter-Agent Protocol Design
Effective trust between agents can be established through
different mechanisms. We classify six major trust models
and compare their characteristics in the context of agentic
web protocols. Table 1 provides a high-level comparison.
Below, each model is discussed in turn, with examples and
analysis of how it addresses security and LLM-specific is-
sues.
Brief: Endorsements and Credentials
The Brief model grounds trust in attestations issued by
trusted authorities or chains-of-trust. An agent presents
signed credentials that assert properties such as iden-
tity,
capability,
or
compliance,
which
others
verify
against issuer public keys or registries. Longstanding
infrastructures—public-key certificates and web-of-trust
schemes—illustrate the basic idea, and contemporary agent
frameworks generalize it with verifiable credentials that bind
claims to cryptographic identities and expiry policies.
This model assumes the existence of at least one trust-
worthy issuer and a secure binding between the credential
and the agent’s cryptographic identity. It also presumes that
credential semantics are sufficiently stable for the relevant
decision window: credentials are issued, refreshed, and re-
voked on timescales that may lag behind real-time behavior.
The main attack surfaces flow from these assumptions. If is-
suers are compromised, negligent, or captured, credentials
can be misleading. If revocation is slow or fragile, creden-
tials can outlive the agent’s good behavior. If credentials are
not strongly bound to the agent’s key material, imperson-
ation becomes possible. Finally, categorical credentials may
fail to encode context, leading to overreach (for example,
certification in one domain misapplied to another).
Despite these risks, Briefs excel at bootstrapping: they en-
able rapid, portable trust for discovery and initial contact.
They mitigate LLM misrepresentation by replacing self-
assertion with signed attestation; a prompt cannot conjure
a valid third-party credential. They also facilitate account-
ability, because credentials and their verification events can
be logged. The principal drawbacks are authority depen-
dence, potential centralization, and coarse granularity: over-
reliance on a small set of issuers creates choke points; cre-
dential states can lag behind behavior; and binary badges
rarely reflect nuanced, dynamic competence. In practice, ro-
bust revocation, short credential lifetimes, issuer diversifi-
cation, and strong identity binding are prerequisites for de-
ploying the Brief model safely at scale.
Claim: Self-Descriptions of Identity, Policy, and
Capability
The Claim model begins with what agents say about them-
selves. Protocols typically require an agent profile or “card”
enumerating identity, version, skills, supported APIs, and
operating policies. Such claims are necessary for discovery
and routing; they advertise dynamic attributes that no exter-
nal authority can continuously certify (for example, current
availability or resource capacity).
Claim-based trust assumes a baseline of good faith or, at
least, that misrepresentations will be uncovered downstream
and discounted over time. On its own, this model is brittle in
adversarial environments. Attackers can overclaim capabili-
ties, craft deceptive profiles, or exploit ambiguous schemas.


=== PAGE 4 ===
Trust Model
Basis of Trust
Strengths
Weaknesses
Mitigates LLM
Issues
Notable Uses
Brief
Third-party or
self-issued credentials
(verifiable).
Quick bootstrapping
of identity and roles;
portable trust across
contexts;
cryptographically
verifiable
endorsements.
Depends on
issuers/authorities;
requires robust
revocation; static, may
not reflect real-time
behavior.
Prevents simple
impersonation or lying
about
identity/capability;
does not stop runtime
attacks beyond
credential scope.
NANDA AgentFacts
and VCs; SSL/TLS
certificates; W3C
Verifiable Credentials.
Claim
Agent’s
self-proclaimed
descriptions
(AgentCard, profile).
Lightweight, no
infrastructure needed;
essential for discovery
and initial interfacing.
Unverified; prone to
false claims; weak
incentives for
truthfulness unless
combined with other
mechanisms;
vulnerable to prompt
tampering.
Barely addresses LLM
fragility—an agent
can claim safety but
still be misled or err
internally.
A2A AgentCards;
basic API
descriptions; peer
agent protocols
without a trust layer.
Proof
Cryptographic proofs
of actions or state
(signatures,
zero-knowledge
proofs, TEE
attestations).
High-assurance,
trust-minimized
verification; can
preempt or catch
incorrect results;
enables trust in hostile
settings.
Computationally
expensive; requires
verifiable task
specifications; TEEs
have attack surfaces;
limited availability.
Strongly addresses
correctness
(hallucination) and
some deception (proof
reveals lies); does not
directly solve goal
misalignment.
ERC-8004 validation
registry (staked
re-execution, zkML);
blockchain smart
contracts; TEE-based
agent enclaves.
Stake
Economic collateral at
risk (slashing
conditions, insurance).
Aligns incentives with
honest behavior;
deters bad actors via
financial risk; enables
automated penalties
and rewards.
Requires robust
detection of
misbehavior; Sybil
risk if identity is
cheap; may favor
wealthy agents; can be
gamed if stakes are
mis-set.
Discourages deceit
and recklessness
(agents “think twice”
if large stake may be
lost); does not prevent
first-time or
undetected
misbehavior.
ERC-8004
crypto-economic
validation (stakers
re-run tasks);
token-curated
registries; prediction
markets for agent
performance.
Reputation
Community feedback
and history (ratings,
trust scores).
Adaptive,
information-rich
evaluation over time;
fosters earned trust
and continuous
improvement; social
accountability.
Slow to build or
change; susceptible to
collusion, Sybil, and
false reporting;
cold-start problem for
new agents.
Over time filters out
consistently poor or
malicious agents;
indirect mitigation of
errors but not
immediate prevention.
EigenTrust (P2P);
e-commerce seller
ratings; ERC-8004
reputation registry;
peer-review style
networks.
Constraint
Technical limits on
agent actions
(sandboxing, least
privilege).
Strong safety
net—contains damage
regardless of agent
intent; reduces need to
trust agent goodwill;
enforces policies
strictly.
May reduce
functionality and
efficiency; requires
secure sandbox tech;
not foolproof
(sandbox escapes); a
substitute for trust
rather than a measure
of it.
Blocks many LLM
attack vectors (cannot
execute disallowed
actions); ensures
misaligned agents
cannot exceed
bounded capabilities.
A2A security/sandbox
recommendations;
dockerized tool
execution; OS-level
app containment;
tiered access control
in AP2.
Table 1: Comparison of trust models for inter-agent protocol design across basis of trust, strengths/weaknesses, LLM-related
mitigations, and representative uses.


=== PAGE 5 ===
LLM agents may hallucinate competence or present incon-
sistent self-descriptions under prompt pressure. Even when
claims are signed and fetched over secure channels to protect
integrity in transit, their truthfulness remains unverified.
Where Claim shines is lightweight scalability and timeli-
ness. It imposes the lowest infrastructural burden and sup-
ports rapidly changing or inherently subjective qualities. It
also improves human and agent interpretability of intent: ex-
plicit operating policies and limitations in a profile can in-
form safe composition. However, Claim provides negligible
direct mitigation for LLM fragilities. A model can profess
safety yet be subverted by injection; it can intend to follow
rules yet fail under distribution shift. Consequently, Claim
should be treated as input to stronger mechanisms—filtering
candidates for further vetting by credentials, proofs, reputa-
tion, or small-stake trials—rather than as a sufficient basis
for critical decisions.
Proof: Cryptographic and Verifiable Evidence
The Proof model replaces promises and endorsements with
verifiable evidence that an agent took particular actions or
satisfied specified properties. Mechanisms include digital
signatures and tamper-evident logs; attestations from trusted
execution environments; and zero-knowledge proofs that es-
tablish correctness or compliance without revealing sensi-
tive details. In blockchain-adjacent designs, agents may an-
chor hashes of actions on chain or submit zk-proofs that val-
idate computations.
Proof-based trust presupposes verifiability: tasks must be
amenable to specification and checking, or at least to at-
tested execution in an enclave. It also relies on the sound-
ness of the underlying cryptography and hardware. The core
strength is trust minimization. Interacting parties need not
know the agent’s history or reputation; they can accept or
reject a transaction purely on the basis of a valid proof. For
LLM agents, proofs counter several failure modes: tamper-
evident logs deter denial and enable post-hoc accountability;
proof-of-process or attested tool calls constrain the gap be-
tween “what the model claims it did” and “what actually ex-
ecuted”; and zk-proofs can certify compliance with policies
(for example, “no personal data left the enclave”) without
exposing internals.
Nevertheless, proofs guarantee integrity, not alignment.
An agent can correctly prove that it executed a harmful pol-
icy if the policy itself is flawed. Verification scope is there-
fore pivotal: protocols must specify what must be proved and
at what granularity. Proof systems also incur costs—circuit
design, proof generation time, hardware requirements for
TEEs—and introduce new single points of failure in the
cryptographic stack. Side-channel and supply-chain risks in
hardware attestations, denial-of-service via expensive veri-
fications, and partial logging (where only favorable actions
are proved) represent additional attack surfaces.
In practice, Proof is best deployed surgically on high-
impact steps: signing outputs; attesting privileged tool invo-
cations; proving constraint satisfaction; and anchoring au-
dit trails. Used this way, Proof significantly elevates secu-
rity against LLM deception and hallucination while keeping
overhead tractable. Its limitations argue for coupling with
governance elements that assert what should be proved and
Constraint mechanisms that restrict what needs proving in
the first place.
Stake: Collateral, Slashing, and Incentive
Alignment
The Stake model engineers trust through skin in the game.
Agents post collateral—monetary or otherwise—subject to
loss if they violate protocol rules or fail to deliver contracted
outcomes. Slashing conditions can be adjudicated algorith-
mically, by on-chain verifiers, or via human or multi-agent
arbitration. Stake is attractive in open environments where
identities are cheap: it imposes a cost on misbehavior and a
visible signal of commitment.
Stake presumes utility-maximizing agents who care about
losing collateral, reliable fault determination, and appro-
priately calibrated stake-to-risk ratios. It performs poorly
against purely malicious adversaries willing to burn stake
for damage or where potential gains from cheating exceed
the maximum slash. Attack surfaces include Sybil splitting
(spreading risk across many small identities), collusion in
adjudication, and last-mile betrayal (accumulating reputa-
tion under small transactions and defecting on a large one).
Determining what to slash for is delicate: penalizing only in-
tentional malice leaves negligent harm unpriced; penalizing
accidents risks chilling honest participation.
Stake shines when combined with verifiability and feed-
back. Proofs and audits provide objective grounds for slash-
ing; reputation raises the opportunity cost of misbehavior;
and claim/credential gates can modulate required stake. For
LLM agents, stake introduces an economic learning signal:
repeated penalties for errors or unsafe actions create in-
centives to adopt safer prompts, tool-use, and self-checks.
However, stake is largely ex post; it cannot prevent a sin-
gle catastrophic action if detection and adjudication occur
later. Moreover, high stake requirements skew participation
toward resource-rich actors, potentially centralizing an agent
economy and erecting barriers to entry.
Designers
should
prefer
progressive
staking—small
bonds for low-impact tasks, rising with privilege and poten-
tial externality—paired with transparent, appealable adjudi-
cation and anti-Sybil measures (for example, binding iden-
tities to keys with history, entry deposits, or credential pre-
requisites). When thoughtfully calibrated, Stake is a power-
ful complement that turns abstract norms into enforceable
incentives.
Reputation: Distributed Feedback and Social
Signals
The Reputation model aggregates interaction outcomes into
a standing that others can query when selecting partners.
Signals may be quantitative (scores, star ratings) or quali-
tative (reviews, endorsements), global or task-specific, and
centrally stored or distributed. Reputation embraces the in-
tuition that past behavior predicts future behavior and lever-
ages the crowd to surface reliability and quality.
Its assumptions are well known: participants provide hon-
est feedback; identities persist long enough for history to


=== PAGE 6 ===
matter; and the environment features repeated games in
which agents value future opportunities. In adversarial set-
tings, classic vulnerabilities arise: Sybil attacks and bal-
lot stuffing, collusion, defamation of competitors, white-
washing by discarding identities, and cold-start inequities
for newcomers. Weighting schemes, reputation decay, and
trust-in-the-reviewer algorithms mitigate but cannot elimi-
nate these risks.
Reputation’s distinctive strengths are adaptivity and ex-
pressiveness. It can track multidimensional qualities, em-
phasize recency to reflect drift, and cover domains where
formal verification is impractical or subjective judgments
dominate. For LLM agents, reputation can proxy for robust-
ness: agents that repeatedly succumb to prompt injection or
hallucinate will accumulate negative feedback and be fil-
tered out of high-value interactions. Reputation also func-
tions as an implicit stake: agents who value their standing
will refrain from opportunistic misbehavior.
However, reputation is a lagging indicator and can am-
plify inequalities. It neither prevents first-time catastrophes
nor guarantees that a high-reputation agent will behave well
when incentives flip. Over-reliance invites “reputation milk-
ing,” where an actor behaves well to amass trust and defects
when the stakes are highest. Privacy and fairness concerns
also arise if histories are globally visible and indelible. Con-
sequently, reputation should be scoped (task-specific where
possible), tempered (with decay and confidence intervals),
and combined with controls that cap the damage any single
interaction can cause.
Constraint: Sandboxing and Capability Bounding
The Constraint model limits what agents can do, minimiz-
ing the need to predict what they will do. By enforcing
least privilege, isolating execution, mediating tool access
through audited gateways, and rate-limiting sensitive oper-
ations, protocols can bound harm even when agents mis-
behave or fail. Constraint shifts trust from the agent to the
framework.
This model assumes we can identify dangerous resources
and reliably restrict them. It depends on the soundness
of sandboxing technologies and policy engines, and it in-
troduces engineering overhead and potential performance
costs. The attack surfaces are familiar from systems secu-
rity: sandbox escapes, confused deputy abuses of permitted
interfaces, covert channels, and policy drift as capabilities
evolve. A second-order risk is complacency: assuming con-
straints are perfect and neglecting monitoring, proofs, or in-
centives.
Constraint is highly effective against LLM-specific vul-
nerabilities. It limits blast radius for prompt injection by nar-
rowing action surfaces and validating inputs and outputs; it
precludes privilege escalation by denying access to broad
system interfaces; and it enables staged onboarding, where
agents graduate from read-only sandboxes to broader per-
missions after demonstrating good behavior. Logging and
mediated I/O support auditability and human-in-the-loop
overrides for high-impact actions. The principal drawback is
capability throttling: over-constrained agents underperform,
and discovering the right balance between safety and auton-
omy is non-trivial. Moreover, constraints cannot neutralize
harmful content within allowed channels unless paired with
content-level policies and checks.
In practice, Constraint should be dynamic and graduated:
protocols can encode “trust tiers” in which higher privileges
require stronger credentials, more extensive proofs, higher
stake, and better reputation. This aligns operational safety
with demonstrated trustworthiness.
How Current Protocols Implement These
Trust Models
Having defined the trust mechanisms, we turn to concrete
protocols – A2A, AP2, ERC-8004, etc. – to see how they
incorporate one or more of these models. Each protocol was
created with slightly different goals and assumptions, which
is reflected in their trust design.
Google’s A2A (Agent-to-Agent) Protocol
A2A is an
open specification for inter-agent interoperability that stan-
dardizes how agents describe themselves and communicate
(Surapaneni et al. 2025). Concretely, each agent exposes an
AgentCard—typically a JSON document at a well-known
endpoint—that advertises identity, capabilities, and contact
details, and it exchanges requests and responses over a se-
cured JSON-RPC channel layered on HTTPS with mutual
authentication. In trust-model terms, A2A natively privi-
leges Claim (the self-described AgentCard) and Constraint
(enterprise controls and least-privilege network policy), with
Brief appearing through transport-level credentials such as
TLS certificates and OAuth tokens. It does not prescribe
ecosystem-wide Reputation, Stake, or cryptographic Proof
of correct computation, and it leaves discovery, vetting, and
risk policy to deployments. This design makes A2A easy
to adopt in organizational settings where participants are
known a priori: enterprises can gate traffic with allow-lists,
bind AgentCards to domains and service accounts, route
calls through API gateways, and record exhaustive logs for
audit. Strengths therefore include pragmatic interoperability,
compatibility with existing identity and access-management
stacks, and clear operational observability. However, these
same choices limit A2A in open, adversarial environments:
unverified AgentCards place weight on declarations rather
than evidence; absence of protocol-level staking, valida-
tion, or global feedback weakens Sybil and collusion re-
sistance; and reliance on perimeter controls does little to
mitigate LLM-specific failures such as prompt injection or
sycophancy at runtime. In practice, A2A benefits from pair-
ing with higher-assurance layers—credential briefs for iden-
tity attestation, reputation directories for discovery, proof-
or-stake validation for high-impact actions, and hardened
sandboxes for tool use—so that its transport and schema can
operate within a richer, defense-in-depth trust fabric.
Agent Payments Protocol (AP2)
AP2 is a payments-
oriented standard that enables AI agents to initiate and com-
plete commerce on behalf of users while preserving ac-
countability (Parikh and Surapaneni 2025). Its core abstrac-
tion is the Mandate, a verifiable credential that captures ex-
plicit authorization and context (for example, an intent cap,


=== PAGE 7 ===
a cart summary, and whether a human was present), co-
signed by relevant parties and presented with each trans-
action. Trust-model integration is therefore explicit: Brief
and Proof are used to bind agent actions to real entities
via signed mandates and verifiable identities; Constraint is
enforced through role separation and tokenization (agents
do not directly handle sensitive payment credentials and in-
teract via scoped intermediaries); and allow-listed partici-
pants provide an initial, curated Claim/Brief layer while the
ecosystem matures toward broader domain and mTLS-based
assurance. Although AP2 does not standardize Reputation or
Stake on the wire, it anticipates their use off-path: transac-
tion outcomes and chargebacks feed risk engines that effec-
tively act as reputation signals, and liability models, insur-
ance, or performance bonds can supply economic skin-in-
the-game for higher-risk cohorts. The approach’s principal
strengths are strong ex-ante consent capture, cryptograph-
ically auditable traces, and a clear chain of responsibility
that supports dispute resolution and regulatory compliance;
in short, it operationalizes “trust but verify” for financial ac-
tions. Its weaknesses arise from the same controls: early
reliance on curated allow-lists creates onboarding friction
and may concentrate platform power; verifiable-credential
issuance and policy evaluation add latency and implemen-
tation cost; mandate proofs attest authorization, not correct-
ness of upstream agent reasoning (so LLM hallucinations
must be contained by separate constraints); and the absence
of standardized staking or validator quorums means runtime
misbehavior is addressed mainly via ex-post risk manage-
ment rather than protocol-mandated slashing. AP2 works
best as the transaction spine in a layered architecture where
reputational routing and optional stake-backed validations
are used to gate expensive or high-impact payments, and
where agents operate under strict capability bounds to blunt
prompt-level exploits.
Ethereum ERC-8004 (Trustless Agents)
ERC-8004 pro-
poses a decentralized trust layer for agent discovery and as-
surance built around three composable registries (Rossi et al.
2025). An on-chain Identity Registry assigns each agent a
persistent handle (typically an NFT) that links to off-chain
metadata such as an AgentCard, enabling portable identi-
fication across contexts; a Reputation Registry aggregates
structured feedback about agent performance; and a Valida-
tion Registry coordinates third-party checks—re-execution,
TEE attestation, or zero-knowledge proofs—often collat-
eralized by stake so that faulty or dishonest behavior can
be penalized. The standard makes trust modalities ex-
plicit: agents declare “supportedTrust” capabilities (e.g.,
reputation, crypto-economic validation, TEE attestation),
thereby advertising which evidence they can furnish. In
trust-model terms, ERC-8004 integrates Claim/Brief (self-
descriptions anchored in on-chain identity and, optionally,
external credentials), Reputation (shared, queryable feed-
back), and Proof+Stake (verifiable validation with economic
consequences). Constraint is indirect but meaningful: when
consequential actions are mediated by smart contracts, the
agent’s latitude is bounded by program logic and policy. The
architecture’s strengths are transparency, composability, and
trust portability: identities and attestations are addressable;
validations can be tailored per task; and reputational data can
be consumed by diverse clients or even proven in privacy-
preserving ways. By attaching stake to validation, the design
aligns incentives for honest behavior and encourages ecosys-
tem watchdogs. Its weaknesses reflect blockchain trade-offs
and open-system realities: on-chain interactions add cost and
latency; public data raises privacy concerns unless accom-
panied by selective-disclosure mechanisms; validators and
feedback channels themselves become targets for sybil and
collusive manipulation; and, critically, the benefits are op-
tional unless counterparties require validation or minimum
reputation thresholds. Moreover, proofs attest properties of
computations, not normative desirability, so misaligned but
formally correct behavior remains a governance problem.
ERC-8004 is most effective when embedded as a normative
requirement for high-impact workflows, while lower-stakes
interactions rely on identity and reputation alone.
Discussion
Protocol Design implications
Drawing on our comparative analysis, we derive the follow-
ing design implications.
Tiered trust and “trustless-by-default” for high-impact
actions.
Not all interactions justify heavy machinery. Sys-
tems should implement adaptive tiers (T0–T3) in which
stricter controls and stronger evidence are automatically re-
quired as potential harm rises. Low-stakes activities (read-
only queries, small reversible writes) can rely on Claims
and Briefs to maximize throughput, while high-stakes
actions escalate to Proofs (signatures, attestations, zero-
knowledge where feasible), multi-party validation, and sub-
stantial Stake/insurance. Crucially, transitions between tiers
must be enforced by policy: when an agent attempts an ac-
tion whose blast radius exceeds its current tier, the platform
intercepts and upgrades the required checks before execu-
tion.
Identity and Briefs as the substrate for discovery and ac-
countability.
Verifiable identity is not the same as trust-
worthiness, but it is a prerequisite for traceability, audit, and
remediation. Agents should expose durable identifiers and
support verifiable credentials (“Briefs”) that attest to prop-
erties such as domain expertise, safety audits, regulatory li-
censes, or organizational affiliation. Pseudonyms may be tol-
erated at low tiers; high tiers should bind agents to legal en-
tities or qualified pseudonyms to enable recourse. This en-
courages healthy credential ecosystems (auditors, certifiers,
registries) and lets policymakers tie thresholds (e.g., trans-
action size) to identity requirements.
Hybrid by default, configurable per task.
Because ver-
ification affordances vary across domains, systems should
compose multiple models—Reputation for discovery, Briefs
for eligibility, Stake for incentives, Proofs and Constraint for
execution guarantees—and allow policy-per-action config-
uration. Architecturally this implies modular “trust hooks”
(proof verification, reputation lookup, staking and slashing,
sandbox provisioning) and declarative policies that choose


=== PAGE 8 ===
which hooks to invoke for each operation. A practical
maxim follows: start trustless, then relax—assume nothing,
then introduce the minimum additional assumption needed
to complete the task.
Reputation as a layered signal, never a single gate.
Rep-
utation is invaluable for routing and prioritization but is vul-
nerable to collusion, Sybil attacks, and domain shift. Use
it to influence ranking, rate limits, and sampling of sec-
ondary checks—not to waive core safety guarantees. Design
for multi-dimensional reputation (accuracy, responsiveness,
security compliance), decay (to reflect inactivity and model
updates), transitive weighting (trust of raters), and anomaly
detection (to flag correlated or suspicious feedback). Couple
reputation drops with automatic escalation (e.g., more fre-
quent proofs or audits).
Incentive alignment via Stake and insurance.
Economic
skin-in-the-game should scale with risk. Require agents
(or their principals) to post bonds proportional to potential
harm; slash objectively when violations occur; and explore
insurance pools that underwrite agent performance. Codify
slashing conditions ex ante and make them verifiable (e.g.,
via attested logs or challenge-response protocols). Route
slashed funds to affected users or validators to incentivize
oversight. Guard against gaming (e.g., orchestrated slash-
ing) by grounding penalties in objective, reproducible fault
conditions.
Hard Constraints and least privilege as non-negotiables
for LLM agents.
Treat agent inputs as adversarial and
outputs as untrusted until checked. Run effectful actions
in sandboxes with ephemeral credentials, narrow scopes,
and time-bounded permissions; enforce rate limits and cir-
cuit breakers; and maintain a separation between planning
(LLM) and acting (tool runner with policy guardrails). Mon-
itor for policy violations at runtime and automatically quar-
antine misbehaving agents or sessions. Constraints are the
last line of defense when other trust layers fail.
Contextual, domain-specific trust zones.
Sectors differ
in harm models and legal obligations. Support trust zones
with domain-tailored requirements (e.g., healthcare agents
must hold clinical credentials, operate under human over-
sight, and log to compliant archives; creative or gaming
agents can tolerate lighter regimes). Provide gateways for
controlled inter-zone interaction, and ensure hooks for regu-
latory audit and legal evidence across zones.
Continuous
monitoring,
auditability,
and
non-
accumulating trust.
Trust should be earned repeatedly.
Maintain append-only action logs and signed receipts;
sample outputs for random audits; and re-baseline trust
when material conditions change (new model weights,
ownership changes, or security incidents). Decay stale
credentials, require periodic restaking, and treat major
updates as probationary periods with elevated scrutiny.
Design guidelines: a tiered blueprint (T0–T3)
This tiered blueprint (T0–T3) offers a risk-calibrated, modu-
lar framework for systematically applying hybrid trust mod-
els across tasks and risk levels in the agentic web.
T0 — Low-stakes discovery and read-only use.
Enable
frictionless interoperability for negligible-impact tasks (e.g.,
public queries, draft generation). Rely on Claims and avail-
able Briefs for discovery; enforce soft constraints (read-only
credentials, rate limits, allow-lists). Logging is best-effort;
reputation may inform routing but must not gate access.
T1 — Moderate stakes with accountability.
Permit lim-
ited writes or small, reversible payments with explicit at-
tribution. Require authenticated, signed intents; narrowly
scoped, reversible permissions; durable receipts in secure
logs. Use small, refundable bonds and minimal reputation
thresholds to lift probation caps; throttle or trigger secondary
checks on anomalies.
T2 — High stakes with strong assurance.
For materi-
ally consequential actions, adopt a “verify relentlessly” pos-
ture. Re-justify each transaction via TEE attestations, zero-
knowledge or interactive proofs, or quorum validation; en-
force deny-by-default, fine-grained, time-boxed privileges
with continuous monitoring. Calibrate stake/insurance to
worst-case loss; maintain immutable audit trails; reserve hu-
man review for exception paths. Reputation may rank eligi-
ble agents but never substitutes for proofs.
T3 — Critical or life-critical with multi-layer over-
sight.
In safety-critical or ethically sensitive domains,
stack all mechanisms: regulatory-grade credentials and insti-
tutional accountability; redundant agents or multi-signature
approvals; human-in/on-the-loop gating with physical/pro-
cedural fail-safes; non-overridable hard limits; comprehen-
sive, privacy-preserving observability. Extend liability be-
yond agent stakes to organizational insurance; engineer for
graceful degradation and rapid intervention.
Modularity and invariants. Tiers are composable rather
than siloed: moving upward adds evidence, incentives, and
containment rather than merely “more of the same.” Two
invariants span all tiers: (i) least privilege at the capa-
bility boundary (minimal, time-boxed powers), and (ii)
evidence-first accountability at the audit boundary (signed
state and reproducible logs enabling independent verifica-
tion). Adopting this blueprint as a default, with local refine-
ments, preserves resilience and accountability under adver-
sarial conditions.
Conclusion
The near future of the agentic web will be shaped by pro-
tocols that treat hybrid trust as infrastructure: beginning
with verification and containment, layering aligned incen-
tives and institutional accountability, and only then exploit-
ing social signal to regain efficiency. A tiered, composable,
and continuously recalibrated approach enables low-friction
exploration where impact is negligible and “verify relent-
lessly” where stakes are high, allowing autonomous agents
to transact with the assurance we expect of conscientious
human institutions.


=== PAGE 9 ===
References
Braga, D. D. S.; Niemann, M.; Hellingrath, B.; and Neto, F.
B. D. L. 2019. Survey on Computational Trust and Reputa-
tion Models. ACM Computing Surveys, 51(5): 1–40.
Carlsmith, J. 2024.
Is Power-Seeking AI an Existential
Risk? arXiv:2206.13353.
Cherep, M.; Maes, P.; and Singh, N. 2025. LLM Agents Are
Hypersensitive to Nudges. arXiv:2505.11584.
Freiman, O. 2023. Making Sense of the Conceptual Non-
sense ‘Trustworthy AI’. AI and Ethics, 3(4): 1351–1360.
Friedman, E. J.; and Resnick, P. 2001. The Social Cost of
Cheap Pseudonyms. Journal of Economics & Management
Strategy, 10(2): 173–199.
Hubinger, E.; Denison, C.; Mu, J.; Lambert, M.; Tong, M.;
MacDiarmid, M.; Lanham, T.; Ziegler, D. M.; Maxwell, T.;
Cheng, N.; Jermyn, A.; Askell, A.; Radhakrishnan, A.; Anil,
C.; Duvenaud, D.; Ganguli, D.; Barez, F.; Clark, J.; Ndousse,
K.; Sachan, K.; Sellitto, M.; Sharma, M.; DasSarma, N.;
Grosse, R.; Kravec, S.; Bai, Y.; Witten, Z.; Favaro, M.;
Brauner, J.; Karnofsky, H.; Christiano, P.; Bowman, S. R.;
Graham, L.; Kaplan, J.; Mindermann, S.; Greenblatt, R.;
Shlegeris, B.; Schiefer, N.; and Perez, E. 2024.
Sleeper
Agents: Training Deceptive LLMs That Persist Through
Safety Training. arXiv:2401.05566.
Josang, A.; and Ismail, R. 2002. The beta reputation sys-
tem. In Proceedings of the 15th bled electronic commerce
conference, volume 5, 2502–2511.
Kamvar, S. D.; Schlosser, M. T.; and Garcia-Molina, H.
2003.
The Eigentrust Algorithm for Reputation Manage-
ment in P2P Networks. In Proceedings of the Twelfth Inter-
national Conference on World Wide Web - WWW ’03, 640.
Budapest, Hungary: ACM Press. ISBN 978-1-58113-680-7.
Liu, Y.; Deng, G.; Li, Y.; Wang, K.; Wang, Z.; Wang, X.;
Zhang, T.; Liu, Y.; Wang, H.; Zheng, Y.; and Liu, Y. 2024.
Prompt Injection Attack against LLM-integrated Applica-
tions. arXiv:2306.05499.
Lynch, A.; Wright, B.; Larson, C.; Ritchie, S. J.; Minder-
mann, S.; Hubinger, E.; Perez, E.; and Troy, K. 2025. Agen-
tic Misalignment: How LLMs Could Be Insider Threats.
arXiv:2510.05179.
Manzini, A.; Keeling, G.; Marchal, N.; McKee, K. R.;
Rieser, V.; and Gabriel, I. 2024. Should Users Trust Ad-
vanced AI Assistants? Justified Trust As a Function of Com-
petence and Alignment.
In The 2024 ACM Conference
on Fairness, Accountability, and Transparency, 1174–1186.
Rio de Janeiro Brazil: ACM. ISBN 979-8-4007-0450-5.
Marsh, S. P. 1994. Formalising trust as a computational con-
cept.
O’Neill, O. 2002. Autonomy and Trust in Bioethics. Cam-
bridge University Press, 1 edition. ISBN 978-0-521-81540-6
978-0-521-89453-1 978-0-511-60625-0.
O’Neill, O. 2018. Linking Trust to Trustworthiness. Inter-
national Journal of Philosophical Studies, 26(2): 293–300.
OpenAI. 2024.
Gpt-4o system card.
arXiv preprint
arXiv:2410.21276.
O’Neill, O.; and Bardrick, J. 2015. Trust, trustworthiness
and transparency. Brussels: European Foundation Centre.
Parikh, S.; and Surapaneni, R. 2025. Powering AI commerce
with the new Agent Payments Protocol (AP2).
Raskar, R.; Chari, P.; Zinky, J.; Lambe, M.; Grogan, J. J.;
Wang, S.; Ranjan, R.; Singhal, R.; Gupta, S.; Lincourt, R.;
Bala, R.; Joshi, A.; Singh, A.; Chopra, A.; Stripelis, D.; B,
B.; Kumar, S.; and Gorskikh, M. 2025. Beyond DNS: Un-
locking the Internet of AI Agents via the NANDA Index and
Verified AgentFacts. arXiv:2507.14263.
Rettenberger, L.; Reischl, M.; and Schutera, M. 2025. As-
sessing Political Bias in Large Language Models. Journal
of Computational Social Science, 8(2): 42.
Rossi, M. D.; Crapis, D.; Ellis, J.; and Reppel, E. 2025.
ERC-8004: Trustless Agents: Discover agents and establish
trust through reputation and validation.
Sabater, J. 2004. EVALUATING THE ReGreT SYSTEM.
Applied Artificial Intelligence, 18(9-10): 797–813.
Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.; Askell,
A.; Bowman, S. R.; Cheng, N.; Durmus, E.; Hatfield-Dodds,
Z.; Johnston, S. R.; Kravec, S.; Maxwell, T.; McCandlish, S.;
Ndousse, K.; Rausch, O.; Schiefer, N.; Yan, D.; Zhang, M.;
and Perez, E. 2025. Towards Understanding Sycophancy in
Language Models. arXiv:2310.13548.
Shen, T.; Jin, R.; Huang, Y.; Liu, C.; Dong, W.; Guo, Z.; Wu,
X.; Liu, Y.; and Xiong, D. 2023. Large Language Model
Alignment: A Survey. arXiv:2309.15025.
Surapaneni, R.; Jha, M.; Vakoc, M.; and Segal, T. 2025. An-
nouncing the Agent2Agent Protocol (A2A).
Teacy, W. T. L.; Patel, J.; Jennings, N. R.; and Luck, M.
2006.
TRAVOS: Trust and Reputation in the Context of
Inaccurate Information Sources.
Autonomous Agents and
Multi-Agent Systems, 12(2): 183–198.
Xu, Z.; Jain, S.; and Kankanhalli, M. 2025. Hallucination Is
Inevitable: An Innate Limitation of Large Language Models.
arXiv:2401.11817.
Yang, Y.; Ma, M.; Huang, Y.; Chai, H.; Gong, C.; Geng,
H.; Zhou, Y.; Wen, Y.; Fang, M.; Chen, M.; Gu, S.; Jin, M.;
Spanos, C.; Yang, Y.; Abbeel, P.; Song, D.; Zhang, W.; and
Wang, J. 2025. Agentic Web: Weaving the Next Web with
AI Agents. arXiv:2507.21206.
Disclosure of the usage of LLM
We used ChatGPT (GPT5 model (OpenAI 2024)) to facili-
tate the writing of this manuscript. The usage includes:
• Turn Excel format tables into LaTeX format tables
• Correct grammar mistakes and spelling
• Deep research for various protocols
• Polish the existing writing