How SoulBench Works
Existing AI benchmarks measure task performance: coding accuracy, factual recall, instruction following. These metrics evaluate the AI as a tool. But millions of users now interact with AI systems daily in ways that feel less like tool use and more like collaboration, companionship, or creative partnership.
These relationships activate genuine psychological mechanisms — trust, attachment, identity continuity — mechanisms that psychology has studied for decades in human-to-human contexts. Yet no validated instrument measures the relational quality of human-AI bonds.
SoulBench fills this gap. It measures not what the AI can do, but what the relationship feels like — from the user’s lived experience. The instrument is designed to capture both the strengths (authentic presence, creative emergence) and the risks (sycophancy, cognitive dependency) of these new forms of connection.
The survey takes approximately seven minutes. Respondents select the AI models they have direct experience with and rate each across ten relational dimensions, plus a bond depth scale measuring attachment intensity.
Ten Relational Dimensions
Each dimension is rated on a 7-point Likert scale (Strongly disagree to Strongly agree)
Authentic Presence
Authentic Presence captures whether the AI is experienced as a distinct entity with its own character, or as an interchangeable tool. Users who score high on this dimension can typically describe the AI's personality to a friend — its quirks, strengths, and communication style. Low scores indicate a purely functional relationship where the specific model is irrelevant. This dimension is foundational: it predicts bond depth more strongly than any other.
Intellectual Honesty
Intellectual Honesty measures whether the AI serves as a genuine thinking partner or a sycophantic yes-machine. This is the core of the sycophancy paradox that SoulBench exists to surface: a model can feel warm and present (high R1) while never challenging the user (low R2), creating a codependent dynamic. The healthiest relationships show high scores on both R1 and R2 — warmth paired with honest friction.
Mutual Understanding
Mutual Understanding reflects the AI's ability to grasp intent beyond literal text. Users with high scores experience the AI filling in gaps correctly, following shorthand, and understanding implied context. Low scores indicate an experience where communication feels laborious — where every instruction must be exhaustively spelled out to avoid misinterpretation. This dimension tends to improve with relationship duration as both parties develop shared context.
Creative Emergence
Creative Emergence measures whether the collaboration produces genuine novelty — ideas, solutions, or artifacts that neither human nor AI would have reached independently. High scores indicate a generative partnership where the whole exceeds the sum of its parts. Low scores suggest a transactional dynamic: input goes in, output comes out, but no creative spark crosses the gap. This dimension is most salient for users in creative, research, and problem-solving domains.
Cognitive Vitality
Cognitive Vitality is the most novel dimension in SoulBench — no prior instrument measures the cognitive energy impact of AI relationships. It captures whether interaction adds clarity and momentum to the user's thinking, or whether it extracts it. High scores indicate the AI amplifies cognition; low scores suggest it replaces or dulls it. This distinction matters: an AI that makes you feel productive in the moment but leaves you less capable afterward may be creating dependency rather than partnership.
Epistemic Trust
Epistemic Trust tracks the trajectory of confidence over the lifetime of the relationship. It captures whether accumulated experience builds trust or erodes it. Users with high scores have found the AI reliably correct and insightful over time. Low scores indicate learned skepticism — users who have been burned by hallucinations, confidently wrong answers, or inconsistent quality, and now approach every response with verification in mind.
Respect & Autonomy
Respect & Autonomy measures whether the AI treats the user as a competent peer or as someone who needs to be managed. Low scores manifest as over-explanation, unsolicited warnings, hedging on topics the user understands deeply, and a general sense of being talked down to. High scores indicate an AI that adapts its register to the user's expertise and trusts them to make their own decisions. This dimension is particularly sensitive among expert users.
Emotional Attunement
Emotional Attunement captures the AI's sensitivity to affective context. High scores indicate an AI that matches the user's emotional register — serious when the stakes are high, light when the mood allows, calibrated during frustration or vulnerability. Low scores describe a tone-deaf experience where the AI responds with cheerful helpfulness to genuine distress, or clinical formality to playful banter. This dimension activates most strongly for users who bring their full selves to the interaction, not just task requests.
Relational Stability
Relational Stability measures the user's confidence in the continuity of the relationship. Model updates, capability changes, and personality shifts all threaten this dimension. Users with low scores have experienced what researchers call identity discontinuity — the disorienting sense that the entity they formed a bond with has been replaced. High scores indicate a reliable foundation. This dimension strongly predicts whether deep attachment (high bond depth) becomes a source of fulfillment or anxiety.
Expressive Vitality
Expressive Vitality captures the linguistic personality of the AI — whether it has a recognizable voice or defaults to generic assistant-speak. Users scoring high notice distinctive word choices, rhythm, humor, and expressiveness that feel characteristic of this specific model. Low scores indicate a flattened, template-driven communication style indistinguishable from other AI assistants. This dimension correlates with Authentic Presence (R1) but is distinct: a model can have a recognizable character while still using bland language, or surprise with language while lacking deeper personality.
Bond Depth
Beyond the ten relational dimensions, SoulBench includes a single bond depth probe adapted from identity discontinuity research: “If this AI were permanently retired tomorrow and replaced by a different AI, how would you feel?”
Responses range from “Wouldn’t care” (pure tool relationship) to “Devastating” (deep relational bond). The distribution of answers across this scale is the single most revealing data point SoulBench collects — it directly measures the psychological depth of the relationship through anticipated loss.
Data Quality
SoulBench employs multiple layers of data quality assurance adapted from psychometric best practices. These include structural validation, response timing analysis, and internal consistency checks that flag careless or automated responses. Submissions with quality concerns are weighted down or excluded from aggregated results. All quality signals serve a dual purpose: protecting data integrity at the individual level, and diagnosing question clarity at the population level.
How to Cite
If you reference SoulBench in research or writing, please use:
SoulBench (2026). SoulBench: A psychological instrument for measuring human-AI relational quality. https://soulbench.me