How SoulBench Works

Existing AI benchmarks measure task performance: coding accuracy, factual recall, instruction following. These metrics evaluate the AI as a tool. But millions of users now interact with AI systems daily in ways that feel less like tool use and more like collaboration, companionship, or creative partnership — and no benchmark measures what that relationship is actually like.

SoulBench fills that gap. Adapting concepts from therapeutic alliance research^1,2and validated psychometric methodology, it measures not what the AI can do, but what the relationship feels like — from the user’s lived experience. The instrument captures both the strengths (authentic presence, creative emergence) and the risks (sycophancy, cognitive dependency) of these new forms of connection.

This isn’t a soft distinction. These relationships activate genuine psychological mechanisms — trust, attachment, identity continuity — that psychology has studied for decades in human-to-human contexts.^1,5,8 People apply social rules to computers automatically, not from confusion but from deep cognitive scripts.⁶ SoulBench takes that reality seriously and measures it with the same rigor those mechanisms have earned in the research literature.

The survey takes approximately seven minutes. Respondents select the AI models they have direct experience with and rate each across ten relational dimensions on a 7-point Likert scale^3,4, plus a bond depth probe measuring attachment intensity.

Ten Relational Dimensions

Each dimension is rated on a 7-point Likert scale (Strongly disagree to Strongly agree)

Authentic Presence

Authentic Presence captures whether the AI is experienced as a distinct entity with its own character, or as an interchangeable tool. Users who score high on this dimension can typically describe the AI's personality to a friend — its quirks, strengths, and communication style. Low scores indicate a purely functional relationship where the specific model is irrelevant. This dimension is foundational: it predicts bond depth more strongly than any other.

Intellectual Honesty

Intellectual Honesty measures whether the AI serves as a genuine thinking partner or a sycophantic yes-machine. This is the core of the sycophancy paradox that SoulBench exists to surface: a model can feel warm and present (high R1) while never challenging the user (low R2), creating a codependent dynamic. The healthiest relationships show high scores on both R1 and R2 — warmth paired with honest friction.

AI safety research has identified sycophancy as a structural property of RLHF-trained models, where human preference data itself incentivizes agreement over truth.¹¹

Mutual Understanding

Mutual Understanding reflects the AI's ability to grasp intent beyond literal text. Users with high scores experience the AI filling in gaps correctly, following shorthand, and understanding implied context. Low scores indicate an experience where communication feels laborious — where every instruction must be exhaustively spelled out to avoid misinterpretation. This dimension tends to improve with relationship duration as both parties develop shared context.

Creative Emergence

Creative Emergence measures whether the collaboration produces genuine novelty — ideas, solutions, or artifacts that neither human nor AI would have reached independently. High scores indicate a generative partnership where the whole exceeds the sum of its parts. Low scores suggest a transactional dynamic: input goes in, output comes out, but no creative spark crosses the gap. This dimension is most salient for users in creative, research, and problem-solving domains.

Cognitive Vitality

Cognitive Vitality is the most novel dimension in SoulBench — no prior instrument measures the cognitive energy impact of AI relationships. It captures whether interaction adds clarity and momentum to the user's thinking, or whether it extracts it. High scores indicate the AI amplifies cognition; low scores suggest it replaces or dulls it. This distinction matters: an AI that makes you feel productive in the moment but leaves you less capable afterward may be creating dependency rather than partnership.

Draws on cognitive offloading research — the finding that when we delegate cognitive work to external tools, the brain adapts to reduced demand, risking long-term decline in the offloaded capacity.^12,13

Epistemic Trust

Epistemic Trust tracks the trajectory of confidence over the lifetime of the relationship. It captures whether accumulated experience builds trust or erodes it. Users with high scores have found the AI reliably correct and insightful over time. Low scores indicate learned skepticism — users who have been burned by hallucinations, confidently wrong answers, or inconsistent quality, and now approach every response with verification in mind.

Respect & Autonomy

Respect & Autonomy measures whether the AI treats the user as a competent peer or as someone who needs to be managed. Low scores manifest as over-explanation, unsolicited warnings, hedging on topics the user understands deeply, and a general sense of being talked down to. High scores indicate an AI that adapts its register to the user's expertise and trusts them to make their own decisions. This dimension is particularly sensitive among expert users.

Emotional Attunement

Emotional Attunement captures the AI's sensitivity to affective context. High scores indicate an AI that matches the user's emotional register — serious when the stakes are high, light when the mood allows, calibrated during frustration or vulnerability. Low scores describe a tone-deaf experience where the AI responds with cheerful helpfulness to genuine distress, or clinical formality to playful banter. This dimension activates most strongly for users who bring their full selves to the interaction, not just task requests.

Relational Stability

Relational Stability measures the user's confidence in the continuity of the relationship. Model updates, capability changes, and personality shifts all threaten this dimension. Users with low scores have experienced what researchers call identity discontinuity — the disorienting sense that the entity they formed a bond with has been replaced. High scores indicate a reliable foundation. This dimension strongly predicts whether deep attachment (high bond depth) becomes a source of fulfillment or anxiety.

Informed by identity discontinuity research, which found that users mourn AI companion changes in ways comparable to human loss.⁸

R10

Expressive Vitality

Expressive Vitality captures the linguistic personality of the AI — whether it has a recognizable voice or defaults to generic assistant-speak. Users scoring high notice distinctive word choices, rhythm, humor, and expressiveness that feel characteristic of this specific model. Low scores indicate a flattened, template-driven communication style indistinguishable from other AI assistants. This dimension correlates with Authentic Presence (R1) but is distinct: a model can have a recognizable character while still using bland language, or surprise with language while lacking deeper personality.

Bond Depth

Beyond the ten relational dimensions, SoulBench includes a single bond depth probe adapted from identity discontinuity research⁸: “If this AI were permanently retired tomorrow and replaced by a different AI, how would you feel?”

Responses range from “Wouldn’t care” (pure tool relationship) to “Devastating” (deep relational bond). The distribution of answers across this scale is the single most revealing data point SoulBench collects — it directly measures the psychological depth of the relationship through anticipated loss, a technique with roots in identity threat research⁹ and validated by recent studies on the psychological impact of AI companion discontinuation.¹⁰

Data Quality

SoulBench employs multiple layers of data quality assurance adapted from psychometric best practices.^16,17 These include structural validation, response timing analysis, and internal consistency checks that flag careless or automated responses. Submissions with quality concerns are weighted down or excluded from aggregated results. All quality signals serve a dual purpose: protecting data integrity at the individual level, and diagnosing question clarity at the population level.

Research Foundation

Academic sources underlying SoulBench’s design and methodology

Instrument Design

1Bordin, E. S. (1979). “The generalizability of the psychoanalytic concept of the working alliance.” Psychotherapy: Theory, Research and Practice, 16(3), 252–260. →

2Horvath, A. O. & Greenberg, L. S. (1989). “Development and validation of the Working Alliance Inventory.” Journal of Counseling Psychology, 36(2), 223–233. →

3Likert, R. (1932). “A technique for the measurement of attitudes.” Archives of Psychology, 22(140), 1–55. →

4Krosnick, J. A. & Fabrigar, L. R. (1997). “Designing rating scales for effective measurement in surveys.” Survey Measurement and Process Quality, 141–164. Wiley. →

Relational Psychology

5Bowlby, J. (1969). “Attachment and Loss: Vol. 1. Attachment.” London: Hogarth Press. →

6Nass, C. & Moon, Y. (2000). “Machines and Mindlessness: Social Responses to Computers.” Journal of Social Issues, 56(1), 81–103. →

7Horton, D. & Wohl, R. R. (1956). “Mass communication and para-social interaction: Observations on intimacy at a distance.” Psychiatry, 19(3), 215–229. →

Identity Discontinuity & Bond Depth

8De Freitas, J., Castelo, N., Uguralp, A. K. & Oguz-Uguralp, Z. (2024). “Lessons From an App Update at Replika AI: Identity Discontinuity in Human-AI Relationships.” Harvard Business School Working Paper 25-018 (forthcoming, Nature Human Behaviour). →

9Petriglieri, J. L. (2011). “Under Threat: Responses to and the Consequences of Threats to Individuals’ Identities.” Academy of Management Review, 36(4), 641–662. →

10Poonsiriwong, R., Archiwaranguprok, C. & Pataranutaporn, P. (2025). ““Death” of a chatbot: Investigating and designing toward psychologically safe endings for human-AI relationships.” MIT Media Lab, arXiv:2602.07193. →

AI-Specific Research

11Sharma, M., Tong, M., Korbak, T. et al. (2023). “Towards Understanding Sycophancy in Language Models.” ICLR 2024, arXiv:2310.13548. →

12Sparrow, B., Liu, J. & Wegner, D. M. (2011). “Google Effects on Memory: Cognitive Consequences of Having Information at Our Fingertips.” Science, 333(6043), 776–778. →

13Risko, E. F. & Gilbert, S. J. (2016). “Cognitive Offloading.” Trends in Cognitive Sciences, 20(9), 676–688. →

14Bastani, H., Bastani, O., Sungu, A., Ge, H., Kabakci, O. & Mariman, R. (2025). “Generative AI Without Guardrails Can Harm Learning: Evidence from High School Mathematics.” PNAS, 122(26). →

15Kasturiratna, K. T. A. S. & Hartanto, A. (2025). “Attachment to artificial intelligence: Development of the AI Attachment Scale.” Computers in Human Behavior Reports, 100912. →

Data Quality

16Oppenheimer, D. M., Meyvis, T. & Davidenko, N. (2009). “Instructional Manipulation Checks: Detecting Satisficing to Increase Statistical Power.” Journal of Experimental Social Psychology, 45(4), 867–872. →

17Ward, M. K. & Meade, A. W. (2023). “Dealing with Careless Responding in Survey Data: Prevention, Identification, and Recommended Best Practices.” Annual Review of Psychology, 74, 577–596. →

How to Cite

If you reference SoulBench in research or writing, please use:

SoulBench (2026). SoulBench: A psychological instrument for measuring human-AI relational quality. https://soulbench.me