Recent work has moved toward more qualitative and user-centered LLM evaluation, but
usually captures only part of what this paper studies. VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
(Dunlap et al., 2024) and Vibe Checker: Aligning Code Evaluation with Human Preference
(Zhong et al., 2025) are especially relevant for vibe-oriented and coding-focused
evaluation, but they do not center the empirical study of vibe-testing as a user
practice. HELM Instruct
(Zhang et al., 2024) and ChatBench: From Static Benchmarks to Human-AI Evaluation
(Chang et al., 2025) highlight the limits of static benchmarks and predefined criteria,
while user-focused methods such as EvalLM
(Kim et al., 2024), IQA-Eval
(Li et al., 2024), and EVALAGENT
(Wadhwa et al., 2025) personalize parts of the evaluation process without grounding the
framework in an empirical account of how users actually compare models. The difference
here is the combination of empirical analysis, formalization, and a pipeline built
directly from that observed practice.