← heapsort-ai

model robustness

7 items

ARTICLE↑ trendingReddit r/MachineLearning·18d ago

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

The author expresses frustration that benchmark performance often fails to predict whether an AI workflow will succeed in real production usage. This is due to factors like ambiguous user intent and messy contexts, suggesting evaluation still prioritizes clean-task optimization over behavioral robustness.

41
RESEARCHarXiv CS.CL·4/15/2026

Robust Explanations for User Trust in Enterprise NLP Systems

This research proposes a unified black-box robustness evaluation framework for token-level explanations to improve user trust in enterprise NLP systems, especially when migrating to LLMs. It operationalizes robustness via top-token flip rate under realistic perturbations, conducting a systematic comparison across various encoder and decoder architectures like BERT, RoBERTa, Qwen, and Llama.

28
RESEARCHarXiv CS.CL·5/5/2026

Compared to What? Baselines and Metrics for Counterfactual Prompting

This work argues that observed effects from "counterfactual prompting" in LLMs cannot be attributed to a targeted factor without accounting for meaning-preserving text modifications that establish general model sensitivity. The research shows that prediction flip rates when surgically changing patient gender are statistically indistinguishable from rates induced by simply paraphrasing inputs, suggesting that special sensitivity to patient gender cannot be concluded.

27