Compared to What? Baselines and Metrics for Counterfactual Prompting
This work argues that observed effects from "counterfactual prompting" in LLMs cannot be attributed to a targeted factor without accounting for meaning-preserving text modifications that establish general model sensitivity. The research shows that prediction flip rates when surgically changing patient gender are statistically indistinguishable from rates induced by simply paraphrasing inputs, suggesting that special sensitivity to patient gender cannot be concluded.