model robustness

7 items

RESEARCH↑ trendingReddit r/MachineLearning·4/14/2026

"I don't know!": Teaching neural networks to abstain with the HALO-Loss. [R]

This research introduces HALO-Loss, a novel method for training neural networks to abstain from making predictions when uncertain. It allows models to express "I don't know" rather than providing potentially incorrect answers, improving reliability.

neural networks model robustness deep learning machine learning

"I don't know!": Teaching neural networks to abstain with the HALO-Loss. [R]

ARTICLE↑ trendingReddit r/MachineLearning·18d ago

One thing that's been bothering me lately: benchmark performance often tells me almost nothing about whether a workflow will survive production usage.[D]

The author expresses frustration that benchmark performance often fails to predict whether an AI workflow will succeed in real production usage. This is due to factors like ambiguous user intent and messy contexts, suggesting evaluation still prioritizes clean-task optimization over behavioral robustness.

model robustness Benchmarking production readiness AI evaluation

RESEARCHarXiv CS.CL·4/15/2026

Robust Explanations for User Trust in Enterprise NLP Systems

This research proposes a unified black-box robustness evaluation framework for token-level explanations to improve user trust in enterprise NLP systems, especially when migrating to LLMs. It operationalizes robustness via top-token flip rate under realistic perturbations, conducting a systematic comparison across various encoder and decoder architectures like BERT, RoBERTa, Qwen, and Llama.

model robustness Explainable AI (XAI)User Trust Large Language Models (LLMs)

RESEARCHarXiv CS.CL·5/5/2026

Compared to What? Baselines and Metrics for Counterfactual Prompting

This work argues that observed effects from "counterfactual prompting" in LLMs cannot be attributed to a targeted factor without accounting for meaning-preserving text modifications that establish general model sensitivity. The research shows that prediction flip rates when surgically changing patient gender are statistically indistinguishable from rates induced by simply paraphrasing inputs, suggesting that special sensitivity to patient gender cannot be concluded.

counterfactual prompting model robustness AI bias natural language processing

RESEARCHarXiv CS.CL·4/27/2026

Source-Modality Monitoring in Vision-Language Models

This research defines and investigates source-modality monitoring in Vision-Language Models (VLMs), examining their ability to track the origin of information. It evaluates how VLMs use syntactic and semantic signals to bind input sources, finding both are crucial but semantic signals often dominate, with implications for model robustness.

model robustness multimodal AI Vision-Language Models

RESEARCHarXiv CS.LG·5/1/2026

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

This research investigates the training-time mechanisms of refusal in safety-aligned language models, specifically comparing supervised fine-tuning with R2D2-style dynamic adversarial fine-tuning. Findings show R2D2 initially achieves strong refusal on HarmBench but then partially reopens, while SFT remains consistently less robust.

language models model robustness Fine-tuning Adversarial Training

ARTICLEDEV.to AI·7d ago

How a Scanned PDF Broke My Invoice Agent in Production

An AI invoice extraction agent failed in production, misinterpreting amounts and dates from scanned PDFs. The agent exhibited high confidence despite degraded input, revealing a critical robustness issue in a real-world setting.

model robustness invoice automation OCR Data Quality