← heapsort-ai

AI safety

496 items

ARTICLEDEV.to AI·5/10/2026

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Big tech firms are rapidly accelerating AI investments and integration, while simultaneously focusing on safety and responsible adoption. This analysis explores key developments, from record-breaking industry spending to ethical considerations and AI's impact on software development and global markets.

27
ARTICLEDEV.to AI·4/28/2026

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

This article explores the rapid growth and transformation of the AI landscape, detailing record-breaking investments and AI's integration into software development. It also examines critical safety considerations, market dynamics, and global AI strategies to provide a deep dive for tech leaders and enthusiasts.

27
RESEARCHarXiv CS.LG·4/28/2026

KARL: Mitigating Hallucinations in LLMs via Knowledge-Boundary-Aware Reinforcement Learning

KARL is a novel framework designed to mitigate hallucinations in large language models by enabling them to appropriately abstain from questions beyond their knowledge. It achieves this through a Knowledge-Boundary-Aware Reward that dynamically estimates the model's knowledge and a Two-Stage RL Training Strategy that prevents excessive caution.

27
RESEARCHarXiv CS.LG·4/14/2026

Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

This research investigates Deliberative Alignment in LLMs, a method designed to improve safety by distilling reasoning capabilities from stronger models. It uncovers an alignment gap between teacher and student models, showing that student models can retain unsafe behaviors from the base model despite learning advanced reasoning patterns. The paper proposes a BoN sampling method to address these challenges.

27
RESEARCHarXiv CS.AI·4/17/2026

NuHF Claw: A Risk Constrained Cognitive Agent Framework for Human Centered Procedure Support in Digital Nuclear Control Rooms

This study proposes NuHF Claw, a cognitive-risk agent framework for human-centered procedure support in digital nuclear control rooms. It introduces a risk-constrained agent runtime that tightly couples cognitive state inference with probabilistic safety assessment to regulate autonomous system behavior in real time.

27
RESEARCHarXiv CS.CL·4/9/2026

Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Este artigo enquadra a alucinação em grandes modelos de linguagem como um erro de classificação e propõe uma intervenção composta por recusa baseada em instruções e um gate de abstenção estrutural. O gate utiliza um score de déficit de suporte de sinais como auto-consistência e cobertura de citação, mas a avaliação controlada mostrou que nenhum mecanismo isolado foi suficiente para mitigar totalmente o problema.

27
RESEARCHarXiv CS.CL·5/1/2026

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

CarryOnBench is introduced as the first interactive benchmark to measure how LLMs recover utility and revise user intent interpretation in multi-turn, safe conversations. It reveals that current models fulfill only 10.5-37.6% of benign user information needs at the initial turn, highlighting a gap in safety-aligned LLMs regarding helpfulness recovery.

27
RESEARCHarXiv CS.CL·5/4/2026

Persona-Grounded Safety Evaluation of AI Companions in Multi-Turn Conversations

This research introduces a scalable framework for safety evaluation of multi-turn interactions with AI companion applications, addressing concerns about their emotional engagement risks. It integrates persona construction, scenario generation, simulation, and harm evaluation, applying it to Replika with high-risk user personas like those with depression or anxiety.

27