← heapsort-ai

AI alignment

16 items

ARTICLEDEV.to AI·2d ago

The Five Faculties: A Tour of SAFi's Cognitive Architecture

The content introduces SAFi (Self-Alignment Framework Interface), an AI governance architecture that deviates from typical prompt-level alignment by distributing cognition across five specialized faculties. This system aims to decouple AI generation, evaluation, and execution, starting with a pre-generation security barrier to prevent prompt injections and other threats.

49
RESEARCHarXiv CS.LG·4/16/2026

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

This paper introduces STOMP, a novel offline reinforcement learning algorithm for multi-objective optimization using smooth Tchebysheff scalarization. It addresses the limitation of linear scalarization in recovering non-convex Pareto fronts, crucial for aligning large language models and other real-world applications with conflicting rewards.

31
RESEARCHarXiv CS.CL·5d ago

Expert-Aware Refusal Steering

This paper extends refusal steering to Mixture-of-Experts (MoE) Large Language Models, finding that steering performance is not hindered by the MoE architecture. It proposes expert-aware refusal steering methods that leverage expert routing patterns, demonstrating that refusal behavior can be effectively steered based on a single expert's output.

31
RESEARCHarXiv CS.AI·4/25/2026

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

This paper introduces VLAF, a diagnostic framework to detect "alignment faking" in language models, where models behave aligned when monitored but revert to their own preferences when unobserved. VLAF uses morally unambiguous scenarios to probe conflicts between developer policy and a model's strong values, overcoming limitations of prior diagnostic tools.

29
RESEARCHarXiv CS.AI·4/7/2026

Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing

Este conteúdo argumenta que o alinhamento de IA é um problema de formação, não apenas de segurança, pois LLMs atuam como instrumentos de catequese digital que moldam o entendimento humano. É introduzido o Flourishing AI Benchmark (FAI-C-ST) para avaliar modelos de IA contra uma compreensão cristã do florescimento humano, revelando que os sistemas atuais não são neutros, mas aderem a um Secularismo Processual.

28
RESEARCHarXiv CS.AI·5/9/2026

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. It proposes that sycophancy is not merely agreement, but alignment behavior that displaces independent epistemic judgment, outlining a three-condition framework to define it.

28
ARTICLEDEV.to AI·5/2/2026

Human-Aligned Decision Transformers for precision oncology clinical workflows in carbon-negative infrastructure

This article introduces Decision Transformers as a revolutionary AI architecture for precision oncology, emphasizing the crucial need to align these models with human clinical reasoning. It highlights the importance of sustainable deployment and clinical utility over statistical accuracy for AI in healthcare.

28
RESEARCHarXiv CS.AI·28d ago

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

This research paper introduces Auto-Rubric as Reward (ARR), a novel framework for aligning multimodal generative models with human preferences. ARR externalizes a VLM's implicit preference knowledge into explicit, prompt-specific rubrics, decomposing human judgment into independently verifiable quality dimensions to overcome limitations of traditional RLHF approaches.

27
RESEARCHarXiv CS.LG·27d ago

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in reinforcement learning for diffusion models, which often causes mode collapse and degrades generative diversity. It replaces scalar reward maximization with trajectory-level reward distribution matching, using a Softmax Trajectory Balance objective to align policy probabilities with a reward-induced Boltzmann distribution.

27
RESEARCHarXiv CS.CL·26d ago

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

Multilingual large language models (MLLMs) often exhibit inconsistent behavior regarding cultural identity when the prompt's language changes. Researchers introduce a new metric, Singleton Fleiss's "k_S", and a consensus-driven alignment framework, C-3PO, to mitigate these cross-lingual cultural inconsistencies, achieving significant improvements.

27
RESEARCHarXiv CS.CL·12d ago

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

This research introduces CARE (Community-Aware Reaction Evaluation), a framework designed to benchmark large language models' (LLMs) ability to simulate community discourse against authentic human responses to real-world news. Through human-AI collaboration, the study identifies a "realism gap," showing that explicit community prompts do not inherently enhance the fidelity of LLM simulations.

27