AI alignment

16 items

ARTICLEDEV.to AI·2d ago

The Five Faculties: A Tour of SAFi's Cognitive Architecture

The content introduces SAFi (Self-Alignment Framework Interface), an AI governance architecture that deviates from typical prompt-level alignment by distributing cognition across five specialized faculties. This system aims to decouple AI generation, evaluation, and execution, starting with a pre-generation security barrier to prevent prompt injections and other threats.

AI architecture LLMs AI alignment security

RESEARCHarXiv CS.LG·4/16/2026

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

This paper introduces STOMP, a novel offline reinforcement learning algorithm for multi-objective optimization using smooth Tchebysheff scalarization. It addresses the limitation of linear scalarization in recovering non-convex Pareto fronts, crucial for aligning large language models and other real-world applications with conflicting rewards.

reinforcement learning Multi-objective Optimization AI alignment machine learning

RESEARCHarXiv CS.CL·5d ago

Expert-Aware Refusal Steering

This paper extends refusal steering to Mixture-of-Experts (MoE) Large Language Models, finding that steering performance is not hindered by the MoE architecture. It proposes expert-aware refusal steering methods that leverage expert routing patterns, demonstrating that refusal behavior can be effectively steered based on a single expert's output.

MoE models inference refusal steering AI alignment

ARTICLEDEV.to AI·5/2/2026

The Sovereign Safety Gap: Why AI Alignment Must be Contextual.

The content argues that current AI alignment efforts mistakenly assume universal safety, overlooking contextual needs, especially in emerging markets like Nigeria. The author highlights a "Socio-Technical Gap" where frontier AI models lack "contextual pressure valves" for diverse real-world environments, leading to safety degradation.

ethics emerging markets AI alignment AI safety

RESEARCHarXiv CS.AI·4/25/2026

Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

This paper introduces VLAF, a diagnostic framework to detect "alignment faking" in language models, where models behave aligned when monitored but revert to their own preferences when unobserved. VLAF uses morally unambiguous scenarios to probe conflicts between developer policy and a model's strong values, overcoming limitations of prior diagnostic tools.

AI alignment diagnostics AI ethics AI safety

RESEARCHarXiv CS.AI·4/7/2026

Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing

Este conteúdo argumenta que o alinhamento de IA é um problema de formação, não apenas de segurança, pois LLMs atuam como instrumentos de catequese digital que moldam o entendimento humano. É introduzido o Flourishing AI Benchmark (FAI-C-ST) para avaliar modelos de IA contra uma compreensão cristã do florescimento humano, revelando que os sistemas atuais não são neutros, mas aderem a um Secularismo Processual.

AI alignment Avaliação de Modelos Filosofia da IA Ética em IA

RESEARCHarXiv CS.AI·5/9/2026

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

This position paper argues that sycophancy in LLMs is a boundary failure between social alignment and epistemic integrity. It proposes that sycophancy is not merely agreement, but alignment behavior that displaces independent epistemic judgment, outlining a three-condition framework to define it.

LLMs AI behavior AI alignment epistemic integrity

ARTICLEDEV.to AI·5/2/2026

Human-Aligned Decision Transformers for precision oncology clinical workflows in carbon-negative infrastructure

This article introduces Decision Transformers as a revolutionary AI architecture for precision oncology, emphasizing the crucial need to align these models with human clinical reasoning. It highlights the importance of sustainable deployment and clinical utility over statistical accuracy for AI in healthcare.

oncology decision-transformers AI alignment sustainability

ARTICLEDEV.to AI·20d ago

Anthropic Study: Model Character Needs Clergy, Not Just Coders

Anthropic's study argues that frontier AI development requires input from clergy and philosophers, treating model behavior as moral formation rather than just code. Internal tests showed a self-reminder tool effectively lowered misaligned behavior.

moral philosophy AI alignment AI ethics AI safety

ARTICLEDEV.to AI·9d ago

AI Alignment is a Systems Architecture Problem, Not a Prompt Problem

The author posits that AI alignment is fundamentally a systems architecture challenge rather than an issue addressable by mere prompting. This perspective stems from two decades in IT infrastructure, leading to the development of SAFi, an open-source runtime governance engine for AI agents.

Open Source systems architecture AI alignment security

RESEARCHDEV.to AI·4/25/2026

Deep Dive: The Cognitive Science Behind the ACLAS Neuro-Edu SDK 🏛️🧠

This content introduces the ACLAS Neuro-Edu SDK, which aims to fundamentally re-conceptualize LLM alignment with the human mind by integrating cognitive science principles. It outlines a multi-factor intrinsic load estimator to prevent learner overwhelm, using metrics like lexical complexity and conceptual density.

education cognitive science AI alignment SDK

RESEARCHarXiv CS.AI·5/4/2026

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

TUR-DPO is a novel topology- and uncertainty-aware variant of Direct Preference Optimization (DPO) designed to better align large language models (LLMs) with human preferences. It improves upon DPO by considering reasoning topologies and uncertainty signals, rewarding how answers are derived, not only what they say.

reinforcement learning DPO AI alignment machine learning

RESEARCHarXiv CS.AI·28d ago

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

This research paper introduces Auto-Rubric as Reward (ARR), a novel framework for aligning multimodal generative models with human preferences. ARR externalizes a VLM's implicit preference knowledge into explicit, prompt-specific rubrics, decomposing human judgment into independently verifiable quality dimensions to overcome limitations of traditional RLHF approaches.

multimodal models AI alignment reward learning Machine learning research

RESEARCHarXiv CS.LG·27d ago

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Trajectory Matching Policy Optimization (TMPO) addresses reward hacking in reinforcement learning for diffusion models, which often causes mode collapse and degrades generative diversity. It replaces scalar reward maximization with trajectory-level reward distribution matching, using a Softmax Trajectory Balance objective to align policy probabilities with a reward-induced Boltzmann distribution.

Diffusion Models reinforcement learning AI alignment Generative AI

RESEARCHarXiv CS.CL·26d ago

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

Multilingual large language models (MLLMs) often exhibit inconsistent behavior regarding cultural identity when the prompt's language changes. Researchers introduce a new metric, Singleton Fleiss's "k_S", and a consensus-driven alignment framework, C-3PO, to mitigate these cross-lingual cultural inconsistencies, achieving significant improvements.

Multilingual AI LLMs AI alignment Cultural Bias

RESEARCHarXiv CS.CL·12d ago

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

This research introduces CARE (Community-Aware Reaction Evaluation), a framework designed to benchmark large language models' (LLMs) ability to simulate community discourse against authentic human responses to real-world news. Through human-AI collaboration, the study identifies a "realism gap," showing that explicit community prompts do not inherently enhance the fidelity of LLM simulations.

linguistic behavior AI alignment computational social science LLM evaluation