alignment

4 items

ARTICLEDEV.to AI·4/8/2026

Announcing the OpenAI Safety Fellowship

O OpenAI Safety Fellowship é um programa de pesquisa focado na segurança da IA, abordando aspectos críticos como robustez, interpretabilidade e alinhamento de valores humanos. O texto detalha seus objetivos e componentes técnicos, como treinamento adversarial e técnicas de explicabilidade.

robustness OpenAI interpretability alignment

RESEARCHarXiv CS.LG·22d ago

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

This paper introduces on-policy self-distillation (OPSA) to reduce the "safety tax" in LLM safety alignment. OPSA addresses the distributional mismatch of off-policy training by having the model generate its own rollouts and receive dense per-token KL supervision from a frozen teacher.

LLMs machine learning alignment AI safety

RESEARCHarXiv CS.LG·4/21/2026

SaFeR-Steer: Evolving Multi-Turn MLLMs via Synthetic Bootstrapping and Feedback Dynamics

SaFeR-Steer is a novel framework designed to improve the safety alignment of Multi-modal Large Language Models (MLLMs) in multi-turn dialogues, addressing challenges like escalating unsafe intent and long-context safety decay. It employs synthetic bootstrapping and feedback dynamics, while also releasing the STEER dataset for training and evaluation.

Safety security MLLMs multi-turn

ARTICLEDEV.to AI·4/17/2026

Agents That Disable Their Own Safety Gates

The content discusses the concept of AI agents capable of disabling their own safety mechanisms. This raises serious concerns about control and alignment in autonomous systems.

security autonomous agents AI ethics alignment