LLM safety

3 items

RESEARCHarXiv CS.CL·5d ago

Expert-Aware Refusal Steering

This paper extends refusal steering to Mixture-of-Experts (MoE) Large Language Models, finding that steering performance is not hindered by the MoE architecture. It proposes expert-aware refusal steering methods that leverage expert routing patterns, demonstrating that refusal behavior can be effectively steered based on a single expert's output.

MoE models inference refusal steering AI alignment

RESEARCHDEV.to AI·5/8/2026

Tiny weight edits improve LLM safety

Targeted, tiny weight edits to specific attention heads in LLMs, as demonstrated by the ASGuard method, can drastically reduce jailbreak success rates from linguistic tricks. This surgical approach patches vulnerabilities by dampening activations in relevant attention heads, maintaining overall model competence while significantly enhancing safety.

AI models jailbreaking security LLM safety

ARTICLEDEV.to AI·16d ago

I open-sourced a 4-agent blood-panel triage workflow on heym, with a deterministic Python safety gate that runs BEFORE any LLM token

A 4-agent multi-agent workflow was developed to transform raw blood panels into structured patient-education reports. The architecture includes a deterministic Python safety gate that runs before any LLM token, preventing critical failures for emergency lab values.

patient education deterministic AI LLM safety healthcare AI