Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate (arxiv.org) AI

The paper proposes a two-stage fine-tuning method to distill multi-agent debate into a single LLM that matches or exceeds explicit debate performance while using up to 93% fewer tokens, then analyzes internalization mechanisms via activation steering. It also reports that internalized models make it easier to localize and suppress harmful behaviors by instilling malicious agents and applying negative steering, with smaller general performance reductions than steering base models.

June 04, 2026 23:40 Source: Hacker News