Taking the Training Wheels Off: Aligning LLMs Without Personas (lesswrong.com) AI
The article argues that current LLM alignment methods (like RLHF, steering vectors, and prompting) rely on “personas” or examples of good behavior found in training data, which may not generalize to superhuman models; it proposes “personaless alignment” as a research direction to achieve good behavior without copying specific moral exemplars.
June 02, 2026 09:15
Source: Hacker News