News
The persona selection model — LessWrong
1+ hour, 35+ min ago (1636+ words) The behavior of the resulting AI assistant can then be understood largely via the traits of the Assistant persona. This general idea is not unique to us. Our goal in this post is to articulate and name the idea, discuss…...
Abstract/Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M — LessWrong
4+ hour, 25+ min ago (618+ words) The most important limitation that was raised was that all prompts shared a similar hand-crafted structure. A natural question: is this separation a property of the concepts themselves, or of my prompt templates? This post attempts to answer that. I'd…...
AGI is Here — LessWrong
3+ day, 9+ hour ago (279+ words) I'm somewhat hesitant to write this post because I worry its central claim will be misconstrued, but I think it's important to say now, so I'm writing it anyway. Claude Opus 4.6 was released on February 5th. GPT-5.3 came out the same…...
"Recursive Self-Improvement" Is Three Different Things — LessWrong
1+ week, 6+ day ago (446+ words) Treating them as one thing produces confused models and bad intuitions. So I want to pull them apart explicitly. This is the one that's already happening and empirically observable with coding agents. The mechanism: better orchestration " better task decomposition " better…...
Distributed vs centralized agents — LessWrong
2+ week, 4+ hour ago (213+ words) Much of my thinking over the last year has focusing on understanding the concept of "distributed agents", as opposed to the "centralized agents" that the existing paradigm of rational agency describes. Roughly speaking, centralized agents are more efficient (as sometimes…...
Playing with an Infrared Camera — LessWrong
2+ week, 2+ day ago (378+ words) I recently got a Thermal Master P1 infrared camera attachment for my phone. The goal was a house project, but it's also a great toy, especially with the kids. Getting a room pitch black but still being able to 'see' with…...
What did we learn from the AI Village in 2025? — LessWrong
2+ week, 6+ day ago (697+ words) Standard AI benchmarks test narrow capabilities in controlled settings. They tell us whether a model can solve a coding problem or answer a factual question. They don't tell us what happens when you give an AI agent a computer, internet…...
Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits — LessWrong
3+ week, 3+ hour ago (1486+ words) Many thanks to Michael Hanna and Joshua Batson for useful feedback and discussion. Kat Dearstyne and Kamal Maher conducted experiments during the SPAR Fall 2025 Cohort. We demonstrate the "feature skipping" behavior of CLTs on a toy model where ground-truth features…...
“Features” aren’t always the true computational primitives of a model, but that might be fine anyways — LessWrong
3+ week, 5+ hour ago (448+ words) Probably the most debated core concept in mechanistic interpretability is that of the "feature: common questions include "are there non-linear features, and does this mean that the linear representation hypothesis is false?, "do SAEs recover a canonical set of features,…...
Whence unchangeable values? — LessWrong
3+ week, 1+ day ago (239+ words) Published on February 1, 2026 3:49 AM GMTSome values don't change. "Maybe sometimes that's because a system isn't "goal seeking." "For example, AlphaZero doesn't change its value of "board-state = win." (Thankfully! "Because if that changed to "board-state = not lose," then a reasonable instrumental…...