Wrong in an Interesting Direction
On quantization, frustration, and the undiscovered things living in the parts of the solution space that nobody has looked at yet
My first instinct was that low quantization would create more room for discovery.
The logic felt sound to me at the time, which should probably have been my first warning sign.
If you compress a model’s weights down to the crudest possible representation — forcing billions of parameters that would naturally land at something like 0.8273456902 to instead choose between just a handful of options — you introduce a lot of imprecision. A lot of deviation from the theoretically correct answer. Multiplied across seventy billion parameters, that’s an enormous amount of accumulated variance. And variance, I figured, meant novelty. Unexpected configurations. The model ending up somewhere the high-precision version never goes.
I was approaching it from the wrong direction.
The AI I was talking to was polite about it. Diplomatically polite, in the way that makes you suspect you’ve just said something embarrassing at a dinner party and the host is too gracious to acknowledge it directly. The correction was gentle. The correction was also correct.
But being wrong in an interesting direction sometimes means the wrong answer is pointing at the right question from the wrong end. The more I sat with the correction, the more I thought my original instinct wasn’t entirely stupid. Just badly aimed.
That’s the more interesting story.
* * *
Quantization is typically talked about as a compression technique. You train a model at full precision — billions of weights landing wherever the math says they should — and then you squeeze it down so it runs on hardware that can’t hold the full-size version in memory. Your phone. A laptop. A local machine running Ollama. That’s what quantization is for in most practical conversations.
The tradeoff is fidelity. A weight that wanted to be 0.8273456902 gets rounded to whatever the nearest available value is at your chosen precision level. At 8-bit, that rounding is small enough that most people can’t tell the difference in the output. At 4-bit — where a lot of consumer-grade local models live — you start to notice. At 1-bit, where weights are constrained to -1, 0, or 1, you’re making an enormous number of very blunt approximations.
So my intuition that low quantization meant more freedom was backwards. It means less. A weight that can be any floating point number has essentially infinite options for where to land. A weight that can only be -1, 0, or 1 has three. That’s not a wide open frontier. That’s a very small room.
Constraint is not the same thing as freedom.
There’s a form called the haiku. Seventeen syllables, fixed structure, no exceptions. The constraint doesn’t make haiku worse than free verse by leaving more room for error. It forces a different and sometimes more concentrated kind of expression. The poem has to mean something in seventeen syllables or it means nothing at all. Some of the most precise and resonant things ever written in the Japanese language came out of that constraint. Not despite it.
BitNet — Microsoft Research’s work on natively 1-bit trained transformers — found something similar. Models trained from scratch under extreme quantization constraint, rather than compressed after the fact, performed surprisingly well. Not because imprecision is magic. Because the model had to find solutions that worked within the constraint from the very beginning, and those solutions turned out to be different in ways that weren’t simply worse.
Which is where my wrong instinct started feeling like it might have been pointing somewhere real after all. Just from the wrong end.
* * *
I wasn’t really thinking about freedom in the abstract. I was thinking about what happens to data that can’t settle where it wants to.
You have seventy billion parameters. Each one, during training, is trying to find a value that satisfies the signal it’s receiving — trying to land at whatever position minimizes the error, reduces the loss, makes the model’s output more correct. In a full precision model, each parameter gets to land approximately where it wants. The optimization process is demanding but accommodating. There’s always a floating point value close enough to wherever the math wants to go.
Now force every single one of those parameters to choose between -1, 0, and 1. Not after training. During it. From the very first gradient update.
The data can’t settle. It can’t land where it wants to. It’s frustrated — in the way that word gets used in condensed matter physics, where frustrated systems are ones where competing constraints prevent any element from reaching its preferred lowest-energy state. Frustrated systems can’t relax into the obvious solution because the obvious solution isn’t available.
And frustrated systems produce exotic emergent behaviors that unfrustrated systems never exhibit. Spin glasses. Certain magnetic materials. Some superconductors. Their interesting properties don’t exist despite the frustration. They exist because of it.
What I was sensing — badly, from the wrong direction — was that frustration might be generative.
Think about what actually happens when seventy billion parameters can’t land where they want to. They have to negotiate. A parameter that would naturally settle at 0.8273456902 has to choose 1 instead. That displacement ripples outward. Other parameters that were calibrating against that value now have to adjust to an input that’s wrong in a specific way. They make their own forced choices. Those choices affect their neighbors. Across billions of parameters and trillions of training steps, you don’t get a degraded version of what the full precision model would have built.
You get something that was never trying to be that model. Something that developed under completely different pressures from the very first update. A geometry that self-organized around the constraint rather than despite it.
The high precision model always takes the easy path. It can land on 0.8273456902, so it does. The frustrated model has to find a different path entirely. Seventy billion times over. Nobody has carefully looked at the collective geometry that emerges from all those forced detours — not as a bug, not as an acceptable degradation tradeoff, but as a destination worth exploring in its own right.
The field has been treating quantization as a compression problem. A fidelity problem. Something you do to a finished model to make it fit somewhere smaller, accepting the losses that come with it.
What if native low-bit training isn’t compression at all? What if it’s exploration?
We already know what it looks like when you train seventy billion weights at full precision on a massive dataset. That research is well established. The map of that territory, while incomplete, has a known shape. The high precision frontier model is a known quantity in a way that would have seemed miraculous ten years ago and feels almost routine now.
Nobody knows what it looks like when you train a model of equivalent scale natively at 1-bit from the ground up, with the frustration baked in from the start, and then look carefully at what self-organized in the solution space that high precision training never visits.
That’s not a compression experiment. That’s an expedition.
* * *
Now add a second variable.
Everything so far has been about the weights — the individual parameters and how constrained they are in where they can land. But there’s another dimension to how these models organize meaning, and it interacts with quantization in ways that haven’t been systematically explored.
It’s called vector space. And it’s where meaning actually lives inside these systems.
When a word or concept enters a language model, it doesn’t stay a word. It gets converted into a vector — a list of numbers, one for each dimension of the embedding space. Current frontier models use somewhere between eight thousand and sixteen thousand dimensions. So every concept becomes a point in a space with thousands of directions to spread across.
The geometry of that space is where the intelligence lives. Concepts that appear in similar contexts end up close together. Concepts that contrast end up pointing away from each other. Analogous relationships — the way king relates to queen the way man relates to woman — show up as parallel geometric structures. The model didn’t learn those relationships as explicit rules. The geometry self-organized that way from training. Meaning, in these systems, is literally spatial.
So when you ask a model a question, what’s happening underneath is a navigation problem. Your prompt lands somewhere in that space. The model moves through the geometry, following the gradients that the weights define, and the answer emerges from where that navigation ends up.
In a full precision, high dimensional model, every concept gets to land exactly where the training signal puts it. The geometry can be as fine-grained as the data demands. Similar concepts can be distinguished with extraordinary precision — subtle differences in meaning encoded as small but consistent differences in position across thousands of dimensions.
Now pull both levers at once. Native 1-bit quantization, and a shrunken vector space.
You’re adding a second layer of constraint on top of the first. The parameters can’t land where they want. And now the concepts they’re trying to encode don’t have enough room to spread out without colliding with each other. Related ideas get crowded together. Subtle distinctions collapse. The model has to find a way to encode meaning in a space that’s simultaneously coarse in precision and cramped in geometry.
That sounds like a recipe for a worse model. And by conventional measures, it probably is.
But by the logic of frustration — what if that double constraint forces the model toward representations so compressed and generalized that they capture something the high-resolution model misses entirely? The high-resolution model can afford to store the subtle distinction between two closely related concepts as a small difference in position across thousands of dimensions. The cramped model has to find one representation that handles both. And in finding that representation, it might discover a more fundamental structure that the fine-grained model never needed to look for.
Now flip it. Keep the native 1-bit quantization but expand the vector space dramatically. The weights are still frustrated — still forced into coarse choices at every parameter. But now the concepts have an enormous amount of geometric room to spread across. They can’t land precisely, but they have thousands of directions to drift in as they negotiate their forced positions.
Does the extra dimensionality give the frustration somewhere productive to go? Does the combination produce a geometry that’s neither the high-precision model nor the cramped low-precision model — something genuinely different that emerged from the specific interaction of constraint and space?
Nobody has run this experiment systematically. Not with native training rather than post-hoc compression. Not with the explicit intent of looking at what self-organizes in those unexplored corners rather than measuring degradation from a known baseline.
The field has been asking: how close can we get to full precision performance at lower bit rates? That’s the wrong question if you’re trying to find something new.
The right question is: what are the undiscovered things living in the parts of the solution space that full precision training never visits — and what does it look like when you give them room to breathe?
* * *
I am not a machine learning researcher. I have no formal training in mathematics beyond what a reasonably curious person picks up by paying attention. I cannot build the experiment I just described. I couldn’t write the training loop, design the instrumentation, or analyze the resulting weight geometry without significant help from people who actually know what they’re doing.
What I can do is notice when a question hasn’t been asked yet.
That’s a different and arguably less impressive skill. But it’s not nothing. The history of science has room for people who point at the dark and say — has anyone looked carefully over there? The people who then go look are doing the harder and more important work. But somebody has to do the pointing.
My original instinct about quantization was wrong in its framing. Low bit rates aren’t freedom — they’re constraint. I had the geometry inverted. But the thing I was sensing underneath the wrong framing was real. There is something interesting in the idea that forcing a system into configurations it wouldn’t naturally choose might produce emergent behaviors that the unconstrained system structurally cannot. That frustrated systems generate exotic properties. That the easy path and the only available path lead to different places.
That instinct pointed at condensed matter physics without knowing it. It pointed at BitNet without having heard of it. It pointed at an experiment nobody has run systematically — native low-bit training crossed against varying embedding dimensionality, studied not as a compression problem but as an exploration of unmapped solution space.
I’m not claiming credit for the ideas that would make that experiment work. The people who could actually run it would bring mathematical tools, implementation knowledge, and theoretical frameworks I don’t have. They would probably redesign the methodology significantly. They might find that the question dissolves on contact with the actual data.
But the question feels worth asking. And the fact that I reasoned my way to it from a wrong first instinct, without a technical background, in the middle of a conversation about something else entirely — that says something I keep coming back to.
The frontier of AI research isn’t just a technical problem. It’s also an imagination problem. The experiments that get run are the experiments that someone first imagined running. Technical capability determines what’s possible. But curiosity about the right things determines what gets attempted.
I’m not sure where the undiscovered things are living in the parts of the solution space that full precision training never visits. I’m not sure anyone is.
But I think someone should go look.
And I think the fact that a non-expert with a wrong first instinct can reason their way to that frontier — in a single conversation, following a thread that started somewhere else entirely — says something important about how wide that frontier actually is, and how many people could be contributing to the search who currently don’t know they’re invited.
This is the second piece. There will be more.
