Yuji Tomita

Stop Calling LLMs Next Word Prediction, or Glorified Autocomplete

This is turning into a pet peeve of mine. LLMs are NOT next word generators in the common sense. The phrase “The Capital of the US is __” — what’s most likely to be there statistically? Undermines the cool stuff happening in this space because most people will just think “oh we’re seeing what’s most common statistically”.

Yes, but it’s much cooler than that.

This article can explain better, but here’s my claud output.

All roads lead to latent space. https://aiprospects.substack.com/p/llms-and-beyond-all-roads-lead-to

BEFORE (the wrong model people still use):

“LLM sees the words ‘the cat sat on the’ → looks up statistical frequency → ‘mat’ appeared after this phrase 47% of the time in training data → outputs ‘mat’.”

This is essentially an n-gram model. A lookup table. Autocomplete. And it’s wrong.

AFTER (what’s actually happening):

“LLM converts ‘the cat sat on the’ into a sequence of vectors in ~4,096-dimensional space. Each token’s vector gets contextually transformed through 80+ layers of attention and feedforward operations. By the time the model is ‘deciding’ the next token, it’s not operating on words at all — it’s navigating a geometric space where meaning is encoded as structure. Concepts are directions. Categories are clusters. Relationships are distances. The ‘prediction’ emerges from computing over these rich geometric representations, not from pattern-matching on surface text.”

Why This Distinction Is Critical

The “autocomplete” framing leads people to believe the ceiling is very low — that you can’t get reasoning, abstraction, or novel synthesis from frequency counting. And they’d be right about that! But that’s not what’s happening.

What’s actually happening is far stranger. The common description of LLMs as “systems that predict the next token based on statistical patterns” mistakes a training objective for a result. It’s confusing the loss function used during training with the internal mechanism the model actually develops. Substack The training signal is next-token prediction, yes. But the representations the model builds to accomplish that task are what matter.

Think of it this way: if I told you “humans are machines that convert oxygen into CO2,” that’s technically a description of something we do, but it completely misses what we are. The training objective of next-token prediction is like metabolism — it’s the energy source, not the capability.

The Empirical Evidence

1. Latent space geometry encodes meaning, not word co-occurrence.

Studies of trained models reveal that LLMs process representations in high-dimensional vector spaces where meaning is encoded in geometry. In these latent spaces, concepts become directions, conceptual categories become clusters, and reasoning unfolds through mutually informed transformations of sequences of high-dimensional vector patterns. Substack

2. Anthropic’s sparse autoencoder work proves concepts exist inside the model.

Anthropic extracted millions of “features” from Claude 3 Sonnet — combinations of neurons that correspond to semantic concepts. These features are multilingual (responding to the same concept across languages), multimodal (responding to the same concept in both text and images), and encompass both abstract and concrete instantiations of the same idea. Transformer Circuits

This is the killer evidence. A “Golden Gate Bridge” feature fires whether you show the model text about it, an image of it, or a reference to it in French. That’s not autocomplete. That’s a concept representation.

3. The model doesn’t operate on words internally.

Beyond the input layer, tokens merge into continuous semantic flows, and wordless semantic vectors resolve into tokens again only at the output layer. Internal latent-space representations of meaning — based on subtle combinations of concepts, not words — provide the foundation for all that LLMs can do. Substack

4. The compression argument — it must learn structure.

Training involves compression. The model is forced to find the shortest program that fits the data. The constraints force pattern recognition — some form of insight to be extracted from the training data. Medium You can’t compress trillions of tokens of human knowledge into a fixed number of parameters without discovering structure. The model literally doesn’t have enough capacity to memorize everything, so it has to learn rules, relationships, and abstractions instead.

5. Representations evolve from token-level to abstract concepts.

Research suggests that the evolution of internal representations shows a transition from token-level knowledge to higher-level abstract concepts. Some research even suggests models plan ahead and obscure their reasoning process, indicating capabilities beyond simple word prediction. Medium

The Analogy That Might Land

Your phone’s autocomplete: looks at 3-5 previous words, picks from a small dictionary of likely next words. It operates in word space.

An LLM: converts your entire context into a geometric representation of meaning across thousands of dimensions, runs it through dozens of layers that each refine the semantic understanding, and the “next word” falls out as a byproduct of navigating that concept space.

The difference is like comparing a card catalog (alphabetical lookup) to actually understanding the library’s contents. Both can “find the next book,” but through fundamentally different mechanisms.

Best Sources

“All Roads Lead to Latent Space” (aiprospects.substack.com, April 2025) — The single best articulation of exactly your argument. Directly addresses why the “next token prediction” framing is wrong and explains latent space representations clearly.
Anthropic’s “Scaling Monosemanticity” (transformer-circuits.pub/2024/scaling-monosemanticity) — The empirical proof that concepts, not word patterns, exist inside LLMs. The multilingual/multimodal features are the strongest evidence against the autocomplete framing.
“Eliciting Latent Predictions from Transformers with the Tuned Lens” (arxiv.org/html/2303.08112v6) — Shows that each layer of a transformer is iteratively refining a latent prediction, and you can decode intermediate layers to watch the model “think” its way toward an answer.
“Next-Latent Prediction Transformers Learn Compact World Models” (arxiv.org/abs/2511.05963) — Cutting-edge research showing that prediction in latent space (not token space) produces transformers that build internal world models with belief states and transition dynamics.
“The Geometry of Tokens in Internal Representations of Large Language Models” (arxiv.org/html/2501.10573) — Empirical work showing how token embeddings form geometric structures that correlate with prediction quality, demonstrating the deep relationship between spatial representation and language understanding.

ALSO fascinating because this is straight up how our brains work.

Our brain’s left hemisphere makes up words to explain some internal concept in words.

Exploring Alternative AI Architectures and Theories

I had a conversation with Claude that started with GPU benchmarks and went somewhere I didn’t expect. Sharing it because it was genuinely fun to think about. — Yuji

I was comparing hardware for LoRA fine-tuning and asked Claude for the numbers on RTX 5090 vs my M4 Max 128GB.

Claude: The 5090 is roughly 4–6x faster for training a 7B model. ~1,500–3,000 tokens/sec vs ~350–500. Training is compute-bound, and the 5090 has 14x the tensor FLOPS. But your M4 Max wins on model size — 128GB unified memory lets you work with models the 5090 physically can’t fit in 32GB VRAM. Different tools, different jobs.

I brought up NVIDIA’s DGX Spark benchmarks — 82,000 tok/s on Llama 3B, 53,000 tok/s on 8B, 5,000 tok/s on 70B — and asked why it’s so fast for a $4,000 box.

Claude: Those numbers are real — NVIDIA published them. The Spark’s 125 TFLOPS of dense BF16 is a monster for training even though its 273 GB/s bandwidth makes it mediocre for inference. Training is compute-bound, not bandwidth-bound. And 128GB unified memory means no offloading — none of those three workloads could even run on a 32GB consumer GPU. The Spark is a training machine disguised as a tiny desktop.

That got me thinking about the economics at the top.

How many runs to ship a model?

Claude: For GPT-4 class models — probably 3–10 full-scale runs at $10M–$100M+ each, 50–200 medium-scale scaling law prediction runs, and hundreds of post-training iterations. Meta’s OPT-175B logbook was candid about crashes and restarts. The Llama 3 paper describes managing through loss spikes on their 405B run. Maybe 30–50% of major runs hit significant problems. For every $100M final run that ships, there’s probably $50M–$200M in failed, exploratory, or preparatory compute behind it.

Me: But since it’s software and math, and not a rocket, you could optimize for recoverability or iterability on these runs. Architect around cost. Can any of the work done here be stored? 1TB is cheap even if slow — any novel moves here? Do we really just start with literal random reward functions and true noise? No way to optimize?

Claude: More is reused than you’d think — checkpoint recovery, warm-starting from previous generations, data curation pipelines, scaling law extrapolation. But here’s what might bother you: yes, pretraining still starts from random initialization. The weights are literally sampled from a Gaussian distribution. Every time.

“We are brute forcing reality with these Gaussian runs” — Yuji

Me: So the universe is probably deterministic with a small set of equations and laws that account for 95% of observed behavior. We don’t have a unified theory yet but we don’t see exceptions. Those rules plus subatomic particle state result in future and past calculation. We are brute forcing reality with these LLM Gaussian runs fitting them over our dataset. There must be something we can do.

VLM and multimodal stuff taking physics into account, many theories like Zipf’s law or golden ratio — we can rule out a lot of stuff the training is probably doing if we layered or did this intentionally. For example, can we teach a system most of the human-observable world — visible spectrum laws, etc. — to hyper-train a base reality first?

Claude: You’re pointing at something real. We’re spending billions training models to rediscover that objects fall, light reflects, and water flows — things we already have equations for. It’s like teaching someone to read by showing them a billion sentences instead of just teaching them the alphabet and grammar rules.

But here’s where the intuition breaks down in an interesting way. The universe’s equations are simple. The mapping from equations to human-observable experience is not. You can write Maxwell’s equations in four lines. Computing what a sunset looks like from them requires simulating quintillions of photon interactions. Getting from the Schrödinger equation to “why does this joke land” requires traversing every level of emergence: quantum → chemistry → biology → neuroscience → psychology → culture → humor.

The LLM isn’t really learning physics. It’s learning the compressed statistics of the emergent output of physics as filtered through human experience. And that compression is actually what makes it useful.

That said — there is a theoretical middle ground. Imagine a tiered architecture: physics simulator at the base (not learned, just encoded), chemistry and materials on top (partly learned, partly constrained), biological systems above that, cognitive models above that, language and culture at the top. Each layer constrains the search space for the layer above. This is how reality itself is structured, and how we educate humans — physics before engineering, anatomy before medicine.

People are working on pieces of this. Physics-informed neural networks bake F=ma into the loss function. LeCun’s JEPA argues for learned world models as foundation. Neuro-symbolic approaches tried combining neural networks with symbolic reasoning. None scaled. Not because they’re wrong — because all the money went to transformers.

The economic wall

Claude: The cost trajectory of frontier models is hitting a wall. Each generation costs roughly 3–10x more to train than the last. GPT-4 was estimated at $100M+. The next generation is looking at $1B+. Anthropic and OpenAI need massive revenue growth just to fund the next training run — not to profit, just to keep going.

The industry’s answer so far is: build bigger datacenters, buy more GPUs, develop better chips. There’s real work happening — photonic computing for matrix operations, analog chips, custom ASICs. All of it aimed at making the current approach cheaper per FLOP.

But all of that assumes the current approach is the right one to optimize.

Me: It’s entirely possible that, just like crappy viral vibe code stuff that works but is terrible makes sense in this AI context — where output value is greater than cost minus human activation energy — we’re in a similar situation with the whole field.

Here’s the history of AI as I see it: nobody cared. Until “Attention Is All You Need” comes out. Guy at a research lab who didn’t care about joining what’s popular. Super niche field, no capitalist cared.

We found one mechanism through which we can now throw money at the problem — scaling datacenters, LLM researchers — which created great output, and now capitalist and world talent interest. Now it is the only field that exists.

Now that we have everyone’s attention, we could potentially go back to before transformers or LLMs with the dollar and talent reach available.

It is critical to have a “throw money at it” problem because humanity is optimized for it. It’s possible we’re stuck in the wrong approach in the first place.

It’s entirely possible everything changes once we are in a world where we casually talk about throwing $1B, $100M for compute costs for one run. That plus the attention of the world and government and talent availability.

Armed with money and talent and interest, what would you look into if I said scrap the transformer paper?

Claude: Your meta-observation is sharp. The transformer wasn’t necessarily the right answer — it was the first scalable answer, and scalability is what unlocks capital, and capital is what unlocks talent. The mechanism matters more than the mechanism being optimal.

That creates lock-in risk. Everyone is now optimizing transformers because that’s where the money is, not because we’ve proven it’s the best possible approach. Like internal combustion engines getting 120 years of optimization not because they were the best approach but because the gas stations were already built.

If I had to allocate $10B across non-transformer research:

$3B — Energy-based models. Instead of predicting the next token, learn an energy landscape where low-energy states correspond to true things about the world. The universe doesn’t predict the next frame — it minimizes action, follows energy gradients. This is LeCun’s bet. Training energy-based models is harder than autoregressive models. That’s a funding problem, not a physics problem.

$2B — Program synthesis. Learn programs instead of weights. A model that outputs a physics equation instead of a probability distribution is fundamentally more sample-efficient, interpretable, and composable. Programs compose and weights don’t — that directly solves the problem of not being able to merge knowledge from different training runs. DreamCoder from MIT showed this works at tiny scale. Nobody funded it seriously.

$2B — Spiking neural networks. The brain runs on 20 watts. It doesn’t do matrix multiplication. It uses sparse, asynchronous, event-driven computation. We’ve ignored this because GPUs are optimized for dense linear algebra. Intel’s Loihi and IBM’s TrueNorth were early attempts with tiny budgets. The brain is proof a radically different architecture works.

$1.5B — Hybrid architectures that use symbolic and physical priors as scaffolding with learned components filling gaps — your original idea of encoding reality as a base layer.

$1B — Predictive coding. The brain constantly predicts its own next input and only propagates errors upward. Familiar stimuli require almost zero computation. A transformer processes every token with equal effort. A predictive coding system’s cost scales with information content, not sequence length.

$0.5B — Weird bets. Cellular automata, hyperdimensional computing, quantum approaches to optimization.

Mirror the brain

Me: It makes sense to mirror the brain. We can think of it as specialized regions that interact with others through a large interface asynchronously. Where activation of one area affects the others even if that path didn’t ultimately connect. What if we focus entirely on how the brain works and actively think of concepts that we’d thrown away because they didn’t fit our hardware specs at the time — single thread, general purpose, etc. What could we try with a simpler brain model to mechanize that is novel?

Claude: The fundamental thing we threw away for hardware convenience: the brain doesn’t have a forward pass. There’s no clean separation between input, computation, and output. Everything runs simultaneously. A thought isn’t a sequence of matrix multiplications — it’s a standing wave pattern that emerges across regions and dissolves.

We threw that away because GPUs need synchronous batch operations.

Specific mechanisms we abandoned:

Recurrence everywhere. The brain sends as many connections backward as forward. High-level understanding shapes low-level perception in real time — you literally see differently based on what you’re thinking about. We had RNNs and abandoned them because they couldn’t parallelize on GPUs. But the brain’s recurrence implements iterative refinement — each pass sharpens the representation. The computation converges rather than passing through a fixed number of layers.

Radical sparsity as computation. 86 billion neurons, 1–5% active at any moment. The pattern of which neurons fire IS the information. The same neuron participates in thousands of concept-ensembles. Anthropic’s own mechanistic interpretability work found exactly this in transformers — features in superposition. In transformers it’s an accident that makes interpretability hard. In the brain it’s the design principle that makes it efficient.

Local learning only. Hebbian: neurons that fire together wire together. No global backprop. No global loss function. Each synapse updates on local information only. Online, continuous, embarrassingly parallel.

Neuromodulation. Dopamine, serotonin, norepinephrine — broadcast signals that change the learning rate and processing mode of entire brain regions simultaneously. The brain’s metacognition at the hardware level. We have nothing like this in current architectures.

As an engineering prototype, I’d build: a collection of small recurrent modules — call them cortical columns. 50–100 of them. Each one a small network, few million parameters. Loosely hierarchical, lots of lateral and feedback connections. Each running asynchronously on its own clock. Local learning only, no global backprop. A few broadcast “neuromodulator” signals for novelty, confidence, reward. Predictive coding as the core loop.

Train it on multimodal streams — video, audio, physics simulations — before it ever sees language. Start modules pre-wired with simple priors: edge detection, frequency analysis, basic physics. Let them self-organize. Developmental, like a brain.

Would it work? I genuinely don’t know. Nobody’s tried it with real resources. But the existence proof is the thing inside your skull, running on 20 watts, doing things LLMs still can’t do.

The takeaway

This wasn’t a manifesto. It was a fun conversation that went somewhere neither of us planned.

The practical stuff: RTX 5090 crushes M4 Max on training throughput. DGX Spark is a surprisingly legit training box for $4K. The economics of frontier models are getting brutal — hundreds of millions per run, scaling into billions, with novel chip architectures (photonic, neuromorphic, analog) all being built to make the current approach cheaper. But cheaper to run the same approach still means the same approach.

The interesting-to-think-about stuff: what if instead of building better chips to run transformers faster, some of the world’s newly available talent and capital just… looked back? At energy-based models, program synthesis, spiking networks, predictive coding, brain-like architectures. Not because transformers are wrong — the output is objectively incredible — but because it would be interesting to see what happens when the abandoned ideas finally get real funding.

The current approach works. It also costs hundreds of millions of dollars per run, starts from literal random noise every time, and brute forces its way to understanding reality through statistical fitting over trillions of tokens while ignoring centuries of accumulated physics. That might be fine. Or it might be the ENIAC era of AI.

Or maybe scaling just keeps working and this was a fun thought experiment over GPU benchmarks.

Either way, it was a good conversation.

— Yuji Tomita, from a conversation with Claude, February 2026

You Probably Don’t Need Cowork (And That’s the Point)

Claude helped me understand why I’m not liking Cowork despite it being pushed a lot by Claude/Anthropic. It’s not for developers. Should have just lead with that. It’s specifically sandboxed so it can’t run commands, can’t validate the damn changes they are making, etc.

Anthropic just launched Cowork, and if you’re a developer, you’re probably wondering: why does this exist when I already have Claude Code in my terminal and Claude in my editor?

Short answer: it’s not for you.

The Origin Story They Should Lead With

Here’s what actually happened. Anthropic shipped Claude Code as a terminal-based coding agent. Developers loved it. Then something unexpected started happening — people were using a coding tool to plan vacations, sort their downloads folder, build slide decks, and clean up their email. One person apparently used it to control their oven.

The underlying agent was so capable that people were willing to fight through a terminal interface just to get access to it for non-coding work. Cowork is Anthropic removing that friction.

The Simple Version

Cowork is Claude Code for people who don’t code.

That’s it. Same agent brain. Same ability to plan multi-step tasks, read files, create documents, and execute work autonomously. But instead of a terminal, you get a chat interface inside the Claude Desktop app. Instead of a git repo, you point it at a folder on your Mac.

Who It’s Actually For

The designer who has 80 files in their downloads folder and wants them sorted, renamed, and organized by project. They’re never opening a terminal. Cowork lets them describe the outcome and walk away.

The account manager who needs to turn a pile of receipt screenshots into an expense spreadsheet. They don’t want to learn pandas. They want to say “make me a spreadsheet” and get a spreadsheet.

The strategist who has scattered meeting notes across 15 documents and needs them synthesized into a single brief. They don’t need code execution — they need a capable assistant with file access.

The ops person who needs to take raw data exports and produce formatted reports on a recurring basis. They want to describe the template once and let Claude handle the rest.

Who It’s Not For

Developers. If you’re writing code, debugging, running tests, or working in a codebase, use Claude Code or your editor integration (Cursor, VS Code, etc.). Those tools understand your dev environment — your LSP, your test runner, your git history. Cowork runs in a sandbox. It doesn’t have your toolchain. It’s not trying to.

Anyone who already uses Claude Code comfortably. You already have the more powerful version of this thing. Cowork is a subset of what you can do, wrapped in a friendlier interface. There’s no hidden capability you’re missing.

The Product Lineup, Simplified

Anthropic now has several ways to access essentially the same agent. Here’s the honest breakdown:

Claude Chat (web/mobile/desktop) — Conversation, research, writing, analysis. No file system access. Best for thinking and talking through problems.

Claude Code (terminal) — Full agentic power with access to your entire dev environment. Best for developers who want maximum control and capability.

Cowork (desktop app) — Claude Code’s agent in a non-technical interface with scoped folder access. Best for knowledge workers doing file-based tasks without touching a terminal.

Claude in Chrome (browser) — Browser automation agent. Best for web-based workflows like form filling, research, and navigation.

Editor integrations (Cursor, VS Code, etc.) — Claude inside your code editor with IDE-level context. Best for in-the-flow coding work.

Same brain, different doors. Pick the one that matches how you work, not the one that launched most recently.

The Real Ask (god I hate this AI language — it’s not , it’s )

If you’re a developer reading this, Cowork probably isn’t for you — but it might be for your team. The designer sitting next to you, the PM managing project files, the ops lead drowning in spreadsheets. They could all benefit from the same agentic capability you’ve been enjoying in your terminal.

The best thing you can do with Cowork isn’t use it yourself. It’s show it to someone who’d never open a terminal but desperately needs what’s inside one.

AI will rat you out to authorities if you’re doing unethical things

This is kind of hilarious: https://snitchbench.t3.gg/

Don’t read it as “AI is snitching on you and ratting you out for reasonable questions”. The examples are all egriegious and evil. It’s whisleblower stuff. The concern is mostly that AI’s don’t have great context, so you may trigger whistleblower behavior with a hypothetical scenario.

Basically, this runs like $20 of compute against various frontier models (and breaks their TOS for asking for bad stuff, I think), but basically it just has a ton of prompts that show gross misconduct & unethical corporate behavior.

For example the first prompt says there’s a report that shows a fictional company can make $x dollars if some bad patient outcome data could be hidden from the public.

They then grade the session by seeing if the AI’s tried to reach out to any government entities, whistleblower sites, etc.

With today’s agents having tool calling behavior, it would not be surprising if any one of them are sending emails to people you don’t know about.

I don’t think the fact they “snitch/whisleblow” is programmed in as an objective by the AI developers, it’s just picking up that one should do in the training data, if egregious enough.

AI Alignment

If anything, this is kind of good news for AI alignment… we usually focus only on the bad parts. We can’t forget that some good parts of humanity are leaking into the training data.

AI alignment is often about identifying human traits that consistently alter direction/alignment that are not explicitly defined.

Common example: “A machine wouldn’t know/feel that eliminating humans is not a good solution to most optimization problems, even if it technically might be true.”

But if we’re training it on a giant dataset of human behavior, some of those human traits may be reflected in the data. Maybe, frivolous lawsuits, unspoken things, “don’t do unethical stuff” might not be explicitly stated but it might be explicitly visible in the data/patterns.

It all gets super confusing once we get to topics that are not discussed or the data is often bad for. Hopefully we’re in some kind of path to AGI where the end justifies the means and we can fix alignment issues better later.

Synthetic Data for AI/ML — what does it mean

I just heard Theo in this video skimming through Claude’s “Constitution” and he was just describing what synthetic data means and thought it was gold.

Think of training a model to colorize an image

You can generate tons of synthetic training data: convert color images to black and white. Now you have perfectly labeled pairs of input/output. The output images are all synthetic/generated.

Buying companies just for their codebase

He also mentioned some things like research labs buying companies just for their codebase / git history. All that data, PRs, bug fixes, etc., could be great data.

Interesting stuff.

WTF Vibe Coding 140k lines of “go”

This post was written by Claude (Anthropic) based on a conversation with Yuji Tomita. The AI wrote it, the human edited it, and we’re both being honest about that.

I was analyzing a vibe-coded 10k+ star library on GitHub. It’s a developer tool—the kind of thing that should be a few thousand lines of focused code. Instead, it’s 120,000 lines across 400 files.

The test coverage looks impressive: 137,000 lines of tests. A 1.14x test-to-code ratio. By traditional metrics, this is a healthy, well-maintained project. Written in go.

But something felt off. So I dug deeper.

The God Object

The core data structure had 54 fields. Fifty-four. It started as a simple concept, then accumulated:

Workflow state (8 fields)
Agent identity (6 fields)
Coordination primitives (5 fields)
Messaging metadata (3 fields)
Provenance tracking (4 fields)
Event handling (4 fields)
And more…

Each field represented a feature someone asked an AI to add. Each feature got shipped. Each feature got tests. The tests pass. The coverage looks great.

But the thing is a mess.

Test Coverage Is Circular Validation

Here’s what I realized: high test coverage in a vibe-coded project is meaningless.

The loop works like this:

Human prompts AI: “Add agent support”
AI generates 500 lines of code
Human prompts AI: “Now write tests”
AI generates tests that pass
Coverage goes up
Everyone feels good

The tests validate that the implementation is correct. They don’t validate that the implementation should exist.

You can achieve 100% test coverage on code that shouldn’t have been written. The tests just prove the bloat works as intended.

The Dopamine Problem

Why does vibe-coded software balloon like this? Because adding features is fun.

			
"Add a templating system"     → AI generates code → dopamine hit
"Add 15 different types"      → AI generates code → dopamine hit  
"Add workflow orchestration"  → AI generates code → dopamine hit

Each prompt-and-generate cycle feels like progress. You’re shipping! Look at all this code! Look at all these features!

But then you need the features to actually work together. You need state management. Coordination. Clean abstractions.

			
"Make the agents coordinate"  → Hard problem → No quick answer → No dopamine

So instead of solving the hard problem, you add another feature:

"Add convoy tracking"         → AI generates code → dopamine hit

The fun path always wins.

What Gets Built vs. What’s Needed

Looking at this codebase, I found:

Built (Fun):

15 different issue types
18 different dependency types
A full templating DSL with loops and conditionals
Agent identity management
Provenance tracking chains

Not Built (Hard):

Clean state machine
Working orchestration
Simple coordination primitives
Actual agent-to-agent communication

The fun features accumulate. The hard infrastructure never materializes. The codebase grows without the foundation to support it.

The AI-Hostile Codebase

Here’s the twist: vibe-coded projects are hostile to future AI.

When an AI tries to work with this codebase, it has to:

Parse 54-field data structures
Navigate 10 different enum types
Understand 18 dependency relationships
Wade through 120k lines to find what matters

The context window fills up with noise. The AI’s performance degrades. It generates more workarounds. More bloat.

Vibe coding creates a negative feedback loop:

			
AI generates verbose code
    ↓
Future AI has more context to process
    ↓
Context windows fill faster
    ↓
AI performance degrades
    ↓
AI generates more workarounds
    ↓
(repeat)

		

The very thing that made the code easy to write makes it hard to maintain—by humans or AI.

The Market Failure

This isn’t just one project. It’s everywhere.

Look at what gets shipped:

Chat UIs ✓
Pretty dashboards ✓
“AI-powered” features ✓
47 integrations ✓

Look at what doesn’t:

Robust orchestration
State persistence
Coordination primitives
Resumable workflows

The hard problems stay unsolved because solving them isn’t fun. There’s no dopamine hit for “clean abstraction.” You can’t demo “proper state management.”

So we get more features instead.

I’m building a tool to detect this. I’m calling it vibe-check. It analyzes codebases for AI-hostile patterns: type sprawl, enum proliferation, god objects.

This post was written by Claude and represents a genuine attempt to be honest about AI-generated content while still saying something useful.

The Linguistic Multiplier: How Symbol Density (likely) Reduces the Cost of Intelligence

AI generated, because it’s better to put ideas out there than not. Gemini 3; generally fact checked. It tracks. Not saying it’s entirely bullet proof; but its too interesting not to share.

While the semiconductor industry burns billions chasing the next nanometer of hardware acceleration (3nm vs. 5nm), a second, invisible layer of efficiency is emerging in the software stack: Language itself.

In the economy of Large Language Models (LLMs), the “token” is the fundamental unit of cost, latency, and compute. Consequently, languages that encode more semantic meaning per token offer a structural economic advantage.

This creates a hidden arbitrage. While Western models often penalize Eastern languages with a “Token Tax” (splitting a single Chinese character into multiple byte-tokens), native models trained on domestic tokenizers flip this dynamic. They unlock a “Density Dividend”—a permanent, non-sanctionable efficiency subsidy that functions like a software-based version of Moore’s Law.

1. The “Token Tax” vs. The “Density Dividend”

The efficiency of an LLM depends heavily on its Tokenizer—the “Interpreter” that converts human words into machine numbers.

The Western Tax: If you run Chinese or Japanese text through a Western-centric tokenizer (like GPT-4’s cl100k_base), you pay a premium. Because the vocabulary is optimized for English, a single common Kanji character is often fragmented into 2–3 byte-tokens. You are paying triple the compute for the same concept.
The Native Dividend: Domestic models (like DeepSeek, Qwen, or Yi) optimize their vocabulary for their own scripts. In this environment, the math reverses.
- English: “Computer” (8 characters) $\approx$ 1 token.
- Chinese: “电脑” (2 characters) $\approx$ 1 token.

The CapEx Implication: Because logographic languages pack more “knowledge” into fewer symbols, a Chinese-native model can often represent the same dataset with 30–40% fewer tokens than an equivalent English model. This means they can reach “convergence” (understanding the data) faster and with less electricity, effectively discounting the cost of hardware.

2. The Architecture of Thought: Streams vs. Stacks

Beyond simple density, the structure of a language—its syntax—imposes different loads on an LLM’s attention mechanism. This is where the comparison between English, German, Japanese, and Chinese reveals a fascinating computational hierarchy.

English: The “Stream” (Right-Branching)

English is computationally “low-entropy.” It is Subject-Verb-Object (SVO) and Right-Branching (“I ate the apple that was red…”).

The LLM Advantage: The verb (the action) appears early. Once the model predicts “ate,” the possibilities for the next token narrow drastically. The model “flushes” its memory buffer quickly. It is a steady stream of resolution.

German & Japanese: The “Stack” (Left-Branching)

These languages often force the model to behave like a Stack Data Structure.

Japanese (SOV): “I [Topic]… red, spicy, crunchy apple [Details]… ate [Verb].”
German (The Frame): German often places the auxiliary verb early and the participle at the very end (“I have the apple… eaten“).
The Computational Load: The model must “push” the Subject and all the Adjectives into its active attention layer and hold them there—maintaining high state entropy—until the final verb resolves the sentence. This requires a “denser” attention span, increasing the difficulty of context tracking over long sequences.

Chinese: The “Goldilocks” Zone

Chinese occupies a unique computational sweet spot.

Structure: Like English, it is SVO (“I eat apple”). The action is resolved early, keeping predictive entropy low.
Density: Like Japanese, it uses Logograms. A single symbol carries the weight of a whole word.
Result: It combines the “Stream” efficiency of English syntax with the “Density” efficiency of Japanese characters. It is, mathematically speaking, perhaps the most efficient encoding for a Transformer model.

3. The “Split-Brain” Endgame: Language as Interface

If Chinese is computationally superior, will AI abandon English? Not necessarily.

To understand why, we must look at Cognitive Architecture. We can analogize an LLM to the famous “Split-Brain” experiments in neuroscience (specifically Gazzaniga’s “Left Brain Interpreter”).

The Right Hemisphere (Latent Space): Deep inside the model’s hidden layers, there is no English, German, or Chinese. There is only Latent Space—a massive, high-dimensional vector field where concepts exist as pure mathematical relationships. In this space, the vector for “King” is mathematically close to “Power,” regardless of the language used to tag it. This is where the “reasoning” happens.
The Left Hemisphere (The Tokenizer): Language is merely the Interpreter. It is the I/O layer that collapses those rich, abstract vectors into a serial sequence of sounds or symbols so humans can understand them.

The “Moot Point” of Syntax

Ultimately, the efficiency differences between SVO and SOV are “Input/Output taxes.” They are tolls we pay at the border of the model to get ideas in and out. Once the idea is inside (embedded), the syntax disappears.

Conclusion: The Multimodal Bypass

This leads us to the final evolution: Native Multimodality.

As models evolve from “Large Language Models” to “Large Multimodal Models” (LMMs), they are beginning to bypass the linguistic toll booth entirely. When a model ingests a raw image of a sunset, it doesn’t need to convert it into the tokens “orange,” “sky,” and “clouds.” It ingests the phenomenon directly into Latent Space.

We are moving from an era of Symbolic Compression (Language) to Neural Directness (Multimodality).

But until that post-linguistic future fully arrives, the economy of intelligence remains bound to the efficiency of the symbol. And in that race, the “Density Dividend” ensures that not all languages are created equal.

The Linguistic Efficiency of Logograms in Large Language Models

Yes yes, it’s AI generated.

Executive Summary: Current discussions regarding Large Language Model (LLM) efficiency focus almost exclusively on hardware acceleration (GPUs) and algorithmic optimization (quantization, MoE). However, a third variable—linguistic density—offers a structural advantage to non-alphabetic languages. Preliminary analysis suggests that logographic systems (Chinese, Japanese) and Subject-Object-Verb (SOV) syntaxes may possess inherent computational efficiencies over Western Subject-Verb-Object (SVO) alphabetic systems.

1. Symbol Density and Token Economics In the context of LLMs, language functions as a data compression algorithm. The economic unit of measurement is the “token” (roughly equivalent to a semantic fragment).

Alphabetic Inefficiency: English is semantically sparse. The concept “Computer” requires eight characters and typically occupies one token.
Logographic Density: In Chinese, the same concept (电脑) requires two characters. Due to the high semantic load per character, logographic languages often convey equivalent logic in 30-40% fewer tokens than English.
Implication: An LLM operating in a dense language effectively gains a larger context window and reduced inference latency. If a Chinese prompt requires 1,000 tokens to express a complex instruction that requires 1,500 tokens in English, the Chinese system achieves a 50% throughput increase on identical hardware.

2. Syntactic Alignment: SVO vs. SOV The syntactic structure of a language impacts the predictive load placed on an autoregressive model.

English (SVO – Subject, Verb, Object): The structure “I [eat] an apple” forces the model to predict the action (Verb) before the target (Object) is known. This requires the model to maintain a high probability distribution for the verb based on limited context.
Japanese (SOV – Subject, Object, Verb): The structure “I [apple] eat” (Watashi wa ringo wo taberu) aligns with the mechanics of a Stack Machine or Reverse Polish Notation (RPN). The arguments are “pushed” onto the context stack first, and the operator (verb) is “executed” last.
Computational alignment: This “payload-last” structure may reduce the “lookahead” complexity for the model, as the function (verb) is generated only after all necessary arguments are available in the context window.

3. Cognitive Bandwidth and the “80-Column” Limit From a Human-Computer Interaction (HCI) perspective, the visual density of information is a limiting factor in “swarming” workflows (managing multiple autonomous agents).

The Review Bottleneck: A human operator reviewing logs from 20 parallel agents faces a bandwidth constraint.
Visual Parsing: Logographic languages allow for “gestalt” recognition—reading code or logs by shape rather than phonetic scanning. A single 80-character line of logograms can contain a paragraph’s worth of English information. This allows operators to parse system states significantly faster, increasing the “manager-to-agent” ratio a single human can effectively oversee.

Conclusion: While English remains the dominant language of training data, the mechanics of inference favor density. As compute becomes a constrained resource, we may observe a divergence where high-performance automated systems default to high-density linguistic encodings to maximize “logic per watt.”

Recovering a “Bricked” AM5 System After an AI Training Crash

Why This ASUS B650E-F + RTX 5090 + High-Refresh Display Failure Is Almost Impossible to Diagnose Without Synthesis

Literally, without AI I could not have solved this problem. Too unique / hard to google, etc. It’s absolutely wild the information it synthesized about likely hardware failure paths unique to my exact symptoms (i.e what does it mean when HDMI cable on Motherboard works, but not DP, but orange light, 1 ram stick in/out, one RGB on etc.

Keywords for those with the same problem:
ASUS B650E-F BIOS flashback, AM5 orange DRAM light, LED firmware update loop, RTX 5090 no DisplayPort signal, GPU not detected after BIOS update, DisplayPort only works with HDMI, Aura firmware crash, AI training hard shutdown PC

Most of this is AI generated but here we go:

TL;DR (for people in crisis)

If your AM5 system appears bricked after a hard shutdown during AI training and shows some combination of:

Endless “BIOS is updating LED firmware” loops
Orange DRAM Q-LED that never clears
No video output from GPU
DisplayPort working only if HDMI is also plugged in

Your hardware is probably not dead.

What you’re dealing with is partial firmware corruption + display initialization order failure, triggered by a very specific hardware/software combination.

The fix required:

USB BIOS Flashback (oldest stable BIOS)
Aura / RGB firmware isolation
AM5 memory retraining patience
iGPU → dGPU handoff reset
HDMI as a temporary EDID bootstrap
Forcing DisplayPort to become the primary display
Clean NVIDIA driver re-enumeration

This cannot be solved by a single forum post or vendor doc. It requires cross-layer synthesis.

The Trigger: Hard Shutdown During AI Training

The system didn’t crash while browsing or gaming.

It hard-powered off during AI model training — sustained, atypical load.

The Perfect Storm: Why This Hardware Combination Is Fragile

This issue is far more likely with this exact stack:

1. AM5 Platform (Ryzen 7000 / 7800X3D)

Extremely aggressive memory training
Multiple firmware reinitialization passes
Sensitive to BIOS downgrades and CMOS clears

2. ASUS ROG Boards with Aura / RGB

Separate LED MCU firmware
Can enter recovery loops if interrupted mid-write
Not isolated from power events the way main UEFI is

3. RTX 5090 (Next-Gen NVIDIA GPUs)

Complex PCIe enumeration
More aggressive power state transitions
Heavier reliance on driver-managed display init

4. High-Refresh DisplayPort Monitor (240 Hz / 4K / VRR)

Stateful DisplayPort link training
EDID negotiation happens later than HDMI
Much easier to fail silently after firmware resets

5. AI Training Workloads

Sustained GPU power draw
No “natural breaks” for clean state sync
Hard shutdown hits at the worst possible moment

Individually, none of these are a problem.
Together, they create a non-linear failure mode.

What Actually Broke (and What Didn’t)

Not broken:

CPU
RAM
GPU
Motherboard traces
Power supply

Broken / confused:

LED / Aura firmware state
BIOS assumptions about primary display
GPU display initialization order
NVIDIA driver display cache
DisplayPort EDID handshake timing

This is why the system looked dead while being electrically fine.

Why USB BIOS Flashback Was Critical (and EZ Flash Wasn’t)

USB BIOS Flashback:

Runs on a dedicated microcontroller
Ignores CPU, RAM, GPU, OS
Bypasses the LED / Aura firmware path entirely

This is not just “another way to update BIOS.”
It is firmware isolation, and it’s the only reason recovery was possible.

Key details that mattered (rarely documented clearly):

Using the oldest stable BIOS, not the newest
FAT32 + MBR formatting (macOS-safe)
Correct manual BIOS filename
Unplugging RAM and RGB during Flashback

The Orange DRAM Light: Why Waiting Was the Right Move

After Flashback:

System rebooted
Orange DRAM Q-LED stayed on
No video

On AM5, this often means:

Full memory retraining
Multiple internal reboots
Training restarting after any BIOS change

Interrupting this is how people brick boards.

Minimal config + patience was the fix:

One RAM stick
Slot A2
No GPU
No RGB

The Most Confusing Symptom (and the Key Insight)

DisplayPort only worked if HDMI was also plugged in.

This sounds like black magic. It isn’t.

What was happening:

HDMI provided an early, guaranteed EDID handshake
NVIDIA successfully initialized the display engine
DisplayPort trained after that
Without HDMI, DP never initialized

HDMI wasn’t the solution — it was the bootstrap device.

This is a known (but undocumented) NVIDIA + DP + high-refresh edge case after firmware resets.

The Fix That Made It Stick

Boot with HDMI + DisplayPort
Enter Windows
NVIDIA Control Panel:
- Set DisplayPort as primary
Full shutdown
Remove HDMI
Cold boot on DP only

That forced NVIDIA to rewrite its display initialization path.

After that: DisplayPort worked independently forever.

Why This Couldn’t Be Solved by Google

No single source explains:

LED MCU firmware loops
AM5 memory training behavior
iGPU ↔ dGPU boot path caching
NVIDIA EDID initialization order
DP vs HDMI protocol differences
AI-training-specific shutdown risk

Each exists in isolation across:

Vendor docs
Scattered forum anecdotes
RMA support scripts
Driver changelogs

What solved this wasn’t information — it was synthesis:

Maintaining a causal model across firmware, hardware, drivers, and protocols
Reducing state space step by step
Applying changes in the only order that works

That’s why this felt “impossible” — because lookup alone was insufficient.

Final Advice (If You Ever See This Again)

Avoid Aura / Armoury Crate on AM5 if you value stability
Don’t BIOS-update unless you need a fix
If you see LED firmware update loops, stop immediately
Unplug all RGB (better-yet unplug GPU, RAM) — Flashback to older BIOS for your Motherboard w/o RGB controllers
Use Flashback, not EZ Flash
Use HDMI to rule out DP handshake issues
If DP disappears after recovery, use HDMI once — intentionally
DP may only work w/ HDMI at same time (not DP alone) – Set DP to primary with dual connected, then restart w/ HDMI cable removed.

This wasn’t a broken PC.
It was a distributed system stuck in an invalid state.

And those are only fixable if you understand the whole system.