Companion piece: this essay has a looser Intuition sibling, Do I Have to Take Gary Marcus Seriously?.

The conversation behind this

The author + AI development record for this issue: what came from the first reaction, what the process challenged, and what changed before publication.

Audio companion: Listen to this essay as a narrated audio episode: The Gary Marcus Audit.

I want Gary Marcus to be wrong.

That is not the same as knowing that he is.

More specifically, I would like to write him off. I would like to ignore his claims as irrelevant, unimportant, and, most gratifyingly, just plain wrong. If Marcus is just a crank in the road, then I do not have to do much with him. I can notice the irritation, file the warning under familiar public contrarianism, and move on.

But do my wishes hold up under scrutiny?

The crude dismissal fails quickly. Marcus is not simply anti-AI. He grants practical usefulness in some contexts while disputing the conversion of usefulness into reliability, understanding, or AGI. He argues for different AI, not no AI, and often for more machinery around current AI rather than for pure abstinence.

My synthesis of his objection is narrower and more annoying: current LLM-centered systems can be genuinely useful while remaining unreliable in ways that matter. Practical capability is being converted too quickly into claims about reliability and general intelligence. The governance and authority question is the further implication I want to test.

That claim keeps surviving the first pass.

So the question is not whether Marcus annoys me. He does.

The question is which claims survive after I subtract irritation, tribe, and politics.

The irritation is not imaginary. Marcus makes it easy to feel. He has used the headline "Generative AI was a scam", then qualified the literal fraud claim in the body. On X, he called Meta's AI direction a "data-labeling sweatshop" (mirror/context). In a post about Yann LeCun and AI-bubble warnings, he used "sociopathic" and later added that "the only constant is his ego" (direct X lead; mirror/context).

Those examples matter because they explain why this reader sometimes experiences his style as costly. Depending on your politics and your place in the AI argument, they may sound brilliant, rejuvenating, gratifying, funny, bracing, unfair, vindicating, petty, or exhausting.

They do not settle the technical claims.

They also do not measure his overall influence.

That is the whole point.

The more serious hypothesis is not that Marcus is wrong because he is irritating. It is that he may be technically right about important failure modes while still having limited update power among some readers already entangled with adoption: builders, adopters, investors, product leaders, and AI-adjacent readers who have already moved from "Should we use this?" to "How far can this go?"

That is a reception hypothesis, not a measured cultural fact.

Start with adoption.

The tools are useful. That part is no longer speculative. People and organizations use them to draft, code, summarize, translate, search, plan, explain, automate, and avoid blank pages. The question of use moved faster than the question of trust.

So what does the warning mean once many users have already made some version of the use decision?

A warning that once sounded like "do not use this" may now be heard as "do not confuse use with trust." The second warning can remain live after the first decision has been made.

That is where the architecture question becomes unavoidable.

Marcus often treats confidence that LLMs will become AGI as an intellectual or technical category error. He argues that current systems are broad but shallow, fluent but unreliable, and too weak at truth-tracking, generalization, and self-checking to count as the kind of general intelligence the word AGI was meant to name.

But the phrase "LLMs will become AGI" hides several different claims.

One claim is about LLM-centered training and inference improvement. Make the model bigger. Train it on better data. Add synthetic data. Use reinforcement learning. Let it spend more compute at inference time. The system may become far more capable, but the public story is still centered on learned foundation models and the ways they are trained, sampled, evaluated, and run.

A second claim is about surrounding systems and orchestration. Give the model tools, retrieval, external memory, code execution, search, verifiers, safety layers, routers, and agent loops. Now the system can do things a bare chatbot cannot do. But some of that gain comes from operations outside the model's weights. A calculator can provide a bounded, externally checkable operation when the system invokes it correctly. A code interpreter runs code. A verifier checks a step. The model may be coordinating the system rather than doing every part of the work internally.

It is important not to dump every new capability into that bucket. Vision, audio, and robot action are modalities or capability surfaces. They may be learned inside a model, connected through a larger system, or both. Reinforcement learning and test-time computation change training or inference behavior. Tools and external memory supply operations outside the weights. An "agent" often describes a deployment pattern in which a model repeatedly chooses and invokes tools; it is not, by itself, a theory of intelligence.

A third claim is about architecturally distinct or explicitly hybrid AI. This is closer to what Marcus usually wants. Neurosymbolic systems combine neural pattern recognition with more explicit reasoning, planning, verification, rules, search, or conventional algorithms. World models try to learn a representation of how reality changes, then use that model to predict consequences and plan actions. Those are not just bigger chatbots. They are attempts to make parts of reasoning less dependent on fluent token generation.

Once the categories are separated, the argument changes.

If Marcus means that bare LLM pretraining scale alone will not produce dependable general intelligence, he may be arguing against a view that is no longer the dominant explicit public roadmap. But that does not mean strong scaling confidence has disappeared.

The reviewed labs do not publicly reduce their programs to bare pretraining scale alone. They combine scaling with reinforcement learning, test-time computation, tools, orchestration, multimodality, robotics, verification, interpretability, and safeguards. At the same time, some lab leaders still make forceful scaling-law forecasts about rapidly increasing general capability.

On the public record, OpenAI's reasoning work is best classified as LLM-centered training and inference improvement, often embedded in a larger tool-using system. That is an editorial taxonomy, not a disclosure of every proprietary component. Google DeepMind's robotics work is more hybrid: it includes embodied reasoning and vision-language-action models that turn perception and instructions into robot action, while still building on a Gemini foundation-model core. LeCun's JEPA and world-model work, developed at Meta and now pursued through Advanced Machine Intelligence, is the cleaner public example of a distinct architectural bet. That should not be confused with Meta's current flagship messaging, which again emphasizes a scalable LLM family and agentic orchestration.

That leaves Marcus with a narrower but stronger target: not "serious labs publicly believe bare scaling is enough," but "public confidence in LLM-centered systems may still be outrunning the reliability, grounding, security, and governance those systems have demonstrated."

The distinction matters because "agents," "reasoning," "tools," and "multimodal systems" can be labels that clarify nothing. An agent can be an LLM in a loop. A tool-using model can still be brittle about when to use the tool. A reasoning model can spend more compute and still hallucinate. A multimodal model can see more of the world without having a stable model of cause, consequence, and truth.

The question is not whether labs are doing more than 2022-style chatbots. They are.

The question is not whether routing around the failures Marcus describes is illegitimate. In real-world systems, a workaround that is stable, inspectable, and bounded can be a solution. The question is which domains actually provide those boundaries.

Coding is a comparatively favorable example. The model can propose code. The surrounding software environment can compile it, run tests, execute a linter, and surface stack traces. That is external machinery with unusually crisp failure signals. Those checks do not guarantee correct requirements, complete tests, or secure software. They do make many failures visible faster than an open-ended prose answer does.

That is not how every domain works.

Medicine, law, education, management, security, government, and personal advice do not all provide crisp compilers. Many of their hardest errors are delayed, contextual, social, or hidden. They also happen inside a world that keeps changing. A policy changes. A patient develops new symptoms. A precedent shifts. A number that was true in the morning is stale by afternoon. A model's initial answer can remain fluent while the world it described has moved.

In those domains, broad performance is related to dependable general intelligence, but it is not identical to it. A model can be stunningly capable across many benchmarks and still fail in exactly the kind of case where a person or institution is tempted to trust it.

This is where Mythos complicates the argument.

The current Anthropic/Fable/Mythos dispute refuses to stay on either side of the argument. Anthropic's public launch material described Fable 5 and Mythos 5 as highly capable models with stronger long-horizon, software, scientific, visual, and memory-related performance. Those are Anthropic's claims and evaluations, not an independent public decomposition of the models. "Mythos class" is a capability and access designation in Anthropic's materials, not an architecture category.

On June 12, Anthropic said a U.S. export-control directive forced the company to disable access to both models. Anthropic said the directive did not disclose a sufficiently specific technical basis and argued that the cited behavior was narrow, reproducible elsewhere, and better managed through defense in depth. Axios reported a government-side account involving a jailbreak report and subsequent testing. The complete technical record is not public, and an outside reader cannot settle the dispute from the available documents.

That creates a split pressure test.

On the capability side, Anthropic's claims and the government's treatment of the models put pressure on casual claims that frontier systems are strategically trivial or economically inconsequential.

They do not settle the ceiling of LLM-centered systems.

The public sources do not reveal how much performance came from the underlying learned model, test-time computation, tools, memory, orchestration, specialized environments, or human supervision. Nor does government concern independently validate the strongest capability claims.

On the governance side, the dispute shows how quickly strong capability claims can coexist with disagreement about safeguards, access, evidence, and acceptable residual risk.

Anthropic's defense was not that perfect reliability exists. It was that perfect jailbreak resistance does not appear possible today, that defense in depth is the realistic standard, and that the demonstrated behavior was narrow and available from other deployed models too. The lab's argument was not "there is no risk." It was closer to: this risk exists across models, the safeguards are comparatively strong, and the government's standard is technically and commercially overbroad.

That is almost the whole tension.

The system may be important enough to matter while its security, alignment, and dependable operating limits remain disputed or incompletely demonstrated. Capable enough for national-security concern does not automatically mean fit for high-trust institutional authority. In cyber, even a system that succeeds inconsistently can be dangerous because attackers can retry, filter failures, and chain partial successes. In medicine, law, finance, government, or education, the tolerance for hidden error, no matter how small, is different.

So the reliability critique is not obsolete. It has moved.

It is no longer enough to say "the model is too dumb." The capabilities Anthropic claimed for Fable and Mythos make that sound complacent. The live concern is sharper: a system can be capable enough to create serious institutional leverage while still not being secure, governed, grounded, or dependable enough for the authority it is being handed.

That is a better version of Marcus than the one I wanted to dismiss.

It is also a harder version for Marcus himself.

Marcus has supplied meaningful AGI criteria: flexibility, generality, resourcefulness, depth, reliability, and the ability to generalize and check answers. If "LLMs are not the path to AGI" means "scaling alone will not solve reliability, grounding, and generalization," the claim is serious and increasingly specific. If it means "no LLM-centered system can ever reach transformative general capability," it needs clearer update conditions.

What sustained performance would count against a low-ceiling claim? What reliability would have to persist outside benchmark-like settings? When does a successful mixed system count as evidence for an LLM-centered path, and when does it count as the hybrid architecture Marcus expected?

Those are not trick questions. They are necessary if the thesis is to remain discriminating as systems change. The fair pressure on Marcus is the same pressure the essay puts on my dismissal reflex: specify the claim enough that reality can push back.

So where does the audit land?

Not with endorsement.

I still do not want Marcus as my oracle for AI because his rhetoric often makes this reader do avoidable cleanup. The claim arrives entangled with scorekeeping, personal rivalry, political heat, or a sense that every new failure confirms what he already knew. That style may give his opponents permission to ignore him cheaply.

Some of the reception critique holds.

But not enough to wash away the claims.

The reviewed sources support this much:

Marcus is not simply anti-AI.

He grants practical usefulness in some contexts.

He has a longstanding hybrid-AI program.

He distinguishes current AI risk from AGI or superintelligence risk.

He treats agents as a case in which reliability failures matter more because systems can act.

He has substantive criteria behind his broad-but-shallow criticism.

The reviewed sources do not establish this much:

That his warnings lack update power with the readers most entangled with adoption.

That builders, adopters, or investors have fully absorbed his argument.

That the architecture or capability ceiling of LLM-centered systems has been settled.

That Fable or Mythos proves either side's architecture thesis.

That current labs rely on bare scaling alone.

That his public criteria yet provide a clear update rule for every future mixed system.

My own conclusion is narrower than endorsement.

Usefulness has been overconverted into reliability in some public and product rhetoric. Agents make reliability failures matter more because the system can act. Serious AI risk does not require superintelligence; powerful but unreliable AI with access can create institutional harm before anything like settled AGI arrives.

Marcus is on shakier ground if the claim is a hard low ceiling on LLM-centered capability. The capabilities Anthropic claims for Fable and Mythos make that harder to say casually, even though they do not settle the question. Scaling, reinforcement learning, test-time compute, synthetic data, tools, and verification may produce far more general capability than some anti-scaling rhetoric makes room for.

But that does not collapse the governance critique. It clarifies it.

The alternative to AI hype is not abstinence from useful tools. The alternative to dismissing Marcus is not surrendering to his whole worldview. The alternative to trusting frontier AI is not treating it as useless. A warning can be live even if its messenger is costly. A technology can be useful, powerful, and institutionally consequential before its reliability and governance are mature. A culture can accept a tool before it has understood the risk it is accepting.

The hard work is keeping those sentences separate.

Self-critique sounds simple. Write the sentence you want to be true. Write the strongest version of the claim you are tempted to dismiss. Name what would embarrass your view if it turned out to be real.

That is useful. It is also hard to do from inside your own head. Motivated reasoning rarely announces itself as motivated reasoning. Filtering feels like discernment while it is happening. The story you prefer can sound like judgment, taste, prudence, standards, or realism.

This is where an AI-assisted editorial process can be genuinely useful, not because the model knows what to believe, but because it can be made to work against preference.

Write down your view of a person, institution, technology, or claim, then ask a strong model for the best adversarial critique. The result may expose a missing distinction, a weak source, a false certainty, a category error, or the inconvenient part of the story you wanted to smooth away.

But the model's critique is not evidence. It can invent objections, misread sources, flatten categories, or sound certain about a false distinction. Its output still has to be checked against the record. The value is not that the model knows what to believe. The value is that it can make avoidance more difficult.

I began with a wish: I want Gary Marcus to be wrong.

The audit gives me something less convenient.

Marcus may have limited update power with some people who think like I do.

He may also be right about enough - that dismissing him says more about my filters than his claims.

What this is: Field Notes testing an irritating but potentially useful AI critic against primary-source claims, public architecture descriptions, and a dated governance dispute. It is not a Gary Marcus profile, a proprietary architecture analysis, policy advice, investment advice, or proof that an editorial process makes the conclusions true.

Review date: June 20, 2026. The Fable/Mythos access dispute and company descriptions are time-sensitive.

Confidence: Medium on the Marcus claim map; medium on the public-source architecture taxonomy; low on proprietary internals and the government's complete technical rationale; medium-low on Marcus's update power with adoption-entangled readers, which remains an unmeasured reception hypothesis.

What would change this view over the next 12-24 months: sustained, inspectable evidence that LLM-centered systems operate reliably in changing high-trust contexts; clearer evidence about how Marcus would update his low-ceiling claim when mixed systems succeed; or measured evidence that his warnings materially alter, or fail to alter, deployment and governance decisions.

Process transparency: AI tools assisted with drafting and adversarial critique. The human author selected the frame, checked the sources, made the judgments, and owns the published claims and errors. Process review is not evidence that the claims are true.

Sources and anchors

Marcus's positions

Current public architecture framing

Fable/Mythos

Rhetoric examples

  • Marcus's "Generative AI was a scam" headline/body contrast.

  • Meta "data-labeling sweatshop" direct X lead (https://x.com/i/status/2067375496346894404) plus mirror/context.

  • LeCun "sociopathic" / "ego" direct X lead (https://x.com/i/status/2067592852587401567) plus mirror/context.

Source limitations

  • Public material does not reveal complete proprietary architecture. The taxonomy is an editorial synthesis of disclosed system features.

  • "Mythos class" is an Anthropic capability/access designation, not an independently verified architecture class.

  • Anthropic and government-side accounts do not provide a complete public technical record.

  • X is unstable as a reference surface. Mirrors are included for inspectability, not as stronger evidence than original posts.

  • Marcus's update power with adoption-entangled readers remains a reception hypothesis, not an empirical measurement.

Reply

Avatar

or to participate

Keep Reading