GLM-5.2 Benchmark Deep Dive: Open-Weight Frontier
Marcus Bell
Frontier Models Correspondent

TLDRHow Z.ai's GLM-5.2 stacks up across Code Arena, GDPval-AA, ARC-AGI-2, FrontierSWE, and Artificial Analysis — and what the bench data still can't tell us.
GLM-5.2 Benchmark Deep Dive: How Z.ai's Open-Weight Model Closed the Frontier Gap
Ten days ago, GLM-5.2 was an unannounced number on Artificial Analysis. Today it sits at #3 on a real-world agentic benchmark, has the CEO of Vercel calling its coding output "almost shocking," and is the only open-weight model anyone is seriously comparing to Claude Opus 4.8.
TLDR Z.ai's GLM-5.2, released June 13 and benchmarked publicly from June 16 onward, has landed on Artificial Analysis Intelligence Index at 51 (4th overall, 1st among open weights), surpassed every Claude Opus variant on Code Arena: Frontend at 1595, hit #3 on GDPval-AA ahead of GPT-5.5 (xhigh), and scored 22.8% on ARC-AGI-2 at roughly 1/7 the per-task cost of GPT-5.5. Independent commentators now describe the frontier gap between US closed and Chinese open as roughly seven months. Z.ai has not released a detailed architecture paper, internal evals are reported as weaker than published benchmarks, and reproducibility remains the open question.
Key Takeaways
- GLM-5.2 reached 51 on the Artificial Analysis Intelligence Index, the new SOTA open-weight score and 4th overall behind Fable 5, Opus 4.8, and GPT-5.5 (xhigh).
- On Code Arena: Frontend, the Max tier hit 1595, surpassing every Claude Opus variant and closing in on Fable 5 at 1665.
- Pricing on Fireworks is $1.40 input / $4.40 output per 1M tokens — roughly 5x–7x cheaper than Opus 4.8 and GPT-5.5 on output.
- The model is MIT-licensed open weights, with a 1M-token context and 131,072-token max output, runnable locally via 1-bit GGUF on consumer hardware.
- Reproducibility is contested: at least one prominent commentator calls it bench-maxxed, while Teortaxes notes GLM-5.1 scored 0.0% on at least one bench GLM-5.2 now performs well on.
What GLM-5.2 Actually Shipped
GLM-5.2 was released on June 13, 2026 by Z.ai (formerly Zhipu AI), and the first benchmark scorecard appeared three days later on Artificial Analysis. The release pattern matters: Z.ai shipped weights first and numbers second, which compressed the usual launch-narrative window into a 72-hour discovery period.
The model is MIT-licensed open-weight. Fireworks lists it as a 743B-parameter Mixture-of-Experts model with a 1,040K context window and 131,072-token max output. The product page also references IndexShare, a new attention-side architecture, and an improved Multi-Token Prediction (MTP) layer that Fireworks credits with longer speculative-decoding chains. No formal Z.ai paper has been published explaining either component in detail.
API access went live across multiple providers within days. AI/ML API listed GLM-5.2 on June 16 with a head-to-head test against Opus 4.8 on a Backrooms game implementation — GLM-5.2 finished in 1:08 at $0.37 versus Opus's 2:14 at $1.94. Fireworks priced the model at $1.40 input / $4.40 output per 1M tokens, with $0.26 for cached input.
For local inference, Unsloth published 1-bit GGUF quantizations that run at ~21.6 tokens per second on a Mac Studio M3 Ultra with 256GB RAM. That puts a frontier-adjacent open model on a consumer desktop, which is the part of this release with the longest tail.
The Benchmark Sweep
The benchmark coverage on GLM-5.2 is unusually broad for a model released less than two weeks ago. The following are confirmed from named sources in the available signal:
- Artificial Analysis Intelligence Index: 51, 4th overall, 1st among open weights. TestingCatalog noted it ranked second on Frontend Code Arena behind only the currently restricted Claude Fable 5.
- FrontierSWE: 74.4%, with Mark Kretschmann posting it as "just behind Opus 4.8 at 75.1% and ahead of GPT-5.5 at 72.6%".
- SWE-bench Pro: 62.1 per Lushbinary's June 16 comparison, ahead of GPT-5.5 at 58.6.
- VibeCodeBench: 63.96%, an enormous jump from GLM-5.1's 31.46% per Teortaxes's series breakdown.
- Code Arena: Frontend: 1595 at Max tier, per Arena.ai's head-to-head trajectory post. The lab beat every Claude Opus variant in matched pairings, including a 55.0% win share against Opus 4.7 (Thinking) and 59.4% against Sonnet 4.6.
- GDPval-AA: #3 overall in Max tier, ahead of GPT-5.5 (xhigh), per Chubby's writeup. GDPval-AA evaluates real-world agentic deliverables such as retail task lists, circuit schematics, and music-video moodboards.
- DeepSwe: 44% at $3.92 / 78K tokens, compared to Gemini 3.5 Flash at 37% / $7.34 / 276K tokens, per Haider's June 21 post.
- ARC-AGI-2: 22.8% at $0.25/task, per ChrissGPT's verified screenshot. Far behind GPT-5.5 at 85%, but 7.6x above the best verified score from May 2025.

Source: @arena
Two more numbers are worth flagging. Teortaxes posted a private-eval result showing GLM-5.2 closer to Opus 4.8 than Sonnet 4.6 on a benchmark GLM-5.1 scored 0.0% on — a useful counter to claims of direct benchmark targeting. Separately, Teortaxes tracked Artificial Analysis's AA-Omniscience hallucination rate moving from 67% (GLM-4.5) and 95% (GLM-4.6) down to 28% (GLM-5.2), which suggests improved RL stabilization across the version line.
The Pricing and Inference Math
The pricing column is what makes the benchmark column matter. Cost lines up like this:
- GLM-5.2 (Fireworks serverless): $1.40 / $4.40 per 1M input/output tokens.
- Claude Opus 4.8: $5.00 / $25.00 per 1M.
- GPT-5.5: $5.00 / $30.00 per 1M.
That is a roughly 5.7x to 6.8x gap on output tokens versus the two closed frontier models. On the trading-desk coding test reported by Rohan Paul, Sakana's Fugu Ultra produced richer output but at 17x the cost of GLM-5.2 — and Fugu is a multi-model orchestration layer, not a single model.
A Hacker News thread on the Artificial Analysis listing surfaced a concrete reasoning-token tradeoff. Per the discussion of GPT-5.5 vs GLM-5.2 reasoning efficiency, GPT-5.5 xhigh averages around 16K tokens to first-file on a math-evaluator task, while GLM-5.2 (max) spent 45K tokens and over 15 minutes reasoning. GLM-5.2 wins on per-token cost; GPT-5.5 still wins on reasoning efficiency, which is the relevant axis for latency-sensitive agents.
For self-hosting math, the Unsloth 1-bit GGUF result is the new floor: a 743B MoE running at ~21.6 tok/s on a Mac Studio M3 Ultra with 256GB unified memory. That isn't most developers' setup, but it is achievable hardware, and it is the first time a frontier-adjacent model is plausibly local.
GLM-5.2 vs Claude Opus 4.8: What the Signal Says
Opus 4.8 is the only model that GLM-5.2 is being repeatedly compared against in matched-pair testing across the available signal. Here is what the bundle actually shows:
- License & deployment: GLM-5.2 ships MIT-licensed open weights with HuggingFace distribution and 1-bit GGUF support. Opus 4.8 is closed and API-only.
- Context window: Both list 1M-token windows per the Lushbinary comparison. GLM-5.2 publishes a 131,072-token max output cap; Opus 4.8's max output is described as "high" but no exact number appears in this signal set.
- FrontierSWE: 74.4% (GLM-5.2) vs 75.1% (Opus 4.8) — within a point per Mark Kretschmann's reading of the official scorecard.
- Code Arena: Frontend: 1595 (GLM-5.2 Max) vs Opus 4.8 sitting below it per Arena.ai's June 25 chart. GLM-5.2 takes a higher win share in every Opus variant pairing in matched head-to-head.
- Pricing: $1.40 / $4.40 vs $5.00 / $25.00 per 1M — roughly 5.7x cheaper on output.
- Agent reliability: A third-party report Teortaxes quoted claims zero failed runs for GLM-5.2 across 84 runs versus ~10% failure for Opus agents — unverified in the available signal but worth tracking.
What the bundle doesn't settle: SWE-bench Pro head-to-head (Lushbinary marked Opus 4.8 as "not directly comparable"), and any hard private-eval ranking. Teortaxes himself notes that on "hard model evals" GLM-5.2 lands "around 4.8" in some cases and well below in others — the "on par with X" framing varies a lot by benchmark.
The other consistent caveat: GLM-5.2 lacks image understanding, which limits it on visual analysis workflows where Opus and Fable can read charts directly. Chubby flagged this as the binding constraint on GLM-5.2's autoresearch use case.
What We Know vs. What We Don't
What we can verify from public sources
- GLM-5.2 reached 51 on the Artificial Analysis Intelligence Index, the highest open-weight score recorded so far, per TestingCatalog.
- GLM-5.2 (Max) hit 1595 on Code Arena: Frontend, surpassing every Claude Opus variant per Arena.ai.
- Fireworks lists GLM-5.2 as 743B parameters, MoE, 1,040K context, 131,072 max output at $1.40/$4.40 per 1M tokens.
- The model is MIT-licensed open weights distributed via HuggingFace under the zai-org organization.
- On ARC-AGI-2, GLM-5.2 scored 22.8% at $0.25/task versus GPT-5.5 at 85% — far behind on capability, far ahead on cost-per-task.
- On GDPval-AA, GLM-5.2 (Max) ranked #3 overall, ahead of GPT-5.5 (xhigh).
- The Vercel CEO publicly described GLM-5.2's coding output as "almost shocked".
- Unsloth's 1-bit GGUF runs at ~21.6 tok/s on a Mac Studio M3 Ultra (256GB RAM), confirming consumer-hardware viability for the smallest quantization.
- Zhipu's founder reportedly believes a Fable-class Chinese model could arrive before end-of-2026.
What remains unverified or contested
- Whether GLM-5.2 is bench-maxxed. Bindu Reddy explicitly claimed it with the qualifier that internal evals are weaker than published numbers. Teortaxes argues against direct benchmark targeting given GLM-5.1's 0.0% on at least one eval GLM-5.2 now passes.
- No detailed model card or architecture paper has been published by Z.ai. IndexShare and the improved MTP layer are described only in Fireworks's product copy.
- Training data composition and size: no public numbers.
- Tokens activated per forward pass (the activation slice of the 743B MoE): not disclosed in the available signal.
- High vs Max effort-tier algorithmic difference: not documented beyond "more reasoning tokens."
- Image understanding: confirmed absent, per Chubby's autoresearch writeup — but no roadmap for GLM-5.3V is published, despite Teortaxes hoping for one.
- Reasoning efficiency: GLM-5.2 max appears to spend ~2x more tokens than GPT-5.5 xhigh on equivalent reasoning, per the Hacker News discussion. Not yet measured systematically.
- The "zero failed runs across 84" agent-reliability claim is unverified.
- Whether the 7-month frontier gap estimate holds beyond Code Arena and GDPval. On ARC-AGI-2 the gap is much larger; on private hard evals Teortaxes describes more variance.
- Anthropic and OpenAI public response: none documented in the available signal.
Why This Matters for Builders
The practical question for engineering teams is whether GLM-5.2 is good enough to displace closed-frontier reliance for specific workloads. On the available data the answer is workload-dependent.
For frontend coding and React/HTML generation, the Code Arena trajectory is hard to argue with. GLM-5.2 (Max) beats Opus 4.8 head-to-head on real-world web-dev tasks, and at ~1/6 the output cost. For agencies doing high-volume UI work, the cost math alone justifies an evaluation.
For long-horizon agentic work, the GDPval-AA #3 placement and the agent-reliability anecdote ("zero failed runs across 84") point in the same direction, but verification matters. Run a 10-task agent eval on your own harness before betting infrastructure on the reliability claim.
For research acceleration, Chubby's autoresearch result — debugging multi-node H100 RL training experiments — is the most concrete capability claim outside benchmarks. The lack of image understanding is the binding constraint; if your workflow needs chart interpretation, GLM-5.2 has to fall back to programmatic WandB analysis.
For latency-sensitive production, GPT-5.5 still wins on reasoning efficiency. GLM-5.2 Max spending 45K tokens and 15 minutes on a math-evaluator scaffold is a real cost on user-facing systems, even if the per-token rate is lower.
For self-hosted inference, the Unsloth 1-bit GGUF result changes the conversation. A frontier-adjacent model running locally on $5K of consumer hardware is the kind of capability shift that filters into product roadmaps over months, not weeks.
How to Evaluate It Yourself
Before relying on any of the published numbers, the community has converged on a small set of evaluation patterns worth replicating:
- Run a private SWE-style task you've never published. Teortaxes's argument against bench-maxxing rests on GLM-5.2 performing well on benchmarks GLM-5.1 scored zero on. Mirror that test on your own internal eval.
- Compare reasoning token consumption against GPT-5.5 high/xhigh on equivalent tasks. The cost-per-token win evaporates if reasoning runs 3x longer than expected.
- Measure tool-call reliability over a multi-turn agentic task. The "zero failed runs across 84" claim is the most interesting and least verified data point in the entire signal set.
- Test the 1M context window past 200K tokens. Z.ai claims the window is usable rather than degraded; Opus's window is similarly claimed; neither has been independently stress-tested in this signal set.
- Validate on your domain. The bundle includes a meaningful skeptic note from a Reddit thread on r/theprimeagen where a user reports a smaller MiniMax M3 model outperformed GLM-5.2 on code review and verification. Benchmark wins don't transfer uniformly.
The reproducibility loop matters more than any individual leaderboard. The benchmarks GLM-5.2 wins are the ones where its training process targeted the relevant capability; the ones it loses (mathematics relative to other open models per Teortaxes) reveal the shape of the training emphasis.
The Geopolitical Frame
The benchmark discussion has a second layer that's harder to evaluate but worth noting. Multiple commentators in the signal — Haider, Teortaxes, Chubby — converge on a "7-month frontier gap" estimate based on GLM-5.2's placement. Elon Musk reportedly extends that to a similar timeline for open-source reaching Claude Fable.
These estimates are predictions, not measurements. The gap on ARC-AGI-2 is much larger than 7 months. The gap on Code Arena: Frontend is reportedly closed. The reasonable read is that the gap is benchmark-dependent, and "7 months" is an average that obscures wide variance.
What is measurable: Zhipu shipped GLM-5.1 in late May and GLM-5.2 in mid-June, a roughly two-week intra-generation cadence. If GLM-5.3 maintains that pace and continues the per-version capability uplift, the schedule alone forces a re-evaluation of US-China frontier timing assumptions.
What to Watch Next
Three observation signals are worth tracking over the next two to four weeks:
- Z.ai's official model card and architecture paper. IndexShare and the improved MTP layer are currently described only in third-party product copy. A formal release would either validate or recalibrate the open-weight architecture story.
- Independent reproduction of the agent-reliability claim. The "zero failed runs across 84" anecdote is the most consequential and least verified data point in the bundle. Watch for a public harness publishing matched-pair runs between GLM-5.2 and Opus 4.8.
- GLM-5.3 cadence. If Zhipu maintains a roughly 2–4 week minor-version cadence and another VibeCodeBench-scale uplift appears, the "7-month gap" narrative will need to be re-priced. Watch the NVIDIA NIM forum thread for serving-side adoption signals and HuggingFace for weight-drop activity.
Building similar long-context coding and agentic workflows? On kie.ai you can try Claude Opus 4.8, Claude Sonnet 4.6, and GPT-5.5.
About Marcus Bell
Marcus reports on frontier model launches and leaks, weighing community testing against official specs.
View all posts by Marcus Bell