Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Mannequin for Environment friendly Native Coding and Brokers


GLM-4.7-Flash is a brand new member of the GLM 4.7 household and targets builders who need robust coding and reasoning efficiency in a mannequin that’s sensible to run domestically. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE mannequin and presents it because the strongest mannequin within the 30B class, designed for light-weight deployment the place efficiency and effectivity each matter.

Mannequin class and place contained in the GLM 4.7 household

GLM-4.7-Flash is a textual content technology mannequin with 31B params, BF16 and F32 tensor varieties, and the structure tag glm4_moe_lite. It helps English and Chinese language, and it’s configured for conversational use. GLM-4.7-Flash sits within the GLM-4.7 assortment subsequent to the bigger GLM-4.7 and GLM-4.7-FP8 fashions.

Z.ai positions GLM-4.7-Flash as a free tier and light-weight deployment possibility relative to the complete GLM-4.7 mannequin, whereas nonetheless concentrating on coding, reasoning, and common textual content technology duties. This makes it fascinating for builders who can’t deploy a 358B class mannequin however nonetheless need a fashionable MoE design and powerful benchmark outcomes.

Structure and context size

In a Combination of Specialists structure of this kind, the mannequin shops extra parameters than it prompts for every token. That permits specialization throughout consultants whereas conserving the efficient compute per token nearer to a smaller dense mannequin.

GLM 4.7 Flash helps a context size of 128k tokens and achieves robust efficiency on coding benchmarks amongst fashions of comparable scale. This context measurement is appropriate for big codebases, multi-file repositories, and lengthy technical paperwork, the place many current fashions would wish aggressive chunking.

GLM-4.7-Flash makes use of an ordinary causal language modeling interface and a chat template, which permits integration into current LLM stacks with minimal modifications.

Benchmark efficiency within the 30B class

The Z.ai group compares GLM-4.7-Flash with Qwen3-30B-A3B-Pondering-2507 and GPT-OSS-20B. GLM-4.7-Flash leads or is aggressive throughout a mixture of math, reasoning, lengthy horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

This above desk showcase why GLM-4.7-Flash is without doubt one of the strongest mannequin within the 30B class, at the very least among the many fashions included on this comparability. The essential level is that GLM-4.7-Flash will not be solely a compact deployment of GLM but in addition a excessive performing mannequin on established coding and agent benchmarks.

Analysis parameters and considering mode

For many duties, the default settings are: temperature 1.0, prime p 0.95, and max new tokens 131072. This defines a comparatively open sampling regime with a big technology funds.

For Terminal Bench and SWE-bench Verified, the configuration makes use of temperature 0.7, prime p 1.0, and max new tokens 16384. For τ²-Bench, the configuration makes use of temperature 0 and max new tokens 16,384. These stricter settings scale back randomness for duties that want steady instrument use and multi step interplay.

Z.ai group additionally recommends turning on Preserved Pondering mode for multi flip agentic duties similar to τ²-Bench and Terminal Bench 2. This mode preserves inner reasoning traces throughout turns. That’s helpful once you construct brokers that want lengthy chains of operate calls and corrections.

How GLM-4.7-Flash matches developer workflows

GLM-4.7-Flash combines a number of properties which might be related for agentic, coding centered purposes:

  • A 30B-A3B MoE structure with 31B params and a 128k token context size.
  • Sturdy benchmark outcomes on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp in comparison with different fashions in the identical desk.
  • Documented analysis parameters and a Preserved Pondering mode for multi flip agent duties.
  • Top quality help for vLLM, SGLang, and Transformers primarily based inference, with prepared to make use of instructions.
  • A rising set of finetunes and quantizations, together with MLX conversions, within the Hugging Face ecosystem.

Take a look at the Model weight. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The put up Zhipu AI Releases GLM-4.7-Flash: A 30B-A3B MoE Model for Efficient Local Coding and Agents appeared first on MarkTechPost.

Leave a Reply

Your email address will not be published. Required fields are marked *