By Harshit
SAN JOSE, DEC. 17 — 12 AM EDT
Nvidia has released Nemotron 3, a new family of open-source large language models designed specifically for advanced reasoning, long-context understanding, and multi-agent AI systems. The release marks one of Nvidia’s most ambitious moves yet in open AI, positioning the company as a major force in the next phase of model development beyond traditional chat-based systems.
Nemotron 3 is available in Nano, Super, and Ultra variants and introduces a hybrid Mamba–Transformer Mixture-of-Experts (MoE) architecture. The models support a native one-million-token context window, enabling AI agents to retain entire histories, evidence sets, and multi-stage plans within a single inference pass.
Unlike most open-weight releases, Nvidia has open-sourced the full development stack, including training data, recipes, and reinforcement-learning environments. Industry observers see the release as a strategic attempt to reshape the open-source AI ecosystem — and to align it tightly with Nvidia’s hardware platform.
A Hybrid Architecture Optimized for Reasoning at Scale
At the core of Nemotron 3 is a hybrid Mamba–Transformer MoE backbone, designed to balance efficiency with high-level reasoning.
- Mamba layers handle long-range sequence modeling with minimal memory overhead, allowing efficient processing of extremely long contexts.
- Transformer layers provide high-precision attention, critical for tasks such as coding, mathematics, and structured reasoning.
- Mixture-of-Experts routing activates only a subset of parameters per token, dramatically improving compute efficiency while preserving model capacity.
The currently available Nemotron 3 Nano model contains 30 billion parameters, with approximately 3.6 billion active per token. Nvidia plans to release Nemotron 3 Super (~100B parameters) and Ultra (~500B parameters) in the first half of 2026, allowing developers to scale from high-throughput systems to deeper reasoning engines over time.
One Million Tokens, Without Fragmentation
One of Nemotron 3’s defining features is its native one-million-token context window. This eliminates the need for complex chunking or retrieval heuristics commonly used in long-context applications.
For agentic systems — such as research agents, planning systems, or autonomous coding tools — this enables persistent memory across extended workflows. According to Nvidia, the hybrid architecture keeps per-token compute low enough to make such long contexts viable in production environments.
Developers can also toggle “Reasoning ON/OFF” modes and define a configurable “thinking budget,” allowing precise control over inference cost versus reasoning depth.
Innovations Coming in Super and Ultra
The upcoming Super and Ultra models introduce deeper architectural advances aimed at complex, multi-step reasoning.
One key innovation is Latent MoE. Traditional MoE models suffer from memory and communication bottlenecks when routing tokens across many experts. Latent MoE compresses token representations into a smaller latent space before routing, drastically reducing data transfer between GPUs.
This allows the model to consult more experts simultaneously — for example, routing compressed tokens to 22 experts instead of six — without increasing inference cost. Nvidia reports that this results in stronger coding and math performance at equivalent latency.
Another major feature is Multi-Token Prediction (MTP). Instead of predicting one token at a time, Super and Ultra models are trained to predict multiple future tokens simultaneously. This encourages forward planning and improves coherence and reasoning. During inference, MTP acts as a high-speed drafting mechanism, accelerating generation by validating multiple tokens at once.
Training Efficiency Through Low-Precision Computing
To support these massive models, Nvidia uses NVFP4, its proprietary 4-bit floating-point training format. Most of the network operates in 4-bit precision to reduce memory and compute cost, while sensitive components — such as Mamba outputs and MTP projections — remain in higher precision formats like BF16 or MXFP8.
This hybrid precision strategy allows Nemotron 3 to maintain reasoning accuracy while dramatically lowering training and inference overhead, particularly on Nvidia GPUs.
Designed for the Agentic Sweet Spot
Nemotron 3 Nano currently scores 52 on the Artificial Analysis Intelligence Index, matching OpenAI’s gpt-oss-20b and outperforming Nvidia’s previous Nemotron Nano by a wide margin. It also delivers approximately 380 tokens per second on serverless endpoints, making it well-suited for real-time agent interactions.
The model is optimized for tasks such as:
- Software debugging
- Information retrieval
- Long-form summarization
- Multi-agent coordination
Training was conducted using Reinforcement Learning from Verifiable Rewards (RLVR) across multiple domains simultaneously, reducing performance degradation as agents switch between task types — a common weakness in long, multi-step workflows.
An Open Ecosystem, Not Just an Open Model
Nvidia’s release goes beyond model weights. The company is also open-sourcing NeMo Gym, a reinforcement-learning environment library that enables developers to train and evaluate agents using the same environments Nvidia used internally.
In addition, Nvidia is releasing:
- 3 trillion tokens of new pretraining data, with expanded coverage of math and code
- The Nemotron Agentic Safety Dataset, containing nearly 11,000 real-world telemetry traces for evaluating safety and behavior in autonomous systems
Developers can access Nemotron 3 Nano via Hugging Face and inference providers including AWS Bedrock, Baseten, and DeepInfra, with pricing around $0.06 per million input tokens and $0.24 per million output tokens.
A Strategic Play for the Open AI Era
While competitors such as OpenAI and Anthropic rely heavily on closed, paid APIs, Nvidia benefits directly from widespread model adoption. The more Nemotron models are deployed, the greater the demand for AI accelerators — a market Nvidia already dominates.
By open-sourcing high-performance models optimized for its hardware, Nvidia is effectively expanding the open-source AI pie while reinforcing its position as the industry’s infrastructure backbone.
For developers building agentic systems, Nemotron 3 offers a rare combination: open access, long-context reasoning, and production-grade efficiency.

