NVIDIA Cosmos 3 and Isaac GR00T: The Open Physical AI Stack, Reviewed
NVIDIA released Cosmos 3 on May 31, 2026 — and announced the Isaac GR00T Reference Humanoid Robot on June 1 at GTC Taipei. Together, they constitute the most significant Physical AI release of 2026, and arguably the most important open-source robotics announcement since the original Universal Robots ROS stack a decade ago. Two parts. One ambition: democratize the entire humanoid robotics stack from world model down to physical hardware.
Cosmos 3 is NVIDIA's first fully open omnimodel. Two variants shipped immediately — Cosmos 3 Nano at 16B parameters (8B reasoner + 8B generator) and Cosmos 3 Super at 64B parameters (32B reasoner + 32B generator) — with a Cosmos 3 Edge variant at 2B announced for later release. Both shipped variants are available on Hugging Face right now under the OpenMDW 1.1 license, which permits commercial use. They accept text, images, video, ambient audio, and action sequences as input, and produce the same modalities as output — making them the first models in the field to truly merge what we've previously called 'language models,' 'video generators,' and 'robot policies' into one architecture.
Isaac GR00T is the matching robot. NVIDIA paired the Unitree H2 Plus chassis with Sharpa Wave five-finger tactile hands, plugged in a Jetson AGX Thor T5000 brain (2,070 FP4 teraflops), and wrapped the whole thing in the Isaac GR00T open software stack. The result: a roughly 6-foot, 150-pound humanoid robot with 75 degrees of freedom that academic and research teams can buy from Unitree and program against the same model stack NVIDIA's internal research uses. Available late 2026.
This is the detailed review. What Cosmos 3 actually is. Why the omnimodel framing matters. The full GR00T specification. The honest take on what works in practice today, what's still vapor, and which teams should be paying attention.
1. What is NVIDIA Cosmos 3?
Cosmos 3 is NVIDIA's first fully open omnimodel for Physical AI — meaning the artificial intelligence that controls robots, autonomous vehicles, and machines that act in the physical world. The 'omnimodel' framing is doing work: this is not a language model with a vision add-on, not a video generator with a text interface, not a robot policy fine-tuned for one task. It is a single architecture that handles five modalities natively — text, images, video, ambient audio, and action sequences — in both directions.
Released on May 31, 2026 (Hugging Face) and formally announced June 1, 2026 at GTC Taipei. Jensen Huang's launch quote captures the framing precisely: 'The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models. The Cosmos 3 family of open, frontier omnimodels gives developers a generational leap in ability to build robots, autonomous vehicles and vision AI that perceive, reason, plan and act in the physical world.'
Training corpus: 20 trillion multimodal tokens — a scale comparable to large frontier language models, but applied to physics and motion rather than pure text. License: OpenMDW 1.1, which permits commercial use. Distribution: weights, code, and tooling all openly available, with no proprietary restrictions on building products on top.
This release continues NVIDIA's 2026 strategy of going far beyond GPUs. Following LongLive-2.0's NVFP4 long-video infrastructure release in May and the full NVIDIA AI models 2026 lineup, Cosmos 3 is the third major NVIDIA model release in three months. The pattern: NVIDIA is no longer just selling the picks and shovels — it's now publishing the frameworks that show everyone else how to use them at the frontier.
2. The omnimodel architecture — what's actually new
The architectural choice that makes Cosmos 3 interesting is the Mixture-of-Transformers (MoT) design — a structural innovation that separates the reasoning component from the generation component. Each Cosmos 3 model contains two equally-sized transformers working together: a Reasoner (which understands scenes, plans actions, and outputs the structured representations that guide generation) and a Generator (which produces the actual pixels, audio waveforms, or action sequences).
For Cosmos 3 Nano: 8B parameter Reasoner paired with an 8B parameter Generator, totaling 16B. For Cosmos 3 Super: 32B + 32B, totaling 64B. Both variants are initialized from pre-trained vision-language models, then trained on the full 20-trillion-token multimodal corpus. The reasoner-generator split matters because it lets you swap the generator (for example, replace the video generator with a robot-action generator) while keeping the same reasoning backbone — a property that traditional monolithic VLMs don't have.
Compare this to the alternatives. A standard vision-language model like GPT-5.5 or Claude Opus 4.8 can read images but not generate video or actions. A specialist video generator like Veo or Sora can produce video but can't reason about physics or plan actions. A robot policy trained on imitation learning can produce actions but can't reason about new scenarios. Cosmos 3 unifies all three behaviors in one model — input any combination of the five modalities, output any combination of the five modalities.
Practically: this means you can prompt Cosmos 3 with a text description and a starting image and ask it to generate a video of the next 5 seconds plus the joint trajectories a humanoid robot would execute to reproduce that video physically. The model handles the full chain — scene understanding, physical reasoning, motion prediction, and action planning — without a pipeline of separate models stitched together.
3. Cosmos 3 Nano vs Super vs Edge — sizing and use cases
NVIDIA released two variants immediately, with a third announced for later release:

Sizing intuition. Cosmos 3 Nano at 16B parameters in BF16 weights is roughly 32GB — fits comfortably in the 96GB VRAM of an RTX PRO 6000 with substantial room for context and activations. This is the variant designed for real-time robotics inference: a humanoid robot's onboard compute or an autonomous vehicle's compute stack. The latency target is fractions of a second for action prediction.
Cosmos 3 Super at 64B parameters is the high-quality datacenter variant. Use case: large-scale synthetic data generation, advanced physical reasoning research, post-training of smaller robot models. NVIDIA explicitly positions Super for shortening physical AI training cycles — generate millions of synthetic scenarios in Super, then distill the behavior into smaller models for deployment.
Cosmos 3 Edge at 2B parameters is the variant most teams will eventually deploy on actual hardware. Built as a dense transformer (no reasoner-generator split, which makes sense at that size), it targets Jetson-class edge devices. NVIDIA hasn't specified the Edge release date beyond 'later release.' For teams who can't wait, the Step 3.7 Flash open-weight agent model is the closest currently-available alternative at the edge tier, though Step is not purpose-built for physical robotics.
4. The 5-modality input/output surface
Cosmos 3 handles five distinct modalities, each available as input, output, or both. The full surface:

The two modalities that distinguish Cosmos 3 from prior frontier models are ambient audio and action sequences. Ambient audio means the kind of environmental sound a robot or autonomous vehicle would hear in operation — not speech, but the world: traffic, machinery, footsteps, doors closing. Cosmos 3 can both perceive these audio cues (a robot hearing a glass break in the next room) and generate them (synthesizing realistic warehouse acoustics for simulation training).
Action sequences are where the framing matters most. In Cosmos 3, 'action' is concrete numerical data: joint angles for a robotic arm at each timestep, gripper position values, trajectory waypoints for an autonomous vehicle. This is what makes the model usable as a robot policy directly. You don't have to write a separate motor controller that translates language into actions — Cosmos 3 emits the action sequence in the format the robot's low-level controllers expect.
Cosmos 3 supports multiple practical workflows from these modalities. As a vision-language model, it analyzes video and can detect anomalies in real-world footage — partner Linker Vision uses it for smart city traffic monitoring. As a world model, it generates photorealistic video sequences of rare situations: near-miss scenarios for autonomous vehicle training, unusual object arrangements for warehouse robotics. As a planner, it takes a goal state and produces the action sequence to reach it. As a simulator, it predicts the consequences of a proposed action before the robot commits.
5. The Isaac GR00T Reference Humanoid Robot — every spec
Isaac GR00T is the hardware companion. NVIDIA's framing: 'the first open humanoid robot reference design.' Full specification:

The combined system. A Unitree H2 Plus body — a real production humanoid that Unitree has been refining for over a year — paired with the Sharpa Wave hands that add the dexterity gap that has been the biggest barrier for general-purpose humanoid robotics. Five-finger tactile manipulation means the robot can grip varied objects, manipulate small components, and provide the kind of haptic feedback to its software stack that simpler grippers cannot.
The Jetson AGX Thor T5000 onboard compute is the critical hardware detail most coverage will skip. 2,070 FP4 teraflops in a robot-mountable form factor means Cosmos 3 Nano can run onboard the robot itself rather than requiring a connection to cloud or workstation compute. This is the first generation of humanoid platforms where frontier-tier AI runs on the robot at real-time speeds.
Critical clarification on availability: NVIDIA is not selling the robot itself. Unitree is the manufacturer and seller — NVIDIA provides the reference design, the integration specs, and the software stack. Robot developers will buy from Unitree in late 2026 (specific pricing not yet announced; expect academic-research pricing in the tens of thousands of dollars range based on Unitree's existing humanoid pricing).
Partner ecosystem at launch. Confirmed academic and research institutions using the reference design: Ai2 (Allen Institute for AI), ETH Zurich, Stanford Robotics Center, and UC San Diego's Advanced Robotics and Controls Laboratory. The mix is deliberate — Ai2 brings the open-AI-research credibility, ETH and Stanford bring the robotics depth, UCSD brings the controls expertise. This is the partner list NVIDIA needs to credibly position GR00T as 'the new research baseline' rather than just a vendor product.
6. The Cosmos + GR00T full-stack story
The strategic story is what makes this release significant. NVIDIA has been telling a 'full-stack Physical AI' narrative for over a year. With Cosmos 3 + Isaac GR00T, they actually shipped the full stack — for the first time anywhere in the industry.
The five-stage pipeline they cover end-to-end:
- Data collection — synthetic data generation through Cosmos 3 Super, real-world capture through GR00T teleoperation
- Simulation and training — Isaac Sim and Isaac Lab for physically accurate environments and reinforcement learning
- World models — Cosmos 3 itself, generating photorealistic scenarios for closed-loop training
- Robot reasoning — Cosmos 3 Nano running onboard the robot for real-time planning and action prediction
- Deployment on physical hardware — GR00T reference robot from Unitree, with the same Cosmos 3 model the simulation used
The competitive context. Tesla has Optimus but operates a closed proprietary stack. Figure AI is closed and capital-constrained. Apptronik has manufacturing but lacks an open model. Boston Dynamics has the engineering pedigree but no public-facing AI stack. With Cosmos 3 + GR00T, NVIDIA is offering academic and startup teams the closest thing to a free pass past all four — open models, integrated hardware reference, complete software stack, and the institutional partnerships that signal serious research support.
Whether this displaces existing humanoid investment depends on the next twelve months of execution. The pattern to watch: which production-scale humanoid robotics startups choose to build on top of GR00T versus continuing their proprietary stacks. The bet NVIDIA is making is that the answer to that question by mid-2027 will look a lot like the answer to 'which AI startups build on top of NVIDIA GPUs' — almost all of them. For broader context on how this fits with NVIDIA's other 2026 model releases, our NVIDIA AI models 2026 guide covers the full ecosystem from Nemotron through Cosmos.
7. The honest limitations and what's still unproven
Four things worth flagging honestly before treating this as a turnkey humanoid stack.
First: the benchmark verification gap is large. NVIDIA claims Cosmos 3 ranks #1 across 7+ robotics benchmarks. Many of these benchmarks (physics understanding, robot planning and control, physical AI reasoning) don't have the same independent reproduction infrastructure as SWE-Bench or GPQA Diamond do for language models. Independent third-party reproduction of the Cosmos 3 benchmark claims is just starting. Treat the headline numbers as strong directional signals, not as guarantees.
Second: 'open' has some specifics worth reading. The OpenMDW 1.1 license permits commercial use, which is the meaningful permission. Weights and code are public on Hugging Face. The training data, however, is not fully published — like most frontier model releases, NVIDIA publishes the model artifacts but not the full training corpus. This is fine for most use cases but matters for teams that need to verify training data provenance for regulated deployments.
Third: real-world humanoid robotics is harder than benchmarks suggest. Cosmos 3 can produce action sequences that work in simulation. Whether those actions translate cleanly to a physical Unitree H2 Plus with Sharpa Wave hands on actual factory floors is a separate empirical question — the sim-to-real gap remains the dominant unsolved problem in humanoid robotics. Late-2026 GR00T shipments will be the first real test. Until then, treat demos as demos. For context on how the field has progressed on simulation quality, our June 2026 AI models leaderboard tracks the broader state of multimodal models including the ones that compete with Cosmos in specific dimensions.
Fourth: hardware costs and availability constraints. Even with NVIDIA democratizing the software stack, Unitree H2 Plus pricing in academic research configurations is expected to be in the tens of thousands of dollars range. Jetson AGX Thor T5000 modules are not cheap. A complete GR00T setup is closer to a $50,000-$100,000 commitment than a $5,000 one. This is dramatically cheaper than building a comparable platform from scratch — but it's not a hobbyist budget. The democratization is real for funded research labs and startups, less so for individual developers.
8. How to access Cosmos 3 today
Cosmos 3 Nano and Super are both available for download today. Three access paths:
Hugging Face (direct weights)
nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super. BF16 weights. Supported through Hugging Face Diffusers, PyTorch, and the official Cosmos framework. The Hugging Face model card includes example inference code for each modality combination.
vLLM-Omni (production inference)
NVIDIA has shipped first-party support for Cosmos 3 in vLLM-Omni, a multimodal-aware variant of vLLM. This is the path teams running production Cosmos 3 inference should take — proper batching, KV cache management, and the multimodal scheduling primitives that one-shot Hugging Face Diffusers calls don't provide.
NVIDIA NIM (managed enterprise)
For teams that want managed inference without operating their own GPU infrastructure, NVIDIA NIM provides Cosmos 3 deployment with the standard NIM API surface. This is the path for enterprise customers who already have an NVIDIA AI Enterprise subscription.
For the broader Isaac GR00T platform, the reference workflow for the Unitree G1 humanoid (a related Unitree model that's currently available) is rolling out to GitHub and Hugging Face. This gives teams a way to start experimenting with the software stack before the full GR00T reference humanoid ships in late 2026.
For builders composing Cosmos 3 with custom multimodal pipelines, agent frameworks, or simulation environments, the 130+ open-source cookbooks at Build Fast with AI cover the LangChain, LangGraph, and multimodal orchestration patterns that compose models like Cosmos 3 with other frontier models in production stacks. Cosmos as the world-model brain, Claude or Qwen as the planning brain, custom action runtimes as the physical interface — the orchestration patterns are well-trodden ground in the cookbook collection.
9. Frequently Asked Questions
What is NVIDIA Cosmos 3?
Cosmos 3 is NVIDIA's first fully open omnimodel for Physical AI, released May 31, 2026 and formally announced at GTC Taipei on June 1, 2026. It is a single AI model that handles five modalities — text, images, video, ambient audio, and action sequences — in both input and output. The Cosmos 3 family includes Nano (16B parameters) and Super (64B parameters) available now, with Edge (2B parameters) announced for later release. License: OpenMDW 1.1, which permits commercial use.
What is an omnimodel?
An omnimodel is a single AI model that handles multiple modalities (text, image, video, audio, actions) as both input and output, without relying on a pipeline of separate models stitched together. Cosmos 3's omnimodel framing distinguishes it from prior models that handled one or two modalities. A language model handles text. A vision-language model handles text + image input. A video generator handles text input and video output. Cosmos 3 handles all five modalities in both directions through a single Mixture-of-Transformers architecture.
Is Cosmos 3 open source?
Yes, with the standard frontier model caveats. The weights are public on Hugging Face under the OpenMDW 1.1 license, which permits commercial use. The code and tooling are openly available. The training data corpus (20 trillion multimodal tokens) is not fully published — like most frontier model releases, NVIDIA shares the model artifacts but not the complete training data. For teams that need full data provenance, this matters; for most production use cases, the open weights and commercial-friendly license are what counts.
What is the difference between Cosmos 3 Nano and Super?
Cosmos 3 Nano is the 16-billion-parameter variant (8B reasoner + 8B generator) optimized for fast workstation inference — runs on a single NVIDIA RTX PRO 6000 GPU. Use case: real-time robotics inference and physical AI applications. Cosmos 3 Super is the 64-billion-parameter variant (32B reasoner + 32B generator) for datacenter deployment on Hopper or Blackwell GPUs. Use case: large-scale synthetic data generation, advanced physical reasoning research, post-training of smaller robot models. Both are available now on Hugging Face.
What is the Isaac GR00T Reference Humanoid Robot?
The Isaac GR00T Reference Humanoid Robot is NVIDIA's open reference design for a complete humanoid robot platform, announced June 1, 2026 at GTC Taipei. It combines a Unitree H2 Plus humanoid chassis (~6 feet, ~150 pounds, 75 degrees of freedom), Sharpa Wave five-finger tactile hands for dexterous manipulation, an NVIDIA Jetson AGX Thor T5000 onboard compute module (2,070 FP4 teraflops), and the Isaac GR00T open software stack including Isaac Sim, Isaac Lab, Cosmos, ROS middleware, and CUDA-X libraries. Available from Unitree in late 2026.
Is NVIDIA selling the Isaac GR00T robot directly?
No. NVIDIA is providing the reference design — the integration specifications, the software stack, and the partner ecosystem — but the physical robot will be sold by Unitree, the manufacturer of the H2 Plus chassis. NVIDIA's role is the open-stack equivalent of a hardware reference design: defining the architecture, validating the integration, and ensuring software compatibility, while the actual hardware ships through Unitree. Pricing has not been disclosed; expect academic research configurations in the tens of thousands of dollars range.
What hardware do I need to run Cosmos 3?
Three tiers. For Cosmos 3 Nano: workstation-grade NVIDIA hardware, specifically an RTX PRO 6000 or equivalent (96GB+ VRAM recommended for full BF16 inference with reasonable context). For Cosmos 3 Super: datacenter Hopper or Blackwell GPUs (H100, H200, B200 class). For the upcoming Cosmos 3 Edge variant: Jetson family edge devices for real-time onboard robotics inference. NVIDIA's framework supports vLLM-Omni for production inference, plus standard PyTorch and Hugging Face Diffusers integration for research workflows.
When will the Isaac GR00T robot ship?
Late 2026, available from Unitree. NVIDIA has not provided a more specific date than 'late 2026' in the official announcement. The Isaac GR00T reference workflow for the existing Unitree G1 humanoid is rolling out earlier — expected 'soon' on GitHub and Hugging Face — giving research teams a path to start experimenting with the software stack on already-shipping hardware before the full reference humanoid arrives.
Which institutions are using Isaac GR00T?
Confirmed academic and research institutions at launch: Ai2 (Allen Institute for AI), ETH Zurich, Stanford Robotics Center, and UC San Diego's Advanced Robotics and Controls Laboratory. NVIDIA Research will also use the reference design internally to advance the Isaac GR00T open models and frameworks. The partner mix is deliberate — Ai2 brings open-AI-research credibility, ETH and Stanford bring deep robotics expertise, and UCSD contributes the advanced controls and manipulation research.
Recommended Blogs
- LongLive-2.0: How NVIDIA Turned Long-Video AI Into an Infrastructure Problem
- NVIDIA AI Models 2026: Full Guide, Rankings & Comparisons
- NVIDIA Nemotron Nano 3 Omni (2026)
- Best AI Models — June 2026 Leaderboard: Ranked, Compared, Honest Verdicts
- StepFun Step 3.7 Flash Review: The 198B Open MoE Agent
- Qwen3.7-Plus Review: Alibaba's GUI Agent, Tested
- Claude Opus 4.8 Review: Benchmarks, Dynamic Workflows, and Honest Trade-offs
References
- NVIDIA Newsroom — Isaac GR00T Reference Humanoid Robot announcement
- NVIDIA Developer Blog — Develop Physical AI Reasoning, World, and Action Models with Cosmos 3
- Hugging Face — NVIDIA Cosmos 3 launch post
- Cosmos 3 Technical Report (NVIDIA Research)
- The Decoder — NVIDIA bets big on physical AI at GTC Taipei
- Interesting Engineering — NVIDIA launches Cosmos 3 and humanoid robot platform
- The Robot Report — NVIDIA releases new tools for physical AI developers
- NVIDIA Developer — Isaac GR00T Platform Overview
- Techzine — Nvidia unveils Vera Rubin, DGX Station, Cosmos 3, humanoid robot at GTC Taipei




