CppMjStep: fast, differentiable MuJoCo inside PyTorch

Published on Aug 20, 2025

Overview

Modern robotics research often needs gradients that pass cleanly through physics. You might want to tune a feedback policy with backprop, identify physical parameters from rollouts, or train a network that ingests sensor streams and outputs controls. CppMjStep is my answer to that workflow. It plugs MuJoCo into PyTorch as a custom autograd op written in C++, so you can differentiate through full simulation trajectories and keep the speed you need for real experiments. The extension supports multi-step unrolling, batching, and sensor gradients, and it uses multithreading for better throughput. On large models with large batch sizes, my measurements show more than a 2x speedup over a pure Python approach.

Why CppMjStep exists

Differentiating through physics is powerful but expensive. Finite differences across a rollout can balloon into thousands of simulator calls. CppMjStep leans on MuJoCo’s built-in derivative routines to keep the math both correct and fast. Under the hood, it calls functions like mjd_transitionFD to obtain state and control Jacobians for the discrete dynamics, then chains them across the horizon so you can backpropagate through your entire rollout.

Two design choices make this practical for day-to-day work. First, the core op is implemented in C++ and wired into PyTorch’s autograd so gradient flow feels native while avoiding Python overhead. Second, it adds sensor support, which means you can include IMU, force, or other sensor readings in the computational graph with minimal extra cost. If you have ever wanted end-to-end gradients from controls through contact rich dynamics and back through sensors, this is built for that.

What you get

End to end differentiation through MuJoCo rollouts using a single PyTorch module that unrolls multiple steps and returns a differentiable next state.
Batch execution so you can evaluate many parallel environments and still get correct gradients.
Sensor gradients included in the graph, useful for learning from IMUs or force sensors.
Multithreading in the C++ core for better wall clock performance.
Familiar MuJoCo derivative semantics since gradients are built from mjd_transitionFD and friends.

Quick start

Here is a minimal example that treats the simulator as a differentiable layer. Create your model and data in MuJoCo, wrap them with MjStep, and call it like a PyTorch module.

import torch
import mujoco as mj
from CppMjStep import MjStep, PyMjStep  # C++ op and a Python reference

# Load model
xml_path = "path/to/model.xml"
m = mj.MjModel.from_xml_path(filename=xml_path)
d = mj.MjData(m)

# Build differentiable step, choose horizon and threading
step = MjStep(m, d, n_steps=5, multithread=False)

# State and control tensors
state = torch.rand(m.nq + m.nv + m.na + m.nsensordata, requires_grad=True)
ctrl  = torch.rand(m.nu, requires_grad=True)

# One differentiable rollout step
next_state = step(state, ctrl)
loss = next_state.square().mean()
loss.backward()

Two practical notes

First, MuJoCo 3.1.2 expects the initial state for rollouts to use the full physics state size that includes time, and the project notes that requirement explicitly. Second, you can toggle multithreading on initialization to trade a little determinism for speed.

View details on GitHub

Performance in practice

The repo includes timing plots that compare the C++ extension with a Python reference on models with and without sensors. The headline trend is simple: as batch size and model size grow, the C++ path pulls away and passes a 2x speedup in typical reinforcement learning workloads. That difference is what turns a proof of concept into a trainable loop.

Benchmarks on GitHub

How it works under the hood

MuJoCo exposes discrete time Jacobians through mjd_transitionFD, which returns the matrices A = ∂y/∂x and B = ∂y/∂u for a single step. CppMjStep composes these across steps so gradients flow from the loss back to the initial state and controls. You keep the convenience of PyTorch while relying on MuJoCo’s finite differencing that is tailored to its solver and contact model.

MuJoCo documentation

Relationship to prior work

CppMjStep builds on the idea introduced by DiffMjStep and extends it with sensor support and a multithreaded C++ core. If you have used DiffMjStep before, the mental model carries over, so moving projects onto the faster path should be straightforward.

Project GitHub

When to reach for it

System identification where you minimize simulation error with respect to physical parameters. The gradient signal stays strong through many steps because the op differentiates the full trajectory.
Gradient based control for short horizon policy improvement that benefits from smooth physics gradients.
End to end sensor learning where IMU and force channels are part of the objective.

If your problem needs extremely long horizons or very high resolution time steps, remember that backprop through time still carries memory cost. CppMjStep helps with speed, but you will still want techniques like truncated horizons or gradient checkpointing in heavy regimes.

Install and learn more

The package is pip installable, with environment specifics and troubleshooting in the project’s install guide. The README also lists the precise state size requirement for MuJoCo 3.1.2, along with usage snippets and benchmark figures.

CppMjStep on GitHub

Acknowledgment

CppMjStep draws inspiration from DiffMjStep and uses MuJoCo’s official derivative functions to keep results consistent with the core simulator.

GitHub | mujoco.readthedocs.io