Parametric Lunar Lander: A Controlled-Dynamics Testbed

Physics variation for studying RL agents and world models

Lunar Lander is a small control environment: a policy uses main and side thrust to bring a vehicle down safely between two flags. In the stock environment, the physics stays the same every episode. Gravity stays the same. Engine powers stay the same. Density stays the same. The wind regime stays the same. That is exactly what a fixed control environment is for: hold the dynamics constant so the policy is the thing being measured.

It is not enough for asking what an RL agent or a world model has actually learned about its environment. If the physics never changes, an RL agent cannot show whether it learned to adapt to different dynamics, and a world model cannot be checked against physics variation it never had to predict. To make those questions measurable, the parameterized fork in this repo lets the physics vary from one episode to the next.

The Parametric Fork

parametric-lunar-lander is a parameterized fork of Gymnasium’s Lunar Lander. It exposes seven continuous physics parameters that can be sampled per episode: gravity, main and side engine powers, lander density, angular damping, wind power, and turbulence power. The default physics values match the stock environment. Gymnasium supports both discrete and continuous control for Lunar Lander; the fork keeps only the continuous action space.

When physics labels are exposed, the observation includes the stock 8D Lunar Lander state (position, velocity, angle, angular velocity, and leg-contact indicators) together with the current 7D physics vector; when they are withheld, the agent sees only the stock state. Sampling is configured through YAML profiles that can hold any subset of parameters fixed and vary the rest.

Why Lunar Lander

Lunar Lander is one of the smallest environments where the chain from agent action to world consequence is short enough to write down. The policy emits main and side thrust commands; those become forces on the lander body, alongside gravity and any wind or turbulence. Box2D’s rigid-body solver resolves them, together with contact forces at the legs, into linear and angular motion. At a schematic level, the chain reads

action → forces on lander → rigid-body integration → velocity, position, orientation

That transparency is what makes “what gravity does this trained world model implicitly assume?” a measurable question later.

The dynamics are coupled (translational and rotational motion interact through orientation, lateral drift matters, ground contact matters) but small enough to remain analytically tractable. Hover thrust, thrust-to-weight ratio, and minimum controllable TWR can be written down. That gives this substrate ground truth, not just trajectories. Many control-relevant quantities are computable in closed form for any sampled physics config.

This fork also ships with calibration checks. The repo includes a calibration module that runs canonical maneuvers such as hover, freefall, and pure thrust across sampled physics configs and checks that the parameterization behaves as intended under controlled probes. The target of those checks is the exposed parameterization, not Box2D itself: they are there to catch mistakes in how the seven knobs translate into actual forces and responses, not to verify the underlying solver. A heuristic PD controller using only the stock state lands the default config; that is the floor the substrate sits on.

The Body-World Split

The seven parameters split into two categories that pose different adaptation problems, even though they couple in the dynamics.

Parameter Default Range Category
main engine power 13.0 [5, 25] body
side engine power 0.6 [0.2, 1.5] body
lander density 5.0 [2.5, 10] body
angular damping 0.0 [0, 5] body
gravity -10.0 [-12, -2] world
wind power 15.0 [0, 30] world
turbulence power 1.5 [0, 5] world

Body parameters describe what the agent’s vehicle is and how it responds to commands: the engine powers, the density that determines how a force becomes an acceleration, the damping that controls how angular velocity decays. World parameters describe what the environment is doing to the vehicle regardless of the agent’s choice: gravity acting through the body’s mass, wind applying a horizontal force, turbulence applying angular torque.

The split isn’t cosmetic. Varying body parameters forces the agent to recalibrate its action-to-consequence mapping; the same command now produces different accelerations. Varying world parameters forces the agent to infer what environment it is in; the vehicle responds the same way, but the external forces differ. Those are different adaptation problems. The split is a useful organizing distinction for adaptation, not a claim that the underlying dynamics decouple cleanly.

The raw parameters can also be combined into derived quantities that are useful at sampling time. Thrust-to-weight ratio is the main example here: it is computed from main engine power, density, lander area, and gravity, and gives a compact way to describe how much thrust margin a sampled configuration has. The substrate exposes twr_min and twr_max as sampling constraints, so distributions can be restricted to regimes with the desired controllability properties. That does not make TWR an eighth free parameter; it is a derived quantity enforced by rejection sampling over the underlying body and world parameters. The same pattern extends naturally to other derived quantities.

Episode Variation and Observation Variants

The parameterized version makes two choices available at configuration time.

Episode variation. Each parameter can either be fixed or allowed to vary across episodes. A fixed parameter keeps the same value every episode. A varied parameter is sampled from a specified range each episode. That can be done for any subset of the seven parameters, so the vehicle, the world, or both can be held constant or varied together.

In practice these recurring setups often get short labels for convenience: stock physics as gym-default, vehicle-only variation as body-only, environment-only variation as world-only, and joint variation as full-variation. Named range configurations such as easy, medium, and hard are reusable presets over the same parameter space. They package different variation ranges and controllability margins, but do not change the task itself. Curriculum schedules can sequence those configurations across training when needed.

Observation variation. The observation can either expose the physics labels or withhold them. In all variants, the state is augmented with seven terrain raycasts over a 120-degree arc in the lander’s ego frame. These rays are a fixed sensing design choice, not part of the physics parameterization. The labeled variant provides the stock 8D state together with the 7D physics vector, for a 22D observation once the terrain rays are included. The blind variant provides only the stock 8D state plus the same terrain rays, for a 15D observation. A history variant stacks a configurable number K of blind-observation frames for temporal context.

These choices compose directly, so a run can, for example, use the blind observation variant with medium world-only variation, or the labeled variant with hard full variation.

Implied Dynamics, Not Causal Structure

Lunar Lander gives a short action-to-consequence chain, analytical ground truth, and controllable variation along seven physical axes. Those properties make it the right substrate for asking what scalar dynamics a trained model has implicitly assumed and how information channels shape the controllers a policy converges to.

The environment contains real physics, including coupled dynamics, contact forces, and multi-axis control, but it does not strongly constrain a learned model toward a reusable or identifiable decomposition of that structure. Even with more explicit factoring of environment and action effects, a network could still solve parametric Lunar Lander by learning (state, action) → next_state as one entangled function; the substrate may make some separations easier to probe, but it does not force decomposition into reusable mechanisms.

The distinction worth marking explicitly is between two different questions:

  • Implied-dynamics consistency: what scalar physics the model’s transitions are consistent with, against known ground truth.
  • Identifiable causal representation: whether the model has discovered factored, reusable, compositional mechanisms.

The distinction matters because these are different kinds of claims. This substrate is well suited to questions about implied dynamics: whether a learned model or controller behaves in ways consistent with known scalar physics under controlled variation. It may not be well suited to claims about identifiable or reusable causal mechanisms, because a model can still succeed here with a fairly entangled representation of the dynamics rather than a clean decomposition into modular parts.

Toward Factored Dynamics

One possible extension to the substrate is a factored-dynamics mode. Each timestep currently runs one Box2D step that integrates environment forces and action forces together, and the trajectory is their combined result. That is the right default. But some questions become sharper when the channels can be separated explicitly: what the environment did to the lander regardless of the agent’s choice, and what the agent’s action contributed on top of that. A factored version would run two steps per timestep: one env-only and one env-plus-action. Both trajectories and their difference would be exposed, so the decomposition is available directly downstream. That explicit factoring is not built yet, and is noted here as a possible extension of the current substrate.

Two Lines of Work

This substrate currently motivates two downstream lines of work.

One line of work uses the parameterization to evaluate world models trained to predict transitions on parametric trajectories. The substrate alone is a precondition, not the measurement: the actual move is an extraction procedure that infers a scalar physical constant, such as an implied gravity, thrust scaling, or damping, from a model’s predicted state changes and compares it against the known ground-truth value for that environment configuration. The substrate is what makes that comparison clean; the extraction procedure is what makes it a measurement. That makes it possible to ask not just whether the model predicts the next state well, but whether the dynamics it implicitly encoded look anything like the physics the data came from.

The second is an RL agents line. Because the physics can vary across episodes and observation variants can expose or withhold the physics labels, RL agents can be trained on the same task under changing dynamics and different information channels, then compared through the controllers they converge to. That makes it possible to ask not just whether an agent can land, but how it adapts when the dynamics move and how different information channels shape that adaptation.

This post is the reference point for the downstream work: the substrate gives ground truth, controlled variation, and a legible mechanical chain. What counts as a meaningful answer is not something it can settle.