Why MARS-MINDS uses five models and a controller

The load-bearing architectural decision behind the 99.91 per cent Mission Planning benchmark. An end-to-end network would have lost three points and broken the upgrade path. Here is the engineering case for decomposition.

The single most consequential decision in the MARS-MINDS architecture was rejecting the end-to-end network. The published prior in 2022 — Mars-TRP — used a unified deep network across the 25-class Curiosity surface task and reported 88 per cent accuracy. The natural extension would have been to enlarge that network, add more data, fine-tune harder. We did not. We decomposed the problem into four task heads and a reinforcement-learning controller. The Mission Planning head alone reaches 99.91 per cent. The path that got there is worth writing down.

The case for end-to-end

End-to-end deep networks have an attractive property: the loss propagates everywhere. The model can learn features that serve multiple tasks at once, share representations, exploit cross-task signal, and avoid the brittle handoff layers between sub-models. For most computer-vision workloads on terrestrial imagery, the end-to-end framing has won. ResNet-50 to a 1000-class softmax. ViT to detection. A single Mask R-CNN to instance segmentation. The default for a senior engineer should be end-to-end unless a stronger reason exists.

So why does Mars break that default?

The four reasons we decomposed

1. The tasks have incompatible inductive biases

Mission Planning is a fine-grained classification problem over a 25-way taxonomy. The right inductive bias is global attention over the whole image — a Vision Transformer wins this class of problem because the discriminative signal is distributed across the scene. Dust Storm Insight, by contrast, is a temporal sequence problem. Storms evolve over hours and sols, and the right inductive bias is short-range temporal recurrence — a ConvLSTM dominates this class of problem because the discriminative signal is in the sequence, not the single frame. Habitat Building is a segmentation problem with multi-scale features — EfficientNetV2 plus ConvNeXt-V2 with a Mask R-CNN head is the right family because the discriminative signal is in pixel-level boundary placement. Safe Landing is a multi-modal fusion problem over visible-light, thermal, and elevation channels — a fusion architecture is the right family because the discriminative signal is in cross-channel correlation.

A single end-to-end network has to compromise across all four inductive biases. Each task takes a tax on the architectural decisions the others would prefer. The ablation we ran inside the thesis put that tax at approximately three accuracy points on Mission Planning and roughly five on Dust Storm. Together that is the difference between a publishable benchmark and an unremarkable one.

2. The upgrade path is independent per task

The state of the art moves at different speeds across the four task families. Vision Transformer variants improve approximately twice a year. ConvLSTM has been substantially static since 2017 with newer variants like ConvNeXt-Recurrent emerging at a slower cadence. Mask R-CNN improvements are typically incremental. Multi-modal fusion has accelerated dramatically since 2023. If we had bet on one network, we would have had to retrain the whole when any one family advanced. With the decomposition, we swap the head, validate against the head-specific evaluation harness, and ship the new version. The system maintainability across a mission programme lifetime is the load-bearing benefit and it is invisible at benchmark time.

3. The evaluation harness has to be readable

A Q1 manuscript reviewer reads the evaluation harness more carefully than the benchmark number. With an end-to-end network the harness is necessarily entangled — a single train-test split, a single confusion matrix, a single failure-mode analysis covering all four tasks at once. With the decomposition, each head has its own harness, its own confusion matrix, its own failure-mode analysis. A reviewer can examine the Mission Planning errors without having to disentangle them from the Dust Storm errors. The submission becomes legible. Five Q1 manuscripts are in the pipeline from this architecture; one Q1 manuscript would have come out of the end-to-end equivalent.

4. The reinforcement-learning controller does what the network cannot

The controller is trained with PPO over a reward shaped against ground-truth mission-outcome trajectories with safety penalties on hazardous decisions. It learns to weight the four head outputs against the mission context — for example, downweighting Dust Storm Insight in a season where storms are rare, upweighting Safe Landing when the rover is in a hazardous terrain class. An end-to-end network cannot learn that policy because the policy is over the head outputs, not over the input pixels. The controller is the layer that turns four perception modules into a decision system.

The numbers that decided it

Inside the thesis we ran an apples-to-apples ablation. A single end-to-end Vision Transformer trained against all four task heads simultaneously. A single end-to-end ViT trained only against Mission Planning to isolate the multi-task tax. The five-model decomposition with the controller.

End-to-end on all four tasks: 91.4 per cent Mission Planning. End-to-end on Mission Planning only: 96.2 per cent. Five-model decomposition with controller: 99.91 per cent.

The end-to-end network across all four tasks lost more than 8 points on Mission Planning compared to the decomposition. The end-to-end network on Mission Planning alone — paying no multi-task tax — still lost almost four points compared to the decomposition. The decomposition wins on accuracy at the benchmark line and on every other dimension that matters for a maintainable mission-grade system.

The architecture diagram in plain English

At inference time the unified input arrives — a HiRISE or Curiosity tile — and is fanned out to four heads in parallel. Each head emits a structured output specific to its task: a 25-way class distribution for Mission Planning, a temporal-window dust-storm probability vector for Dust Storm Insight, a segmentation mask plus suitability score for Habitat Building, and a landing-safety class plus confidence for Safe Landing. The controller ingests all four outputs, the mission context (rover position, mission phase, hazard budget, seasonal calendar), and emits a unified decision plus an explanation trace. Sub-200-millisecond per-head latency, more than five thousand unified inferences per minute through the controller. The whole system runs on a Tesla T4 in production.

When to choose end-to-end instead

The decomposition wins for MARS-MINDS specifically because the four tasks have incompatible inductive biases, an independent upgrade cadence, a manuscript-readability requirement, and a downstream decision policy that cannot live inside a perception network. None of those conditions necessarily hold for a different problem. If you are building a single-task perception system with no policy layer above it, end-to-end is still the right default. If you are building a multi-task perception system where the tasks share a strong inductive bias — for example, all four tasks are pixel-level segmentation over similar imagery — end-to-end is still likely the right default.

The decision to decompose is not the default. It is the right answer when the four conditions hold. For Mars mission intelligence they hold strongly. For most enterprise vision workloads they do not. The architecture choice is downstream of the problem structure, not upstream of a general preference.

What the controller learned that surprised us

The PPO controller learned an explicit hazard-budget policy that we did not encode. In low-hazard terrain classes, the controller weighted Mission Planning heavily and Safe Landing lightly. In high-hazard terrain classes, the controller inverted the weighting and pulled in Habitat Building scores as a tie-break. The behaviour resembles the way a human mission planner would triage the same decision under time pressure. We did not write that policy; the reward function rewarded mission-outcome trajectories and the controller induced the policy. That is the property that makes the controller worth its weight — a property no perception network would produce.

Closing

The 99.91 per cent number is the headline. The architecture is the story. The five-model decomposition with a PPO controller is the load-bearing decision that produced the headline, that produced the Q1 manuscript pipeline, and that produced a system maintainable across a mission programme lifetime. End-to-end was the default. Decomposition was the right answer. The Mars mission decision surface is not a perception problem with a softmax on top; it is a policy problem with four perception inputs. That is the framing that unlocked the benchmark.

Companion case study / Read the MARS-MINDS case study

Why MARS-MINDS uses five models and a controller, not an end-to-end network.