Technical Report · February 2025

Mimir: Adaptive Multi-Model Reasoning for General-Purpose Agent Tasks

Team Norse

Abstract

We present Mimir, a general-purpose reasoning agent that dynamically routes sub-tasks across heterogeneous foundation model families to maximize performance on complex, multi-step benchmarks. Unlike single-model agents that inherit the blind spots of their backbone, Mimir maintains a lightweight capability profile for each available model and dispatches planning, retrieval, code generation, and verification steps to the model best suited for each operation.

The system employs a three-stage pipeline: (1) hierarchical task decomposition with dependency-aware scheduling, (2) adaptive model selection via a learned router that considers task type, context length, and historical accuracy, and (3) a self-verification loop that cross-checks intermediate outputs using an independent model to reduce hallucination and compounding errors. We evaluate Mimir on GAIA, MMLU-Pro, HumanEval+, and an internal multi-hop QA suite, observing consistent improvements over single-model baselines across all task categories.

Architecture

High-level overview of the Mimir agent pipeline. The router selects among available model backends for each sub-task based on capability profiles and task requirements.

User QueryTask DecomposerDAG generation · dependency analysisAdaptive RouterCapability matrix C ∈ ℝ^(M×T)Model Family Areasoning · long contextModel Family Bcode gen · structured outputModel Family Cretrieval · fast classificationTool Registrysearch · code execfile I/O · APIsCross-Model Verifierconsistency · grounding · retry ≤ 2retry

Figure 1. Mimir agent architecture. The adaptive router dispatches sub-tasks to model backends based on a learned capability matrix. The cross-model verifier checks outputs and triggers re-dispatch on failure.

Methodology

1

Hierarchical Task Decomposition

Given a user query, Mimir first generates a structured execution plan as a directed acyclic graph (DAG) of sub-tasks. Each node is annotated with the estimated task type (reasoning, retrieval, code generation, arithmetic, or synthesis) and required context window. This decomposition step itself is performed by a planning-specialized model, selected from the available pool based on prior decomposition accuracy.

2

Adaptive Model Routing

The core routing mechanism maintains a capability matrix C ∈ ℝ^(M×T) where M is the number of available models and T the number of task types. Each entry c_{m,t} represents the estimated accuracy of model m on task type t, updated online via exponential moving average after each verified execution. At dispatch time, the router selects arg max_m c_{m,t} · w(context_len, latency_budget) where w is a weighting function that penalizes models exceeding their effective context window or violating latency constraints.

3

Tool-Augmented Execution

Each sub-task executor has access to a shared tool registry including web search, code interpreters (Python, JavaScript), file I/O, and structured data APIs. Tool calls are generated by the dispatched model and executed in a sandboxed environment. Results are injected back into the model's context for the next reasoning step. We impose a maximum tool-call depth of 8 per sub-task to prevent runaway loops.

4

Cross-Model Verification

After each sub-task completes, its output is passed to a verification step executed by a different model than the one that produced the answer. The verifier checks for internal consistency, factual grounding (when retrieval sources are available), and adherence to the original sub-task specification. If verification fails, the sub-task is re-dispatched — optionally to a different model — with the verification feedback appended to the prompt. A maximum of two retries is enforced before the system proceeds with a confidence-discounted answer.

Limitations

Mimir's routing mechanism relies on task-type classification, which can misclassify ambiguous sub-tasks and dispatch them to sub-optimal models. The capability matrix requires a warm-up period of approximately 50–100 tasks to converge to reliable estimates, during which routing decisions are effectively random. Additionally, the cross-model verification step roughly doubles inference cost per sub-task, which may be prohibitive for latency-sensitive applications.

We also note that our evaluation is limited to English-language benchmarks and may not generalize to multilingual or low-resource settings where model capability profiles differ substantially.

Citation

If you find this work useful, please cite:

@article{lindqvist2025mimir,
  title={Mimir: Adaptive Multi-Model Reasoning for General-Purpose Agent Tasks},
  author={Team Norse},
  journal={arXiv preprint arXiv:2502.14308},
  year={2025}
}