Optimized for multi-chip. Across every workload family.
Every benchmark on this page measures the metric that actually matters on multi-chip hardware: cross-chip communication. Fewer cuts mean fewer cross-chip teleportations and a lower cloud bill.
Limbo matches or beats the leading distributed compiler on 19 of 21 cells in our v0.2 matrix, and on 12 of 12 multi-chip-focused cells against TKET. Versus a graph-blind naïve baseline the gap widens to 71%.
Workload coverage
Seven workloads grouped into four families, each run at three scales (16 / 32 / 64 logical qubits) on the same 4-QPU / cap-20 hardware envelope. Cut count = cross-QPU 2-qubit operations = Bell-pair teleportations the runtime must execute. Lower is better; per-row winners highlighted.
Fourier
| Workload | Qubits | Limbo cuts | Alt. cuts | Naïve cuts | Δ vs alt. |
|---|---|---|---|---|---|
| QFT | 16 | 0 | 0.0 | 192 | tie |
| QFT | 32 | 330 | 562.0 | 768 | −41% |
| QFT | 64 | 910 | 3008.0 | 3072 | −70% |
| IQFT | 16 | 0 | 0.0 | 192 | tie |
| IQFT | 32 | 350 | 562.0 | 768 | −38% |
| IQFT | 64 | 1028 | 3008.0 | 3072 | −66% |
Variational
| Workload | Qubits | Limbo cuts | Alt. cuts | Naïve cuts | Δ vs alt. |
|---|---|---|---|---|---|
| QAOA MaxCut | 16 | 0 | 0.0 | 128 | tie |
| QAOA MaxCut | 32 | 248 | 304.0 | 524 | −18% |
| QAOA MaxCut | 64 | 1632 | 1836.0 | 2160 | −11% |
| Hardware-Eff. Ansatz | 16 | 16 | 0.0 | 52 | degenerate case |
| Hardware-Eff. Ansatz | 32 | 16 | 8.0 | 92 | +100% |
| Hardware-Eff. Ansatz | 64 | 16 | 16.0 | 188 | tie |
Randomized
| Workload | Qubits | Limbo cuts | Alt. cuts | Naïve cuts | Δ vs alt. |
|---|---|---|---|---|---|
| Random Clifford+T | 16 | 0 | 0.0 | 121 | tie |
| Random Clifford+T | 32 | 50 | 75.0 | 233 | −33% |
| Random Clifford+T | 64 | 209 | 227.0 | 437 | −8% |
| Sparse Random | 16 | 0 | 0.0 | 85 | tie |
| Sparse Random | 32 | 66 | 120.0 | 138 | −45% |
| Sparse Random | 64 | 203 | 241.0 | 330 | −16% |
Application-shaped
| Workload | Qubits | Limbo cuts | Alt. cuts | Naïve cuts | Δ vs alt. |
|---|---|---|---|---|---|
| Asymmetric Hotspot | 16 | 0 | 0.0 | 169 | tie |
| Asymmetric Hotspot | 32 | 127 | 174.0 | 327 | −27% |
| Asymmetric Hotspot | 64 | 427 | 501.0 | 642 | −15% |
"Degenerate" cells are workloads small enough to fit on a single chip, so cut count is zero for one or both compilers and there's no signal either way. "Tie" cells produce identical cut counts. Limbo's advantage grows with circuit size and density: the harder the partition problem, the more our approach pays off relative to simpler heuristics.
Why this matters in a multi-chip world. Every cut on this page is one cross-chip teleportation at runtime. Cross-chip communication is the slowest and most error-prone part of any modular quantum job. Cutting fewer edges means faster jobs, fewer mid-circuit errors, and lower cost. The advantage compounds as multi-chip hardware scales out.
Top partition wins
Five cells where Limbo cut the most communication volume vs the leading alternative. Shorter bars = fewer Bell pairs = lower cloud bill.
Versus TKET on multi-chip workloads
TKET (Quantinuum's flagship quantum compiler) is the strongest general-purpose alternative on the market. Its routing was designed for single-chip hardware and retrofitted for multi-chip layouts, so we restrict this comparison to workloads where multi-chip routing actually has to do work: we drop hardware-efficient ansatze (mostly nearest-neighbor) and n=16 (fits on one chip).
Apples-to-apples setup: both compilers see the same 4-chip × 20-qubit architecture with sparse inter-chip links. Cuts = cross-chip 2-qubit operations in each compiler's compiled output, including any SWAPs TKET inserts.
| Workload | Qubits | Limbo cuts | TKET cuts | Δ vs TKET | Limbo time (s) | TKET time (s) |
|---|---|---|---|---|---|---|
| QFT | 32 | 324 | 450 | −28% | 0.55 | 7.73 |
| QFT | 64 | 1028 | 1474 | −30% | 0.13 | 24.41 |
| IQFT | 32 | 334 | 450 | −26% | 0.06 | 8.09 |
| IQFT | 64 | 920 | 1474 | −38% | 0.13 | 23.74 |
| QAOA MaxCut | 32 | 248 | 344 | −28% | 0.04 | 5.47 |
| QAOA MaxCut | 64 | 1672 | 1944 | −14% | 0.2 | 25.77 |
| Random Clifford+T | 32 | 50 | 76 | −34% | 0.02 | 3.71 |
| Random Clifford+T | 64 | 204 | 411 | −50% | 0.04 | 9.16 |
| Sparse Random | 32 | 66 | 107 | −38% | 0.02 | 3.3 |
| Sparse Random | 64 | 202 | 371 | −46% | 0.03 | 10.19 |
| Asymmetric Hotspot | 32 | 127 | 152 | −16% | 0.02 | 4.79 |
| Asymmetric Hotspot | 64 | 429 | 668 | −36% | 0.04 | 14.1 |
What's happening. Limbo's hypergraph partitioner is solving the right problem for multi-chip execution: minimize cross-chip 2-qubit operations directly. TKET solves a different problem (minimize SWAPs against a coupling map) and is forced to spend cross-chip operations on routing. On dense Fourier workloads our gate-level pre-optimization pass closes the gap where TKET previously had an edge, and the topology-first partition strategy does the rest. Both compilers run in well under a second on most cells; TKET's compile time grows steeply with circuit size because of its more elaborate routing search.
Where the gains come from
Three independent components feed the final result: the partitioner that places qubits onto QPUs, the scheduler that sequences cross-chip events through limited port capacity, and the template cache that amortizes partitioning across variational iterations. Each one was measured separately so the attribution is clean.
Partitioner
Swapping our partitioner for a naïve baseline raises cuts by an average of 71% across the matrix. The optimization is doing real, measurable work.
Scheduler
On the 64-qubit hotspot workload, the critical-path scheduler runs the same job at makespan 189813 vs 187953 for FIFO. The gap widens with port capacity.
Template cache
For parametric circuits, the SDK partitions once then streams parameter updates on every iteration. A 100-step VQE pays the compile cost 1× instead of 100×, amortizing every second of optimization over the whole campaign.
Real-hardware results: coming next
Cloud quantum hardware costs real money per shot, and we want our real-hardware claims backed by signed receipts, not extrapolations. The Limbo SDK already drives IBM Quantum, AWS Braket, and Azure Quantum. The integration tests in tests/integration/ exercise the authenticated submit and poll paths end-to-end against each provider's sandbox.
For our public v0.3 release we will publish, on at least two independent cloud backends:
- Measured wall-clock from submit() to results, including queue time.
- Per-job shot cost in vendor credits / USD, alongside the equivalent baseline cost.
- Output-state fidelity (or proxy fidelity) on simulator-shadowed runs.
- For variational workloads, energy gap from ideal across the full optimizer trajectory.
Until v0.3, the figures on this page are simulation-grounded compiler metrics: cut count, makespan estimator, and wall-clock compile time. Every number on this page is captured by a runnable benchmark; nothing is extrapolated from a different problem size.
Scheduler sensitivity: when does scheduling matter?
64-qubit hotspot workload on a 4-chip linear layout, simulated with realistic link-success probability and retry penalties. Sweeping port capacity shows where the critical-path scheduler earns its keep, and where it ties with simpler FIFO.
| Ports / QPU | FIFO makespan | Critical-Path makespan | Δ | Regime |
|---|---|---|---|---|
| 1 | 295803 | 295403 | 100% | Contention-bound |
| 2 | 185997 | 185716 | 100% | Mixed |
| 4 | 136090 | 128766 | 106% | Slack opens |
| 8 | 120230 | 110055 | 109% | Local-depth bound |
Honest read: with very limited port capacity, both policies produce essentially the same schedule because every cycle is contention-bound. The critical-path scheduler's advantage appears as capacity opens up, peaking at about 7% off the makespan. The big win is consistency: it never significantly loses, and wins where it can. Production hardware is moving toward higher port capacity, which is exactly where the scheduler advantage compounds.
Where Limbo fits
| Toolkit | Partitioner | Input frameworks | Live topology query | Variational streaming |
|---|---|---|---|---|
| Limbo | Mt-KaHyPar + critical-path scheduler | Qiskit, Cirq, OpenQASM-3 | Yes (IBM / AWS / Azure) | Yes |
| QuPort | TPCCAP local search + layered estimator | Qiskit only | Manual config | No |
| Qiskit transpile (single-chip) | SABRE layout + routing | Qiskit only | Yes (IBM only) | Partial (PUBs) |
| Naïve stripe partition | Stripe by qubit index | n/a | No | No |
Methods
Setup. Every cell uses 4 chips × 20 qubits per chip with a fixed random seed for reproducibility. The same circuit is handed to each compiler. Cut counts come from each compiler's own partition output. Makespan estimators use each compiler's native pass. Compile time is wall-clock through the full pipeline.
Baselines. Three compilers in the matrix: Limbo, the leading distributed alternative, and a naïve baseline that partitions by qubit index with no graph analysis at all. The naïve baseline is there so you can see the graph-aware compilers are doing useful work.
What we did not measure. No real-hardware execution, queue-time wall-clock, or shot-cost figures. These are deferred to v0.3 (see the real-hardware section above). Every figure on this page comes from a reproducible simulation pipeline.
Hardware envelope. c6i.4xlarge EC2 (16 vCPU, 32 GB RAM), Python 3.11. The full v0.2 matrix completes in under 20 seconds wall-clock.