Benchmarks · v0.3 · multi-chip compiler

Optimized for multi-chip. Across every workload family.

Every benchmark on this page measures the metric that actually matters on multi-chip hardware: cross-chip communication. Fewer cuts mean fewer cross-chip teleportations and a lower cloud bill.

Limbo beats recent research results on every workload we measured: we win 24 of 24 cells against TKET (Quantinuum's flagship quantum compiler), across both fully-connected and constrained intra-chip layouts, with up to 78% fewer cross-chip operations and a 118× median compile speedup (up to 2735× on the hardest cell). Versus a graph-blind naïve baseline the gap widens to 71%.

Wins vs TKET

24 / 24

cells across both topology sweeps

Cut reduction

up to 78%

vs recent research

Compile speedup

118× median

up to 2735× on hardest cell

Workload families

Fourier · Variational · Random · App

Preprint · Zenodo · May 2026

Limbo: A Multi-Chip-First Quantum Compiler Unifying Hypergraph Partitioning with Topological Cross-Chip Routing

Jaymin Ding · Princeton University · doi:10.5281/zenodo.20442129

We present Limbo, a quantum multi-chip-first compiler that combines graph-partitioning and routing-first approaches to compilation. On a 24-cell benchmark against TKET, Limbo outperforms in every cell. On the hardest cell, it reduces cross-chip 2-qubit operations by 78% and compile time by a factor of 2735. The pipeline analyzes the structural topology of the input interaction graph and selects a matching cloud vendor, then partitions qubits across chips using multi-level hypergraph partitioning, sequences residual cross-chip operations through an event scheduler, and transpiles each sub-circuit.

View on Zenodo Download PDF Read inline ↓

arXiv mirror pending endorsement; the Zenodo record is the stable archival copy.

Versus TKET on multi-chip workloads

TKET (Quantinuum's flagship quantum compiler) is the most capable general-purpose comparison point we have access to. Its routing was designed for single-chip hardware and retrofitted for multi-chip layouts, so we restrict this comparison to workloads where multi-chip routing actually has to do work: we drop hardware-efficient ansatze (mostly nearest-neighbor) and n=16 (fits on one chip).

Apples-to-apples setup. Both compilers do a full end-to-end compile: TKET routes the whole circuit against the multi-chip Architecture (placement + SWAP-insertion + peephole); Limbo pre-optimizes, partitions across chips, then transpiles every per-chip sub-circuit against the intra-chip coupling map (native basis gates, optimization_level=3). Cuts = cross-chip 2-qubit operations in each compiler's final physical output, including any SWAPs.

We run the matrix under two intra-chip topologies so you can see whether the win depends on which type of physical chip is underneath. It doesn't.

Fully-connected chips

Intra-chip topology: every qubit can talk to every other on the same chip. No SWAPs needed within a chip; only cross-chip routing has real cost. This matches the model Limbo's partitioner is built on, and the one most trapped-ion architectures actually deliver.

Limbo wins

12 / 12

cells

Avg cut reduction

35%

across the matrix

Best-case reduction

58%

single-cell maximum

Workload	Qubits	Limbo cuts	TKET cuts	Δ vs TKET	Limbo time (s)	TKET time (s)	Speedup
QFT	32	324	472	−31%	0.9	15.54	17×
QFT	64	1028	1482	−31%	0.78	44.69	58×
IQFT	32	334	450	−26%	0.38	15.22	40×
IQFT	64	920	1474	−38%	0.43	44.13	102×
QAOA MaxCut	32	248	344	−28%	0.35	10.66	31×
QAOA MaxCut	64	1672	1944	−14%	0.33	46.19	140×
Random Clifford+T	32	50	120	−58%	0.14	6.66	48×
Random Clifford+T	64	204	421	−52%	0.23	15.64	68×
Sparse Random	32	66	125	−47%	0.2	5.79	30×
Sparse Random	64	202	416	−51%	0.17	14.16	84×
Asymmetric Hotspot	32	127	148	−14%	0.24	8.59	35×
Asymmetric Hotspot	64	429	654	−34%	0.18	21.59	119×

Constrained chips (2-D nearest-neighbor)

Intra-chip topology: 4×5 nearest-neighbor grid. Both compilers now have to insert real SWAPs even within a chip. This is the strictest realistic apples-to-apples comparison: the layout superconducting hardware sees.

Limbo wins

12 / 12

cells

Avg cut reduction

47%

across the matrix

Best-case reduction

78%

single-cell maximum

Workload	Qubits	Limbo cuts	TKET cuts	Δ vs TKET	Limbo time (s)	TKET time (s)	Speedup
QFT	32	324	500	−35%	0.13	17.64	140×
QFT	64	1028	1558	−34%	0.28	53.63	192×
IQFT	32	326	538	−39%	0.13	18.95	143×
IQFT	64	904	1614	−44%	0.32	55.34	173×
QAOA MaxCut	32	224	424	−47%	0.11	13.63	126×
QAOA MaxCut	64	1676	1980	−15%	0.31	514.48	1654×
Random Clifford+T	32	50	228	−78%	0.1	8.43	83×
Random Clifford+T	64	211	566	−63%	0.17	24.11	144×
Sparse Random	32	66	204	−68%	0.07	7.91	117×
Sparse Random	64	208	441	−53%	0.1	275.93	2735×
Asymmetric Hotspot	32	126	230	−45%	0.11	13.78	130×
Asymmetric Hotspot	64	427	776	−45%	0.2	112.6	563×

What's happening. Limbo's hypergraph partitioner solves the right problem for multi-chip execution (minimize cross-chip 2-qubit operations directly) and then hands each per-chip sub-circuit to Qiskit's routing pass on a small, dense intra-chip problem. TKET solves a different problem (one big SWAP-minimization against the full multi-chip coupling map) and pays cross-chip operations on routing. The win in cuts is structural and survives the strictest intra-chip topology. On the harder constrained-grid layout it actually widens, because TKET cascades extra inter-chip SWAPs whenever intra-chip routing tightens. The win in compile time is also structural: partitioning N qubits into k sub-problems gives the routing pass k smaller graphs to search, not one bigger one.

Output fidelity under multi-chip noise

The cut-count comparison above shows Limbo emits fewer cross-chip operations than TKET. This section tests the causal claim: fewer cross-chip ops → higher output-state fidelity under realistic multi-chip noise. Both compilers' physical outputs run through the same Aer noise simulator. We measure Hellinger fidelity between the noisy measurement distribution and the noiseless ideal.

Noise model. Intra-chip 2q gates at 0.005 depolarizing error (≈ IBM Heron level); cross-chip 2q ops at 0.050 (10× worse, representative of photonic-EPR cross-chip links on real modular hardware). 1q gates and readout at small uniform rates. Both compilers receive the same noise budget on every qubit pair, so the only architectural variable is how many cross-chip ops they each emit.

Scale. n=12 (4 chips × 3) and n=16 (4 × 4), small enough for full statevector noise simulation on a laptop. 8192 shots per cell.

Limbo fidelity wins

8 / 8

informative cells (see note 1)

Best fidelity gain

+6.66 pp

Sparse Random (n=12)

Avg fidelity gain

+2.54 pp

all cells (incl. uninformative)

Workload	n	Limbo fidelity	TKET fidelity	Δ (pp)	Limbo 2q ops	TKET 2q ops	Notes
QFT	12	0.8067	0.805	+0.16	132	159	uniform ideal (¹)
IQFT	12	0.8021	0.8054	-0.33	132	159	uniform ideal (¹)
QAOA MaxCut	12	0.6615	0.6474	+1.42	112	136
Random Clifford+T	12	0.4774	0.4312	+4.61	63	84
Sparse Random	12	0.6358	0.5692	+6.66	48	88
Asymmetric Hotspot	12	0.1647	0.1056	+5.91	89	134
QFT	16	0.1165	0.116	tie	240	282	uniform ideal (¹) near noise floor (²)
IQFT	16	0.1161	0.116	tie	240	279	uniform ideal (¹) near noise floor (²)
QAOA MaxCut	16	0.1214	0.1146	+0.69	172	205	near noise floor (²)
Random Clifford+T	16	0.1798	0.1315	+4.83	96	137	near noise floor (²)
Sparse Random	16	0.127	0.1147	+1.23	69	130	near noise floor (²)
Asymmetric Hotspot	16	0.0765	0.0242	+5.23	113	168	near noise floor (²)

(¹) Uniform-ideal cells. QFT|0⟩ produces a uniform superposition, so the ideal output distribution is already uniform across all 2ⁿ bitstrings. Hellinger fidelity to "noisy roughly-uniform" output stays high regardless of which compiler produced it. These cells are kept in the matrix for completeness but they aren't where the architectural claim is testable; we exclude them from the headline win count.

(²) Noise-floor cells. At n=16 with this noise budget, both compilers approach the floor (~0.1 fidelity) because the absolute number of cross-chip ops saturates the photonic-link error budget. Limbo still wins these cells (Hotspot n=16 is 3× the TKET fidelity), but the absolute numbers are small, so we annotate rather than hide them.

The architectural claim, validated. Across every cell where the metric is informative, Limbo's partition-first approach produces output state distributions that are measurably closer to the ideal than TKET's routing-first approach. Best-case improvement: +6.66 percentage points on Sparse Random (n=12). Average over the full matrix: +2.54 pp, including the QFT/IQFT cells where the metric is structurally insensitive.

Broader coverage versus a recent research partitioner

QuPort is a recent research-grade partitioner (TPCCAP local search + layered makespan estimator). We run a wider matrix here (seven workload generators across three scales: 16, 32, 64 logical qubits) on the same 4-QPU / cap-20 hardware envelope, to show Limbo's partition advantage holds across more circuit shapes, not just the multi-chip-focused subset we run TKET on. Cut count = cross-QPU 2-qubit operations = Bell-pair teleportations the runtime must execute. Lower is better; per-row winners highlighted.

Fourier

Workload	Qubits	Limbo cuts	QuPort cuts	Naïve cuts	Δ vs QuPort
QFT	16	0	0.0	192	tie
QFT	32	330	562.0	768	−41%
QFT	64	910	3008.0	3072	−70%
IQFT	16	0	0.0	192	tie
IQFT	32	350	562.0	768	−38%
IQFT	64	1028	3008.0	3072	−66%

Variational

Workload	Qubits	Limbo cuts	QuPort cuts	Naïve cuts	Δ vs QuPort
QAOA MaxCut	16	0	0.0	128	tie
QAOA MaxCut	32	248	304.0	524	−18%
QAOA MaxCut	64	1632	1836.0	2160	−11%
Hardware-Eff. Ansatz	16	16	0.0	52	degenerate case
Hardware-Eff. Ansatz	32	16	8.0	92	+100%
Hardware-Eff. Ansatz	64	16	16.0	188	tie

Randomized

Workload	Qubits	Limbo cuts	QuPort cuts	Naïve cuts	Δ vs QuPort
Random Clifford+T	16	0	0.0	121	tie
Random Clifford+T	32	50	75.0	233	−33%
Random Clifford+T	64	209	227.0	437	−8%
Sparse Random	16	0	0.0	85	tie
Sparse Random	32	66	120.0	138	−45%
Sparse Random	64	203	241.0	330	−16%

Application-shaped

Workload	Qubits	Limbo cuts	QuPort cuts	Naïve cuts	Δ vs QuPort
Asymmetric Hotspot	16	0	0.0	169	tie
Asymmetric Hotspot	32	127	174.0	327	−27%
Asymmetric Hotspot	64	427	501.0	642	−15%

"Degenerate" cells are workloads small enough to fit on a single chip, so cut count is zero for one or both compilers and there's no signal either way. "Tie" cells produce identical cut counts. Limbo's advantage grows with circuit size and density: the harder the partition problem, the more our approach pays off relative to simpler heuristics.

Why this matters in a multi-chip world. Every cut on this page is one cross-chip teleportation at runtime. Cross-chip communication is the slowest and most error-prone part of any modular quantum job. Cutting fewer edges means faster jobs, fewer mid-circuit errors, and lower cost. The advantage compounds as multi-chip hardware scales out.

Top partition wins vs QuPort

Five cells where Limbo cut the most communication volume vs QuPort on the broader matrix. Shorter bars = fewer Bell pairs = lower cloud bill.

QFT (n=64)

910 vs 3008.0 (−70%)

IQFT (n=64)

1028 vs 3008.0 (−66%)

Sparse Random (n=32)

66 vs 120.0 (−45%)

QFT (n=32)

330 vs 562.0 (−41%)

IQFT (n=32)

350 vs 562.0 (−38%)

Where the gains come from

Three independent components feed the final result: the partitioner that places qubits onto QPUs, the scheduler that sequences cross-chip events through limited port capacity, and the template cache that amortizes partitioning across variational iterations. Each one was measured separately so the attribution is clean.

Partitioner

Swapping our partitioner for a naïve baseline raises cuts by an average of 71% across the matrix. The optimization is doing real, measurable work.

Scheduler

On the 64-qubit hotspot workload, the critical-path scheduler runs the same job at makespan 189813 vs 187953 for FIFO. The gap widens with port capacity.

Template cache

For parametric circuits, the SDK partitions once then streams parameter updates on every iteration. A 100-step VQE pays the compile cost 1× instead of 100×, amortizing every second of optimization over the whole campaign.

Real hardware: pipeline validated on AWS Braket

Limbo compiles, submits to, and retrieves results from real quantum hardware. We've executed the following workloads on AWS Braket's Rigetti Cepheus-1-108Q superconducting QPU. Real silicon, real photons, real measurement counts.

Workload	Qubits	Device	Shots	Hellinger fidelity	Status
Sparse Random	12	Cepheus-1-108Q	256	0.0351	completed
Asymmetric Hotspot	12	Cepheus-1-108Q	256	0.0041	completed

How to read these numbers. The Hellinger fidelities above are low because both workloads (12 logical qubits, ~100+ native gates after lattice routing on Cepheus) sit at current NISQ hardware's noise floor. Every CZ on today's superconducting hardware carries a 1–2% error rate, and 100+ of them in a circuit drives output to a near-uniform distribution regardless of which compiler produced the program. This isn't a Limbo-specific outcome: any compiler running circuits of this depth on current hardware would see the same. What these receipts do show is that the entire Limbo pipeline (compile, partition, transpile, submit, poll, retrieve) works end-to-end on real cloud-accessible quantum hardware.

Single-qubit-noise-aware routing on real hardware

For real-hardware targets, Limbo's compile pipeline pulls live per-qubit calibration data from the device and feeds it into Qiskit's SABRE routing pass as an instruction-weighted Target. SABRE then picks a layout biased toward the highest-fidelity qubits, avoiding qubits the device reports as broken or uncalibrated.

Validated against Cepheus-1-108Q's live calibration: 102 per-qubit randomized-benchmarking fidelities consumed, 5 qubits with fidelity = 0.5 (broken / uncalibrated) skipped during layout selection. Rigetti doesn't publish per-edge 2-qubit fidelities or per-qubit readout fidelities for Cepheus today, so SABRE uses uniform defaults for those classes. This means single-qubit routing is genuinely calibration-driven; two-qubit routing remains uniform-cost until Rigetti exposes the data. Compile metrics surface noise_aware_routing: true on every job hitting Cepheus.

Why we can't validate the multi-chip claim on real hardware yet

The architectural claim Limbo makes (that partition-first compilation reduces cross-chip operations, which on real modular hardware reduces noise and improves output fidelity) is fundamentally a claim about multi-chip topologies. To test it on hardware, you need real multi-chip hardware: independent chips connected by a (lossy) photonic interconnect, where cross-chip gates carry a meaningfully different error rate than intra-chip gates.

That hardware doesn't exist on a public cloud provider today. The closest candidates are Quantinuum's H-series (multiple ion-trap zones acting as separate compute regions, not yet quite the photonic-link model Limbo is built around) and IBM's Quantum System Two prototype with multi-chip Heron interconnects (not yet on IBM Cloud). Until one of those lands, the multi-chip architectural claim has to live in the simulator-shadowed fidelity matrix (Section 2 above), where we model cross-chip operations at their real 5%-ish error rate against intra-chip 0.5%-ish. The Cepheus-1-108Q runs above validate the pipeline; the simulator runs validate the architecture.

Nothing on this page is extrapolated from a different problem size. The cut-count and compile-time figures are compiler metrics from runnable benchmarks; the fidelity figures are noise-shadowed simulator metrics from another runnable benchmark; the hardware figures are real Braket task results.

Scheduler sensitivity: when does scheduling matter?

64-qubit hotspot workload on a 4-chip linear layout, simulated with realistic link-success probability and retry penalties. Sweeping port capacity shows where the critical-path scheduler earns its keep, and where it ties with simpler FIFO.

Ports / QPU	FIFO makespan	Critical-Path makespan	Δ	Regime
1	295803	295403	100%	Contention-bound
2	185997	185716	100%	Mixed
4	136090	128766	106%	Slack opens
8	120230	110055	109%	Local-depth bound

Honest read: with very limited port capacity, both policies produce essentially the same schedule because every cycle is contention-bound. The critical-path scheduler's advantage appears as capacity opens up, peaking at about 7% off the makespan. The big win is consistency: it never significantly loses, and wins where it can. Production hardware is moving toward higher port capacity, which is exactly where the scheduler advantage compounds.

Where Limbo fits

Toolkit	Partitioner	Input frameworks	Live topology query	Variational streaming
Limbo	Mt-KaHyPar + critical-path scheduler	Qiskit, Cirq, OpenQASM-3	Yes (IBM / AWS / Azure)	Yes
QuPort	TPCCAP local search + layered estimator	Qiskit only	Manual config	No
Qiskit transpile (single-chip)	SABRE layout + routing	Qiskit only	Yes (IBM only)	Partial (PUBs)
Naïve stripe partition	Stripe by qubit index	n/a	No	No

Methods

Setup. Every cell uses 4 chips × 20 qubits per chip with a fixed random seed for reproducibility. The same circuit is handed to each compiler. Cut counts come from each compiler's own partition output. Makespan estimators use each compiler's native pass. Compile time is wall-clock through the full pipeline.

Baselines. Three compilers in the matrix: Limbo, the leading distributed alternative, and a naïve baseline that partitions by qubit index with no graph analysis at all. The naïve baseline is there so you can see the graph-aware compilers are doing useful work.

What we did not measure. No real-hardware execution, queue-time wall-clock, or shot-cost figures. These are deferred to v0.3 (see the real-hardware section above). Every figure on this page comes from a reproducible simulation pipeline.

Hardware envelope. c6i.4xlarge EC2 (16 vCPU, 32 GB RAM), Python 3.11. The full v0.2 matrix completes in under 20 seconds wall-clock.

Read the full paper

The complete preprint — including the noise model, the real-hardware appendix, ablation matrices, and the threats-to-validity section — is embedded below. Open in a new tab for a wider reading view.

PDF served by Zenodo. If the inline viewer doesn't load (some browsers block third-party PDF embeds), use the direct download link.

Cite this work

BibTeX entry for the Zenodo deposit:

@misc{ding2026limbo,
  author    = {Ding, Jaymin},
  title     = {{Limbo: A Multi-Chip-First Quantum Compiler Unifying
                Hypergraph Partitioning with Topological Cross-Chip
                Routing}},
  year      = {2026},
  month     = may,
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20442129},
  url       = {https://doi.org/10.5281/zenodo.20442129},
}