As datacenter networks grow in complexity and AI training clusters reach scales of tens or hundreds of thousands of GPUs, the tools used to design and evaluate these networks have become a bottleneck. Our research explores techniques to increase the tractability of large cluster performance estimation by multiple orders of magnitude, using ML (
MimicNet, SIGCOMM '21), Data-Oriented Design principles (
DONS, SIGCOMM '23), GPU acceleration (
Multiverse, NSDI '25), and statistical techniques (
CCEval, NSDI '26).