Scheduling, ILP, and a Bluespec Bug That Ate 64 GB of RAM

I have been studying architectural synthesis for a while now, the formal problem of mapping an algorithm, expressed as a data-flow graph, onto a circuit with area and timing constraints. Most of this started as background reading for wsa-gen, a personal project asking a simple question: can we describe an algorithm and let a compiler decide what to harden?

The theory is elegant. The practice, I learned, will occasionally eat your entire machine’s RAM and not give it back.

The Formal Problem

In high-level synthesis, the starting point is a sequence graph, a directed acyclic graph where nodes are operations and edges represent data dependencies. The scheduling problem assigns each operation to a control step (clock cycle) such that:

No operation executes before its inputs are ready (data dependency constraint)
The number of concurrently active operations respects the available functional units (resource constraint)
The overall latency meets the timing specification

Two classical algorithms bound the feasible scheduling space:

ASAP (As Soon As Possible). Assign each operation to the earliest control step allowed by its data dependencies. This gives a lower bound on latency, the critical path of the DAG. No resource constraints are applied; ASAP assumes unlimited functional units.

ALAP (As Late As Possible). Working backward from the timing constraint, assign each operation to the latest control step that still allows downstream dependents to meet their deadlines. ALAP gives an upper bound, the slack for each operation.

The difference between an operation’s ASAP and ALAP times is its mobility: the window of freedom the scheduler has. Operations with zero mobility are on the critical path and must be placed exactly.

In practice, you want to satisfy resource constraints while staying close to ASAP latency. This is where it gets hard: resource-constrained scheduling with minimized latency is NP-hard in general. The standard exact approach is Integer Linear Programming. A typical ILP formulation introduces a binary variable x_i, k indicating whether operation i is scheduled at step k, then minimizes overall latency subject to dependency and resource constraints. For small circuits this is fine. For large ones, the combinatorial explosion is savage.

Enter Bluespec and BVI

Bluespec SystemVerilog is a hardware description language with Haskell-derived semantics. It models hardware as a set of atomic rules, guarded transactions that fire when their preconditions are met. The compiler resolves scheduling conflicts between rules automatically, proving that concurrent rule executions remain coherent.

This is wonderful, until you need to wrap existing Verilog IP. Bluespec’s answer is the BVI (Bluespec Verilog Import) mechanism: you write a module that declares a Verilog interface in Bluespec types, and annotate each method with scheduling attributes (CF, SB, SBR, C…) telling the compiler what pairs of methods can fire simultaneously and in what order.

Internally, the BSC compiler uses those annotations to build a conflict matrix and then runs its ILP-based scheduler to verify that the annotations are consistent and to compute a global schedule for the design.

The Bug: 10,000 Signals, 64 GB RAM, 15 Hours

At InCore, we work with SoC generators that wire up large Verilog IP blocks into a coherent system. One such integration involved a Verilog module with roughly 10,000 interface signals. We knew there were no real scheduling conflicts, the signals were mostly orthogonal. The BVI annotations reflected this: everything marked CF (conflict-free).

Compilation started. Memory climbed. After a few hours it had consumed 64 GB. After 15 hours it had not finished. This was production CI, unacceptable.

Hypothesis. BSC is written in Haskell. The entire design elaborates into memory before code generation. The conflict resolution algorithm the compiler uses internally is roughly O(n²) in the number of annotated method pairs, and with 10,000 signals, that’s on the order of 10⁸ pairs. The ILP instance is enormous, the Haskell heap swells to match, and the solver grinds.

Fix: signal sharding. The key insight: if you split the monolithic BVI module into multiple smaller BVI modules, each covering a disjoint subset of signals, the ILP instances are independent. A 10,000-signal module becomes fifty 200-signal modules. The conflict matrices are tiny. BSC can resolve each independently and infer the composed schedule polymorphically.

After sharding, compilation time dropped to 8 minutes and peak memory usage to 4 GB. The generated RTL was identical.

What This Teaches About Compilers

The BVI case is a microcosm of a broader compiler design problem: scalable constraint solving. The BSC scheduler’s correctness is not in question, it produces the right answer given the right inputs. The issue is that the ILP formulation doesn’t scale to 10⁸ pairs.

Modern synthesis tools handle this through a combination of problem decomposition, SAT/SMT instead of pure ILP, and incremental solving. The theoretical complexity doesn’t change, but the practical tractability improves dramatically when the solver can exploit structure.

For wsa-gen, this is a concrete lesson about how the backend will have to work. A naïve global schedule for a large algorithm graph will be intractable. The compiler will need to partition the graph, perhaps using the sequence graph partitioning methods studied in architectural synthesis, and solve sub-problems independently before composing results. The challenge is proving that independently scheduled partitions compose correctly, which is exactly the kind of guarantee a typed IR can provide.

The Bigger Picture

There is a thread that connects all of this:

Architectural synthesis gives us the formal language for the scheduling problem
Bluespec gives us a proof-carrying intermediate representation where the scheduler can verify correctness by construction
The BVI bug shows what happens when that machinery meets the combinatorial reality of large designs
wsa-gen is an attempt to push this problem one level higher: starting not from RTL but from an algorithm description, and letting the compiler decide the boundary between hard and soft

The notes on scheduling algorithms and abstract synthesis models from my reading on this are in the archive if you want the formal treatment. This post is more about where that theory collides with production reality.