A Walk Down Polyhedral Lane.

Combinatorial Optimization.

Combinatorial optimization seeks the optimal object from a finite collection of objects. This collection typically has a representation such as a graph or tree. Since the number of objects grows exponentially with the size of the representation, such as in the case of Hamiltonian circuits, scanning all objects one by one to select the best solution is infeasible at scale. Thus, efficient search methods are necessary.

In the 1960s, Jack Edmonds advocated for defining an efficient method as one whose running time is bounded by a polynomial in the size of its representation. Edmonds developed polynomial-time algorithms like the Blossom algorithm for the matching problem. Today, problems solvable in polynomial time are classified as belonging to the complexity class P. In the 1970s, Cook and Karp identified combinatorial optimization problems, such as the Travelling Salesman Problem, that belong to the complexity class NP. By definition, any problem in NP must be reducible to every NP-complete problem. All NP-complete problems are equivalent in that the polynomial-time solvability of one implies the same for all. So far, most combinatorial optimization problems have been classified either as polynomial-time solvable or NP-complete, but none have been proven to be both. This raises the fundamental question: are these two properties disjoint or do they coincide? In other words: \[P\neq NP\quad\text{or}\quad P=NP\] I will focus on three key aspects of combinatorial optimization problems that are in class P:

Polynomial-time solvability.
Polyhedral representations.
Min-Max relations.

Edmonds showed that these aspects are closely related. In many cases, a polynomial-time algorithm yields a representation formed of inequalities of an associated polyhedron. Conversely, an appropriate representation of a polyhedron implies the polynomial-time solvability of an associated optimization problem via linear programming techniques. The duality theorem of linear programming allows polyhedral representations to yield Min-Max relations. This field of discrete math is called polyhedral combinatorics.

Maximum-Size Matching.

Let $G = (V, E)$ be an undirected graph and let $w : E\to \mathbb{R}_{+}$. For any subset $F$ of $E$, let \[w(F) := \sum_{e\in F}w(e)\] We call $w(F)$ the weight of $F$. Now suppose you want to find a matching $M$ where we define matching as a set of disjoint edges in $G$ such that the weight $w(M)$ is maximized. This in notation is \[\max\{w(M)\mid M\text{ matching in } G\}\] This has equivalent formulation: for any matching $M$, denote the indicator vector of $M$ in $\mathbb{R}^{E}$ by $\phi^M$, that is \[\phi^M(e) := \begin{cases} 1 \quad \text{if }e\in M,\\ 0 \quad \text{if }e \notin M \end{cases}\quad \text{ for } e\in E\] Given $w$ is a vector in $\mathbb{R}^{E}$, we have $w(M) = w^\intercal \phi^M$ letting us rewrite our maximization as \[\max\left\{w^\intercal\phi^M \mid M\text{ matching in } G\right\}\] This is equivalent to maximizing the linear function $w^\intercal x$ over a finite set of vectors $\implies$ the optimal value does not change if we maximize it over the convex hull of these vectors: \[\max\left\{w^\intercal \mid x \in\text{conv.hull}\left\{\phi^M \mid M\text{ matching in } G\right\} \right\}\] The set \[\text{conv.hull} \{\phi^M\mid M\text{ matching in } G\}\] is a polytope in $\mathbb{R}^E$, called the matching polytope of $G$. There are two equivalent ways to define a polytope

V-representation: defines a polytope as a convex hull of a finite set of vertices, where the finite set must contain the set of extreme points of the polytope.

H-representation: defines a polytope as the bounded intersection of a finite number of half-spaces, which is the solution to a system of linear inequalities. These half-spaces are described by a matrix $A$ and a vector $b$ in the form: \[H = \{x \in\mathbb{R}^n : Ax\leq b\}\]

Because of the H-representation we can say that \[\text{conv.hull} \{\phi^M\mid M\text{ matching in } G\}\implies \{x\in\mathbb{R}^E\mid x\geq 0, Ax\leq b\}\] which allows us to say \[\max\left\{w^\intercal \mid x \in\text{conv.hull}\left\{\phi^M \mid M\text{ matching in } G\right\} \right\} = \max\left\{w^\intercal x\mid x\geq 0, Ax\leq b\right\}\] In this way we were able to translate our original problem $\max\left\{w(M)\mid M\text{ matching in } G\right\}$ as a linear programming problem that we can apply relevant techniques to solve. The main issue is we need to somehow find the matrix $A$ and vector $b$ which we know exist.

Consider $G$ to be bipartite, then we know that the matching polytope of $G$ is equivalent to the set of all vectors $x$ satisifying \[\begin{align*} &x(e)\geq 0 &\text{for }e \in E\\ &\sum_{e \ni v}x(e)\leq 1 &\text{for }v\in V \end{align*}\] Note that the sum ranges over all edges $e$ containing $v$. Thus for $A$ we can take the $V\times E$ indicator matrix of $G$ and for $b$ the all-one vector $\textbf{1}$ in $\mathbb{R}^V$. We can show that the matching polytope for bipartite graphs is completely determined by the above inequalities in the following steps:

Containment Of The Matching Polytope: The matching polytope is defined by all characteristic vectors $ \phi^M $ of matchings $ M $ in the bipartite graph $ G $. Since each matching satisfies the inequalities defined above, it follows that the matching polytope is contained within the polytope defined by these inequalities.

Total Unimodularity: For a bipartite graph $ G $, the constraint matrix $ A $ associated with the inequalities is totally unimodular. This property means that every square submatrix of $ A $ has a determinant that is either $ 0 $, $ 1 $, or $ -1 $. This ensures that any linear combination of the rows of $ A $ that produces integer solutions will also have integer coefficients.

Integer Vertices: Due to the total unimodularity of the matrix $ A $, any feasible region defined by the inequalities will have vertices that are integer vectors. Specifically, this means that the solutions to the linear program formed by these inequalities will lie in $ \mathbb{Z}^E $, where $ E $ represents the edges of the bipartite graph.

Characterization Of Integer Solutions: Given that every integer vector satisfying the inequalities must be a vertex of the polytope and also corresponds to some matching $ M $ in the graph, we can say that each integer solution is equal to $ \phi^M $. Thus, all integer vectors satisfying the inequalities are characterized by the matchings in the bipartite graph $ G $.

We can now apply linear programming techniques to handle our original problem, specifically, we can find a maximum-weight matching in a bipartite graph in polynomial time, with any polynomial-time linear programming algorithm. The duality theorem of linear programming also gives us \[ \begin{align*} \max\left\{w(M)\mid M\text{ matching in }G\right\} &= \max\left\{w^\intercal x \mid x \geq \textbf{0},Ax\leq \textbf{1}\right\}\\ &= \min\left\{y^\intercal\textbf{1}\mid y\geq \textbf{0}, y^\intercal A\geq w^\intercal\right\} \end{align*}\] If we take for $w$ the all-one vector $\textbf{1}$ in $\mathbb{R}^E$, we can derive Kőnig's theorem which states: the maximum size of a matching in a bipartite graph is equal to the minimum size of a vertex cover (a set of vertices intersecting each edge). We can see this in action directly. First look at the expression \[\max\left\{w(M)\mid M\text{ matching in }G\right\}\] and notice it is indeed equal to the maximum size of a matching. Next, the minimum can be seen via the integer vector $y$ (from the total unimodularity of $A$). This $y\in\mathbb{R}^V$ is a $0$, $1$ vector $\implies$ it is the indicator vector $\phi^U$ for some subset $U\subseteq V$. Then, $y^\intercal A \geq \textbf{1$^\intercal$}$ implies $U$ is a vertex cover. Therefore the rightmost expression is equal to the minimum size of a vertex cover.

Kőnig's matching theorem is an example of a Min-Max formula that can be derived as we've shown from a polyhedral representation. We can also go the other way and go from a Min-Max formula in a weighted form that yields a polyhedral representation. This representation together with linear programming duality gives a definitive proof of optimality of a matching $M$. So if you ever need to convince someone that a certain matching $M$ has maximum size, it is sufficient and possible to show a vertex cover of size $|M|$. This is what yields a good characterization for the maximum-size matching problem in bipartite graphs.

What About Nonbipartite Graphs?

If $G$ is nonbipartite, the matching polytope is not defined directly by the inequalities \[\begin{align*} &x(e)\geq 0 &\text{for }e \in E\\ &\sum_{e \ni v}x(e)\leq 1 &\text{for }v\in V \end{align*}\] This is because if $ C $ is an odd circuit in the graph $ G $, we can define a vector $ x \in \mathbb{R}^E $ such that:

$ x(e) = \frac{1}{2} $ for each edge $ e $ in the circuit $ C $.

$ x(e) = 0 $ for edges not in $ C $.

This vector satisfies the inequalities for matchings in $ G $, but it is not a valid matching vector because it assigns non-integer values to edges. To accurately describe the matching polytope of any graph, we need to add specific inequalities to the existing conditions:

Non-Negativity: \[ x(e) \geq 0,\quad \forall e \in E \]

Vertex Degree Constraints: \[ \sum_{e \ni v} x(e) \leq 1,\quad \forall v \in V \]

In addition to these, we need the following constraint for every odd-sized subset $ U $ of vertices $ V $: \[ \sum_{e \subseteq U} x(e) \leq \left\lfloor \frac{1}{2} |U| \right\rfloor \] This means that the total weight assigned to the edges connecting the vertices in $ U $ must not exceed half the number of vertices in $ U $, rounded down. This extra condition ensures that any indicator vector $ \phi^M $ of a matching $ M $ satisfies it, confirming that the matching polytope of $ G $ is fully contained within the polytope defined by this combination of inequalities.

One can derive from this representation the polynomial-time solvability of the weighted matching problem, with the ellipsoid method. However it suffices to have a polynomial-time algorithm that simply answers the following question: \[\text{given } x\in\mathbb{R}^E\text{, does $x$ belong to the matching polytope of $G$?}\] We know that an algorithm does exist because of the inequalities can be checked in time bounded by a polynomial in $|V|, |E|$, and the size of $x$. The method should obviously however avoid testing all the inequalities of $\sum_{e \subseteq U} x(e) \leq \left\lfloor \frac{1}{2} |U| \right\rfloor $ one by one. This is one of the main motivations for studying polyhedral methods in that although the ellipsoid proves polynomial-time solvability, it does not yield a direct method and rather incentivizes a search for a practically efficient algorithm. The polyhedral method can be helpful in this for example by imitating the simplex method with a constraint generation technique, or by a primal-dual approach.

Why Should You Care?

Note that I've only barely scratched the surface of the rigor for polyhedral optimization and it's already quite complex. For those lacking mathematical maturity you might've just skimmed by and likely just want to see what you can do in the "real world" with polyhedral combinatorics. A brief list includes:

Compilers.
Schedulers.
Network design.
Supply chain optimization.

Since my last post was on compilers, I'll dive into scheduling for this post.

Designing A Polyhedral Scheduler.

Polyhedral optimization of kernels can be thought of in three phases:

Input code and loop nests are represented as polyhedra.
Algebraic transformations are applied to them.
A new optimized code that scans the transformed polyhedra is generated.

I'm going to focus on the first two stages for brevity.

Polyhedral Model.

Iteration Domain: For each line of code, the iteration domain represents the range of values taken by the loop iterators surrounding the statement. For example, in the following C code:
```
// Comments are bolded

int N = 3;                              // External parameter
for (int i = 0; i < N; i++) {           // Loop 1
    for (int j = 0; j < N - i; j++) {   // Loop 2
        printf("i = %d, j = %d\n", i, j);
    }
}
```
The iteration domain represents all possible pairs of the values $(i, j)$, which together we call iteration vectors (denoted $\vec{it}$), that the variables will take when the loops execute. In this example:
- The outer loop ($i$) runs from $0$ to $2$ (inclusive of $0$, exclusive of $3$), so the values of $i$ are $\{0, 1, 2\}$.
- The inner loop ($j$) runs from $0$ to $N - i - 1$, so the values of $j$ vary depending on $i$. For example, when $i = 1$, $j$ runs from $0$ to $1$.
- Thus, the iteration domain is constrained by \[\begin{cases} 0\leq i < N\\ 0\leq j < N - i\end{cases}\] where $N$ is the external parameter defined by the limits of the outer and inner loop. External parameters typically refer to constants or values defined outside the loop that influence loop bounds. This gives us the final iteration domain: \[ \begin{align*}\mathcal{D} &= \{(i, j)\mid 0\leq i < N, 0\leq j < N - i\}\\ &= \{(0, 0), (0, 1), (1, 0), (1, 1), (2, 0)\} \end{align*}\]
As $i$ increases, the range of $j$ decreases. This creates fewer $j$ values as $i$ gets larger, resulting in a triangular pattern:
```
// Comments are bolded

    j
    ↑
2   ●  ●  ●
1   ●  ●
0   ●
    ------------
    0  1  2   → i            
```
This triangular shape is the 2D polyhedron of iteration vectors, constrained by the loop bounds, and can be generalized for larger values of $N$.
In general, the domain $\mathcal{D}$ of a statement $X$ is defined as \[D_X = \left\{\vec{it}\quad\middle|\quad M_X\cdot\begin{pmatrix} \vec{it}\\ \vec{N}\\ 1 \end{pmatrix}\geq 0 \right\}\] where $M_X$ is a matrix defining the domain polyhedron.

Dependencies & Legality: A dependency $\Delta_{X\to Y}$ from statement $X$ to $Y$ means that the statement $X$ needs to be executed before statement $Y$ to preserve the semantics of the program. This dependency is defined on a set of iteration vector values for $X$ and $Y$ with the following constraints:

$X$ and $Y$ access the same memory location (either $X$ or $Y$ is a write).

$X$ is executed before $Y$.

To see this concept, consider the following C code block that demonstrates the dependency:

// Comments are bolded

// Example: Two statements X and Y accessing the same memory location
int shared_data = 0; // Shared memory location

// Function representing statement X
void statement_X(int it_X) {
    shared_data = it_X; // Write access
    printf("X executed with iterator: %d\n", it_X);
}

// Function representing statement Y
void statement_Y(int it_Y) {
    printf("Y executed with iterator: %d, shared_data: %d\n", it_Y, shared_data);
}

int main() {
    int iterations = 5;

    // Simulating the execution of X and Y with dependencies
    for (int i = 0; i < iterations; i++) {
        // Dependency: X must be executed before Y
        statement_X(i); // Execute X first
        statement_Y(i); // Then execute Y
    }

    return 0;
}

These constraints are similar to the domain of statements define a polyhedron: \[\Delta_{X\to Y} = \left\{\begin{pmatrix} \vec{it}_X\\ \vec{it}_Y \end{pmatrix}\quad \middle|\quad M_{X\to Y}\cdot\begin{pmatrix} \vec{it}_X\\ \vec{it}_Y\\ \vec{N}\\ 1 \end{pmatrix}\geq 0 \right\}\]

Scheduling Function: A scheduling function $\Lambda$ maps each statement and iteration vector of its domain to a lexicographically-ordered unique multidimensional timestamp. For a specific statement $X$, the function $\Lambda_X$ maps its iteration vectors to a multidimensional timestamp using affine forms, $\Phi_{X,i}$. These affine forms depend on iterators $\vec{it}$ and parameters $\vec{N}$, thus allowing us to define \[\Lambda_X: \begin{align*} &\mathcal{D}_X(\vec{N})\quad\to\mathbb{N}^m\\ &\vec{it}\qquad\hspace{6mm}\to\left(\Phi_{X, 0}(\vec{it}) \dots, \Phi_{X,m - 1}(\vec{it})\right) \end{align*}\] where $m$ is the number of scheduling dimensions, and each affine form $\Phi_{X, i}$ is defined by \[\Phi_{X, i} = T_{X, i}\cdot \begin{pmatrix} \vec{it}\\ \vec{N}\\ 1 \end{pmatrix}\] where $T_{X, i}$ is the transformation vector that helps compute the timestamp.
For a statement $X$ surrounded by $k$ nested loops, we can prove there are at most $2k + 1$ dimensions are necessary to express all possible scheduling transformations, but in practice, there is no upper limit on the number of dimensions.

Polyhedral Scheduler.

A polyhedral scheduler is the algorithm that computes our scheduling function $\Lambda$. We now know that there are two types of integer affine constraints that determine the computation of this function and that it has to preserve the semantics of the initial code and optimize some cost functions. Now let's go into how to build the ILP problem, with the proximity cost function representing the most common cost function defined in SOTA schedulers (I'll be choosing an iterative scheduler):

Scheduling Problem Formulation: Consider an iterative scheduler where the algorithm must find the full scheduling transformations $\Lambda$ step by step, building an ILP problem to find each scheduling dimension $\Phi_{X,i}$, starting from the outermost dimension until the innermost. It will terminate when enough scheduling dimensions are found and is aiming to find the optimal vector of coefficients $T_{X,i}\hspace{1mm}\forall S$.

Validity/Legality Constraint: This is the core of the polyhedral scheduler because it constrains the transformation vectors $T_{X, i}$ to have values that preserve the program semantic which ensures the legality of the schedule. For each dependency $\Delta_{X\to Y}$, $X$ has to be executed before $Y$: \[\left(\vec{it},\vec{it}'\right)\in\Delta_{X\to Y}\implies \Lambda_Y\left(\vec{it}'\right)\succ \Lambda_X\left(\vec{it}\right)\] where $\succ$ represents lexicographically greater.
Because for an iterative scheduler, each dimension $\Phi_{X, i}$ is computed from the outermost to the innermost, the definition of validity becomes: \[\left(\vec{it},\vec{it}'\right)\in\Delta_{X\to Y}\implies \Phi_{Y, i}\left(\vec{it}'\right)\geq \Phi_{X, i}\left(\vec{it}\right)\] until the dependency $\Delta_{X\to Y}$ is satisfied. You can actually linearize this implication using Farkas' Lemma; which makes it so the constraints are then expressed only on the space of variables composed by the vectors $T_{X, i}$ and $T_{Y, i}$.
Progression Constraint: The progression constraint is added at each scheduling iteration to ensure the progression of the algorithm. Its purpose is to ensure that the schedule defines a complete order for the iteration space and to make sure that the trivial zero solution is completely avoided. The constraint definition forces the next scheduling solution to be linearly independent of previous solutions in the iteration space.
Define the matrix $Z_X$ as the row-by-row concatenation of previous scheduling dimension solutions $T_{X, i}$. We can define the orthogonal complement $Z^{\perp}_X$ as: \[Z^{\perp}_X = I - Z^{\intercal}_X\left(Z_X Z^{\intercal}_X\right)^{-1}Z_X\] where $I$ is the identity matrix and $Z^{\intercal}_X$ is the transpose of $Z_X$. Given we're limiting the scheduling search space to only consider solutions that're non-negative, the progression constraint is as the row-by-row sum of the orthogonal complement matrix: \[\forall i, Z^{\perp}_{X, i}\cdot z^{*}_X\geq 0\quad \wedge\quad \sum_i Z^{\perp}_{X, i}\cdot z^{*}_X \geq 1\] where $Z^{\perp}_{X, i}$ is a row of $Z^{\perp}_X$, and $z^{*}_X$ being the next solution to be computed.
Proximity Cost Function: The proximity cost function is defined in order to find among all legal solutions the ones that optimize temporal locality. Basically the idea is to minimize the distance (in scheduling time) between multiple accesses to the same memory position. Data dependencies describe multiple accesses to the same memory position, then the proximity objective is to minimize the dependency distance. For a dependency $\Delta_{X\to Y}$ we can define the constraint \[\left(\vec{it},\vec{it}'\right)\in\Delta_{X\to Y}\implies \Phi_{i, Y}\left(\vec{it}\right)\leq \min_{\vec{u}, w}\vec{u}\vec{N} + w \] where $\vec{u}$ and $w$ are the cost functions to minimize. In essence, proximity accurately represents a useful transformation and indirectly favors the first dimensions to be parallel, with a dependency distance of $0$.

Wrapping Up.

Now you have all the pieces to go ahead and start building your own polyhedral scheduler and add lots of optimizations that I didn't cover in the post like loop fusion to make it high-performant. Hopefully you learned something cool from the math too!

I know it's likely the contents of this post are too rigorous for the average reader so if you have any questions, feel free to dm me @hy3na_xyz and I can help out wherever.

Always laugh at the competition.

- hyena

09/19/2024