A Walk Down Polyhedral Lane.
Combinatorial Optimization.
Combinatorial optimization seeks the optimal object from a finite collection of objects. This collection typically has a representation such as a graph or tree. Since the number of objects grows exponentially with the size of the representation, such as in the case of Hamiltonian circuits, scanning all objects one by one to select the best solution is infeasible at scale. Thus, efficient search methods are necessary. In the 1960s, Jack Edmonds advocated for defining an efficient method as one whose running time is bounded by a polynomial in the size of its representation. Edmonds developed polynomial-time algorithms like the Blossom algorithm for the matching problem. Today, problems solvable in polynomial time are classified as belonging to the complexity class P. In the 1970s, Cook and Karp identified combinatorial optimization problems, such as the Travelling Salesman Problem, that belong to the complexity class NP. By definition, any problem in NP must be reducible to every NP-complete problem. All NP-complete problems are equivalent in that the polynomial-time solvability of one implies the same for all. So far, most combinatorial optimization problems have been classified either as polynomial-time solvable or NP-complete, but none have been proven to be both. This raises the fundamental question: are these two properties disjoint or do they coincide? In other words: \[P\neq NP\quad\text{or}\quad P=NP\] I will focus on three key aspects of combinatorial optimization problems that are in class P:- Polynomial-time solvability.
- Polyhedral representations.
- Min-Max relations.
Maximum-Size Matching.
Let $G = (V, E)$ be an undirected graph and let $w : E\to \mathbb{R}_{+}$. For any subset $F$ of $E$, let \[w(F) := \sum_{e\in F}w(e)\] We call $w(F)$ the weight of $F$. Now suppose you want to find a matching $M$ where we define matching as a set of disjoint edges in $G$ such that the weight $w(M)$ is maximized. This in notation is \[\max\{w(M)\mid M\text{ matching in } G\}\] This has equivalent formulation: for any matching $M$, denote the indicator vector of $M$ in $\mathbb{R}^{E}$ by $\phi^M$, that is \[\phi^M(e) := \begin{cases} 1 \quad \text{if }e\in M,\\ 0 \quad \text{if }e \notin M \end{cases}\quad \text{ for } e\in E\] Given $w$ is a vector in $\mathbb{R}^{E}$, we have $w(M) = w^\intercal \phi^M$ letting us rewrite our maximization as \[\max\left\{w^\intercal\phi^M \mid M\text{ matching in } G\right\}\] This is equivalent to maximizing the linear function $w^\intercal x$ over a finite set of vectors $\implies$ the optimal value does not change if we maximize it over the convex hull of these vectors: \[\max\left\{w^\intercal \mid x \in\text{conv.hull}\left\{\phi^M \mid M\text{ matching in } G\right\} \right\}\] The set \[\text{conv.hull} \{\phi^M\mid M\text{ matching in } G\}\] is a polytope in $\mathbb{R}^E$, called the matching polytope of $G$. There are two equivalent ways to define a polytope- V-representation: defines a polytope as a convex hull of a finite set of vertices, where the finite set must contain the set of extreme points of the polytope.
- H-representation: defines a polytope as the bounded intersection of a finite number of half-spaces, which is the solution to a system of linear inequalities. These half-spaces are described by a matrix $A$ and a vector $b$ in the form: \[H = \{x \in\mathbb{R}^n : Ax\leq b\}\]
- Containment Of The Matching Polytope: The matching polytope is defined by all characteristic vectors \( \phi^M \) of matchings \( M \) in the bipartite graph \( G \). Since each matching satisfies the inequalities defined above, it follows that the matching polytope is contained within the polytope defined by these inequalities.
- Total Unimodularity: For a bipartite graph \( G \), the constraint matrix \( A \) associated with the inequalities is totally unimodular. This property means that every square submatrix of \( A \) has a determinant that is either \( 0 \), \( 1 \), or \( -1 \). This ensures that any linear combination of the rows of \( A \) that produces integer solutions will also have integer coefficients.
- Integer Vertices: Due to the total unimodularity of the matrix \( A \), any feasible region defined by the inequalities will have vertices that are integer vectors. Specifically, this means that the solutions to the linear program formed by these inequalities will lie in \( \mathbb{Z}^E \), where \( E \) represents the edges of the bipartite graph.
- Characterization Of Integer Solutions: Given that every integer vector satisfying the inequalities must be a vertex of the polytope and also corresponds to some matching \( M \) in the graph, we can say that each integer solution is equal to \( \phi^M \). Thus, all integer vectors satisfying the inequalities are characterized by the matchings in the bipartite graph \( G \).
Kőnig's matching theorem is an example of a Min-Max formula that can be derived as we've shown from a polyhedral representation. We can also go the other way and go from a Min-Max formula in a weighted form that yields a polyhedral representation. This representation together with linear programming duality gives a definitive proof of optimality of a matching $M$. So if you ever need to convince someone that a certain matching $M$ has maximum size, it is sufficient and possible to show a vertex cover of size $|M|$. This is what yields a good characterization for the maximum-size matching problem in bipartite graphs.
What About Nonbipartite Graphs?
If $G$ is nonbipartite, the matching polytope is not defined directly by the inequalities \[\begin{align*} &x(e)\geq 0 &\text{for }e \in E\\ &\sum_{e \ni v}x(e)\leq 1 &\text{for }v\in V \end{align*}\] This is because if \( C \) is an odd circuit in the graph \( G \), we can define a vector \( x \in \mathbb{R}^E \) such that:- \( x(e) = \frac{1}{2} \) for each edge \( e \) in the circuit \( C \).
- \( x(e) = 0 \) for edges not in \( C \).
- Non-Negativity: \[ x(e) \geq 0,\quad \forall e \in E \]
- Vertex Degree Constraints: \[ \sum_{e \ni v} x(e) \leq 1,\quad \forall v \in V \]
One can derive from this representation the polynomial-time solvability of the weighted matching problem, with the ellipsoid method. However it suffices to have a polynomial-time algorithm that simply answers the following question: \[\text{given } x\in\mathbb{R}^E\text{, does $x$ belong to the matching polytope of $G$?}\] We know that an algorithm does exist because of the inequalities can be checked in time bounded by a polynomial in $|V|, |E|$, and the size of $x$. The method should obviously however avoid testing all the inequalities of \(\sum_{e \subseteq U} x(e) \leq \left\lfloor \frac{1}{2} |U| \right\rfloor \) one by one. This is one of the main motivations for studying polyhedral methods in that although the ellipsoid proves polynomial-time solvability, it does not yield a direct method and rather incentivizes a search for a practically efficient algorithm. The polyhedral method can be helpful in this for example by imitating the simplex method with a constraint generation technique, or by a primal-dual approach.
Why Should You Care?
Note that I've only barely scratched the surface of the rigor for polyhedral optimization and it's already quite complex. For those lacking mathematical maturity you might've just skimmed by and likely just want to see what you can do in the "real world" with polyhedral combinatorics. A brief list includes:- Compilers.
- Schedulers.
- Network design.
- Supply chain optimization.
Designing A Polyhedral Scheduler.
Polyhedral optimization of kernels can be thought of in three phases:- Input code and loop nests are represented as polyhedra.
- Algebraic transformations are applied to them.
- A new optimized code that scans the transformed polyhedra is generated.
Polyhedral Model.
- Iteration Domain: For each line of code,
the iteration domain represents the range of values taken by the loop iterators
surrounding the statement. For example, in the following C code:
// Comments are bolded int N = 3; // External parameter for (int i = 0; i < N; i++) { // Loop 1 for (int j = 0; j < N - i; j++) { // Loop 2 printf("i = %d, j = %d\n", i, j); } }
The iteration domain represents all possible pairs of the values $(i, j)$, which together we call iteration vectors (denoted $\vec{it}$), that the variables will take when the loops execute. In this example:- The outer loop ($i$) runs from $0$ to $2$ (inclusive of $0$, exclusive of $3$), so the values of $i$ are $\{0, 1, 2\}$.
- The inner loop ($j$) runs from $0$ to $N - i - 1$, so the values of $j$ vary depending on $i$. For example, when $i = 1$, $j$ runs from $0$ to $1$.
- Thus, the iteration domain is constrained by \[\begin{cases} 0\leq i < N\\ 0\leq j < N - i\end{cases}\] where $N$ is the external parameter defined by the limits of the outer and inner loop. External parameters typically refer to constants or values defined outside the loop that influence loop bounds. This gives us the final iteration domain: \[ \begin{align*}\mathcal{D} &= \{(i, j)\mid 0\leq i < N, 0\leq j < N - i\}\\ &= \{(0, 0), (0, 1), (1, 0), (1, 1), (2, 0)\} \end{align*}\]
// Comments are bolded j ↑ 2 ● ● ● 1 ● ● 0 ● ------------ 0 1 2 → i
This triangular shape is the 2D polyhedron of iteration vectors, constrained by the loop bounds, and can be generalized for larger values of $N$. In general, the domain $\mathcal{D}$ of a statement $X$ is defined as \[D_X = \left\{\vec{it}\quad\middle|\quad M_X\cdot\begin{pmatrix} \vec{it}\\ \vec{N}\\ 1 \end{pmatrix}\geq 0 \right\}\] where $M_X$ is a matrix defining the domain polyhedron. - Dependencies & Legality: A dependency $\Delta_{X\to Y}$ from statement $X$ to $Y$
means that the statement $X$ needs to be executed before statement $Y$ to preserve the semantics of the program.
This dependency is defined on a set of iteration vector values for $X$ and $Y$ with the following constraints:
- $X$ and $Y$ access the same memory location (either $X$ or $Y$ is a write).
- $X$ is executed before $Y$.
// Comments are bolded // Example: Two statements X and Y accessing the same memory location int shared_data = 0; // Shared memory location // Function representing statement X void statement_X(int it_X) { shared_data = it_X; // Write access printf("X executed with iterator: %d\n", it_X); } // Function representing statement Y void statement_Y(int it_Y) { printf("Y executed with iterator: %d, shared_data: %d\n", it_Y, shared_data); } int main() { int iterations = 5; // Simulating the execution of X and Y with dependencies for (int i = 0; i < iterations; i++) { // Dependency: X must be executed before Y statement_X(i); // Execute X first statement_Y(i); // Then execute Y } return 0; }
These constraints are similar to the domain of statements define a polyhedron: \[\Delta_{X\to Y} = \left\{\begin{pmatrix} \vec{it}_X\\ \vec{it}_Y \end{pmatrix}\quad \middle|\quad M_{X\to Y}\cdot\begin{pmatrix} \vec{it}_X\\ \vec{it}_Y\\ \vec{N}\\ 1 \end{pmatrix}\geq 0 \right\}\] - Scheduling Function: A scheduling function $\Lambda$ maps each statement and iteration vector of its domain to a lexicographically-ordered unique multidimensional timestamp. For a specific statement $X$, the function $\Lambda_X$ maps its iteration vectors to a multidimensional timestamp using affine forms, $\Phi_{X,i}$. These affine forms depend on iterators $\vec{it}$ and parameters $\vec{N}$, thus allowing us to define \[\Lambda_X: \begin{align*} &\mathcal{D}_X(\vec{N})\quad\to\mathbb{N}^m\\ &\vec{it}\qquad\hspace{6mm}\to\left(\Phi_{X, 0}(\vec{it}) \dots, \Phi_{X,m - 1}(\vec{it})\right) \end{align*}\] where $m$ is the number of scheduling dimensions, and each affine form $\Phi_{X, i}$ is defined by \[\Phi_{X, i} = T_{X, i}\cdot \begin{pmatrix} \vec{it}\\ \vec{N}\\ 1 \end{pmatrix}\] where $T_{X, i}$ is the transformation vector that helps compute the timestamp. For a statement $X$ surrounded by $k$ nested loops, we can prove there are at most $2k + 1$ dimensions are necessary to express all possible scheduling transformations, but in practice, there is no upper limit on the number of dimensions.
Polyhedral Scheduler.
A polyhedral scheduler is the algorithm that computes our scheduling function $\Lambda$. We now know that there are two types of integer affine constraints that determine the computation of this function and that it has to preserve the semantics of the initial code and optimize some cost functions. Now let's go into how to build the ILP problem, with the proximity cost function representing the most common cost function defined in SOTA schedulers (I'll be choosing an iterative scheduler):- Scheduling Problem Formulation: Consider an iterative scheduler where the algorithm must find the full scheduling transformations $\Lambda$ step by step, building an ILP problem to find each scheduling dimension $\Phi_{X,i}$, starting from the outermost dimension until the innermost. It will terminate when enough scheduling dimensions are found and is aiming to find the optimal vector of coefficients $T_{X,i}\hspace{1mm}\forall S$.
- Validity/Legality Constraint: This is the core of the polyhedral scheduler
because it constrains the transformation vectors $T_{X, i}$ to have values that preserve the program
semantic which ensures the legality of the schedule. For each dependency $\Delta_{X\to Y}$,
$X$ has to be executed before $Y$:
\[\left(\vec{it},\vec{it}'\right)\in\Delta_{X\to Y}\implies \Lambda_Y\left(\vec{it}'\right)\succ \Lambda_X\left(\vec{it}\right)\]
where $\succ$ represents lexicographically greater.
Because for an iterative scheduler, each dimension $\Phi_{X, i}$ is computed from the outermost to the innermost, the definition of validity becomes: \[\left(\vec{it},\vec{it}'\right)\in\Delta_{X\to Y}\implies \Phi_{Y, i}\left(\vec{it}'\right)\geq \Phi_{X, i}\left(\vec{it}\right)\] until the dependency $\Delta_{X\to Y}$ is satisfied. You can actually linearize this implication using Farkas' Lemma; which makes it so the constraints are then expressed only on the space of variables composed by the vectors $T_{X, i}$ and $T_{Y, i}$.
- Progression Constraint: The progression constraint is added at each
scheduling iteration to ensure the progression of the algorithm. Its purpose is to ensure
that the schedule defines a complete order for the iteration space and to make sure that the trivial zero solution
is completely avoided. The constraint definition forces the next scheduling solution to be linearly independent
of previous solutions in the iteration space.
Define the matrix $Z_X$ as the row-by-row concatenation of previous scheduling dimension solutions $T_{X, i}$. We can define the orthogonal complement $Z^{\perp}_X$ as: \[Z^{\perp}_X = I - Z^{\intercal}_X\left(Z_X Z^{\intercal}_X\right)^{-1}Z_X\] where $I$ is the identity matrix and $Z^{\intercal}_X$ is the transpose of $Z_X$. Given we're limiting the scheduling search space to only consider solutions that're non-negative, the progression constraint is as the row-by-row sum of the orthogonal complement matrix: \[\forall i, Z^{\perp}_{X, i}\cdot z^{*}_X\geq 0\quad \wedge\quad \sum_i Z^{\perp}_{X, i}\cdot z^{*}_X \geq 1\] where $Z^{\perp}_{X, i}$ is a row of $Z^{\perp}_X$, and $z^{*}_X$ being the next solution to be computed.
- Proximity Cost Function: The proximity cost function is defined in order to find among all legal solutions the ones that optimize temporal locality. Basically the idea is to minimize the distance (in scheduling time) between multiple accesses to the same memory position. Data dependencies describe multiple accesses to the same memory position, then the proximity objective is to minimize the dependency distance. For a dependency $\Delta_{X\to Y}$ we can define the constraint \[\left(\vec{it},\vec{it}'\right)\in\Delta_{X\to Y}\implies \Phi_{i, Y}\left(\vec{it}\right)\leq \min_{\vec{u}, w}\vec{u}\vec{N} + w \] where $\vec{u}$ and $w$ are the cost functions to minimize. In essence, proximity accurately represents a useful transformation and indirectly favors the first dimensions to be parallel, with a dependency distance of $0$.