Constraint Transforms

To avoid having to deal with constraints while simulating the Hamiltonian dynamics during sampling, every (multivariate) parameter in a Stan model is transformed to an unconstrained variable behind the scenes by the model compiler. The transform is based on the constraints, if any, in the parameter’s definition. Scalars or the scalar values in vectors, row vectors or matrices may be constrained with lower and/or upper bounds. Vectors may alternatively be constrained to be ordered, positive ordered, or simplexes. Matrices may be constrained to be correlation matrices or covariance matrices. This chapter provides a definition of the transforms used for each type of variable. For examples of how to declare and define these variables in a Stan program, see section Variable declaration. To directly access the functional form of these transformations from inside the Stan language, see Variable Transformation Functions in the Functions Reference.

Stan converts models to C++ classes which define probability functions with support on all of \(\mathbb{R}^K\), where \(K\) is the number of unconstrained parameters needed to define the constrained parameters defined in the program. The C++ classes also include code to transform the parameters from unconstrained to constrained and apply the appropriate Jacobians.

Limitations due to finite accuracy presentation

In this section the transformations are described mathematically. However, observed behavior can be different from the exact arithmetic.

Stan’s arithmetic is implemented using double-precision floating-point numbers, which may cause computation to behave differently than mathematics. For example, the lower bound constraint is defined above by an exponential inverse transform which mathematically excludes the lower bound, but if the closest floating-point number for the inverse transformed value is the boundary, then the value is rounded to the boundary. This may cause unexpected warnings or errors, if in other parts of the code the boundary value is invalid. For example, we may observe floating-point value 0 for a variance parameter that has been declared with lower=0. In general, double-precision floating-point numbers cannot reliably store more than 16 digits of a number in decimal. See more about floating point arithmetic in the Stan User’s Guide.

These issues are exacerbated by the fact that CmdStan stores the output to CSV files with 8 digits precision by default. More digits can be requested by the user at the cost of additional disk usage, as discussed in the CmdStan Command-Line Interface Overview.

Changes of variables

The support of a random variable \(X\) with density \(p_X(x)\) is that subset of values for which it has non-zero density,

\[ \mathrm{supp}(X) = \{ x | p_X(x) > 0 \}. \]

If \(f\) is a total function defined on the support of \(X\), then \(Y = f(X)\) is a new random variable. This section shows how to compute the probability density function of \(Y\) for well-behaved transforms \(f\). The rest of the chapter details the transforms used by Stan.

Univariate changes of variables

Suppose \(X\) is one dimensional and \(f: \mathrm{supp}(X) \rightarrow \mathbb{R}\) is a one-to-one, monotonic function with a differentiable inverse \(f^{-1}\). Then the density of \(Y\) is given by

\[ p_Y(y) = p_X(f^{-1}(y)) \, \left| \, \frac{d}{dy} f^{-1}(y)\, \right|. \]

The absolute derivative of the inverse transform measures how the scale of the transformed variable changes with respect to the underlying variable.

Multivariate changes of variables

The multivariate generalization of an absolute derivative is a Jacobian, or more fully the absolute value of the determinant of the Jacobian matrix of the transform. The Jacobian matrix measures the change of each output variable relative to every input variable and the absolute determinant uses that to determine the differential change in volume at a given point in the parameter space.

Suppose \(X\) is a \(K\)-dimensional random variable with probability density function \(p_X(x)\). A new random variable \(Y = f(X)\) may be defined by transforming \(X\) with a suitably well-behaved function \(f\). It suffices for what follows to note that if \(f\) is one-to-one and its inverse \(f^{-1}\) has a well-defined Jacobian, then the density of \(Y\) is

\[ p_Y(y) = p_X(f^{-1}(y)) \, \left| \, \det \, J_{f^{-1}}(y) \, \right|, \]

where \(\det{}\) is the matrix determinant operation and \(J_{f^{-1}}(y)\) is the Jacobian matrix of \(f^{-1}\) evaluated at \(y\). Taking \(x = f^{-1}(y)\), the Jacobian matrix is defined by

\[ J_{f^{-1}}(y) = \left[ \begin{array}{ccc}\displaystyle \frac{\partial x_1}{\partial y_1} & \cdots & \displaystyle \frac{\partial x_1}{\partial y_{K}} \\ \vdots & \vdots & \vdots \\ \displaystyle\frac{\partial x_{K}}{\partial y_1} & \cdots & \displaystyle\frac{\partial x_{K}}{\partial y_{K}} \end{array} \right]. \]

If the Jacobian matrix is triangular, the determinant reduces to the product of the diagonal entries,

\[ \det \, J_{f^{-1}}(y) = \prod_{k=1}^K \frac{\partial x_k}{\partial y_k}. \]

Triangular matrices naturally arise in situations where the variables are ordered, for instance by dimension, and each variable’s transformed value depends on the previous variable’s transformed values. Diagonal matrices, a simple form of triangular matrix, arise if each transformed variable only depends on a single untransformed variable.

Lower bounded scalar

Stan uses a logarithmic transform for lower and upper bounds.

Lower bound transform

If a variable \(X\) is declared to have lower bound \(a\), it is transformed to an unbounded variable \(Y\), where

\[ Y = \log(X - a). \]

Lower bound inverse transform

The inverse of the lower-bound transform maps an unbounded variable \(Y\) to a variable \(X\) that is bounded below by \(a\) by

\[ X = \exp(Y) + a. \]

Absolute derivative of the lower bound inverse transform

The absolute derivative of the inverse transform is

\[ \left| \, \frac{d}{dy} \left( \exp(y) + a \right) \, \right| = \exp(y). \]

Therefore, given the density \(p_X\) of \(X\), the density of \(Y\) is

\[ p_Y(y) = p_X\!\left( \exp(y) + a \right) \cdot \exp(y). \]

Upper bounded scalar

Stan uses a negated logarithmic transform for upper bounds.

Upper bound transform

If a variable \(X\) is declared to have an upper bound \(b\), it is transformed to the unbounded variable \(Y\) by

\[ Y = \log(b - X). \]

Upper bound inverse transform

The inverse of the upper bound transform converts the unbounded variable \(Y\) to the variable \(X\) bounded above by \(b\) through

\[ X = b - \exp(Y). \]

Absolute derivative of the upper bound inverse transform

The absolute derivative of the inverse of the upper bound transform is

\[ \left| \, \frac{d}{dy} \left( b - \exp(y) \right) \, \right| = \exp(y). \]

Therefore, the density of the unconstrained variable \(Y\) is defined in terms of the density of the variable \(X\) with an upper bound of \(b\) by

\[ p_Y(y) = p_X \!\left( b - \exp(y) \right) \cdot \exp(y). \]

Lower and upper bounded scalar

For lower and upper-bounded variables, Stan uses a scaled and translated log-odds transform.

Log odds and the logistic sigmoid

The log-odds function is defined for \(u \in (0,1)\) by

\[ \mathrm{logit}(u) = \log \frac{u}{1 - u}. \]

The inverse of the log odds function is the logistic sigmoid, defined for \(v \in (-\infty,\infty)\) by

\[ \mathrm{logit}^{-1}(v) = \frac{1}{1 + \exp(-v)}. \]

The derivative of the logistic sigmoid is

\[ \frac{d}{dy} \mathrm{logit}^{-1}(y) = \mathrm{logit}^{-1}(y) \cdot \left( 1 - \mathrm{logit}^{-1}(y) \right). \]

Lower and upper bounds transform

For variables constrained to be in the open interval \((a, b)\), Stan uses a scaled and translated log-odds transform. If variable \(X\) is declared to have lower bound \(a\) and upper bound \(b\), then it is transformed to a new variable \(Y\), where

\[ Y = \mathrm{logit} \left( \frac{X - a}{b - a} \right). \]

Lower and upper bounds inverse transform

The inverse of this transform is

\[ X = a + (b - a) \cdot \mathrm{logit}^{-1}(Y). \]

Absolute derivative of the lower and upper bounds inverse transform

The absolute derivative of the inverse transform is given by

\[ \left| \frac{d}{dy} \left( a + (b - a) \cdot \mathrm{logit}^{-1}(y) \right) \right| = (b - a) \cdot \mathrm{logit}^{-1}(y) \cdot \left( 1 - \mathrm{logit}^{-1}(y) \right). \]

Therefore, the density of the transformed variable \(Y\) is

\[ p_Y(y) = p_X \! \left( a + (b - a) \cdot \mathrm{logit}^{-1}(y) \right) \cdot (b - a) \cdot \mathrm{logit}^{-1}(y) \cdot \left( 1 - \mathrm{logit}^{-1}(y) \right). \]

Despite the apparent complexity of this expression, most of the terms are repeated and thus only need to be evaluated once. Most importantly, \(\mathrm{logit}^{-1}(y)\) only needs to be evaluated once, so there is only one call to \(\exp(-y)\).

Affinely transformed scalar

Stan uses an affine transform to be able to specify parameters with a given offset and multiplier.

Affine transform

For variables with expected offset \(\mu\) and/or (positive) multiplier \(\sigma\), Stan uses an affine transform. Such a variable \(X\) is transformed to a new variable \(Y\), where

\[ Y = \frac{X - \mu}{\sigma}. \]

The default value for the offset \(\mu\) is \(0\) and for the multiplier \(\sigma\) is \(1\) in case not both are specified.

Affine inverse transform

The inverse of this transform is

\[ X = \mu + \sigma \cdot Y. \]

Absolute derivative of the affine inverse transform

The absolute derivative of the affine inverse transform is

\[ \left| \frac{d}{dy} \left( \mu + \sigma \cdot y \right) \right| = \sigma. \]

Therefore, the density of the transformed variable \(Y\) is

\[ p_Y(y) = p_X \! \left( \mu + \sigma \cdot y \right) \cdot \sigma. \]

For an example of how to code this in Stan, see section Affinely Transformed Real.

Ordered vector

For some modeling tasks, a vector-valued random variable \(X\) is required with support on ordered sequences. One example is the set of cut points in ordered logistic regression.

In constraint terms, an ordered \(K\)-vector \(x \in \mathbb{R}^K\) satisfies

\[ x_k < x_{k+1} \]

for \(k \in \{ 1, \ldots, K-1 \}\).

Ordered transform

Stan’s transform follows the constraint directly. It maps an increasing vector \(x \in \mathbb{R}^{K}\) to an unconstrained vector \(y \in \mathbb{R}^K\) by setting

\[ y_k = \left\{ \begin{array}{ll} x_1 & \mbox{if } k = 1, \mbox{ and} \\ \log \left( x_{k} - x_{k-1} \right) & \mbox{if } 1 < k \leq K. \end{array} \right. \]

Ordered inverse transform

The inverse transform for an unconstrained \(y \in \mathbb{R}^K\) to an ordered sequence \(x \in \mathbb{R}^K\) is defined by the recursion

\[ x_k = \left\{ \begin{array}{ll} y_1 & \mbox{if } k = 1, \mbox{ and} \\ x_{k-1} + \exp(y_k) & \mbox{if } 1 < k \leq K. \end{array} \right. \]

\(x_k\) can also be expressed iteratively as

\[ x_k = y_1 + \sum_{k'=2}^k \exp(y_{k'}). \]

Absolute Jacobian determinant of the ordered inverse transform

The Jacobian of the inverse transform \(f^{-1}\) is lower triangular, with diagonal elements for \(1 \leq k \leq K\) of

\[ J_{k,k} = \left\{ \begin{array}{ll} 1 & \mbox{if } k = 1, \mbox{ and} \\ \exp(y_k) & \mbox{if } 1 < k \leq K. \end{array} \right. \]

Because \(J\) is triangular, the absolute Jacobian determinant is

\[ \left| \, \det \, J \, \right| \ = \ \left| \, \prod_{k=1}^K J_{k,k} \, \right| \ = \ \prod_{k=2}^K \exp(y_k). \]

Putting this all together, if \(p_X\) is the density of \(X\), then the transformed variable \(Y\) has density \(p_Y\) given by

\[ p_Y(y) = p_X(f^{-1}(y)) \ \prod_{k=2}^K \exp(y_k). \]

Positive ordered vector

Placeholder.

The positive ordered transformation is defined in the same style as the ordered transformation above, but with the first element being exponentiated to ensure positivity.

Sum-to-zero transforms

Stan provides built-in constraint transforms for sum-to-zero vectors and sum-to-zero matrices. The sum-to-zero vector is a vector of length \(N\) with real values and the sum of the vector equals zero. The sum-to-zero matrix is an \(N \times M\) matrix where both the rows and columns sum-to-zero.

Stan uses an orthogonal basis as the initial point of construction. The orthogonal basis balances the constraint across each unconstrained value. The basis is a matrix \(H\) such that \(H\in\mathbb R^{N\times (N-1)}\) and \(H^{\mathsf T}H=I_{N-1},\;H^{\mathsf T}\mathbf 1=0\). The sum-to-zero vector lies in the subspace where the vector sums to zero. Although this seems tautological, the orthogonal basis construction allows all the marginal variances of the contrained vector to be the same, see, e.g.,(Seyboldt 2024). Simpler alternatives, such as setting the final element to the negative sum of the first elements, do not result in equal variances. It is worth noting that even with marginal variances being equal each value in the sum-to-zero constrained space is negatively correlated.

In many cases one wishes to model the sum-to-zero vector as normally distributed and the induced covariance matrix is fully known:

\[ \sigma^2 \begin{pmatrix} 1-\tfrac{1}{N} & -\tfrac{1}{N} & \cdots&-\tfrac{1}{N} \\ -\tfrac{1}{N} & 1-\tfrac{1}{N} & \cdots & -\tfrac{1}{N} \\ \vdots & \vdots & \ddots & \vdots \\ -\tfrac{1}{N} & -\tfrac{1}{N} & \cdots &1-\tfrac{1}{N} \end{pmatrix}. \]

The marginal standard deviation no longer corresponds to \(\sigma\) but is \(\sigma \sqrt{1-\tfrac{1}{N}}\). The properties of the normal distribution allow multiplying the sum-to-zero vector by the reciprocal of \(\sqrt{1-\tfrac{1}{N}}\) to adjust the variance to the intended \(\sigma\):

sum_to_zero_vector[N] x;
x ~ normal(0, sqrt(N %/% (N - 1)));

When \(\sigma\) is a parameter there is an additional adjustment when using the centered version of the sum-to-zero constraint. If we let \(y\) be the unconstrained \(N - 1\) vector and \(y\) is implicitly given a standard normal prior, then the sum-to-zero vector distributed as normal with a mean of zero and a standard deviation of \(\sigma\) is given by \[ x = \underbrace{\sigma\sqrt{\frac{N}{N-1}}\;H}_{\text{ size of } N \times (N - 1)}\,y. \] This is the classic non-centered model. The crucial detail from above is that the operation of multiplying \(\sigma\) to \(y\) is on \(N - 1\) dimensions, not \(N\) dimensions. When writing the centered model using \[ x \sim \mathcal N \bigg(0,\, \sigma\sqrt{\frac{N}{N-1}} \bigg), \] we are incrementing the log density by an additional \(-\log(\sigma)\) and so must increment the log density by

target += log(sigma * sqrt(N * inv(N - 1)));

// or
// because `sqrt(N * inv(N - 1))` is constant

target += log(sigma);

which correctly adjusts for the \(N - 1\) free variables.

More details of the Helmert matrix are in (Lancaster 1965), for the basic definitions of the isometric log ratio transform see (Egozcue et al. 2003) and Chapter 3 of (Filzmoser, Hron, and Templ 2018) for the pivot coordinate version used here.

Sum-to-zero vector

Vectors that are constrained to sum-to-zero are useful for, among other things, additive varying effects, such as varying slopes or intercepts in a regression model (e.g., for income deciles).

A sum-to-zero \(K\)-vector \(x \in \mathbb{R}^K\) satisfies the constraint \[ \sum_{k=1}^K x_k = 0. \]

Sum-to-zero vector transform

The transform is defined iteratively. Given an \(x \in \mathbb{R}^{N + 1}\) that sums to zero (i.e., \(\sum_{n=1}^{N+1} x_n = 0\)), the transform proceeds as follows to produce an unconstrained \(y \in \mathbb{R}^N\). This is mathematically equivalent to pre-multiplying the unconstrained \(y\) by an orthogonal standard basis, e.g. constructing orthogonal vectors from the standard basis using the Gram-Schmidt process. The single loop version below achieves low computational and memory costs as no matrices are created or multplied.

The transform is initialized by setting \[ S_N = 0 \] and \[ y_N = -x_{N + 1} \sqrt{1 + \frac{1}{N}}. \] The for each \(n\) from \(N - 1\) down to \(1\), let \[ w_{n + 1} = \frac{y_{n + 1}}{\sqrt{(n + 1)(n + 2)}}, \] \[ S_n = S_{n + 1} + w_{n + 1}, \] and \[ y_n = (S_n - x_{n + 1}) \frac{\sqrt{n (n + 1)}}{n}. \]

The mathematical expression above is equivalently expressed within Stan as:

  vector manual_sum_to_zero_jacobian(vector y) {
    int N = num_elements(y);
    vector[N + 1] z = zeros_vector(N + 1);
    real sum_w = 0;
    for (n in 1:N) {
      int i = N - n + 1;
      real w = y[i] * inv_sqrt(i * (i + 1));
      sum_w += w;
      z[i] += sum_w;
      z[i + 1] -= w * i;
    }
    return z;
  }

Note that there is no target += increment because the Jacobian is zero.

Sum-to-zero vector inverse transform

The inverse transform follows the isometric logratio tranform. It maps an unconstrained vector \(y \in \mathbb{R}^N\) to a sum-to-zero vector \(x \in \mathbb{R}^{N + 1}\) such that \[ \sum_{n=1}^{N + 1} x_n = 0. \] The values are defined inductively, starting with \[ x_1 = \sum_{n=1}^N \frac{y_n}{\sqrt{n (n + 1)}} \] and then setting \[ x_{n + 1} = \sum_{i = n + 1}^N \frac{y_i}{\sqrt{i (i + 1)}} - n \cdot \frac{y_n}{\sqrt{n (n + 1)}}. \] for \(n \in 1{:}N\).

The definition is such that \[ \sum_{n = 1}^{N + 1} x_n = 0 \] by construction, because each of the terms added to \(x_{n}\) is then subtracted from \(x_{n + 1}\) the number of times it shows up in earlier terms.

Absolute Jacobian determinant of the zero sum inverse transform

The inverse transform is a linear operation, leading to a constant Jacobian determinant which is therefore not included.

Sum-to-zero matrix transform

The matrix case of the sum-to-zero transform generalizes the vector case to ensure that every row of the matrix sums to zero and every column the matrix sums to zero. In fact, any \(N\)-dimensional array can be constructed into a sum-to-zero N-dimensional array using the sum-to-zero vector. This is because the vector transform is a linear bijection and produces an orthogonally constructed sum-to-zero object by applying the one dimensional transform across each array slice. For the matrix case, there are two slices present, the rows and columns, to perform the transform over. The sum-to-zero vector is applied over the vectorized slice of either the row or column slice and subsequently to the other slice.

Let the unconstrained matrix be \[ \mathcal Y \in \mathbb R^{n_1 \times n_2} \] and the zero sum vector transform as \[ \mathbf z = \mathcal C_n(\mathbf y)\;=\; \begin{bmatrix}H_n\\[2pt]-\mathbf 1_{1\times d}\end{bmatrix}\mathbf y \;\in\mathbb R^{n+1}, \] where \(H_n\in\mathbb R^{n \times n}\) is the orthogonal Helmert matrix and satisfies \(\mathbf 1_{1\times d}A_d^{\!\top}=0\).

Applying \(C_n\) to each slice results in

\[ \mathcal Z \;=\; \mathcal X \times_1 \bigl[\mathcal C_{n_1}\bigr] \times_2 \bigl[\mathcal C_{n_2}\bigr] \] where \[ \mathcal Z \in \mathbb R^{(n_1+1)\times\cdots\times(n_2+1)}. \]

Because each \(\mathbb R^{d_1\times\cdots\times d_N}\) is invertible on the \(\mathbf 1^\perp\) subspace the composite map applied to \(\mathcal Z\) is a linear bijection between \(\mathbb R^{n_1\times n_2}\) and the codomain \(\mathbb R^{(n_1+1)\times\cdots\times(n_2+1)}\).

Unit simplex

Variables constrained to the unit simplex show up in multivariate discrete models as both parameters (categorical and multinomial) and as variates generated by their priors (Dirichlet and multivariate logistic).

The unit \(K\)-simplex is the set of points \(x \in \mathbb{R}^K\) such that for \(1 \leq k \leq K\),

\[ x_k > 0, \]

and

\[ \sum_{k=1}^K x_k = 1. \]

An alternative definition is to take the convex closure of the vertices. For instance, in 2-dimensions, the simplex vertices are the extreme values \((0,1)\), and \((1,0)\) and the unit 2-simplex is the line connecting these two points; values such as \((0.3,0.7)\) and \((0.99,0.01)\) lie on the line. In 3-dimensions, the basis is \((0,0,1)\), \((0,1,0)\) and \((1,0,0)\) and the unit 3-simplex is the boundary and interior of the triangle with these vertices. Points in the 3-simplex include \((0.5,0.5,0)\), \((0.2,0.7,0.1)\) and all other triplets of non-negative values summing to 1.

As these examples illustrate, the simplex always picks out a subspace of \(K-1\) dimensions from \(\mathbb{R}^K\). Therefore a point \(x\) in the \(K\)-simplex is fully determined by its first \(K-1\) elements \(x_1, x_2, \ldots, x_{K-1}\), with

\[ x_K = 1 - \sum_{k=1}^{K-1} x_k. \]

Unit simplex inverse transform

The length-\(K\) unit simplex inverse transform is given by the softmax of a sum-to-zero vector of length \(K\).

Let \(y\) represent the unconstrained \(K - 1\) values in \((-\infty, \infty)\). The intermediate sum-to-zero vector \(z = \text{sum\_to\_zero\_transform}(y)\) is length \(K\). The unit simplex is then given by \[ x_i = \text{softmax}(z) = \frac{\exp(z_i)}{\sum_{i = 1}^K \exp(z_i)} \]

The sum-to-zero vector transform is described in further detail at the sum-to-zero vector section of the Reference Manual.

Note

All versions of Stan pre-2.37 used the stick-breaking transform. This is documented at Stan 2.36 Reference Manual: Simplex Transform.

Absolute Jacobian determinant of the unit-simplex inverse transform

The Jacobian \(J\) of the inverse unit-simplex transform is found by restricting \(J\) to the subspace spanned by the sum-to-zero vector \(z\). The Jacobian is given as the \((K - 1) \times (K - 1)\) matrix \(J\) where

\[ J_{ij} = \frac{\partial x_i}{\partial z_j} = \frac{\partial}{\partial z_i} \left( \frac{\exp(z_i)}{{\sum_{i = 1}^K \exp(z_i)}} \right) \] and \(i,j \in 1, \ldots, K - 1\).

The diagonal and off-diagonal derivatives are found using the derivative quotient rule and algebraic simplification

\[ J_{ij} = \begin{cases} x_i (1 - x_i), & \text{if } i = j, \\ -x_i x_j, & \text{if } i \neq j. \end{cases} \]

In matrix form this can be expressed as

\[ J = \text{diag}(x) - x x^\top \]

The determinant of this matrix can be found using the Matrix Determinant Lemma:

\[ \det\bigl(A + u v^{\top}\bigr) = \det(A)\,\bigl(1 + v^{\top}A^{-1}u\bigr). \]

Here,

\[ A \;=\; \operatorname{diag}(x_{1},\ldots, x_{K-1}), \quad u \;=\; -\bigl(x_1,\ldots, x_{K-1}\bigr)^{\!\top}, \quad v \;=\; \bigl(x_{1}, \ldots, x_{K-1}\bigr)^{\!\top}. \] Therefore,

\[ \begin{aligned} \det(J) &= \bigg(\prod_{i=1}^{K-1} x_i \bigg) \bigg(1 + (x_{1},\ldots, x_{K-1})\,\mathrm{diag}\bigl(x_{1}^{-1},\ldots,x_{K-1}^{-1}\bigr)\, \big(-x_{1},\ldots,-x_{K-1}\big)^{\top} \bigg) \\ &= \bigg(\prod_{i=1}^{K-1} x_{i}\bigg) \bigg(1 - \sum_{i=1}^{K-1} x_{i}\bigg) = \bigg(\prod_{i=1}^{K-1} x_{i}\bigg) x_{K} \\ &= \prod_{i=1}^{K} x_{i}. \end{aligned} \]

Unit simplex transform

The transform \(Y = f(X)\) can be derived by reversing the stages of the inverse transform,

\[ y_k = H^\top \bigg(\log(x_k) - \frac{1}{K}\sum_{i=1}^K\log(x_i) \bigg) . \]

The matrix \(H\) is the orthogonal basis matrix the sum-to-zero vector uses. Since the matrix is orthonormal, the transpose is the same as the inverse.

Stochastic Matrix

The column_stochastic_matrix[N, M] and row_stochastic_matrix[M, N] type in Stan represents an \(N \times M\) matrix where each column (row) is a unit simplex of dimension \(N\). In other words, each column (row) of the matrix is a vector constrained to have non-negative entries that sum to one.

Definition of a Stochastic Matrix

A column stochastic matrix \(X \in \mathbb{R}^{N \times M}\) is defined such that each column is a simplex. For column \(m\) (where \(1 \leq m \leq M\)):

\[ X_{n, m} \geq 0 \quad \text{for } 1 \leq n \leq N, \]

and

\[ \sum_{n=1}^N X_{n, m} = 1. \]

A row stochastic matrix is any matrix whose transpose is a column stochastic matrix (i.e. the rows of the matrix are simplexes)

\[ X_{n, m} \geq 0 \quad \text{for } 1 \leq n \leq N, \]

and

\[ \sum_{m=1}^N X_{n, m} = 1. \]

This definition ensures that each column (row) of the matrix \(X\) lies on the \(N-1\) dimensional unit simplex, similar to the simplex[N] type, but extended across multiple columns(rows).

Inverse Transform for Stochastic Matrix

For the column and row stochastic matrices the inverse transform is the same as simplex, but applied to each column (row).

Absolute Jacobian Determinant for the Inverse Transform

The Jacobian determinant of the inverse transform for each column \(m\) in the matrix is given by the product of the diagonal entries \(J_{n, m}\) of the lower-triangular Jacobian matrix. This determinant is calculated as:

\[ \left| \det J_m \right| = \prod_{n=1}^{N-1} \left( z_{n, m} (1 - z_{n, m}) \left( 1 - \sum_{n'=1}^{n-1} X_{n', m} \right) \right). \]

Thus, the overall Jacobian determinant for the entire column_stochastic_matrix and row_stochastic_matrix is the product of the determinants for each column (row):

\[ \left| \det J \right| = \prod_{m=1}^{M} \left| \det J_m \right|. \]

Transform for Stochastic Matrix

For the column and row stochastic matrices the transform is the same as simplex, but applied to each column (row).

Unit vector

An \(n\)-dimensional vector \(x \in \mathbb{R}^n\) is said to be a unit vector if it has unit Euclidean length, so that

\[ \Vert x \Vert \ = \ \sqrt{x^{\top}\,x} \ = \ \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} \ = \ 1\ . \]

Unit vector inverse transform

Stan divides an unconstrained vector \(y \in \mathbb{R}^{n}\) by its norm, \(\Vert y \Vert = \sqrt{y^\top y}\), to obtain a unit vector \(x\),

\[ x = \frac{y}{\Vert y \Vert}. \]

To generate a unit vector, Stan generates points at random in \(\mathbb{R}^n\) with independent unit normal distributions, which are then standardized by dividing by their Euclidean length. Muller (1959) showed this generates points uniformly at random on \(S^{n-1}\). That is, if we draw \(y_n \sim \mathsf{Normal}(0, 1)\) for \(n \in 1{:}n\), then \(x = \frac{y}{\Vert y \Vert}\) has a uniform distribution over \(S^{n-1}\). This allows us to use an \(n\)-dimensional basis for \(S^{n-1}\) that preserves local neighborhoods in that points that are close to each other in \(\mathbb{R}^n\) map to points near each other in \(S^{n-1}\). The mapping is not perfectly distance preserving, because there are points arbitrarily far away from each other in \(\mathbb{R}^n\) that map to identical points in \(S^{n-1}\).

Warning: undefined at zero!

The above mapping from \(\mathbb{R}^n\) to \(S^n\) is not defined at zero. While this point outcome has measure zero during sampling, and may thus be ignored, it is the default initialization point and thus unit vector parameters cannot be initialized at zero. A simple workaround is to initialize from a very small interval around zero, which is an option built into all of the Stan interfaces.

Absolute Jacobian determinant of the unit vector inverse transform

The Jacobian matrix relating the input vector \(y\) to the output vector \(x\) is singular because \(x^\top x = 1\) for any non-zero input vector \(y\). Thus, there technically is no unique transformation from \(x\) to \(y\). To circumvent this issue, let \(r = \sqrt{y^\top y}\) so that \(y = r x\). The transformation from \(\left(r, x_{-n}\right)\) to \(y\) is well-defined but \(r\) is arbitrary, so we set \(r = 1\). In this case, the determinant of the Jacobian is proportional to \(e^{-\frac{1}{2} y^\top y}\), which is the kernel of a standard multivariate normal distribution with \(n\) independent dimensions.

Correlation matrices

A \(K \times K\) correlation matrix \(x\) must be symmetric, so that

\[ x_{k,k'} = x_{k',k} \]

for all \(k,k' \in \{ 1, \ldots, K \}\), it must have a unit diagonal, so that

\[ x_{k,k} = 1 \]

for all \(k \in \{ 1, \ldots, K \}\), and it must be positive definite, so that for every non-zero \(K\)-vector \(a\),

\[ a^{\top} x a > 0. \]

The number of free parameters required to specify a \(K \times K\) correlation matrix is \(\binom{K}{2}\).

There is more than one way to map from \(\binom{K}{2}\) unconstrained parameters to a \(K \times K\) correlation matrix. Stan implements the Lewandowski-Kurowicka-Joe (LKJ) transform Lewandowski, Kurowicka, and Joe (2009).

Correlation matrix inverse transform

It is easiest to specify the inverse, going from its \(\binom{K}{2}\) parameter basis to a correlation matrix. The basis will actually be broken down into two steps. To start, suppose \(y\) is a vector containing \(\binom{K}{2}\) unconstrained values. These are first transformed via the bijective function \(\tanh : \mathbb{R} \rightarrow (-1, 1)\)

\[ \tanh y = \frac{\exp(2y) - 1}{\exp(2y) + 1}. \]

Then, define a \(K \times K\) matrix \(z\), the upper triangular values of which are filled by row with the transformed values, and the diagonal entries are set to one. For example, in the \(4 \times 4\) case, there are \(\binom{4}{2}\) values arranged as

\[ z = \left[ \begin{array}{cccc} 1 & \tanh y_1 & \tanh y_2 & \tanh y_4 \\ 0 & 1 & \tanh y_3 & \tanh y_5 \\ 0 & 0 & 1 & \tanh y_6 \\ 0 & 0 & 0 & 1 \end{array} \right] . \]

Lewandowski, Kurowicka and Joe (LKJ) show how to bijectively map the array \(z\) to a correlation matrix \(x\). The entry \(z_{i,j}\) for \(i < j\) is interpreted as the canonical partial correlation (CPC) between \(i\) and \(j\), which is the correlation between \(i\)’s residuals and \(j\)’s residuals when both \(i\) and \(j\) are regressed on all variables \(i'\) such that \(i'< i\). In the case of \(i=1\), there are no earlier variables, so \(z_{1,j}\) is just the Pearson correlation between \(i\) and \(j\).

In Stan, the LKJ transform is reformulated in terms of a Cholesky factor \(w\) of the final correlation matrix, defined for \(1 \leq i,j \leq K\) by

\[ w_{i,j} = \left\{ \begin{array}{cl} 0 & \mbox{if } i > j, \\ 1 & \mbox{if } 1 = i = j, \\ \prod_{i'=1}^{i - 1} \left( 1 - z_{i'\!,\,j}^2 \right)^{1/2} & \mbox{if } 1 < i = j, \\ z_{i,j} & \mbox{if } 1 = i < j, \mbox{ and} \\\ z_{i,j} \, \prod_{i'=1}^{i-1} \left( 1 - z_{i'\!,\,j}^2 \right)^{1/2} & \mbox{ if } 1 < i < j. \end{array} \right. \]

This does not require as much computation per matrix entry as it may appear; calculating the rows in terms of earlier rows yields the more manageable expression

\[ w_{i,j} = \left\{ \begin{array}{cl} 0 & \mbox{if } i > j, \\ 1 & \mbox{if } 1 = i = j, \\ z_{i,j} & \mbox{if } 1 = i < j, \mbox{ and} \\ \frac{z_{i,j}}{z_{i-1,j}} \ w_{i-1,j} \left( 1 - z_{i-1,j}^2 \right)^{1/2} & \mbox{ if } 1 < i \leq j. \end{array} \right. \]

Given the upper-triangular Cholesky factor \(w\), the final correlation matrix is

\[ x = w^{\top} w. \]

Lewandowski, Kurowicka, and Joe (2009) show that the determinant of the correlation matrix can be defined in terms of the canonical partial correlations as

\[ \mbox{det} \, x = \prod_{i=1}^{K-1} \ \prod_{j=i+1}^K \ (1 - z_{i,j}^2) = \prod_{1 \leq i < j \leq K} (1 - z_{i,j}^2), \]

Absolute Jacobian determinant of the correlation matrix inverse transform

From the inverse of equation 11 in (Lewandowski, Kurowicka, and Joe 2009), the absolute Jacobian determinant is

\[ \sqrt{\prod_{i=1}^{K-1}\prod_{j=i+1}^K \left(1-z_{i,j}^2\right)^{K-i-1}} \ \times \prod_{i=1}^{K-1}\prod_{j=i+1}^K \frac{\partial z_{i,j}}{\partial y_{i,j}} \]

Correlation matrix transform

The correlation transform is defined by reversing the steps of the inverse transform defined in the previous section.

Starting with a correlation matrix \(x\), the first step is to find the unique upper triangular \(w\) such that \(x = w w^{\top}\). Because \(x\) is positive definite, this can be done by applying the Cholesky decomposition,

\[ w = \mbox{chol}(x). \]

The next step from the Cholesky factor \(w\) back to the array \(z\) of canonical partial correlations (CPCs) is simplified by the ordering of the elements in the definition of \(w\), which when inverted yields

\[ z_{i,j} = \left\{ \begin{array}{cl} 0 & \mbox{if } i \leq j, \\ w_{i,j} & \mbox{if } 1 = i < j, \mbox{ and} \\ {w_{i,j}} \ \prod_{i'=1}^{i-1} \left( 1 - z_{i'\!,j}^2 \right)^{-1/2} & \mbox{if } 1 < i < j. \end{array} \right. \]

The final stage of the transform reverses the hyperbolic tangent transform, which is defined by

\[ y = \tanh^{-1} z = \frac{1}{2} \log \left( \frac{1 + z}{1 - z} \right). \]

The inverse hyperbolic tangent function, \(\tanh^{-1}\), is also called the Fisher transformation.

Covariance matrices

A \(K \times K\) matrix is a covariance matrix if it is symmetric and positive definite (see the previous section for definitions). It requires \(K + \binom{K}{2}\) free parameters to specify a \(K \times K\) covariance matrix.

Covariance matrix transform

Stan’s covariance transform is based on a Cholesky decomposition composed with a log transform of the positive-constrained diagonal elements.1

If \(x\) is a covariance matrix (i.e., a symmetric, positive definite matrix), then there is a unique lower-triangular matrix \(z = \mathrm{chol}(x)\) with positive diagonal entries, called a Cholesky factor, such that

\[ x = z \, z^{\top}. \]

The off-diagonal entries of the Cholesky factor \(z\) are unconstrained, but the diagonal entries \(z_{k,k}\) must be positive for \(1 \leq k \leq K\).

To complete the transform, the diagonal is log-transformed to produce a fully unconstrained lower-triangular matrix \(y\) defined by

\[ y_{m,n} = \left\{ \begin{array}{cl} 0 & \mbox{if } m < n, \\ \log z_{m,m} & \mbox{if } m = n, \mbox{ and} \\ z_{m,n} & \mbox{if } m > n. \end{array} \right. \]

Covariance matrix inverse transform

The inverse transform reverses the two steps of the transform. Given an unconstrained lower-triangular \(K \times K\) matrix \(y\), the first step is to recover the intermediate matrix \(z\) by reversing the log transform,

\[ z_{m,n} = \left\{ \begin{array}{cl} 0 & \mbox{if } m < n, \\ \exp(y_{m,m}) & \mbox{if } m = n, \mbox{ and} \\ y_{m,n} & \mbox{if } m > n. \end{array} \right. \]

The covariance matrix \(x\) is recovered from its Cholesky factor \(z\) by taking

\[ x = z \, z^{\top}. \]

Absolute Jacobian determinant of the covariance matrix inverse transform

The Jacobian is the product of the Jacobians of the exponential transform from the unconstrained lower-triangular matrix \(y\) to matrix \(z\) with positive diagonals and the product transform from the Cholesky factor \(z\) to \(x\).

The transform from unconstrained \(y\) to Cholesky factor \(z\) has a diagonal Jacobian matrix, the absolute determinant of which is thus

\[ \prod_{k=1}^K \frac{\partial}{\partial_{y_{k,k}}} \, \exp(y_{k,k}) \ = \ \prod_{k=1}^K \exp(y_{k,k}) \ = \ \prod_{k=1}^K z_{k,k}. \]

The Jacobian matrix of the second transform from the Cholesky factor \(z\) to the covariance matrix \(x\) is also triangular, with diagonal entries corresponding to pairs \((m,n)\) with \(m \geq n\), defined by

\[ \frac{\partial}{\partial z_{m,n}} \left( z \, z^{\top} \right)_{m,n} \ = \ \frac{\partial}{\partial z_{m,n}} \left( \sum_{k=1}^K z_{m,k} \, z_{n,k} \right) \ = \ \left\{ \begin{array}{cl} 2 \, z_{n,n} & \mbox{if } m = n \mbox{ and } \\ z_{n,n} & \mbox{if } m > n. \end{array} \right. \]

The absolute Jacobian determinant of the second transform is thus

\[ 2^{K} \ \prod_{m = 1}^{K} \ \prod_{n=1}^{m} z_{n,n} \ = \ \prod_{n=1}^K \ \prod_{m=n}^K z_{n,n} \ = \ 2^{K} \ \prod_{k=1}^K z_{k,k}^{K - k + 1}. \]

Finally, the full absolute Jacobian determinant of the inverse of the covariance matrix transform from the unconstrained lower-triangular \(y\) to a symmetric, positive definite matrix \(x\) is the product of the Jacobian determinants of the exponentiation and product transforms,

\[ \left( \prod_{k=1}^K z_{k,k} \right) \left( 2^{K} \ \prod_{k=1}^K z_{k,k}^{K - k + 1} \right) \ = \ 2^K \, \prod_{k=1}^K z_{k,k}^{K-k+2}. \]

Let \(f^{-1}\) be the inverse transform from a \(K + \binom{K}{2}\)-vector \(y\) to the \(K \times K\) covariance matrix \(x\). A density function \(p_X(x)\) defined on \(K \times K\) covariance matrices is transformed to the density \(p_Y(y)\) over \(K + \binom{K}{2}\) vectors \(y\) by

\[ p_Y(y) = p_X(f^{-1}(y)) \ 2^K \ \prod_{k=1}^K z_{k,k}^{K-k+2}. \]

Cholesky factors of covariance matrices

An \(M \times M\) covariance matrix \(\Sigma\) can be Cholesky factored to a lower triangular matrix \(L\) such that \(L\,L^{\top} = \Sigma\). If \(\Sigma\) is positive definite, then \(L\) will be \(M \times M\). If \(\Sigma\) is only positive semi-definite, then \(L\) will be \(M \times N\), with \(N < M\).

A matrix is a Cholesky factor for a covariance matrix if and only if it is lower triangular, the diagonal entries are positive, and \(M \geq N\). A matrix satisfying these conditions ensures that \(L \, L^{\top}\) is positive semi-definite if \(M > N\) and positive definite if \(M = N\).

A Cholesky factor of a covariance matrix requires \(N + \binom{N}{2} + (M - N)N\) unconstrained parameters.

Cholesky factor of covariance matrix transform

Stan’s Cholesky factor transform only requires the first step of the covariance matrix transform, namely log transforming the positive diagonal elements. Suppose \(x\) is an \(M \times N\) Cholesky factor. The above-diagonal entries are zero, the diagonal entries are positive, and the below-diagonal entries are unconstrained. The transform required is thus

\[ y_{m,n} = \left\{ \begin{array}{cl} 0 & \mbox{if } m < n, \\ \log x_{m,m} & \mbox{if } m = n, \mbox{ and} \\ x_{m,n} & \mbox{if } m > n. \end{array} \right. \]

Cholesky factor of covariance matrix inverse transform

The inverse transform need only invert the logarithm with an exponentiation. If \(y\) is the unconstrained matrix representation, then the elements of the constrained matrix \(x\) is defined by

\[ x_{m,n} = \left\{ \begin{array}{cl} 0 & \mbox{if } m < n, \\ \exp(y_{m,m}) & \mbox{if } m = n, \mbox{ and} \\ y_{m,n} & \mbox{if } m > n. \end{array} \right. \]

Absolute Jacobian determinant of Cholesky factor inverse transform

The transform has a diagonal Jacobian matrix, the absolute determinant of which is

\[ \prod_{n=1}^N \frac{\partial}{\partial_{y_{n,n}}} \, \exp(y_{n,n}) \ = \ \prod_{n=1}^N \exp(y_{n,n}) \ = \ \prod_{n=1}^N x_{n,n}. \]

Let \(x = f^{-1}(y)\) be the inverse transform from a \(N + \binom{N}{2} + (M - N)N\) vector to an \(M \times N\) Cholesky factor for a covariance matrix \(x\) defined in the previous section. A density function \(p_X(x)\) defined on \(M \times N\) Cholesky factors of covariance matrices is transformed to the density \(p_Y(y)\) over \(N + \binom{N}{2} + (M - N)N\) vectors \(y\) by

\[ p_Y(y) = p_X(f^{-1}(y)) \prod_{N=1}^N x_{n,n}. \]

Cholesky factors of correlation matrices

A \(K \times K\) correlation matrix \(\Omega\) is positive definite and has a unit diagonal. Because it is positive definite, it can be Cholesky factored to a \(K \times K\) lower-triangular matrix \(L\) with positive diagonal elements such that \(\Omega = L\,L^{\top}\). Because the correlation matrix has a unit diagonal,

\[ \Omega_{k,k} = L_k\,L_k^{\top} = 1, \]

each row vector \(L_k\) of the Cholesky factor is of unit length. The length and positivity constraint allow the diagonal elements of \(L\) to be calculated from the off-diagonal elements, so that a Cholesky factor for a \(K \times K\) correlation matrix requires only \(\binom{K}{2}\) unconstrained parameters.

Cholesky factor of correlation matrix inverse transform

It is easiest to start with the inverse transform from the \(\binom{K}{2}\) unconstrained parameters \(y\) to the \(K \times K\) lower-triangular Cholesky factor \(x\). The inverse transform is based on the hyperbolic tangent function, \(\tanh\), which satisfies \(\tanh(x) \in (-1,1)\). Here it will function like an inverse logit with a sign to pick out the direction of an underlying canonical partial correlation; see the section on correlation matrix transforms for more information on the relation between canonical partial correlations and the Cholesky factors of correlation matrices.

Suppose \(y\) is a vector of \(\binom{K}{2}\) unconstrained values. Let \(z\) be a lower-triangular matrix with zero diagonal and below diagonal entries filled by row. For example, in the \(3 \times 3\) case,

\[ z = \left[ \begin{array}{ccc} 0 & 0 & 0 \\ \tanh y_1 & 0 & 0 \\ \tanh y_2 & \tanh y_3 & 0 \end{array} \right] \]

The matrix \(z\), with entries in the range \((-1, 1)\), is then transformed to the Cholesky factor \(x\), by taking2

\[ x_{i,j} = \left\{ \begin{array}{lll} 0 & \mbox{ if } i < j & \mbox{ [above diagonal]} \\ \sqrt{1 - \sum_{j' < j} x_{i,j'}^2} & \mbox{ if } i = j & \mbox{ [on diagonal]} \\ z_{i,j} \ \sqrt{1 - \sum_{j' < j} x_{i,j'}^2} & \mbox{ if } i > j & \mbox{ [below diagonal]} \end{array} \right. \]

In the \(3 \times 3\) case, this yields

\[ x = \left[ \begin{array}{ccc} 1 & 0 & 0 \\ z_{2,1} & \sqrt{1 - x_{2,1}^2} & 0 \\ z_{3,1} & z_{3,2} \sqrt{1 - x_{3,1}^2} & \sqrt{1 - (x_{3,1}^2 + x_{3,2}^2)} \end{array} \right], \]

where the \(z_{i,j} \in (-1,1)\) are the \(\tanh\)-transformed \(y\).

The approach is a signed stick-breaking process on the quadratic (Euclidean length) scale. Starting from length 1 at \(j=1\), each below-diagonal entry \(x_{i,j}\) is determined by the (signed) fraction \(z_{i,j}\) of the remaining length for the row that it consumes. The diagonal entries \(x_{i,i}\) get any leftover length from earlier entries in their row. The above-diagonal entries are zero.

Cholesky factor of correlation matrix transform

Suppose \(x\) is a \(K \times K\) Cholesky factor for some correlation matrix. The first step of the transform reconstructs the intermediate values \(z\) from \(x\),

\[ z_{i,j} = \frac{x_{i,j}}{\sqrt{1 - \sum_{j' < j}x_{i,j'}^2}}. \]

The mapping from the resulting \(z\) to \(y\) inverts \(\tanh\),

\[ y \ = \ \tanh^{-1} z \ = \ \frac{1}{2} \left( \log (1 + z) - \log (1 - z) \right). \]

Absolute Jacobian determinant of inverse transform

The Jacobian of the full transform is the product of the Jacobians of its component transforms.

First, for the inverse transform \(z = \tanh y\), the derivative is

\[ \frac{d}{dy} \tanh y = \frac{1}{(\cosh y)^2}. \]

Second, for the inverse transform of \(z\) to \(x\), the resulting Jacobian matrix \(J\) is of dimension \(\binom{K}{2} \times \binom{K}{2}\), with indexes \((i,j)\) for \((i > j)\). The Jacobian matrix is lower triangular, so that its determinant is the product of its diagonal entries, of which there is one for each \((i,j)\) pair,

\[ \left| \, \mbox{det} \, J \, \right| \ = \ \prod_{i > j} \left| \frac{d}{dz_{i,j}} x_{i,j} \right|, \]

where

\[ \frac{d}{dz_{i,j}} x_{i,j} = \sqrt{1 - \sum_{j' < j} x^2_{i,j'}}. \]

So the combined density for unconstrained \(y\) is

\[ p_Y(y) = p_X(f^{-1}(y)) \ \ \prod_{n < \binom{K}{2}} \frac{1}{(\cosh y)^2} \ \ \prod_{i > j} \left( 1 - \sum_{j' < j} x_{i,j'}^2 \right)^{1/2}, \]

where \(x = f^{-1}(y)\) is used for notational convenience. The log Jacobian determinant of the complete inverse transform \(x = f^{-1}(y)\) is given by

\[ \log \left| \, \det J \, \right| = -2 \sum_{n \leq \binom{K}{2}} \log \cosh y \ + \ \frac{1}{2} \ \sum_{i > j} \log \left( 1 - \sum_{j' < j} x_{i,j'}^2 \right) . \]

Back to top

References

Egozcue, Juan José, Vera Pawlowsky-Glahn, Glòria Mateu-Figueras, and Carles Barcelo-Vidal. 2003. “Isometric Logratio Transformations for Compositional Data Analysis.” Mathematical Geology 35 (3): 279–300.
Filzmoser, Peter, Karel Hron, and Matthias Templ. 2018. Geometrical Properties of Compositional Data. Springer.
Lancaster, H. O. 1965. “The Helmert Matrices.” The American Mathematical Monthly 72 (1): 4–12. http://www.jstor.org/stable/2312989.
Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100: 1989–2001.
Muller, Mervin E. 1959. “A Note on a Method for Generating Points Uniformly on n-Dimensional Spheres.” Commun. ACM 2 (4): 19–20. https://doi.org/10.1145/377939.377946.
Seyboldt, Adrian. 2024. “Add ZeroSumNormal Distribution.” https://github.com/pyro-ppl/numpyro/pull/1751#issuecomment-1980569811.

Footnotes

  1. An alternative to the transform in this section, which can be coded directly in Stan, is to parameterize a covariance matrix as a scaled correlation matrix. An arbitrary \(K \times K\) covariance matrix \(\Sigma\) can be expressed in terms of a \(K\)-vector \(\sigma\) and correlation matrix \(\Omega\) as \[\Sigma = \mbox{diag}(\sigma) \times \Omega \times \mbox{diag}(\sigma),\] so that each entry is just a deviation-scaled correlation, \[\Sigma_{m,n} = \sigma_m \times \sigma_n \times \Omega_{m,n}.\]↩︎

  2. For convenience, a summation with no terms, such as \(\sum_{j' < 1} x_{i,j'}\), is defined to be 0. This implies \(x_{1,1} = 1\) and that \(x_{i,1} = z_{i,1}\) for \(i > 1\).↩︎