Stan Math Library
5.0.0
Automatic Differentiation
|
Before reading this section it's a good idea to at least skim over the getting started guide. Stan's univariate distribution functions must work with mixes of scalars and vectors. This requirement can make the code for the distributions look a bit daunting, but in the below we'll cover each of the general steps needed to add a new distribution.
We will use the normal distribution as an example and adding the lpdf function, though note for acceptance into Stan math a function must have its respective lpdf
, lcdf
, cdf
, lccdf
and rng
implemented. Though we will only be doing the lpdf
in the below all of the notes here will apply to the other functions.
So for the normal distribution probability density function
\[ \text{Normal}(y|\mu,\sigma)=\frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{y-\mu}{\sigma}\right)^2} \]
to get the log probability density function we log the above to get
\[ \ln{\left(\text{Normal}(y|\mu,\sigma)\right)}=-\frac{1}{2} \left(\frac{y-\mu}{\sigma}\right)^2 - \ln{\left(\sigma\right)} - \frac{1}{2}\ln{\left(2\pi\right)} \]
Now we can directly plug this into Stan as a custom lpdf function
This is nice because now we can test this function against another implementations to verify its correctness. At this point it is a good idea to post something on discourse or file an issue to let folks know you would like to add this distribution.
For an efficient implimentation of the distribution we want to calculate each of its partials with respect to the distributions inputs. This is easy for normal but can be rough for other distributions. In that case then matrixcalculus.org or wolframalpha is your friend. There we can plug in the lpdf and get back each of the partials.
Note that for univariate distributions wolfram handles things a bit simpler, though for multivariate distributions you'll have to use matrixcalculus. Though for both you'll have a much nicer time if you plug in the log'd version of the function. One other nice thing about the matrixcalculus site is that it can generate latex which is nice for documentation.
\begin{aligned} f = \text{ln}\left(\text{Normal}(y|\mu,\sigma)\right) &= -\frac{1}{2} \left(\frac{y-\mu}{\sigma}\right)^2 - \ln\left(\sigma\right) - \frac{1}{2}\ln{\left(2\pi\right)} \cr \frac{\partial f}{\partial y} &= -\frac{y-\mu}{\sigma^{2}} \cr \frac{\partial f}{\partial \mu} &= \frac{y-\mu}{\sigma^{2}} \cr \frac{\partial f}{\partial \sigma} &= -\frac{1}{\sigma} + \frac{(y-\mu)^{2}}{\sigma^{3}} \end{aligned}
It's a little early, but once we get the lpdf
function working with the above we will want to get out a pen and paper to simplify and find common subexpressions we only need to calculate once. For instance in the normal we can compute y - mu
and 1/sigma
\begin{aligned} f(y|\mu,\sigma) = \text{ln}\left(\text{Normal}(y|\mu,\sigma)\right) &= -\frac{1}{2} \left(\frac{y-\mu}{\sigma}\right)^2 - \ln{\left(\sigma\right)} - \frac{1}{2}\ln{\left(2\pi\right)} \cr \frac{\partial f}{\partial y} &= -t_3 \cr \frac{\partial f}{\partial \mu} &= t_3 \cr \frac{\partial f}{\partial \sigma} &= \frac{t_{2}^2}{t_1} \cdot t_0 - t_0 \cr \text{Where} \cr t_0 &= \frac{1}{\sigma} \cr t_1 &= t_{0}^2 \cr t_2 &= y - \mu \cr t_3 &= \frac{t_2}{t_1} \end{aligned}
So now let's add the lpdf function in stan/math/prim/dist/new_normal_lpdf.hpp
. First we'll go over what we have to do before we start doing any math. We'll be breaking down Stan's current normal_lpdf
function which you can find here.
Each of the input arguments represent the inputs to the Normal
function we wrote out above. The template parameters for univariate distributions are very general and they must work for all of Stan's scalar types double
, var
, and fvar<T>
while accepting mixtures of scalars and vectors. The return of the function is the joint log probability accumulated over all of the inputs which is a scalar of the least upper bound of all the parameters scalar types. That's a lot of big words, but in essence means that if one of the inputs is a var
and the others are double the return type needs to be a var
. If the input signature contained fvar<var>
, var
, double
then the return type would be fvar<var>
. See the Common pitfalls for an explanation of return_type_t
.
Notice the bool propto
template parameter, this is used by the function to decide whether or not the function needs to propagate constants to the joint log probability we'll be calculating.
At the start of the function we need to take each argument, deduce whether it is an unevaluated Eigen expression, and extract the values from them and then convert them into Eigen::Array
types. ref_type_t
is the return type of to_ref()
which is explained in the getting started guide. ret_type_if_t<>
will conditionally evaluate Eigen expressions if both the Eigen type passed is an Eigen expression and the compile time conditional passed to the function is also true
.
Then we need to check that all the vector inputs sizes match, and then check that each of the inputs satisfies the conditions of the distribution. For the normal distribution we need to check that y
does not contain nan values, mu
is finite, and sigma
is positive.
The if
statements here are checking if
propto=true
and all of the inputs are constant (aka if they are all of type double
).If either of the two conditions are met then there's no need to calculate the rest of the lpdf function and we can return back zero.
Woof! That was a good bit of stuff just to get to the math, but here we are! Our goal is to calculate the partial adjoints using our stuff above, but we only want to bother ourselves to calculate adjoints of parameters which are not constant (double
). There's some more technical bits to building the log joint probability, but those are all hidden away in the operands_and_partials
class so we won't cover those here. For now you can take the evaluated inputs and pass them to the operands_and_partials
class
This sets up each of the input operand's partials so that we only store and calculate the ones we need.
On a side note it would be nice to have a helper function like make_ops_partials
which would construct that class and then we could simply write
This would let us also cleanup the aliases for the unevaluated Eigen expressions so we could use to_ref_if()
such as
There's two ways of doing the math, one using a simple loop and another utilizing Eigen expressions. Below I will cover both starting with the loop version.
The loop version requires one other piece of overhead to make sure that vectors and scalars can both be iterated over in the loop.
For vectors, scalar_seq_view
simply holds a reference to the vector it's passed and calling scalar_seq_view
's method .val(i)
will return element i in the vector after calling value_of()
on the element. The actual element can be accessed with operator[]
. For scalars, scalar_seq_view
's .val(i)
and operator[]
will just return the scalar no matter what index is passed.
But with that now we can get the maximum size of the input arguments and run a loop calculating the partials for each input argument's values.
The logp
is used to accumulate the log probability density function's value, where propto
is used to decide whether or not that value should have constants added or dropped.
The odd bits here are mostly the if
s that include include_summand<propto>
and !is_constant_all<T_loc>
. We want to only compute the partials and accumulate the constants if those values are not constant (double
), so we have an if statement here, which since the conditional is a type trait whose value is known at compile time we won't pay for any of these if they are constant. And the compiler will remove the ifs that are false during the dead code elimination phase of optimization.
We collect the partials for each of our inputs via their respective edge*_
in the operands_and_partials
class. The first argument will have edge1_
, the second edge2_
and so on. One important question to ask here is, what if the edge is a scalar? It seems odd that we are able to call partials_[n]
when the operand can be either a vector or scalar. Under the hood, operands_and_partials
wraps the partials for Scalar
types in what's called a broadcast_array
which has an overloaded operator[]
for scalars such that it just simply returns back the partials scalar. Similarly, broadcast_array
has an overloaded operator=
which when assigning a vector to the partial the overloaded operator=
will sum the vector before assigning it to the partial.
Finally once the loop is finished we call ops_partials.build()
passing it the joint log probability value. For reverse mode this will place a callback on the callback stack that takes the edge for each partial_
and accumulates them into operands adjoint.
The for loop version is nice and simple, but there's a few things for performance that we can do better. For instance, in the for loop version we are constantly reading and writing to memory from a bunch of different places. We can fix that by rewriting the above to use multiple loops, but unless we have separate loops that turn on and off for when combinations of partials need to be calculated then we lose places where we can share calculations between partials.
For a more efficient version we can do the math for the partials with no loop and possibly sharing computation.
The below code replaces the loop above with Eigen. It uses most of the same tricks we've used previously.
Some of the same tricks from the above sections are used here in clever ways. For instance, when calculating inv_sigma
, if that expression is used for calculating multiple partials then we want to evaluate it once on that line and reuse the precomputed operation multiple times in the preceding code. However if it's not used multiple times then we just want an expression that will then later be evaluated at its final destination. The same happens for y_scaled_sq
and scaled_diff
.
One odd piece of code here is
If sigma_val
is a scalar then we want to decrement the joint log probability by log(sigma_val) * N
, but if it's a vector we want to decrement it by sum(log(sigma_val))
. In Stan, passing a scalar to sum()
is a no-op that just returns back the scalar, so we can use that to have the left hand side of the multiply work with both vectors and scalars. We always multiply the returned scalar from sum()
by N
, which we don't want if sigma_val
is a vector. So then we just divide by the size()
of sigma
, which for a scalar will be 1 and for a vector will be N
.
It might be easier to see how this would look if Stan used C++17
Side Note: We should make size()
constexpr for scalars so then tricks like this let the compiler see we are doing division by 1 and will remove the operation.
But that's it, you can see the full normal_lpdf
function here in Stan that uses the Eigen version.
One other little piece you'll want to do is add a normal_lpdf
function with the exact same signature but without the propto
parameter. Unless told otherwise we don't assume users want the proportional constants added so we have a default signature that does not require setting the propto
parameter.
For testing the new distribution you'll be adding to the Distribution Testing Framework. In test/prob
you will add a new folder with the name of your distribution with the .hpp
files for generating the tests inside of it. Looking at the normal distribution testing files you'll see you need
mydist_test.hpp
containing a class inheriting from AgradDistributionTest
containing methodsvalid_values()
which fills the [in/out]
params argument with valued testing values for your distribution and the [in/out] log_prob
argument with the log probability given those parameters.invalid_values()
which fills the [in/out]
value argument with testing values that should fail.log_prob()
methods which calls your distribution, one with propto
and the other without.log_prob_function()
which is the log probability density function's simple implementation much like the one you wrote in Stan.The other test files are quite similar, though note the normal
cdf test uses some extra numeric tricks where for example the tests for gumbel_lcdf are very similar to the normal_lpdf
test.
Once you've added those files you can run your tests for your distribution with ./runTests.py
pointing it to the folder containing your tests. This will generate all of the tests for each of Stan's scalar types and mixes of scalars and vectors.
For more details see the distribution testing docs.
The above should cover all of the lpdf
, lcdf
, cdf
, lccdf
and rng
functions needed to add a new distribution to Stan. If you have further questions please reach out on our discourse or file an issue!