Friday, June 28, 2013

Stochastic models and matrix exponentials in R and Julia

Continuous time Markov chain (CTMC) models crop up a lot in my work, for example, in terms of evolution of genetic sequences, as well as in transmission of infectious diseases. For those of you interested in the following, see the papers by Keeling and Ross here and here.

Let's say we are interested in the probability distribution of a CTMC model at a given time. First, define a rate matrix, \( Q \), with entries \( q(i,j) \), representing the rate of transition from state \( i \) to state \( j \), for \( j \neq i \), with the rates along the diagonal \( q(i,i)=-q(i) \), where \( q(i)=\sum_{j \neq i} q(i,j) \). If \( p \) is the probability of finding the system in a particular state at a given time, then \( p \) evolves through time according to the following differential equation.

\[ \begin{align} \frac{dp}{dt} & = Qp \end{align} \]

We can simulate this system numerically using standard differential equation solvers. Alternatively, we can solve the system directly:

\[ \begin{align} p(t) & = e^{Qt}p(0) \end{align} \]

As an example, let's consider a simple susceptible-infected-susceptible (SIS) model in a fixed population of size \( N \). If we were to model this deterministically, we might model the dynamics of the number of infected individuals as follows.

\[ \begin{align} \frac{dI}{dt} & = \beta S I - \gamma I \end{align} \]

The stochastic analogue of this model is as follows.

\[ \begin{align} {\rm Transition} & \quad {\rm Rate} \cr I \rightarrow I+1 & \quad \beta S(t) I(t) \cr I \rightarrow I-1 & \quad \gamma I(t) \end{align} \]

For this system \( p \) is a vector of \( N+1 \) probabilities of there being \( I=0,1,\dots,N \) infected individuals, and the rate matrix \( Q \) is as follows.

\[ \begin{align} Q=\left(\begin{array}{cccccc} 0 & \gamma & 0 & 0 & 0 & \cdots \cr 0 & -(\beta(N-1)\times 1 +\gamma) & 2\gamma & 0 & 0 & \cdots \cr 0 & \beta(N-1)\times 1 & -(\beta(N-2)\times 2+2\gamma) & 3\gamma & 0 & \cdots \cr \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array} \right) \end{align} \]

In the above, \( \beta \) is an infectivity parameter, and \( \gamma \) is the recovery rate.

Here's how we can solve this model in R, using the matrix exponential functions from the expm package. I use the bandSparse function from the Matrix library in order to construct the triadiagonal matrix \( Q \).

# Load libraries
library(Matrix)
## Loading required package: lattice
library(expm)
## Attaching package: 'expm'
## The following object is masked from 'package:Matrix':
## 
## expm

# Define (sup/super) diagonals
diagFun <- function(b, g, N) {
    n <- seq(0, N)
    -(b * (N - n) * n + g * n)
}

lDiagFun <- function(b, g, N) {
    n <- seq(1, N)
    b * (N - n + 1) * (n - 1)
}

uDiagFun <- function(g, N) {
    n <- seq(1, N)
    g * n
}

# Generate marux from duagonals
Qmatrix <- function(b, g, N, sparse = FALSE) {
    d <- diagFun(b, g, N)
    l <- lDiagFun(b, g, N)
    u <- uDiagFun(g, N)
    Q <- bandSparse(n = N + 1, m = N + 1, k = c(-1, 0, 1), diagonals = list(l, 
        d, u))
    # as.matrix(Q)
    if (sparse) {
        return(Q)
    } else {
        return(as.matrix(Q))
    }
}

# Sp;ve
sisexpm <- function(Q, p, tm, ...) {
    expm(Q * tm, ...) %*% p
}

Here's a specific example for \( \beta=0.002 \), \( \gamma=1 \) and \( N=1000 \). For illustrative purposes, I'm going to start \( I(0)=500 \) (i.e. at the deterministic equilibrium) to avoid having stochastic fadeouts, and solve for \( t=100 \). An alternative strategy would be to consider a model with a 'background' infection rate, or by considering a reduced matrix, \( Q_0 \), in which the first row and column have been removed, as in Keeling and Ross.

Qmat <- Qmatrix(2/1000, 1, 1000)
pinit <- rep(0, 1001)
pinit[501] <- 1  # Note that this vector refers to 0..N
print(system.time(result <- sisexpm(Qmat, pinit, 100, method = "R_Pade"))["elapsed"])
## elapsed 
##   36.71

This code plots out the density.

library(ggplot2)
df <- data.frame(S = seq(0, 1000), P = result)
ggplot(data = df, aes(x = S, y = P)) + geom_line()

plot of chunk unnamed-chunk-3

We can compare this with output from Gillespie's stochastic simulation algorithm. Here is the code in plain R.

sis <- function(b, g, N, I0, tm) {
    time <- 0
    I <- I0
    while (time < tm) {
        pf1 <- b * (N - I) * I
        pf2 <- g * I
        pf <- pf1 + pf2
        dt <- rexp(1, rate = pf)
        time <- time + dt
        if (time > tm) {
            break
        }
        ru <- runif(1)
        if (ru < (pf1/pf)) {
            I <- I + 1
        } else {
            I <- I - 1
        }
        if (I == 0) {
            break
        }
    }
    return(I)
}

However, the simulation in R is rather slow, so as before, I'll recode in C++ using Rcpp.

library(Rcpp)
cppFunction('
  double sisc(double b, double g, double N, double I0, double tm){
    double t = 0;
    double I = I0;
    do{
      double pf1 = b*(N-I)*I;
      double pf2 = g*I;
      double pf = pf1+pf2;
      double dt = rexp(1,pf)[0];
      t += dt;
      double r = runif(1)[0];
      if(r<pf1/pf){
        I++;
      }else{
        I--;
      }
      if(I==0){break;}
    } while (t<=tm && (I>0));
  return (I);
  }'
)

For those who are interested, here's a comparison of the speed of the two implementations.

nsims <- 100  # Run 100 times
set.seed(1)
sistime <- sum(replicate(nsims, system.time(x <- sis(2/1000, 1, 1000, 500, 100))["elapsed"]), 
    trim = 0.05)
library(compiler)
sisjit <- cmpfun(sis)
set.seed(1)
sisjittime <- sum(replicate(nsims, system.time(x <- sisjit(2/1000, 1, 1000, 
    500, 100))["elapsed"]), trim = 0.05)
set.seed(1)
sisctime <- sum(replicate(nsims, system.time(x <- sisc(2/1000, 1, 1000, 500, 
    100))["elapsed"]), trim = 0.05)
paste("Plain R:", sistime)
## [1] "Plain R: 176.55"
paste("R JIT:", sisjittime)
## [1] "R JIT: 105.913"
paste("Rcpp version:", sisctime)
## [1] "Rcpp version: 4.38999999999997"

Let's compare the simulations with the numerical solution.

nsims <- 10000
fI <- replicate(nsims, sisc(2/1000, 1, 1000, 500, 100))
qplot(fI, geom = "blank") + geom_histogram(aes(y = ..density..), binwidth = 1, 
    color = "blue") + geom_line(data = df, aes(x = S, y = P)) + scale_x_continuous(limits = c(400, 
    600)) + xlab("I")
## Warning: Removed 800 rows containing missing values (geom_path).

plot of chunk unnamed-chunk-7

Now the same, but in Julia.

function diagFun(b,g,N)
    [-(b*(N-n)*n+g*n) for n in 0:N]
end

function lDiagFun(b,g,N)
    [b*(N-n+1)*(n-1) for n in 1:N]
end

function uDiagFun(g,N)
    [g*n for n in 1:N]
end

function Qmatrix(b,g,N)
    d = diagFun(b,g,N)
    l = lDiagFun(b,g,N)
    u = uDiagFun(g,N)
    Q = zeros(N+1,N+1)
    Q[diagind(Q)] = d
    Q[diagind(Q,1)] = u
    Q[diagind(Q,-1)] = l
    Q
end

function sisexpm(Q,p,tm)
    expm(Q*tm)*p
end

Qmat = Qmatrix(2./1000,1,1000)
pinit = zeros(1001)
pinit[2] = 1

@elapsed result = sisexpm(Qmat,pinit,1000)

On my computers, Julia runs much faster (about 10 times).

It would be fun (well, if you like that sort of thing) to try some of the approaches described here.


No comments:

Post a Comment