Friday, June 28, 2013

Stochastic models and matrix exponentials in R and Julia

Continuous time Markov chain (CTMC) models crop up a lot in my work, for example, in terms of evolution of genetic sequences, as well as in transmission of infectious diseases. For those of you interested in the following, see the papers by Keeling and Ross here and here.

Let's say we are interested in the probability distribution of a CTMC model at a given time. First, define a rate matrix, \( Q \), with entries \( q(i,j) \), representing the rate of transition from state \( i \) to state \( j \), for \( j \neq i \), with the rates along the diagonal \( q(i,i)=-q(i) \), where \( q(i)=\sum_{j \neq i} q(i,j) \). If \( p \) is the probability of finding the system in a particular state at a given time, then \( p \) evolves through time according to the following differential equation.

\[ \begin{align} \frac{dp}{dt} & = Qp \end{align} \]

We can simulate this system numerically using standard differential equation solvers. Alternatively, we can solve the system directly:

\[ \begin{align} p(t) & = e^{Qt}p(0) \end{align} \]

As an example, let's consider a simple susceptible-infected-susceptible (SIS) model in a fixed population of size \( N \). If we were to model this deterministically, we might model the dynamics of the number of infected individuals as follows.

\[ \begin{align} \frac{dI}{dt} & = \beta S I - \gamma I \end{align} \]

The stochastic analogue of this model is as follows.

\[ \begin{align} {\rm Transition} & \quad {\rm Rate} \cr I \rightarrow I+1 & \quad \beta S(t) I(t) \cr I \rightarrow I-1 & \quad \gamma I(t) \end{align} \]

For this system \( p \) is a vector of \( N+1 \) probabilities of there being \( I=0,1,\dots,N \) infected individuals, and the rate matrix \( Q \) is as follows.

\[ \begin{align} Q=\left(\begin{array}{cccccc} 0 & \gamma & 0 & 0 & 0 & \cdots \cr 0 & -(\beta(N-1)\times 1 +\gamma) & 2\gamma & 0 & 0 & \cdots \cr 0 & \beta(N-1)\times 1 & -(\beta(N-2)\times 2+2\gamma) & 3\gamma & 0 & \cdots \cr \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{array} \right) \end{align} \]

In the above, \( \beta \) is an infectivity parameter, and \( \gamma \) is the recovery rate.

Here's how we can solve this model in R, using the matrix exponential functions from the expm package. I use the bandSparse function from the Matrix library in order to construct the triadiagonal matrix \( Q \).

# Load libraries
library(Matrix)
## Loading required package: lattice
library(expm)
## Attaching package: 'expm'
## The following object is masked from 'package:Matrix':
## 
## expm

# Define (sup/super) diagonals
diagFun <- function(b, g, N) {
    n <- seq(0, N)
    -(b * (N - n) * n + g * n)
}

lDiagFun <- function(b, g, N) {
    n <- seq(1, N)
    b * (N - n + 1) * (n - 1)
}

uDiagFun <- function(g, N) {
    n <- seq(1, N)
    g * n
}

# Generate marux from duagonals
Qmatrix <- function(b, g, N, sparse = FALSE) {
    d <- diagFun(b, g, N)
    l <- lDiagFun(b, g, N)
    u <- uDiagFun(g, N)
    Q <- bandSparse(n = N + 1, m = N + 1, k = c(-1, 0, 1), diagonals = list(l, 
        d, u))
    # as.matrix(Q)
    if (sparse) {
        return(Q)
    } else {
        return(as.matrix(Q))
    }
}

# Sp;ve
sisexpm <- function(Q, p, tm, ...) {
    expm(Q * tm, ...) %*% p
}

Here's a specific example for \( \beta=0.002 \), \( \gamma=1 \) and \( N=1000 \). For illustrative purposes, I'm going to start \( I(0)=500 \) (i.e. at the deterministic equilibrium) to avoid having stochastic fadeouts, and solve for \( t=100 \). An alternative strategy would be to consider a model with a 'background' infection rate, or by considering a reduced matrix, \( Q_0 \), in which the first row and column have been removed, as in Keeling and Ross.

Qmat <- Qmatrix(2/1000, 1, 1000)
pinit <- rep(0, 1001)
pinit[501] <- 1  # Note that this vector refers to 0..N
print(system.time(result <- sisexpm(Qmat, pinit, 100, method = "R_Pade"))["elapsed"])
## elapsed 
##   36.71

This code plots out the density.

library(ggplot2)
df <- data.frame(S = seq(0, 1000), P = result)
ggplot(data = df, aes(x = S, y = P)) + geom_line()

plot of chunk unnamed-chunk-3

We can compare this with output from Gillespie's stochastic simulation algorithm. Here is the code in plain R.

sis <- function(b, g, N, I0, tm) {
    time <- 0
    I <- I0
    while (time < tm) {
        pf1 <- b * (N - I) * I
        pf2 <- g * I
        pf <- pf1 + pf2
        dt <- rexp(1, rate = pf)
        time <- time + dt
        if (time > tm) {
            break
        }
        ru <- runif(1)
        if (ru < (pf1/pf)) {
            I <- I + 1
        } else {
            I <- I - 1
        }
        if (I == 0) {
            break
        }
    }
    return(I)
}

However, the simulation in R is rather slow, so as before, I'll recode in C++ using Rcpp.

library(Rcpp)
cppFunction('
  double sisc(double b, double g, double N, double I0, double tm){
    double t = 0;
    double I = I0;
    do{
      double pf1 = b*(N-I)*I;
      double pf2 = g*I;
      double pf = pf1+pf2;
      double dt = rexp(1,pf)[0];
      t += dt;
      double r = runif(1)[0];
      if(r<pf1/pf){
        I++;
      }else{
        I--;
      }
      if(I==0){break;}
    } while (t<=tm && (I>0));
  return (I);
  }'
)

For those who are interested, here's a comparison of the speed of the two implementations.

nsims <- 100  # Run 100 times
set.seed(1)
sistime <- sum(replicate(nsims, system.time(x <- sis(2/1000, 1, 1000, 500, 100))["elapsed"]), 
    trim = 0.05)
library(compiler)
sisjit <- cmpfun(sis)
set.seed(1)
sisjittime <- sum(replicate(nsims, system.time(x <- sisjit(2/1000, 1, 1000, 
    500, 100))["elapsed"]), trim = 0.05)
set.seed(1)
sisctime <- sum(replicate(nsims, system.time(x <- sisc(2/1000, 1, 1000, 500, 
    100))["elapsed"]), trim = 0.05)
paste("Plain R:", sistime)
## [1] "Plain R: 176.55"
paste("R JIT:", sisjittime)
## [1] "R JIT: 105.913"
paste("Rcpp version:", sisctime)
## [1] "Rcpp version: 4.38999999999997"

Let's compare the simulations with the numerical solution.

nsims <- 10000
fI <- replicate(nsims, sisc(2/1000, 1, 1000, 500, 100))
qplot(fI, geom = "blank") + geom_histogram(aes(y = ..density..), binwidth = 1, 
    color = "blue") + geom_line(data = df, aes(x = S, y = P)) + scale_x_continuous(limits = c(400, 
    600)) + xlab("I")
## Warning: Removed 800 rows containing missing values (geom_path).

plot of chunk unnamed-chunk-7

Now the same, but in Julia.

function diagFun(b,g,N)
    [-(b*(N-n)*n+g*n) for n in 0:N]
end

function lDiagFun(b,g,N)
    [b*(N-n+1)*(n-1) for n in 1:N]
end

function uDiagFun(g,N)
    [g*n for n in 1:N]
end

function Qmatrix(b,g,N)
    d = diagFun(b,g,N)
    l = lDiagFun(b,g,N)
    u = uDiagFun(g,N)
    Q = zeros(N+1,N+1)
    Q[diagind(Q)] = d
    Q[diagind(Q,1)] = u
    Q[diagind(Q,-1)] = l
    Q
end

function sisexpm(Q,p,tm)
    expm(Q*tm)*p
end

Qmat = Qmatrix(2./1000,1,1000)
pinit = zeros(1001)
pinit[2] = 1

@elapsed result = sisexpm(Qmat,pinit,1000)

On my computers, Julia runs much faster (about 10 times).

It would be fun (well, if you like that sort of thing) to try some of the approaches described here.


Tuesday, June 25, 2013

A simple phylogeny workflow in R

I often have to do some routine phylogenetic analyses, where one has to find a 'good' model of substitution using some sort of model testing framework, then reconstruct a tree using maximum likelihood. This can all be done in R with the ape and phangorn libraries. Here, I'm going to reconstruct a phylogeny from 71 hepatitis C sequences of the core/E1 region from infected individuals in Egypt, studied by Ray et al. 2000. You can download the sequences here.

First of all, we need to load the libraries.

library(ape)
library(phangorn)

Next, we set the filename of a multiple sequence alignment in FASTA format.

myalignment.filename <- "ray2000.fas"
myalignment <- read.dna(myalignment.filename, format = "fasta", as.matrix = TRUE)

Use of the as.matrix=TRUE option means that an error will be thrown if sequences are of unequal length (which means they may not be aligned), and also allows an image of the alignment to be displayed.

image(myalignment, show.labels = FALSE)

plot of chunk unnamed-chunk-3

In order to obtain a good-fitting model, I need a rough tree. I use neighbour joining (NJ), assuming a TN93 distance model (just because I like it). NJ takes a distance matrix as input, so I generate this matrix first.

myalignment.dist.tn93 <- dist.dna(myalignment, model = "TN93", as.matrix = TRUE)

It's always sensible to plot out the resulting distances.

hist(myalignment.dist.tn93[lower.tri(myalignment.dist.tn93)], xlab = "Distance", 
    ylab = "Frequency", main = "", col = "grey")

plot of chunk unnamed-chunk-5

Neighbour joining is but a single command away.

myalignment.tn93.nj <- nj(myalignment.dist.tn93)

Here is a plot of the resulting tree.

plot(myalignment.tn93.nj, "unrooted", cex = 0.5, show.tip.label = FALSE)

plot of chunk unnamed-chunk-7

In order to fit models using maximum likelihood in phangorn, I have to change the format of the data from ape's DNAbin format, to the phyDat structure used in phangorn. I can then run a selection of (nucleotide) substitution models.

myalignment.phydat <- as.phyDat(myalignment)  # Convert to a format modeltest understand
myalignment.modeltest <- modelTest(myalignment.phydat, tree = myalignment.tn93.nj, 
    model = c("JC", "F81", "K80", "HKY", "SYM", "GTR"), G = TRUE, I = TRUE, 
    k = 4, control = pml.control(epsilon = 1e-08, maxit = 3, trace = 1), multicore = FALSE)
## Warning: negative edges length changed to 0!
myalignment.modeltest
##      Model  df logLik   AIC   BIC
## 1       JC 139  -9094 18465 19024
## 2     JC+I 140  -8618 17516 18079
## 3     JC+G 140  -8368 17017 17579
## 4   JC+G+I 141  -8352 16985 17552
## 5      F81 142  -9006 18296 18866
## 6    F81+I 143  -8525 17337 17911
## 7    F81+G 143  -8252 16790 17364
## 8  F81+G+I 144  -8239 16766 17344
## 9      K80 140  -8483 17246 17809
## 10   K80+I 141  -7980 16243 16810
## 11   K80+G 141  -7686 15653 16220
## 12 K80+G+I 142  -7668 15619 16190
## 13     HKY 143  -8441 17167 17742
## 14   HKY+I 144  -7916 16120 16699
## 15   HKY+G 144  -7636 15561 16139
## 16 HKY+G+I 145  -7617 15524 16106
## 17     SYM 144  -8445 17179 17758
## 18   SYM+I 145  -7943 16177 16759
## 19   SYM+G 145  -7650 15589 16172
## 20 SYM+G+I 146  -7632 15555 16142
## 21     GTR 147  -8436 17166 17757
## 22   GTR+I 148  -7903 16103 16698
## 23   GTR+G 148  -7622 15541 16136
## 24 GTR+G+I 149  -7602 15502 16100

It can be difficult to spot the 'best' model, so I write a function that pulls out the row of the dataframe with the lowest AIC or BIC.

findBestModel <- function(mt, ic = "AIC") {
    if (ic == "AIC") {
        minic <- min(mt$AIC)
        bestmodel <- mt[mt$AIC == minic, ]
    } else if (ic == "BIC") {
        minic <- min(mt$BIC)
        bestmodel <- mt[mt$BIC == minic, ]
    }
    bestmodel
}
myalignment.bestmodel <- findBestModel(myalignment.modeltest)
myalignment.bestmodel
##      Model  df logLik   AIC   BIC
## 24 GTR+G+I 149  -7602 15502 16100

As a first step in getting a maximum likelihood tree, I generate a phylogenetic maximum likelihood fit using pml. Note that this doesn't optimise the parameters; to do that, I have to run optim.pml. In this example, I do this in several stages, to show how different aspects of the tree - the substitution model, the branch lengths, and the topology - can be optimised separately from one another.

myalignment.mltree <- pml(myalignment.tn93.nj, myalignment.phydat, model = as.character(myalignment.bestmodel$Model), 
    k = 4)
## Warning: negative edges length changed to 0!
# optimise model parameters without fitting edges
myalignment.mltree <- optim.pml(myalignment.mltree, optNni = FALSE, optBf = TRUE, 
    optQ = TRUE, optInv = TRUE, optGamma = TRUE, optEdge = FALSE, optRate = TRUE, 
    control = pml.control(epsilon = 1e-08, maxit = 10, trace = 0))
# Now fix substitution model and fit tree
myalignment.mltree <- optim.pml(myalignment.mltree, optNni = TRUE, optBf = FALSE, 
    optQ = FALSE, optInv = FALSE, optGamma = FALSE, optEdge = TRUE, control = pml.control(epsilon = 1e-08, 
        maxit = 10, trace = 0))
# Now fine tune
myalignment.mltree <- optim.pml(myalignment.mltree, optNni = TRUE, optBf = TRUE, 
    optQ = TRUE, optInv = TRUE, optGamma = TRUE, optEdge = TRUE, optRate = FALSE, 
    control = pml.control(epsilon = 1e-08, maxit = 10, trace = 0))

Finally, I display the tree and save it as a Newick-formatted string to a file.

myalignment.mltree
## 
##  loglikelihood: -7529 
## 
## unconstrained loglikelihood: -1991 
## Proportion of invariant sites: 0.2699 
## Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 0.9088 
## 
## Rate matrix:
##       a       c      g      t
## a 0.000  1.3620 8.0664  1.791
## c 1.362  0.0000 0.7302 11.374
## g 8.066  0.7302 0.0000  1.000
## t 1.791 11.3741 1.0000  0.000
## 
## Base frequencies:  
## 0.1749 0.3425 0.2533 0.2293
myalignment.mltree$tree
## 
## Phylogenetic tree with 71 tips and 69 internal nodes.
## 
## Tip labels:
##  AF271819, AF271820, AF271821, AF271823, AF271824, AF271822, ...
## 
## Unrooted; includes branch lengths.
plot(myalignment.mltree$tree, "unrooted", show.tip.label = FALSE)

plot of chunk unnamed-chunk-11

write.tree(myalignment.mltree$tree, "mytree.tre")

An R markdown file with the workflow can be found here.


Wednesday, June 19, 2013

An SIR model in Julia

I've been playing around a bit with a relatively new language, Julia, that is dynamic, like R and Python, but purports to have performance not dissimilar to C (see here for my setup). My initial attempts at coding up a simple SIR type epidemiological model, like this were disappointing, which I subsequently discovered was due to how I was storing the results. Here is my second try; you'll need to use an up-to-date build of Julia from the git repository, especially in order to get plotting working.

using DataFrames
using Distributions

function sir(beta,gamma,N,S0,I0,R0,tf)
    t = 0
    S = S0
    I = I0
    R = R0
    ta=DataFrames.DataArray(Float64,0)
    Sa=DataFrames.DataArray(Float64,0)
    Ia=DataFrames.DataArray(Float64,0)
    Ra=DataFrames.DataArray(Float64,0)
    while t < tf
        push!(ta,t)
        push!(Sa,S)
        push!(Ia,I)
        push!(Ra,R)
        pf1 = beta*S*I
        pf2 = gamma*I
        pf = pf1+pf2
        dt = rand(Distributions.Exponential(1/pf))
        t = t+dt
        if t>tf
            break
        end
        ru = rand()
        if ru<(pf1/pf)
            S=S-1
            I=I+1
        else
            I=I-1
            R=R+1
        end
    end
    results = DataFrames.DataFrame()
    results["t"] = ta
    results["S"] = Sa
    results["I"] = Ia
    results["R"] = Ra
    return(results)
end
We can now run the model as follows, with parameter values $\beta$ = 0.1/10000 and $\gamma$ = 0.05, initial conditions $S(0)$ =9999, $I(0)$=1, $R(0)$=0, and a simulation time of 1000.

s=sir(0.1/10000,0.05,10000,9999,1,0,1000)
We can plot using Winston, and save to a PNG file.

using Winston
wp = plot(s[:,"t"],s[:,"I"])
setattr(wp,"xlabel","t")
setattr(wp,"ylabel","I")
file(wp,"sir_winston.png")