Tuesday, June 18, 2013

Comparing the performance of R and Rcpp for a simple epidemiological model

While R is great for statistics, like many dynamic languages, it's terribly slow in loops. This can make dynamical models, like those commonly used in epidemiology, rather slow. However, thanks to the wonderful Rcpp package, such simulations can be sped up considerably. Let's take the 'standard' susceptible-infected-recovered (SIR) model. A deterministic version of this model is as follows.

\[\begin{align} \frac{dS(t)}{dt} & = -\beta S(t) I(t) \cr \frac{dI(t)}{dt} & = \beta S(t) I(t)- \gamma I(t) \cr \frac{dR(t)}{dt} & = \gamma I(t) \end{align}\]
Let's consider a stochastic version of the SIR model.

\[ \begin{align} {\rm Transition} & \quad {\rm Rate} \cr S \rightarrow S-1,\; I \rightarrow I+1 & \quad \beta S(t) I(t) \cr I \rightarrow I-1,\; R \rightarrow R+1 & \quad \gamma I(t) \end{align}\]
We can simulate this model using the Gillespie method, also known as the stochastic simulation algorithm, or SSA. In plain R, the stochastic model is as follows.

sir <- function(beta, gamma, N, S0, I0, R0, tf) {
    time <- 0
    S <- S0
    I <- I0
    R <- R0
    ta <- numeric(0)
    Sa <- numeric(0)
    Ia <- numeric(0)
    Ra <- numeric(0)
    while (time < tf) {
        ta <- c(ta, time)
        Sa <- c(Sa, S)
        Ia <- c(Ia, I)
        Ra <- c(Ra, R)
        pf1 <- beta * S * I
        pf2 <- gamma * I
        pf <- pf1 + pf2
        dt <- rexp(1, rate = pf)
        time <- time + dt
        if (time > tf) {
            break
        }
        ru <- runif(1)
        if (ru < (pf1/pf)) {
            S <- S - 1
            I <- I + 1
        } else {
            I <- I - 1
            R <- R + 1
        }
        if (I == 0) {
            break
        }
    }
    results <- data.frame(time = ta, S = Sa, I = Ia, R = Ra)
    # print(head(results))
    return(results)
}

Here is the version written using Rcpp, inlining the C++ code.

library(Rcpp)
cppFunction('
  List sirc(double beta, double gamma, double N, double S0, double I0, double R0, double tf){
    double t = 0;
    double S = S0;
    double I = I0;
    double R = R0;
    std::vector<double> ta;
    std::vector<double> Sa;
    std::vector<double> Ia;
    std::vector<double> Ra;
    do{
      ta.push_back(t);
      Sa.push_back(S);
      Ia.push_back(I);
      Ra.push_back(R);
      double pf1 = beta*S*I;
      double pf2 = gamma*I;
      double pf = pf1+pf2;
      double dt = rexp(1,pf)[0];
      t += dt;
      double r = runif(1)[0];
      if(r<pf1/pf){
        S--;
        I++;
      }else{
        I--;
        R++;
      }
      if(I==0){break;}
    } while (t<=tf && (I>0));
  return List::create(_["time"] = ta, _["S"] = Sa, _["I"] = Ia, _["R"]=Ra);
  }'
)

Now we run both the vanilla R and Rcpp versions of the same model 100 times, with parameter values \( \beta \) = 0.1/10000 and \( \gamma \) = 0.05, initial conditions \( S(0) \) =9999, \( I(0) \)=1, \( R(0) \)=0, and a simulation time of 1000.

N <- 100  # Run 100 times
set.seed(4)
sirsim <- sum(replicate(N, system.time(x <- sir(0.1/10000, 0.05, 10000, 9999, 
    1, 0, 1000))["elapsed"]), trim = 0.05)
set.seed(4)
sircsim <- sum(replicate(N, system.time(x <- sirc(0.1/10000, 0.05, 10000, 9999, 
    1, 0, 1000))["elapsed"]), trim = 0.05)

As you can see, the Rcpp version is considerably faster.

paste("Plain R:", sirsim)

## [1] "Plain R: 114.178"

paste("Rcpp version:", sircsim)

## [1] "Rcpp version: 0.175000000000028"

paste("R took", sirsim/sircsim, "times longer")

## [1] "R took 652.445714285608 times longer"

No comments:

Post a Comment