Sunday, July 21, 2013

Individual-based modeling in Julia

SimJulia is a package for process-based simulations, similar to SimPy in Python, which I contributed to a little to many moons ago. Over the years, I've implemented simulation models of disease transmission to help me learn new computer languages (see my efforts in Python, C, and Eiffel on my Google Code site here). Here's my first try at a process-based version of a stochastic SIR type epidemiological model.

module SIRSim

using SimJulia
using Distributions
using DataFrames

export
    SIRPerson,
    SIRModel,
    activate,
    run,
    out

type SIRPerson
    state::Char
end

function increment(a::Array{Int64})
    push!(a,a[length(a)]+1)
end

function decrement(a::Array{Int64})
    push!(a,a[length(a)]-1)
end

function carryover(a::Array{Int64})
    push!(a,a[length(a)])
end

type SIRModel
    sim::Simulation
    parray::Array{Process}
    beta::Float64
    c::Float64
    gamma::Float64
    ta::Array{Float64}
    Sa::Array{Int64}
    Ia::Array{Int64}
    Ra::Array{Int64}
    allIndividuals::Array{SIRPerson}
end

function SIRModel(heapsize::Int64,beta::Float64,c::Float64,gamma::Float64,S::Int64,I::Int64,R::Int64)
        sim = Simulation(uint(heapsize))
        N=S+I+R
        parray = [Process(sim,string(i)) for i in 1:N]
        states = [fill!(Array(Char,S),'S'),
                  fill!(Array(Char,I),'I'),
                  fill!(Array(Char,R),'R')]
        allIndividuals=[SIRPerson(state) for state in states]
        ta=Array(Float64,0)
        push!(ta,0.0)
        Sa=Array(Int64,0)
        push!(Sa,S)
        Ia=Array(Int64,0)
        push!(Ia,I)
        Ra=Array(Int64,0)
        push!(Ra,R)
        SIRModel(sim,parray,beta,c,gamma,ta,Sa,Ia,Ra,allIndividuals)
end


function live(p::Process,individual::SIRPerson,s::SIRModel)
  while individual.state=='S'
      # Wait until next contact
      hold(p,rand(Distributions.Exponential(1/s.c)))
      # Choose random alter
      alter=individual
      while alter==individual
          N=length(s.allIndividuals)
          index=rand(Distributions.DiscreteUniform(1,N))
          alter=s.allIndividuals[index]
      end
      # If alter is infected
      if alter.state=='I'
          infect = rand(Distributions.Uniform(0,1))
          if infect < s.beta
              individual.state='I'
              push!(s.ta,now(p))
              decrement(s.Sa)
              increment(s.Ia)
              carryover(s.Ra)
          end
      end
  end
  if individual.state=='I'
      # Wait until recovery
      hold(p,rand(Distributions.Exponential(1/s.gamma)))
      individual.state='R'
      push!(s.ta,now(p))
      carryover(s.Sa)
      decrement(s.Ia)
      increment(s.Ra)
  end
end

function activate(s::SIRModel)
    [SimJulia.activate(s.parray[i],0.0,live,s.allIndividuals[i],s) for i in 1:length(s.parray)]
end

function run(s::SIRModel,tf::Float64)
    SimJulia.run(s.sim,tf)
end

function out(s::SIRModel)
    result = DataFrame()
    result["t"] = s.ta
    result["S"] = s.Sa
    result["I"] = s.Ia
    result["R"] = s.Ra
    result
end

end # module

The function, live, contains the logic behind the simulation; each susceptible individual contacts others at rate \( c \); if the contact is infected, then there is a probability \( \beta \) that the susceptible individual will become infected. After infection, individuals recover at rate \( \gamma \).

Some things to note:

  • Julia doesn't permit redefining types in the main scope, so if you want to run in the REPL, types can be defined in a module, which can be reloaded.
  • Julia doesn't allow subtyping of concrete types. In SimPy, one could define SIRPerson as a subclass of Process. Here, I use members of SIR model to hold instances of SimJulia.Simulation and an array of SimJulia.Process.
  • I use an outer constructor method for SIRModel.
  • I update S, I, and R every time step, as the output method constructs a DataFrame for the results.
  • I've separated the contact rate \( c \) from the probability of infection, \( \beta \), just to clarify their different roles in transmission.
  • The exclamation mark in push! denotes that the operation is in place (hence, I should really use activate! etc.).

Here is the model in action.

# Load simulation code and libraries
include("sir_sim.jl")
using SimJulia
using Gadfly

# Set parameters
# All Float64 so use decimal points
beta = 0.1
c = 1.0
gamma = 0.05

# Set initial conditions
# All Int64
S0 = 99
I0 = 1
R0 = 0
N0 = S0+I0+R0

# Initialise and run
sirmodel = SIRSim.SIRModel(N0,beta,c,gamma,S0,I0,R0);
SIRSim.activate(sirmodel);
SIRSim.run(sirmodel,1000.0);
# Collate results
result=SIRSim.out(sirmodel)

# Plot using Gadfly
p=plot(results,x="t",y="I",Geom.line)
10 20 5 15 25 1 I 400 -100 -400 300 700 200 -300 0 500 -200 600 100 t

This all seems a lot more complex than the stochastic model using Gillespie's algorithm I wrote about before. However, this approach makes it easier to build more complex models:

  • The distribution of passage times (here the recovery time) can easily be changed to a distribution other than an exponential.
  • Infectivity terms and contact rates can be allowed to vary in a complex way across individuals.
  • Rather than assuming a well-mixed population, we can easily assume a contact network, such that individuals only contact a subset of the population.

Friday, July 19, 2013

Emboss is a handy collection of tools for processing sequence data, in the form of command-line programs. One program, getorf finds and outputs open reading frames (ORFs) in sequences; however, it doesn't (yet) have a function to output the longest ORF. It also renames the sequences, by tacking on an identifier for each ORF found. Borrowing from here, here is a shell script that outputs the longest ORF for each sequence, intepreting ORFs as the sequence between stop codons, so it can be applied to partial gene sequences. It's slow, but for extracting ORFs from influenza haemagglutinin sequences, it seems to work fine.

#!/bin/bash

if [ $# != 1 ]; then
    echo "USAGE: ./longorf <fasta-file>"
    exit
fi

numseq=$(grep -c  ">" $1)

for i in $(seq 1 $numseq)
do
    (nthseq $1 -filter -number $i | getorf -filter -table 0 -find 2 -noreverse | sizeseq -filter -desc | seqret -filter -first | sed -r -e 's/_[0-9]* (.+?)//g')
done

You'll have to tweak in order to use a different table, or for circular genomes.

To use, just redirect the output to stdout.

./longorf.sh myseqs.fas > myseqs_orf.fas

Thursday, July 11, 2013

Differential equation modeling with Julia

In the last few examples of epidemiological modeling, I focused on stochastic models. However, most published models are in the form of a set of differential equations. In R, I often use simecol. The ODE package in Julia offers some functionality for solving ordinary differential equations, which should be very familiar to users of Matlab/Octave.

Here's the old favourite, the susceptible-infected-recovered model again.

# Load libraries
using ODE
using DataFrames
using Gadfly

# Define the model
# Has to return a column vector
function SIR(t,x)
    S=x[1]
    I=x[2]
    R=x[3]
    beta=x[4]
    gamma=x[5]
    N=S+I
    dS=-beta*S*I/N
    dI=beta*S*I/N-gamma*I
    dR=gamma*I
    return([dS;dI;dR;0;0])
end

# Initialise model
t = linspace(0,500,101);
inits=[9999,1,0,0.1,0.05];

# Run model
result=ode23(SIR,t,inits);

# Collate results in DataFrame
df=DataFrame();
df["t"]=result[1];
df["S"]=result[2][:,1];
df["I"]=result[2][:,2];
df["R"]=result[2][:,3];

# Plot using Gadfly
p=Gadfly.plot(df,x="t",y="I",Geom.line)

4000 -4000 2000 6000 1000 3000 -1000 5000 0 7000 -2000 -3000 I -600 600 900 -400 800 0 1000 -300 -500 200 700 1100 300 500 -200 -100 400 100 t

There are a few things to note; firstly, the syntax of the ode23 command expects only the time and a state vector. For parameter values, one can hard-code them in the model description (not good) or pass them as additional variables (with gradient 0, so they don't change). I don't really like either option. Perhaps one can use macros to do much the same job, but I'm not that well acquainted with Julia yet.