Tuesday, June 25, 2013

A simple phylogeny workflow in R

I often have to do some routine phylogenetic analyses, where one has to find a 'good' model of substitution using some sort of model testing framework, then reconstruct a tree using maximum likelihood. This can all be done in R with the ape and phangorn libraries. Here, I'm going to reconstruct a phylogeny from 71 hepatitis C sequences of the core/E1 region from infected individuals in Egypt, studied by Ray et al. 2000. You can download the sequences here.

First of all, we need to load the libraries.

library(ape)
library(phangorn)

Next, we set the filename of a multiple sequence alignment in FASTA format.

myalignment.filename <- "ray2000.fas"
myalignment <- read.dna(myalignment.filename, format = "fasta", as.matrix = TRUE)

Use of the as.matrix=TRUE option means that an error will be thrown if sequences are of unequal length (which means they may not be aligned), and also allows an image of the alignment to be displayed.

image(myalignment, show.labels = FALSE)

plot of chunk unnamed-chunk-3

In order to obtain a good-fitting model, I need a rough tree. I use neighbour joining (NJ), assuming a TN93 distance model (just because I like it). NJ takes a distance matrix as input, so I generate this matrix first.

myalignment.dist.tn93 <- dist.dna(myalignment, model = "TN93", as.matrix = TRUE)

It's always sensible to plot out the resulting distances.

hist(myalignment.dist.tn93[lower.tri(myalignment.dist.tn93)], xlab = "Distance", 
    ylab = "Frequency", main = "", col = "grey")

plot of chunk unnamed-chunk-5

Neighbour joining is but a single command away.

myalignment.tn93.nj <- nj(myalignment.dist.tn93)

Here is a plot of the resulting tree.

plot(myalignment.tn93.nj, "unrooted", cex = 0.5, show.tip.label = FALSE)

plot of chunk unnamed-chunk-7

In order to fit models using maximum likelihood in phangorn, I have to change the format of the data from ape's DNAbin format, to the phyDat structure used in phangorn. I can then run a selection of (nucleotide) substitution models.

myalignment.phydat <- as.phyDat(myalignment)  # Convert to a format modeltest understand
myalignment.modeltest <- modelTest(myalignment.phydat, tree = myalignment.tn93.nj, 
    model = c("JC", "F81", "K80", "HKY", "SYM", "GTR"), G = TRUE, I = TRUE, 
    k = 4, control = pml.control(epsilon = 1e-08, maxit = 3, trace = 1), multicore = FALSE)
## Warning: negative edges length changed to 0!
myalignment.modeltest
##      Model  df logLik   AIC   BIC
## 1       JC 139  -9094 18465 19024
## 2     JC+I 140  -8618 17516 18079
## 3     JC+G 140  -8368 17017 17579
## 4   JC+G+I 141  -8352 16985 17552
## 5      F81 142  -9006 18296 18866
## 6    F81+I 143  -8525 17337 17911
## 7    F81+G 143  -8252 16790 17364
## 8  F81+G+I 144  -8239 16766 17344
## 9      K80 140  -8483 17246 17809
## 10   K80+I 141  -7980 16243 16810
## 11   K80+G 141  -7686 15653 16220
## 12 K80+G+I 142  -7668 15619 16190
## 13     HKY 143  -8441 17167 17742
## 14   HKY+I 144  -7916 16120 16699
## 15   HKY+G 144  -7636 15561 16139
## 16 HKY+G+I 145  -7617 15524 16106
## 17     SYM 144  -8445 17179 17758
## 18   SYM+I 145  -7943 16177 16759
## 19   SYM+G 145  -7650 15589 16172
## 20 SYM+G+I 146  -7632 15555 16142
## 21     GTR 147  -8436 17166 17757
## 22   GTR+I 148  -7903 16103 16698
## 23   GTR+G 148  -7622 15541 16136
## 24 GTR+G+I 149  -7602 15502 16100

It can be difficult to spot the 'best' model, so I write a function that pulls out the row of the dataframe with the lowest AIC or BIC.

findBestModel <- function(mt, ic = "AIC") {
    if (ic == "AIC") {
        minic <- min(mt$AIC)
        bestmodel <- mt[mt$AIC == minic, ]
    } else if (ic == "BIC") {
        minic <- min(mt$BIC)
        bestmodel <- mt[mt$BIC == minic, ]
    }
    bestmodel
}
myalignment.bestmodel <- findBestModel(myalignment.modeltest)
myalignment.bestmodel
##      Model  df logLik   AIC   BIC
## 24 GTR+G+I 149  -7602 15502 16100

As a first step in getting a maximum likelihood tree, I generate a phylogenetic maximum likelihood fit using pml. Note that this doesn't optimise the parameters; to do that, I have to run optim.pml. In this example, I do this in several stages, to show how different aspects of the tree - the substitution model, the branch lengths, and the topology - can be optimised separately from one another.

myalignment.mltree <- pml(myalignment.tn93.nj, myalignment.phydat, model = as.character(myalignment.bestmodel$Model), 
    k = 4)
## Warning: negative edges length changed to 0!
# optimise model parameters without fitting edges
myalignment.mltree <- optim.pml(myalignment.mltree, optNni = FALSE, optBf = TRUE, 
    optQ = TRUE, optInv = TRUE, optGamma = TRUE, optEdge = FALSE, optRate = TRUE, 
    control = pml.control(epsilon = 1e-08, maxit = 10, trace = 0))
# Now fix substitution model and fit tree
myalignment.mltree <- optim.pml(myalignment.mltree, optNni = TRUE, optBf = FALSE, 
    optQ = FALSE, optInv = FALSE, optGamma = FALSE, optEdge = TRUE, control = pml.control(epsilon = 1e-08, 
        maxit = 10, trace = 0))
# Now fine tune
myalignment.mltree <- optim.pml(myalignment.mltree, optNni = TRUE, optBf = TRUE, 
    optQ = TRUE, optInv = TRUE, optGamma = TRUE, optEdge = TRUE, optRate = FALSE, 
    control = pml.control(epsilon = 1e-08, maxit = 10, trace = 0))

Finally, I display the tree and save it as a Newick-formatted string to a file.

myalignment.mltree
## 
##  loglikelihood: -7529 
## 
## unconstrained loglikelihood: -1991 
## Proportion of invariant sites: 0.2699 
## Discrete gamma model
## Number of rate categories: 4 
## Shape parameter: 0.9088 
## 
## Rate matrix:
##       a       c      g      t
## a 0.000  1.3620 8.0664  1.791
## c 1.362  0.0000 0.7302 11.374
## g 8.066  0.7302 0.0000  1.000
## t 1.791 11.3741 1.0000  0.000
## 
## Base frequencies:  
## 0.1749 0.3425 0.2533 0.2293
myalignment.mltree$tree
## 
## Phylogenetic tree with 71 tips and 69 internal nodes.
## 
## Tip labels:
##  AF271819, AF271820, AF271821, AF271823, AF271824, AF271822, ...
## 
## Unrooted; includes branch lengths.
plot(myalignment.mltree$tree, "unrooted", show.tip.label = FALSE)

plot of chunk unnamed-chunk-11

write.tree(myalignment.mltree$tree, "mytree.tre")

An R markdown file with the workflow can be found here.


1 comment:

  1. Thank you. This was useful in 2020. Delightful that the sequence data was still there to be downloaded.

    Hope you are doing well!

    ReplyDelete