Friday, July 19, 2013

Emboss is a handy collection of tools for processing sequence data, in the form of command-line programs. One program, getorf finds and outputs open reading frames (ORFs) in sequences; however, it doesn't (yet) have a function to output the longest ORF. It also renames the sequences, by tacking on an identifier for each ORF found. Borrowing from here, here is a shell script that outputs the longest ORF for each sequence, intepreting ORFs as the sequence between stop codons, so it can be applied to partial gene sequences. It's slow, but for extracting ORFs from influenza haemagglutinin sequences, it seems to work fine.

#!/bin/bash

if [ $# != 1 ]; then
    echo "USAGE: ./longorf <fasta-file>"
    exit
fi

numseq=$(grep -c  ">" $1)

for i in $(seq 1 $numseq)
do
    (nthseq $1 -filter -number $i | getorf -filter -table 0 -find 2 -noreverse | sizeseq -filter -desc | seqret -filter -first | sed -r -e 's/_[0-9]* (.+?)//g')
done

You'll have to tweak in order to use a different table, or for circular genomes.

To use, just redirect the output to stdout.

./longorf.sh myseqs.fas > myseqs_orf.fas

No comments:

Post a Comment