Emboss is a handy collection of tools for processing sequence data, in the form of command-line programs. One program, getorf finds and outputs open reading frames (ORFs) in sequences; however, it doesn't (yet) have a function to output the longest ORF. It also renames the sequences, by tacking on an identifier for each ORF found. Borrowing from here, here is a shell script that outputs the longest ORF for each sequence, intepreting ORFs as the sequence between stop codons, so it can be applied to partial gene sequences. It's slow, but for extracting ORFs from influenza haemagglutinin sequences, it seems to work fine.
#!/bin/bash
if [ $# != 1 ]; then
echo "USAGE: ./longorf <fasta-file>"
exit
fi
numseq=$(grep -c ">" $1)
for i in $(seq 1 $numseq)
do
(nthseq $1 -filter -number $i | getorf -filter -table 0 -find 2 -noreverse | sizeseq -filter -desc | seqret -filter -first | sed -r -e 's/_[0-9]* (.+?)//g')
done
You'll have to tweak in order to use a different table, or for circular genomes.
To use, just redirect the output to stdout.
./longorf.sh myseqs.fas > myseqs_orf.fas
No comments:
Post a Comment