Sequence data in GenBank are often annotated with very useful information. In the case of viral sequence data, information such as the sampling time and country of sampling is invaluable for performing phylogeographic analyses, for example. One can download data from GenBank in a variety of formats. Unfortunately, Julia doesn't (yet) have a native parser for the GenBank flatfile format, but it's also possible to obtain the data in XML in one of two flavours; GenBank and the International Nucleotide Sequence Database (INSD) formats (see here ). I'll give a brief demo of how the INSD XML format can be parsed in Julia.
There are a couple of libraries for processing XML in Julia; LightXML and LibExpat. I'll use the former, as processing XML is more Julia-esque, as opposed to LibExpat, which offers the facility to search using XPath.
For demonstration purposes, I'll use hepatitis E virus (HEV), which causes acute hepatitis. HEV infects multiple species, including humans, wild and domestic swine, deer, chickens, mongooses, rats, ferrets, fish, and rabbits. As I'm meant to be working on zoonotic viruses, HEV is a good pathogen to study.
First of all, I load the LightXML, DataArrays, and DataFrames packages.
using LightXML
using DataArrays
using DataFrames
I searched for all available HEV sequences on GenBank, and downloaded the sequences in INSD XML. Although this can be done using Julia (another time, perhaps), the sequences can be downloaded by following this link, selecting the INSDSeq XML format, and saving the sequences as sequences.gbc.xml
.
The document can be parsed using parse_file
from LightXML
.
xdoc = parse_file("sequence.gbc.xml");
I start by identifying the root element, which is an INSDSet
.
xroot = root(xdoc)
println(name(xroot));
I extract all the sequences and accession numbers as lists, the latter using a comprehension.
sequences = get_elements_by_tagname(xroot, "INSDSeq")
accessions = [content(find_element(s,"INSDSeq_primary-accession")) for s in sequences];
There are an increasing number of HEV sequences available; over 10,000 as of the time of writing.
numseq=length(sequences)
Let's look at the first entry.
sequences[1]
To extract all the information about organism, host, sampling time, etc., that is held in the list of INSDQualifier
s, I loop through all the sequences and generate a dictionary with accession as the key and a dictionary of qualifiers as the value.
I start by initialising an empty dictionary, with strings as both the key and the value.
seq_dict=(ASCIIString=>Dict{ASCIIString,ASCIIString})[]
Extracting the information is a mixture of find_element
and find_elements_by_tagname
to search for the right elements, get_elements_by_tagname
, and finally using content
to extract the contents of the qualifiers.
for i in 1:numseq
s=sequences[i]
accession=content(find_element(s, "INSDSeq_primary-accession"))
feature_table=find_element(s,"INSDSeq_feature-table")
features=get_elements_by_tagname(feature_table,"INSDFeature")
feature_quals=get_elements_by_tagname(features[1], "INSDFeature_quals")
qualifiers=get_elements_by_tagname(feature_quals[1], "INSDQualifier")
qualifier_dict=(ASCIIString=>ASCIIString)[]
for q in qualifiers
n=find_element(q,"INSDQualifier_name")
v=find_element(q,"INSDQualifier_value")
if v!=nothing
qualifier_dict[content(n)]=content(v)
end
end
seq_dict[accession]=qualifier_dict
end;
Here is an example of the features for the first accession.
seq_dict[accessions[1]]
To flatten the dictionary, I first make a dictionary of all feature names, with the number of times the field is found.
fn_dict=(ASCIIString=>Int64)[]
for acc in keys(seq_dict)
features=seq_dict[acc]
for k in keys(features)
current_count=get(fn_dict,k,0)
fn_dict[k]=current_count+1
end
end
fn_dict
I extract the names of the qualifiers as a list, that will be used below to construct a DataFrame
.
feature_names=collect(keys(fn_dict))
I then loop through each feature name, for each sequence, determine whether the feature is present, and construct a DataArray
, which is then added to a DataFrame
.
df=DataFrame(accession=accessions)
numfeatures=length(feature_names)
for i in 1:numfeatures
key=feature_names[i]
dv=DataArray(ASCIIString[],Bool[])
for j in 1:numseq
acc=accessions[j]
f=seq_dict[acc]
val=get(f,key,NA) # NA is the default
push!(dv,val)
end
df[symbol(key)]=dv
end;
I now have a DataFrame
that has the features in a flat format.
head(df)
The annotations can now be written to file as a table.
writetable("annotations.txt", df, separator = '\t', header = true)
Fields can be summarised using countmap
, Julia's equivalent of R's table
. Here's the content of the host
field.
countmap(df[:host])
That's not the only field that has information relevant to the host. Sometimes, the organism is labelled with the host.
countmap(df[:organism])
Isolation source can also contain information on host type.
countmap(longseqdf[:isolation_source])
As you can see, quite a lot of work is needed to generate a 'clean' list of host types.
Of course, we would normally go on to do something with the sequences, which are also contained in the XML.
seqstrings=[content(find_element(s,"INSDSeq_sequence")) for s in sequences];
Once I have the sequences as an array of strings, I can do all the usual sorts of things like look at sequence length.
seqlen =[length(s) for s in seqstrings]
The Base library in Julia has a histogram function; one can see that only a small fraction of sequences (258 of them) are almost full genomes (7000-7500 base pairs long).
hist(convert(Array{Int64},seqlen))
There are various options to output these sequences as a FASTA file. Perhaps the most straightforward is to using the @printf
macro, which if given a stream as the first argument, will print formatted text to the stream.
f=open("hev_sequences.fas","w")
for i in 1:numseq
@printf(f,">%s\n%s\n",accessions[i],seqstrings[i])
end
close(f)