Tuesday, August 19, 2014

Parsing XML from GenBank in Julia

Sequence data in GenBank are often annotated with very useful information. In the case of viral sequence data, information such as the sampling time and country of sampling is invaluable for performing phylogeographic analyses, for example. One can download data from GenBank in a variety of formats. Unfortunately, Julia doesn't (yet) have a native parser for the GenBank flatfile format, but it's also possible to obtain the data in XML in one of two flavours; GenBank and the International Nucleotide Sequence Database (INSD) formats (see here ). I'll give a brief demo of how the INSD XML format can be parsed in Julia.

There are a couple of libraries for processing XML in Julia; LightXML and LibExpat. I'll use the former, as processing XML is more Julia-esque, as opposed to LibExpat, which offers the facility to search using XPath.

For demonstration purposes, I'll use hepatitis E virus (HEV), which causes acute hepatitis. HEV infects multiple species, including humans, wild and domestic swine, deer, chickens, mongooses, rats, ferrets, fish, and rabbits. As I'm meant to be working on zoonotic viruses, HEV is a good pathogen to study.

First of all, I load the LightXML, DataArrays, and DataFrames packages.

In [1]:
using LightXML
using DataArrays
using DataFrames

I searched for all available HEV sequences on GenBank, and downloaded the sequences in INSD XML. Although this can be done using Julia (another time, perhaps), the sequences can be downloaded by following this link, selecting the INSDSeq XML format, and saving the sequences as sequences.gbc.xml.

The document can be parsed using parse_file from LightXML.

In [2]:
xdoc = parse_file("sequence.gbc.xml");

I start by identifying the root element, which is an INSDSet.

In [3]:
xroot = root(xdoc)
println(name(xroot));
INSDSet

I extract all the sequences and accession numbers as lists, the latter using a comprehension.

In [4]:
sequences = get_elements_by_tagname(xroot, "INSDSeq")
accessions = [content(find_element(s,"INSDSeq_primary-accession")) for s in sequences];

There are an increasing number of HEV sequences available; over 10,000 as of the time of writing.

In [5]:
numseq=length(sequences)
Out[5]:
10049

Let's look at the first entry.

In [6]:
sequences[1]
<INSDSeq>
  <INSDSeq_locus>AB073909</INSDSeq_locus>
  <INSDSeq_length>421</INSDSeq_length>
  <INSDSeq_strandedness>single</INSDSeq_strandedness>
  <INSDSeq_moltype>RNA</INSDSeq_moltype>
  <INSDSeq_topology>linear</INSDSeq_topology>
  <INSDSeq_division>VRL</INSDSeq_division>
  <INSDSeq_update-date>16-JAN-2009</INSDSeq_update-date>
  <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
  <INSDSeq_definition>Swine hepatitis E virus gene for capsid protein, partial cds, strain: swJ570</INSDSeq_definition>
  <INSDSeq_primary-accession>AB073909</INSDSeq_primary-accession>
  <INSDSeq_accession-version>AB073909.1</INSDSeq_accession-version>
  <INSDSeq_other-seqids>
    <INSDSeqid>dbj|AB073909.1|</INSDSeqid>
    <INSDSeqid>gi|17826977</INSDSeqid>
  </INSDSeq_other-seqids>
  <INSDSeq_source>Swine hepatitis E virus</INSDSeq_source>
  <INSDSeq_organism>Swine hepatitis E virus</INSDSeq_organism>
  <INSDSeq_taxonomy>Viruses; ssRNA positive-strand viruses, no DNA stage; Hepeviridae; Hepevirus</INSDSeq_taxonomy>
  <INSDSeq_references>
    <INSDReference>
      <INSDReference_reference>1</INSDReference_reference>
      <INSDReference_authors>
        <INSDAuthor>Okamoto,H.</INSDAuthor>
        <INSDAuthor>Takahashi,M.</INSDAuthor>
        <INSDAuthor>Nishizawa,T.</INSDAuthor>
        <INSDAuthor>Fukai,K.</INSDAuthor>
        <INSDAuthor>Muramatsu,U.</INSDAuthor>
        <INSDAuthor>Yoshikawa,A.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Analysis of the complete genome of indigenous swine hepatitis E virus isolated in Japan</INSDReference_title>
      <INSDReference_journal>Biochem. Biophys. Res. Commun. 289 (5), 929-936 (2001)</INSDReference_journal>
      <INSDReference_xref>
        <INSDXref>
          <INSDXref_dbname>doi</INSDXref_dbname>
          <INSDXref_id>10.1006/bbrc.2001.6088</INSDXref_id>
        </INSDXref>
      </INSDReference_xref>
      <INSDReference_pubmed>11741279</INSDReference_pubmed>
    </INSDReference>
    <INSDReference>
      <INSDReference_reference>2</INSDReference_reference>
      <INSDReference_position>1..421</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Okamoto,H.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Direct Submission</INSDReference_title>
      <INSDReference_journal>Submitted (05-NOV-2001) Contact:Hiroaki Okamoto Jichi Medical School, Immunology Division and Division of Molecular Virology; 3311 Yakushiji, Minamikawachi-Machi, Kawachi-Gun, Tochigi-Ken 329-0498, Japan</INSDReference_journal>
    </INSDReference>
  </INSDSeq_references>
  <INSDSeq_feature-table>
    <INSDFeature>
      <INSDFeature_key>source</INSDFeature_key>
      <INSDFeature_location>1..421</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>421</INSDInterval_to>
          <INSDInterval_accession>AB073909.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>organism</INSDQualifier_name>
          <INSDQualifier_value>Swine hepatitis E virus</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>mol_type</INSDQualifier_name>
          <INSDQualifier_value>genomic RNA</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>strain</INSDQualifier_name>
          <INSDQualifier_value>swJ570</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>isolate</INSDQualifier_name>
          <INSDQualifier_value>No. 570</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>isolation_source</INSDQualifier_name>
          <INSDQualifier_value>pig</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>taxon:63421</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>country</INSDQualifier_name>
          <INSDQualifier_value>Japan</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
    <INSDFeature>
      <INSDFeature_key>gene</INSDFeature_key>
      <INSDFeature_location>&lt;1..&gt;421</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>421</INSDInterval_to>
          <INSDInterval_accession>AB073909.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_partial5 value="true"/>
      <INSDFeature_partial3 value="true"/>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>gene</INSDQualifier_name>
          <INSDQualifier_value>ORF2</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
    <INSDFeature>
      <INSDFeature_key>CDS</INSDFeature_key>
      <INSDFeature_location>&lt;1..&gt;421</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>421</INSDInterval_to>
          <INSDInterval_accession>AB073909.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_partial5 value="true"/>
      <INSDFeature_partial3 value="true"/>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>gene</INSDQualifier_name>
          <INSDQualifier_value>ORF2</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>codon_start</INSDQualifier_name>
          <INSDQualifier_value>2</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>transl_table</INSDQualifier_name>
          <INSDQualifier_value>1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>product</INSDQualifier_name>
          <INSDQualifier_value>capsid protein</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>protein_id</INSDQualifier_name>
          <INSDQualifier_value>BAB79301.1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>GI:17826978</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>translation</INSDQualifier_name>
          <INSDQualifier_value>TGVAEEEATSGLVMLCIHGSPVNSYTNTPYTGALGLLDFALELEFRNLTPGNTNTRVSRYTSTARHRLRRGADGTAELTTTAATRFMKDLHFTGTNGVGEVGRGIALTLFNLADTLLGGLPTELISSAGGQLFYSRPVVS</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
  </INSDSeq_feature-table>
  <INSDSeq_sequence>cacgggcgtggctgaagaggaggctacttccggcctagtaatgctctgcattcatggctcccctgtcaattcttatactaatacaccctacacaggagcactggggcttcttgattttgcattggagcttgaatttaggaatttgacacctgggaatactaacacccgtgtatcccgatataccagcacagcccgccatcggctgcgccgcggtgccgacgggactgccgagcttactactacggcggccactcgatttatgaaagatctgcacttcactggtacgaatggagttggagaggtgggccgtggtattgctctgaccttgtttaatcttgctgatacgcttctcggtggcctgccgacagaattgatttcgtcggctgggggtcagctcttctactcccgccctgtcgtctcg</INSDSeq_sequence>
</INSDSeq>
Out[6]:

To extract all the information about organism, host, sampling time, etc., that is held in the list of INSDQualifiers, I loop through all the sequences and generate a dictionary with accession as the key and a dictionary of qualifiers as the value.

I start by initialising an empty dictionary, with strings as both the key and the value.

In [7]:
seq_dict=(ASCIIString=>Dict{ASCIIString,ASCIIString})[]


Out[7]:
Dict{ASCIIString,Dict{ASCIIString,ASCIIString}} with 0 entries

Extracting the information is a mixture of find_element and find_elements_by_tagname to search for the right elements, get_elements_by_tagname, and finally using content to extract the contents of the qualifiers.

In [8]:
for i in 1:numseq
    s=sequences[i]
    accession=content(find_element(s, "INSDSeq_primary-accession"))
    feature_table=find_element(s,"INSDSeq_feature-table")
    features=get_elements_by_tagname(feature_table,"INSDFeature")
    feature_quals=get_elements_by_tagname(features[1], "INSDFeature_quals")
    qualifiers=get_elements_by_tagname(feature_quals[1], "INSDQualifier")
    qualifier_dict=(ASCIIString=>ASCIIString)[]
    for q in qualifiers
        n=find_element(q,"INSDQualifier_name")
        v=find_element(q,"INSDQualifier_value")
        if v!=nothing
            qualifier_dict[content(n)]=content(v)
        end
    end
    seq_dict[accession]=qualifier_dict
end;

Here is an example of the features for the first accession.

In [9]:
seq_dict[accessions[1]]
Out[9]:
Dict{ASCIIString,ASCIIString} with 7 entries:
  "organism"         => "Swine hepatitis E virus"
  "isolation_source" => "pig"
  "mol_type"         => "genomic RNA"
  "strain"           => "swJ570"
  "isolate"          => "No. 570"
  "db_xref"          => "taxon:63421"
  "country"          => "Japan"

To flatten the dictionary, I first make a dictionary of all feature names, with the number of times the field is found.

In [10]:
fn_dict=(ASCIIString=>Int64)[]
for acc in keys(seq_dict)
    features=seq_dict[acc]
    for k in keys(features)
        current_count=get(fn_dict,k,0)
        fn_dict[k]=current_count+1
    end
end
fn_dict
Out[10]:
Dict{ASCIIString,Int64} with 27 entries:
  "collected_by"     => 250
  "lab_host"         => 18
  "collection_date"  => 5800
  "cell_line"        => 1
  "acronym"          => 19
  "organism"         => 10038
  "isolation_source" => 6309
  "map"              => 61
  "subtype"          => 212
  "clone"            => 1133
  "genotype"         => 4932
  "isolate"          => 6898
  "country"          => 8991
  "note"             => 2062
  "mol_type"         => 10038
  "strain"           => 2943
  "specimen_voucher" => 1
  "lat_lon"          => 45
  "db_xref"          => 10038
  "clone_lib"        => 3
  "host"             => 7503
  "segment"          => 7
  "PCR_primers"      => 241
  "tissue_lib"       => 1
  "type"             => 7
  ⋮                  => ⋮

I extract the names of the qualifiers as a list, that will be used below to construct a DataFrame.

In [11]:
feature_names=collect(keys(fn_dict))
Out[11]:
27-element Array{ASCIIString,1}:
 "collected_by"    
 "lab_host"        
 "collection_date" 
 "cell_line"       
 "acronym"         
 "organism"        
 "isolation_source"
 "map"             
 "subtype"         
 "clone"           
 "genotype"        
 "isolate"         
 "country"         
 ⋮                 
 "strain"          
 "specimen_voucher"
 "lat_lon"         
 "db_xref"         
 "clone_lib"       
 "host"            
 "segment"         
 "PCR_primers"     
 "tissue_lib"      
 "type"            
 "identified_by"   
 "tissue_type"     

I then loop through each feature name, for each sequence, determine whether the feature is present, and construct a DataArray, which is then added to a DataFrame.

In [12]:
df=DataFrame(accession=accessions)
numfeatures=length(feature_names)
for i in 1:numfeatures
    key=feature_names[i]
    dv=DataArray(ASCIIString[],Bool[])
    for j in 1:numseq
        acc=accessions[j]
        f=seq_dict[acc]
        val=get(f,key,NA) # NA is the default
        push!(dv,val)
    end
    df[symbol(key)]=dv
end;

I now have a DataFrame that has the features in a flat format.

In [13]:
head(df)
Out[13]:
accessioncollected_bylab_hostcollection_datecell_lineacronymorganismisolation_sourcemapsubtypeclonegenotypeisolatecountrynotemol_typestrainspecimen_voucherlat_londb_xrefclone_libhostsegmentPCR_primerstissue_libtypeidentified_bytissue_type
1AB073909NANANANANASwine hepatitis E viruspigNANANANANo. 570JapanNAgenomic RNAswJ570NANAtaxon:63421NANANANANANANANA
2AB073910NANANANANASwine hepatitis E viruspigNANANANANo. 681JapanNAgenomic RNAswJ681NANAtaxon:63421NANANANANANANANA
3AB073911NANANANANASwine hepatitis E viruspigNANANANANo. 791JapanNAgenomic RNAswJ791NANAtaxon:63421NANANANANANANANA
4AB073912NANANANANASwine hepatitis E viruspigNANANANANo. 570JapanNAgenomic RNAswJ570NANAtaxon:63421NANANANANANANANA
5AB074915NANANANANAHepatitis E virusNANANANANAJAK-SaiJapan: SaitamaHEV; Japanese strain; isolated from serumgenomic RNANANANAtaxon:12461NAHomo sapiensNANANANANANA
6AB074916NANANANANAHepatitis E virushuman serumNANANANAJHA-SapJapan: SapporoHEV; Japanese straingenomic RNANANANAtaxon:12461NAHomo sapiensNANANANANANA

The annotations can now be written to file as a table.

In [14]:
writetable("annotations.txt", df, separator = '\t', header = true)

Fields can be summarised using countmap, Julia's equivalent of R's table. Here's the content of the host field.

In [15]:
countmap(df[:host])
Out[15]:
Dict{Union(NAtype,ASCIIString),Int64} with 82 entries:
  "Sus scrofa"             => 274
  "Homo sapiens; 40 year … => 1
  "ferret"                 => 18
  NA                       => 2546
  "Oryctolagus cuniculus"  => 4
  "Homo sapiens 70 year-o… => 2
  "Homo sapiens"           => 4762
  "Suncus murinus"         => 5
  "pet pig"                => 1
  "domestic pig"           => 28
  "Homo sapiens; travel t… => 1
  "Homo sapiens 43 year-o… => 2
  "domestic swine"         => 28
  "Rattus rattus (wild ra… => 102
  "Japanese monkey"        => 1
  "Oryctolagus cuniculus … => 3
  "Rattus rattus"          => 41
  "Bandicota indica"       => 4
  "Homo sapiens 63 year-o… => 2
  "humna"                  => 1
  "Homo sapiens 51 years … => 1
  "Neovison vison (mink)"  => 4
  "silver pheasant"        => 1
  "sheep N3"               => 1
  "swine"                  => 1094
  ⋮                        => ⋮

That's not the only field that has information relevant to the host. Sometimes, the organism is labelled with the host.

In [16]:
countmap(df[:organism])
Out[16]:
Dict{Union(NAtype,ASCIIString),Int64} with 19 entries:
  "Hepatitis E virus Hu/0… => 1
  "Hepatitis E virus rat/… => 1
  "Hepatitis E virus type… => 24
  "Hepatitis E virus type… => 46
  "Hepatitis E virus Hu/1… => 1
  "Hepatitis E virus type… => 360
  "Hepatitis E virus Hu/2… => 1
  "Hepatitis E virus Hu/6… => 1
  "Mink hepatitis E virus" => 4
  "Hepatitis E virus huma… => 1
  "Hepatitis E virus rat/… => 1
  "Hepatitis E virus Hu/1… => 1
  "Hepatitis E virus huma… => 1
  "Hepatitis E virus"      => 8603
  "Hepatitis E virus rat/… => 1
  "Swine hepatitis E viru… => 983
  "Big liver and spleen d… => 1
  "Ferret hepatitis E vir… => 17
  "Hepatitis E virus Hu/0… => 1

Isolation source can also contain information on host type.

In [17]:
countmap(longseqdf[:isolation_source])
longseqdf not defined
while loading In[17], in expression starting on line 1

As you can see, quite a lot of work is needed to generate a 'clean' list of host types.

Of course, we would normally go on to do something with the sequences, which are also contained in the XML.

In [18]:
seqstrings=[content(find_element(s,"INSDSeq_sequence")) for s in sequences];

Once I have the sequences as an array of strings, I can do all the usual sorts of things like look at sequence length.

In [19]:
seqlen =[length(s) for s in seqstrings]
Out[19]:
10049-element Array{Any,1}:
  421
  421
  421
 7257
 7236
  326
 7235
 7256
  326
 7240
  326
   98
   98
    ⋮
 1117
  849
  453
  506
  327
  327
  327
  327
  210
  210
 7202
 7194

The Base library in Julia has a histogram function; one can see that only a small fraction of sequences (258 of them) are almost full genomes (7000-7500 base pairs long).

In [20]:
hist(convert(Array{Int64},seqlen))
Out[20]:
(0.0:500.0:7500.0,[8014,1354,330,52,19,1,1,1,1,0,3,0,0,15,258])

There are various options to output these sequences as a FASTA file. Perhaps the most straightforward is to using the @printf macro, which if given a stream as the first argument, will print formatted text to the stream.

In [21]:
f=open("hev_sequences.fas","w")
for i in 1:numseq
    @printf(f,">%s\n%s\n",accessions[i],seqstrings[i])
end
close(f)

No comments:

Post a Comment