Tuesday, August 19, 2014

Parsing XML from GenBank in Julia

Sequence data in GenBank are often annotated with very useful information. In the case of viral sequence data, information such as the sampling time and country of sampling is invaluable for performing phylogeographic analyses, for example. One can download data from GenBank in a variety of formats. Unfortunately, Julia doesn't (yet) have a native parser for the GenBank flatfile format, but it's also possible to obtain the data in XML in one of two flavours; GenBank and the International Nucleotide Sequence Database (INSD) formats (see here ). I'll give a brief demo of how the INSD XML format can be parsed in Julia.

There are a couple of libraries for processing XML in Julia; LightXML and LibExpat. I'll use the former, as processing XML is more Julia-esque, as opposed to LibExpat, which offers the facility to search using XPath.

For demonstration purposes, I'll use hepatitis E virus (HEV), which causes acute hepatitis. HEV infects multiple species, including humans, wild and domestic swine, deer, chickens, mongooses, rats, ferrets, fish, and rabbits. As I'm meant to be working on zoonotic viruses, HEV is a good pathogen to study.

First of all, I load the LightXML, DataArrays, and DataFrames packages.

In [1]:
using LightXML
using DataArrays
using DataFrames

I searched for all available HEV sequences on GenBank, and downloaded the sequences in INSD XML. Although this can be done using Julia (another time, perhaps), the sequences can be downloaded by following this link, selecting the INSDSeq XML format, and saving the sequences as sequences.gbc.xml.

The document can be parsed using parse_file from LightXML.

In [2]:
xdoc = parse_file("sequence.gbc.xml");

I start by identifying the root element, which is an INSDSet.

In [3]:
xroot = root(xdoc)
println(name(xroot));
INSDSet

I extract all the sequences and accession numbers as lists, the latter using a comprehension.

In [4]:
sequences = get_elements_by_tagname(xroot, "INSDSeq")
accessions = [content(find_element(s,"INSDSeq_primary-accession")) for s in sequences];

There are an increasing number of HEV sequences available; over 10,000 as of the time of writing.

In [5]:
numseq=length(sequences)
Out[5]:
10049

Let's look at the first entry.

In [6]:
sequences[1]
<INSDSeq>
  <INSDSeq_locus>AB073909</INSDSeq_locus>
  <INSDSeq_length>421</INSDSeq_length>
  <INSDSeq_strandedness>single</INSDSeq_strandedness>
  <INSDSeq_moltype>RNA</INSDSeq_moltype>
  <INSDSeq_topology>linear</INSDSeq_topology>
  <INSDSeq_division>VRL</INSDSeq_division>
  <INSDSeq_update-date>16-JAN-2009</INSDSeq_update-date>
  <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
  <INSDSeq_definition>Swine hepatitis E virus gene for capsid protein, partial cds, strain: swJ570</INSDSeq_definition>
  <INSDSeq_primary-accession>AB073909</INSDSeq_primary-accession>
  <INSDSeq_accession-version>AB073909.1</INSDSeq_accession-version>
  <INSDSeq_other-seqids>
    <INSDSeqid>dbj|AB073909.1|</INSDSeqid>
    <INSDSeqid>gi|17826977</INSDSeqid>
  </INSDSeq_other-seqids>
  <INSDSeq_source>Swine hepatitis E virus</INSDSeq_source>
  <INSDSeq_organism>Swine hepatitis E virus</INSDSeq_organism>
  <INSDSeq_taxonomy>Viruses; ssRNA positive-strand viruses, no DNA stage; Hepeviridae; Hepevirus</INSDSeq_taxonomy>
  <INSDSeq_references>
    <INSDReference>
      <INSDReference_reference>1</INSDReference_reference>
      <INSDReference_authors>
        <INSDAuthor>Okamoto,H.</INSDAuthor>
        <INSDAuthor>Takahashi,M.</INSDAuthor>
        <INSDAuthor>Nishizawa,T.</INSDAuthor>
        <INSDAuthor>Fukai,K.</INSDAuthor>
        <INSDAuthor>Muramatsu,U.</INSDAuthor>
        <INSDAuthor>Yoshikawa,A.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Analysis of the complete genome of indigenous swine hepatitis E virus isolated in Japan</INSDReference_title>
      <INSDReference_journal>Biochem. Biophys. Res. Commun. 289 (5), 929-936 (2001)</INSDReference_journal>
      <INSDReference_xref>
        <INSDXref>
          <INSDXref_dbname>doi</INSDXref_dbname>
          <INSDXref_id>10.1006/bbrc.2001.6088</INSDXref_id>
        </INSDXref>
      </INSDReference_xref>
      <INSDReference_pubmed>11741279</INSDReference_pubmed>
    </INSDReference>
    <INSDReference>
      <INSDReference_reference>2</INSDReference_reference>
      <INSDReference_position>1..421</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Okamoto,H.</INSDAuthor>
      </INSDReference_authors>
      <INSDReference_title>Direct Submission</INSDReference_title>
      <INSDReference_journal>Submitted (05-NOV-2001) Contact:Hiroaki Okamoto Jichi Medical School, Immunology Division and Division of Molecular Virology; 3311 Yakushiji, Minamikawachi-Machi, Kawachi-Gun, Tochigi-Ken 329-0498, Japan</INSDReference_journal>
    </INSDReference>
  </INSDSeq_references>
  <INSDSeq_feature-table>
    <INSDFeature>
      <INSDFeature_key>source</INSDFeature_key>
      <INSDFeature_location>1..421</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>421</INSDInterval_to>
          <INSDInterval_accession>AB073909.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>organism</INSDQualifier_name>
          <INSDQualifier_value>Swine hepatitis E virus</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>mol_type</INSDQualifier_name>
          <INSDQualifier_value>genomic RNA</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>strain</INSDQualifier_name>
          <INSDQualifier_value>swJ570</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>isolate</INSDQualifier_name>
          <INSDQualifier_value>No. 570</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>isolation_source</INSDQualifier_name>
          <INSDQualifier_value>pig</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>taxon:63421</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>country</INSDQualifier_name>
          <INSDQualifier_value>Japan</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
    <INSDFeature>
      <INSDFeature_key>gene</INSDFeature_key>
      <INSDFeature_location>&lt;1..&gt;421</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>421</INSDInterval_to>
          <INSDInterval_accession>AB073909.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_partial5 value="true"/>
      <INSDFeature_partial3 value="true"/>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>gene</INSDQualifier_name>
          <INSDQualifier_value>ORF2</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
    <INSDFeature>
      <INSDFeature_key>CDS</INSDFeature_key>
      <INSDFeature_location>&lt;1..&gt;421</INSDFeature_location>
      <INSDFeature_intervals>
        <INSDInterval>
          <INSDInterval_from>1</INSDInterval_from>
          <INSDInterval_to>421</INSDInterval_to>
          <INSDInterval_accession>AB073909.1</INSDInterval_accession>
        </INSDInterval>
      </INSDFeature_intervals>
      <INSDFeature_partial5 value="true"/>
      <INSDFeature_partial3 value="true"/>
      <INSDFeature_quals>
        <INSDQualifier>
          <INSDQualifier_name>gene</INSDQualifier_name>
          <INSDQualifier_value>ORF2</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>codon_start</INSDQualifier_name>
          <INSDQualifier_value>2</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>transl_table</INSDQualifier_name>
          <INSDQualifier_value>1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>product</INSDQualifier_name>
          <INSDQualifier_value>capsid protein</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>protein_id</INSDQualifier_name>
          <INSDQualifier_value>BAB79301.1</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>db_xref</INSDQualifier_name>
          <INSDQualifier_value>GI:17826978</INSDQualifier_value>
        </INSDQualifier>
        <INSDQualifier>
          <INSDQualifier_name>translation</INSDQualifier_name>
          <INSDQualifier_value>TGVAEEEATSGLVMLCIHGSPVNSYTNTPYTGALGLLDFALELEFRNLTPGNTNTRVSRYTSTARHRLRRGADGTAELTTTAATRFMKDLHFTGTNGVGEVGRGIALTLFNLADTLLGGLPTELISSAGGQLFYSRPVVS</INSDQualifier_value>
        </INSDQualifier>
      </INSDFeature_quals>
    </INSDFeature>
  </INSDSeq_feature-table>
  <INSDSeq_sequence>cacgggcgtggctgaagaggaggctacttccggcctagtaatgctctgcattcatggctcccctgtcaattcttatactaatacaccctacacaggagcactggggcttcttgattttgcattggagcttgaatttaggaatttgacacctgggaatactaacacccgtgtatcccgatataccagcacagcccgccatcggctgcgccgcggtgccgacgggactgccgagcttactactacggcggccactcgatttatgaaagatctgcacttcactggtacgaatggagttggagaggtgggccgtggtattgctctgaccttgtttaatcttgctgatacgcttctcggtggcctgccgacagaattgatttcgtcggctgggggtcagctcttctactcccgccctgtcgtctcg</INSDSeq_sequence>
</INSDSeq>
Out[6]:

To extract all the information about organism, host, sampling time, etc., that is held in the list of INSDQualifiers, I loop through all the sequences and generate a dictionary with accession as the key and a dictionary of qualifiers as the value.

I start by initialising an empty dictionary, with strings as both the key and the value.

In [7]:
seq_dict=(ASCIIString=>Dict{ASCIIString,ASCIIString})[]


Out[7]:
Dict{ASCIIString,Dict{ASCIIString,ASCIIString}} with 0 entries

Extracting the information is a mixture of find_element and find_elements_by_tagname to search for the right elements, get_elements_by_tagname, and finally using content to extract the contents of the qualifiers.

In [8]:
for i in 1:numseq
    s=sequences[i]
    accession=content(find_element(s, "INSDSeq_primary-accession"))
    feature_table=find_element(s,"INSDSeq_feature-table")
    features=get_elements_by_tagname(feature_table,"INSDFeature")
    feature_quals=get_elements_by_tagname(features[1], "INSDFeature_quals")
    qualifiers=get_elements_by_tagname(feature_quals[1], "INSDQualifier")
    qualifier_dict=(ASCIIString=>ASCIIString)[]
    for q in qualifiers
        n=find_element(q,"INSDQualifier_name")
        v=find_element(q,"INSDQualifier_value")
        if v!=nothing
            qualifier_dict[content(n)]=content(v)
        end
    end
    seq_dict[accession]=qualifier_dict
end;

Here is an example of the features for the first accession.

In [9]:
seq_dict[accessions[1]]
Out[9]:
Dict{ASCIIString,ASCIIString} with 7 entries:
  "organism"         => "Swine hepatitis E virus"
  "isolation_source" => "pig"
  "mol_type"         => "genomic RNA"
  "strain"           => "swJ570"
  "isolate"          => "No. 570"
  "db_xref"          => "taxon:63421"
  "country"          => "Japan"

To flatten the dictionary, I first make a dictionary of all feature names, with the number of times the field is found.

In [10]:
fn_dict=(ASCIIString=>Int64)[]
for acc in keys(seq_dict)
    features=seq_dict[acc]
    for k in keys(features)
        current_count=get(fn_dict,k,0)
        fn_dict[k]=current_count+1
    end
end
fn_dict
Out[10]:
Dict{ASCIIString,Int64} with 27 entries:
  "collected_by"     => 250
  "lab_host"         => 18
  "collection_date"  => 5800
  "cell_line"        => 1
  "acronym"          => 19
  "organism"         => 10038
  "isolation_source" => 6309
  "map"              => 61
  "subtype"          => 212
  "clone"            => 1133
  "genotype"         => 4932
  "isolate"          => 6898
  "country"          => 8991
  "note"             => 2062
  "mol_type"         => 10038
  "strain"           => 2943
  "specimen_voucher" => 1
  "lat_lon"          => 45
  "db_xref"          => 10038
  "clone_lib"        => 3
  "host"             => 7503
  "segment"          => 7
  "PCR_primers"      => 241
  "tissue_lib"       => 1
  "type"             => 7
  ⋮                  => ⋮

I extract the names of the qualifiers as a list, that will be used below to construct a DataFrame.

In [11]:
feature_names=collect(keys(fn_dict))
Out[11]:
27-element Array{ASCIIString,1}:
 "collected_by"    
 "lab_host"        
 "collection_date" 
 "cell_line"       
 "acronym"         
 "organism"        
 "isolation_source"
 "map"             
 "subtype"         
 "clone"           
 "genotype"        
 "isolate"         
 "country"         
 ⋮                 
 "strain"          
 "specimen_voucher"
 "lat_lon"         
 "db_xref"         
 "clone_lib"       
 "host"            
 "segment"         
 "PCR_primers"     
 "tissue_lib"      
 "type"            
 "identified_by"   
 "tissue_type"     

I then loop through each feature name, for each sequence, determine whether the feature is present, and construct a DataArray, which is then added to a DataFrame.

In [12]:
df=DataFrame(accession=accessions)
numfeatures=length(feature_names)
for i in 1:numfeatures
    key=feature_names[i]
    dv=DataArray(ASCIIString[],Bool[])
    for j in 1:numseq
        acc=accessions[j]
        f=seq_dict[acc]
        val=get(f,key,NA) # NA is the default
        push!(dv,val)
    end
    df[symbol(key)]=dv
end;

I now have a DataFrame that has the features in a flat format.

In [13]:
head(df)
Out[13]:
accessioncollected_bylab_hostcollection_datecell_lineacronymorganismisolation_sourcemapsubtypeclonegenotypeisolatecountrynotemol_typestrainspecimen_voucherlat_londb_xrefclone_libhostsegmentPCR_primerstissue_libtypeidentified_bytissue_type
1AB073909NANANANANASwine hepatitis E viruspigNANANANANo. 570JapanNAgenomic RNAswJ570NANAtaxon:63421NANANANANANANANA
2AB073910NANANANANASwine hepatitis E viruspigNANANANANo. 681JapanNAgenomic RNAswJ681NANAtaxon:63421NANANANANANANANA
3AB073911NANANANANASwine hepatitis E viruspigNANANANANo. 791JapanNAgenomic RNAswJ791NANAtaxon:63421NANANANANANANANA
4AB073912NANANANANASwine hepatitis E viruspigNANANANANo. 570JapanNAgenomic RNAswJ570NANAtaxon:63421NANANANANANANANA
5AB074915NANANANANAHepatitis E virusNANANANANAJAK-SaiJapan: SaitamaHEV; Japanese strain; isolated from serumgenomic RNANANANAtaxon:12461NAHomo sapiensNANANANANANA
6AB074916NANANANANAHepatitis E virushuman serumNANANANAJHA-SapJapan: SapporoHEV; Japanese straingenomic RNANANANAtaxon:12461NAHomo sapiensNANANANANANA

The annotations can now be written to file as a table.

In [14]:
writetable("annotations.txt", df, separator = '\t', header = true)

Fields can be summarised using countmap, Julia's equivalent of R's table. Here's the content of the host field.

In [15]:
countmap(df[:host])
Out[15]:
Dict{Union(NAtype,ASCIIString),Int64} with 82 entries:
  "Sus scrofa"             => 274
  "Homo sapiens; 40 year … => 1
  "ferret"                 => 18
  NA                       => 2546
  "Oryctolagus cuniculus"  => 4
  "Homo sapiens 70 year-o… => 2
  "Homo sapiens"           => 4762
  "Suncus murinus"         => 5
  "pet pig"                => 1
  "domestic pig"           => 28
  "Homo sapiens; travel t… => 1
  "Homo sapiens 43 year-o… => 2
  "domestic swine"         => 28
  "Rattus rattus (wild ra… => 102
  "Japanese monkey"        => 1
  "Oryctolagus cuniculus … => 3
  "Rattus rattus"          => 41
  "Bandicota indica"       => 4
  "Homo sapiens 63 year-o… => 2
  "humna"                  => 1
  "Homo sapiens 51 years … => 1
  "Neovison vison (mink)"  => 4
  "silver pheasant"        => 1
  "sheep N3"               => 1
  "swine"                  => 1094
  ⋮                        => ⋮

That's not the only field that has information relevant to the host. Sometimes, the organism is labelled with the host.

In [16]:
countmap(df[:organism])
Out[16]:
Dict{Union(NAtype,ASCIIString),Int64} with 19 entries:
  "Hepatitis E virus Hu/0… => 1
  "Hepatitis E virus rat/… => 1
  "Hepatitis E virus type… => 24
  "Hepatitis E virus type… => 46
  "Hepatitis E virus Hu/1… => 1
  "Hepatitis E virus type… => 360
  "Hepatitis E virus Hu/2… => 1
  "Hepatitis E virus Hu/6… => 1
  "Mink hepatitis E virus" => 4
  "Hepatitis E virus huma… => 1
  "Hepatitis E virus rat/… => 1
  "Hepatitis E virus Hu/1… => 1
  "Hepatitis E virus huma… => 1
  "Hepatitis E virus"      => 8603
  "Hepatitis E virus rat/… => 1
  "Swine hepatitis E viru… => 983
  "Big liver and spleen d… => 1
  "Ferret hepatitis E vir… => 17
  "Hepatitis E virus Hu/0… => 1

Isolation source can also contain information on host type.

In [17]:
countmap(longseqdf[:isolation_source])
longseqdf not defined
while loading In[17], in expression starting on line 1

As you can see, quite a lot of work is needed to generate a 'clean' list of host types.

Of course, we would normally go on to do something with the sequences, which are also contained in the XML.

In [18]:
seqstrings=[content(find_element(s,"INSDSeq_sequence")) for s in sequences];

Once I have the sequences as an array of strings, I can do all the usual sorts of things like look at sequence length.

In [19]:
seqlen =[length(s) for s in seqstrings]
Out[19]:
10049-element Array{Any,1}:
  421
  421
  421
 7257
 7236
  326
 7235
 7256
  326
 7240
  326
   98
   98
    ⋮
 1117
  849
  453
  506
  327
  327
  327
  327
  210
  210
 7202
 7194

The Base library in Julia has a histogram function; one can see that only a small fraction of sequences (258 of them) are almost full genomes (7000-7500 base pairs long).

In [20]:
hist(convert(Array{Int64},seqlen))
Out[20]:
(0.0:500.0:7500.0,[8014,1354,330,52,19,1,1,1,1,0,3,0,0,15,258])

There are various options to output these sequences as a FASTA file. Perhaps the most straightforward is to using the @printf macro, which if given a stream as the first argument, will print formatted text to the stream.

In [21]:
f=open("hev_sequences.fas","w")
for i in 1:numseq
    @printf(f,">%s\n%s\n",accessions[i],seqstrings[i])
end
close(f)

Tuesday, July 22, 2014

Plotting probability distributions in Julia

I often find the need to plot out probability distributions of parameters, for example posterior distributions. There are lots of ways to do this in Julia, and a clear 'winner' for plotting has yet to be established (although there's always room for implementations). Here, I outline some of the ways to generate probability histograms.

To generate the data, I first need to load Distributions and DataFrames.

In [1]:
using Distributions
using DataFrames;

There doesn't appear to be a function in the standard library to normalise histograms, so here is a function to calculate the total area under the curve for a histogram, using the tuple returned by the hist function.

In [2]:
function auc(h::(Range{Float64},Array{Int64,1}))
  freq=convert(Array{Float64,1},h[2])
  e=convert(Array{Float64,1},h[1])
  numbins=length(e)
  deltax=e[2:numbins]-e[1:(numbins-1)]
  sum(deltax.*freq)
end;

I now generate some random data from a $Gamma(3,1)$ distribution.

In [3]:
srand(1)
x=rand(Distributions.Gamma(3,1),10000)
xhist=hist(x,100)
nbins=length(xhist[1])
xauc=auc(xhist)
xvec=vec([z::Float64 for z in xhist[1]])
xdf=DataFrame(xmin=vec(xvec[1:(nbins-1)]),xmax=vec(xvec[2:nbins]),count=xhist[2])
xdf[:dens]=xdf[:count]/xauc;

Here's what the data look like.

In [4]:
xdf
Out[4]:
xminxmaxcountdens
10.00.2170.0085
20.20.4620.031
30.40.61810.0905
40.60.82440.122
50.81.03610.1805
61.01.23960.198
71.21.44640.232
81.41.65070.2535
91.61.85380.269
101.82.05350.2675
112.02.25540.277
122.22.44740.237
132.42.65270.2635
142.62.85200.26
152.83.04020.201
163.03.24380.219
173.23.43900.195
183.43.63550.1775
193.63.83300.165
203.84.02890.1445
214.04.22760.138
224.24.42560.128
234.44.62220.111
244.64.81950.0975
254.85.01860.093
265.05.21420.071
275.25.41520.076
285.45.61440.072
295.65.81340.067
305.86.01020.051

I generate a dataframe containing points from the PDF, which I will use to compare the random values.

In [5]:
xp=DataFrame(x=linspace(0,15,1501),dens=pdf(Distributions.Gamma(3,1),linspace(0,15,1501)))
xp
Out[5]:
xdens
10.00.0
20.014.950249168745848e-5
30.020.00019603973466135092
40.030.0004367004900968286
50.040.0007686315513218585
60.050.0011890367806258926
70.060.0016951761604516473
80.070.0022843648587695725
90.080.0029539723084372345
100.090.0037014213003484743
110.10.004524187090179799
120.110.005419796518543995
130.120.006385827144363534
140.130.007419906391278744
150.140.008519710706908304
160.150.009682964734781897
170.160.010907440498767502
180.170.012190956599817747
190.180.013531377424862616
200.190.014926612367677677
210.20.016374615061559628
220.210.017873382623642627
230.220.019420954910691974
240.229999999999999980.021015413786213192
250.240.022654882398716738
260.250.024337524470981413
270.260.02606154360016055
280.269999999999999960.02782518256857831
290.280.02962672266506445
300.290000000000000040.031464483016678674

PyPlot

As I use IJulia a lot, the most straightforward place to start is to use PyPlot, which is just an interface to Python's matplotlib. PyPlot has an option to normalise the histogram, so we pass the raw data, and PyPlot generates the histogram for us.

In [6]:
using PyPlot
PyPlot.plt.hist(x,nbins,normed="True")
PyPlot.plt.plot(xp[:x],xp[:dens])
INFO: Loading help data...

Out[6]:
1-element Array{Any,1}:
 PyObject <matplotlib.lines.Line2D object at 0xf43ee10>

Gadfly

Gadfly is one of the more polished plotting packages out there, and takes many of its cues from Hadley Wickham's excellent ggplot2 library. However, it seems that there isn't an option to normalise a histogram, so I use a barplot of the pre-computed densities instead. To get different colours for the histogram and the density plot, I use Themes.

In [7]:
using Gadfly
Gadfly.plot(layer(x=xdf[:xmin],y=xdf[:dens],Geom.bar,Theme(default_color=color("lightgray"))),
layer(x=xp[:x],y=xp[:dens],Geom.line,Theme(default_color=color("orange"))))
Out[7]:
x -20 -15 -10 -5 0 5 10 15 20 25 30 35 -15.0 -14.5 -14.0 -13.5 -13.0 -12.5 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 -20 0 20 40 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -0.31 -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 -0.5 0.0 0.5 1.0 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 y

Winston

Winston is another nice plotting library, which is very easy to use. It can take the output of the hist command, as well as easily add a curve on top.

In [8]:
using Winston
p = Winston.FramedPlot(title="", xlabel="x", ylabel="Density")
xhist2=(xhist[1],xhist[2]./xauc)
Winston.add(p, Winston.Histogram(xhist2...))
Winston.add(p, Winston.Curve(xp[:x],xp[:dens],color="orange"))
Out[8]:

Plotly

Plotly is an online service that generates some nice plots; one will need an account and an API key to be able to use it, however (fill in yours below). To install, you need to run Pkg.clone("https://github.com/plotly/Plotly.jl"). I also had to edit the URL, as I run my IJulia notebook over HTTPS, and Chrome doesn't like insecure content loaded from a secure site.

In [9]:
using Plotly
username=""
api_key=""
Plotly.signin(username, api_key)
data0 = ["x" => x, "type" => "histogramx", "histnorm" => "probability density"]
data1 = ["x" => xp["x"],"y" => xp["dens"], "type" => "scatter"]
layout = {
  "title"=> "PDF",
  "autosize"=> "false",
  "width"=> 700,
  "height"=> 700,
  "titlefont"=>
          {
           "family"=> "Open Sans",
           "size"=> 18,
           "color"=> "rgb(84, 39, 143)"
           },
  "margin"=> {"l"=>160, "pad"=>0},
          "xaxis"=> {
                     "title"=> "x"
                     },
  "yaxis"=> {"title"=> "Density"},
  "showlegend" => false
          };
response = Plotly.plot([data0, data1],["layout"=>layout])
url = response["url"]
url = replace(url,"http","https")
s = string("<iframe height='750' id='igraph' scrolling='no' seamless='seamless' src='",
            url,
            "/700/700' width='750'></iframe>")
display("text/html", s);
Warning: using Winston.text in module Main conflicts with an existing identifier.
WARNING: indexing DataFrames with strings is deprecated; use symbols instead
 in getindex at /home/simon/.julia/v0.3/DataFrames/src/deprecated.jl:170
 in include_string at loading.jl:97
WARNING: indexing DataFrames with strings is deprecated; use symbols instead
 in getindex at /home/simon/.julia/v0.3/DataFrames/src/deprecated.jl:170
 in include_string at loading.jl:97

Conclusions

There are some other packages for plotting in Julia, like Gaston and Vega. Gaston is an interface to Gnuplot. Out of the box, it supports basic histograms, but if you want to do anything more complex, like add a line to a histogram, one has to drop down to the mid-level interface. In addition, Gaston doesn't currently support inline images in IJulia, and I've had problems with it recently. Vega has potential, but isn't as fully featured as some of the other plotting packages and hasn't been updated recently.