Tuesday, July 22, 2014

Plotting probability distributions in Julia

I often find the need to plot out probability distributions of parameters, for example posterior distributions. There are lots of ways to do this in Julia, and a clear 'winner' for plotting has yet to be established (although there's always room for implementations). Here, I outline some of the ways to generate probability histograms.

To generate the data, I first need to load Distributions and DataFrames.

In [1]:
using Distributions
using DataFrames;

There doesn't appear to be a function in the standard library to normalise histograms, so here is a function to calculate the total area under the curve for a histogram, using the tuple returned by the hist function.

In [2]:
function auc(h::(Range{Float64},Array{Int64,1}))
  freq=convert(Array{Float64,1},h[2])
  e=convert(Array{Float64,1},h[1])
  numbins=length(e)
  deltax=e[2:numbins]-e[1:(numbins-1)]
  sum(deltax.*freq)
end;

I now generate some random data from a $Gamma(3,1)$ distribution.

In [3]:
srand(1)
x=rand(Distributions.Gamma(3,1),10000)
xhist=hist(x,100)
nbins=length(xhist[1])
xauc=auc(xhist)
xvec=vec([z::Float64 for z in xhist[1]])
xdf=DataFrame(xmin=vec(xvec[1:(nbins-1)]),xmax=vec(xvec[2:nbins]),count=xhist[2])
xdf[:dens]=xdf[:count]/xauc;

Here's what the data look like.

In [4]:
xdf
Out[4]:
xminxmaxcountdens
10.00.2170.0085
20.20.4620.031
30.40.61810.0905
40.60.82440.122
50.81.03610.1805
61.01.23960.198
71.21.44640.232
81.41.65070.2535
91.61.85380.269
101.82.05350.2675
112.02.25540.277
122.22.44740.237
132.42.65270.2635
142.62.85200.26
152.83.04020.201
163.03.24380.219
173.23.43900.195
183.43.63550.1775
193.63.83300.165
203.84.02890.1445
214.04.22760.138
224.24.42560.128
234.44.62220.111
244.64.81950.0975
254.85.01860.093
265.05.21420.071
275.25.41520.076
285.45.61440.072
295.65.81340.067
305.86.01020.051

I generate a dataframe containing points from the PDF, which I will use to compare the random values.

In [5]:
xp=DataFrame(x=linspace(0,15,1501),dens=pdf(Distributions.Gamma(3,1),linspace(0,15,1501)))
xp
Out[5]:
xdens
10.00.0
20.014.950249168745848e-5
30.020.00019603973466135092
40.030.0004367004900968286
50.040.0007686315513218585
60.050.0011890367806258926
70.060.0016951761604516473
80.070.0022843648587695725
90.080.0029539723084372345
100.090.0037014213003484743
110.10.004524187090179799
120.110.005419796518543995
130.120.006385827144363534
140.130.007419906391278744
150.140.008519710706908304
160.150.009682964734781897
170.160.010907440498767502
180.170.012190956599817747
190.180.013531377424862616
200.190.014926612367677677
210.20.016374615061559628
220.210.017873382623642627
230.220.019420954910691974
240.229999999999999980.021015413786213192
250.240.022654882398716738
260.250.024337524470981413
270.260.02606154360016055
280.269999999999999960.02782518256857831
290.280.02962672266506445
300.290000000000000040.031464483016678674

PyPlot

As I use IJulia a lot, the most straightforward place to start is to use PyPlot, which is just an interface to Python's matplotlib. PyPlot has an option to normalise the histogram, so we pass the raw data, and PyPlot generates the histogram for us.

In [6]:
using PyPlot
PyPlot.plt.hist(x,nbins,normed="True")
PyPlot.plt.plot(xp[:x],xp[:dens])
INFO: Loading help data...

Out[6]:
1-element Array{Any,1}:
 PyObject <matplotlib.lines.Line2D object at 0xf43ee10>

Gadfly

Gadfly is one of the more polished plotting packages out there, and takes many of its cues from Hadley Wickham's excellent ggplot2 library. However, it seems that there isn't an option to normalise a histogram, so I use a barplot of the pre-computed densities instead. To get different colours for the histogram and the density plot, I use Themes.

In [7]:
using Gadfly
Gadfly.plot(layer(x=xdf[:xmin],y=xdf[:dens],Geom.bar,Theme(default_color=color("lightgray"))),
layer(x=xp[:x],y=xp[:dens],Geom.line,Theme(default_color=color("orange"))))
Out[7]:
x -20 -15 -10 -5 0 5 10 15 20 25 30 35 -15.0 -14.5 -14.0 -13.5 -13.0 -12.5 -12.0 -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 20.0 20.5 21.0 21.5 22.0 22.5 23.0 23.5 24.0 24.5 25.0 25.5 26.0 26.5 27.0 27.5 28.0 28.5 29.0 29.5 30.0 -20 0 20 40 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 -0.31 -0.30 -0.29 -0.28 -0.27 -0.26 -0.25 -0.24 -0.23 -0.22 -0.21 -0.20 -0.19 -0.18 -0.17 -0.16 -0.15 -0.14 -0.13 -0.12 -0.11 -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 -0.5 0.0 0.5 1.0 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62 y

Winston

Winston is another nice plotting library, which is very easy to use. It can take the output of the hist command, as well as easily add a curve on top.

In [8]:
using Winston
p = Winston.FramedPlot(title="", xlabel="x", ylabel="Density")
xhist2=(xhist[1],xhist[2]./xauc)
Winston.add(p, Winston.Histogram(xhist2...))
Winston.add(p, Winston.Curve(xp[:x],xp[:dens],color="orange"))
Out[8]:

Plotly

Plotly is an online service that generates some nice plots; one will need an account and an API key to be able to use it, however (fill in yours below). To install, you need to run Pkg.clone("https://github.com/plotly/Plotly.jl"). I also had to edit the URL, as I run my IJulia notebook over HTTPS, and Chrome doesn't like insecure content loaded from a secure site.

In [9]:
using Plotly
username=""
api_key=""
Plotly.signin(username, api_key)
data0 = ["x" => x, "type" => "histogramx", "histnorm" => "probability density"]
data1 = ["x" => xp["x"],"y" => xp["dens"], "type" => "scatter"]
layout = {
  "title"=> "PDF",
  "autosize"=> "false",
  "width"=> 700,
  "height"=> 700,
  "titlefont"=>
          {
           "family"=> "Open Sans",
           "size"=> 18,
           "color"=> "rgb(84, 39, 143)"
           },
  "margin"=> {"l"=>160, "pad"=>0},
          "xaxis"=> {
                     "title"=> "x"
                     },
  "yaxis"=> {"title"=> "Density"},
  "showlegend" => false
          };
response = Plotly.plot([data0, data1],["layout"=>layout])
url = response["url"]
url = replace(url,"http","https")
s = string("<iframe height='750' id='igraph' scrolling='no' seamless='seamless' src='",
            url,
            "/700/700' width='750'></iframe>")
display("text/html", s);
Warning: using Winston.text in module Main conflicts with an existing identifier.
WARNING: indexing DataFrames with strings is deprecated; use symbols instead
 in getindex at /home/simon/.julia/v0.3/DataFrames/src/deprecated.jl:170
 in include_string at loading.jl:97
WARNING: indexing DataFrames with strings is deprecated; use symbols instead
 in getindex at /home/simon/.julia/v0.3/DataFrames/src/deprecated.jl:170
 in include_string at loading.jl:97

Conclusions

There are some other packages for plotting in Julia, like Gaston and Vega. Gaston is an interface to Gnuplot. Out of the box, it supports basic histograms, but if you want to do anything more complex, like add a line to a histogram, one has to drop down to the mid-level interface. In addition, Gaston doesn't currently support inline images in IJulia, and I've had problems with it recently. Vega has potential, but isn't as fully featured as some of the other plotting packages and hasn't been updated recently.


4 comments:

  1. The syntax for both histograms and tuples for arguments has changed as of Julia 0.5+, so the above code to generate the data doesn't work. Try the following:

    ```
    using Distributions
    using DataFrames

    function auc(h::Tuple{Range{Float64},Array{Int64,1}})
    freq=convert(Array{Float64,1},h[2])
    e=convert(Array{Float64,1},h[1])
    numbins=length(e)
    deltax=e[2:numbins]-e[1:(numbins-1)]
    sum(deltax.*freq)
    end;

    function auc(h::Histogram)
    freq=convert(Array{Float64,1},h.weights)
    e=convert(Array{Float64,1},h.edges[1])
    numbins=length(e)
    deltax=e[2:numbins]-e[1:(numbins-1)]
    sum(deltax.*freq)
    end;

    srand(1)
    x=rand(Distributions.Gamma(3,1),10000)

    xhist=hist(x,100)
    nbins=length(xhist[1])
    xauc=auc(xhist)
    xvec=vec([z::Float64 for z in xhist[1]])
    xdf=DataFrame(xmin=vec(xvec[1:(nbins-1)]),xmax=vec(xvec[2:nbins]),count=xhist[2])
    xdf[:dens]=xdf[:count]/xauc

    xhist2=fit(Histogram,x,nbins=100)
    xauc2=auc(xhist2)
    nbins2=length(xhist2.edges[1])
    xvec2=vec([z::Float64 for z in xhist2.edges[1]])
    xdf2=DataFrame(xmin=vec(xvec2[1:(nbins-1)]),xmax=vec(xvec2[2:nbins]),count=xhist2.weights)
    xdf2[:dens]=xdf2[:count]/xauc2
    ```

    ReplyDelete
    Replies
    1. Great post! The `hist` function has been removed in Juia 0.6:

      https://github.com/JuliaLang/julia/pull/16450

      Delete