Sharp BellPersonal blog of Osvaldo Zagordi, data scientist at University of Zurich and founder of enGene Statistics. Data science, statistics, science, machine learning, python, R and so on.
https://ozagordi.github.io/
Mon, 21 Nov 2016 15:55:40 +0100Mon, 21 Nov 2016 15:55:40 +0100Jekyll v3.1.6A very simple plot<p>If you look at the front cover of <em>The Visual Display of Quantitative Information</em>,
one of the famous books by Eduard Tufte, you will see a plot made of many
lines. An intricate network of red lines of different thickness,
cutting diagonally a regular grey grid.</p>
<p><img src="https://www.edwardtufte.com/tufte/graphics/vdqi_bookcover.gif" alt="VDQI bookcover" class="center-block" /></p>
<p>The plot is one of the examples of the first chapter, <em>Graphical Excellence</em>,
and it is a graphical train schedule for Paris to Lyon in 1880s, by
<a href="https://en.wikipedia.org/wiki/Étienne-Jules_Marey">Étienne-Jules Marey</a>. The
station are arranged vertically (the top one being Paris) at a distance
proportional to their actual distance, while time is on the horizontal axis. Each
line is a train, and the slope is the speed. It took around nine hours for the
entire trip but the TGV, first introduced in 1981, completes the trip in less
than three hours. A red line overlayed on the schedule of 100 years before
marks the progress made.</p>
<p><img src="/img/marey_tgv.jpg" alt="Marey's schedule with TGV overlaid" class="center-block" /></p>
<p>I like these plots because of their simplicity: position vs. time; something I
must have seen for the first time in secondary school or even earlier. Angle,
slope, tangent, derivative, speed, everything is there, you just name it
differently in different moments of your education.</p>
<p>I decided to reproduce this plot for the Zurich-Milano train connection, which
is about to experience an historical moment. In a few weeks, all passenger trains
on this line will be traveling in the
<a href="https://en.wikipedia.org/wiki/Gotthard_Base_Tunnel">Gotthard Base Tunnel</a>.
It is the world’s longest tunnel, running along more than 57 km under the Swiss
Alps at an altitude between 469 and 312 m. Trains currently travel through
another tunnel, the
<a href="https://en.wikipedia.org/wiki/Gotthard_Tunnel">Gotthard Tunnel</a>, 15 km long,
built at the end of 19th century at an altitude of around 1,100 m. In order
to climb to that altitude, trains travel through several railway spirals.
The view is great, but the speed is not.</p>
<p>I downloaded the schedule for a few morning trains that go through the
<em>Panoramastrecke</em> (the panoramic track) and a few that will go through the
base tunnel, they are shown in the plot below. It might not look that impressive
compared to the TGV. But if you look at the mountains while approaching them
from the plain, it’s a different feeling.</p>
<p><img src="/img/gotthard_trains.svg" alt="Trains schedule Zurich-Milano" class="center-block" /></p>
<p><a href="https://www.alptransit.ch/en/media/press-releases/detail/article/memorial-ceremony-for-deceased-tunnel-workers/">Nine people died</a>
during construction of the Gotthard Base Tunnel. None of them was of Swiss nationality.</p>
<p>Below is the R code used to produce the schedule plot. The timetable for a few train
was obtained on the SBB <a href="http://www.sbb.ch">website</a> and manually written into
a <a href="https://github.com/ozagordi/ozagordi.github.io/blob/gh-pages/_source/train_gotthard/timetable-table.csv">csv file</a>.
The coordinates of the cities were taken on Google Maps and saved
<a href="https://github.com/ozagordi/ozagordi.github.io/blob/gh-pages/_source/train_gotthard/coordinates-table.csv">here</a>.</p>
<div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">tidyverse</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggthemr</span><span class="p">)</span><span class="w">
</span><span class="n">ggthemr</span><span class="p">(</span><span class="s1">'solarized'</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s1">'outer'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">geosphere</span><span class="p">)</span><span class="w">
</span><span class="c1"># parse coordinates and compute distances
</span><span class="n">coordinates_table</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"train_gotthard/coordinates-table.csv"</span><span class="p">)</span><span class="w">
</span><span class="n">cities</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">coordinates_table</span><span class="o">$</span><span class="n">city</span><span class="w">
</span><span class="n">coordinates_table</span><span class="o">$</span><span class="n">city</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="kc">NULL</span><span class="w">
</span><span class="n">p</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">coordinates_table</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="m">2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="n">p</span><span class="m">1</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">cities</span><span class="p">)),</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">p</span><span class="m">1</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">cities</span><span class="p">))),</span><span class="w"> </span><span class="n">ncol</span><span class="o">=</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">dists</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">distHaversine</span><span class="p">(</span><span class="n">p</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">trains</span><span class="w"> </span><span class="o"><-</span><span class="w">
</span><span class="n">read_csv</span><span class="p">(</span><span class="s2">"train_gotthard/timetable-table.csv"</span><span class="p">,</span><span class="w">
</span><span class="n">col_types</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">(</span><span class="n">city</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col_character</span><span class="p">(),</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col_character</span><span class="p">(),</span><span class="w">
</span><span class="n">train</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col_character</span><span class="p">(),</span><span class="w">
</span><span class="n">track</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">col_character</span><span class="p">()))</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">time</span><span class="o">=</span><span class="n">as.POSIXct</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">format</span><span class="o">=</span><span class="s2">"%H:%M"</span><span class="p">))</span><span class="w">
</span><span class="n">distances</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data_frame</span><span class="p">(</span><span class="n">city</span><span class="o">=</span><span class="n">cities</span><span class="p">,</span><span class="w"> </span><span class="n">d</span><span class="o">=</span><span class="n">dists</span><span class="p">)</span><span class="w">
</span><span class="n">trains</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">trains</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">full_join</span><span class="p">(</span><span class="n">distances</span><span class="p">,</span><span class="w"> </span><span class="n">by</span><span class="o">=</span><span class="s2">"city"</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="n">train_track</span><span class="o">=</span><span class="n">paste</span><span class="p">(</span><span class="n">train</span><span class="p">,</span><span class="w"> </span><span class="n">track</span><span class="p">,</span><span class="w"> </span><span class="n">sep</span><span class="o">=</span><span class="s1">'_'</span><span class="p">))</span><span class="w">
</span><span class="n">cbPalette</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"#657b83"</span><span class="p">,</span><span class="w"> </span><span class="s2">"#268bd2"</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">trains</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">group</span><span class="o">=</span><span class="n">train_track</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="n">track</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_y_continuous</span><span class="p">(</span><span class="n">breaks</span><span class="o">=</span><span class="n">distances</span><span class="o">$</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">labels</span><span class="o">=</span><span class="n">distances</span><span class="o">$</span><span class="n">city</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_x_datetime</span><span class="p">(</span><span class="n">date_breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"1 hour"</span><span class="p">,</span><span class="w"> </span><span class="n">date_labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"%H:%M"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">ylab</span><span class="p">(</span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">15</span><span class="p">),</span><span class="w">
</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">15</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">15</span><span class="p">),</span><span class="w">
</span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">15</span><span class="p">),</span><span class="w">
</span><span class="n">legend.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">15</span><span class="p">),</span><span class="w">
</span><span class="n">legend.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="m">15</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">scale_colour_manual</span><span class="p">(</span><span class="n">values</span><span class="o">=</span><span class="n">cbPalette</span><span class="p">)</span><span class="w">
</span></code></pre>
</div>
<p>Header image via <a href="https://flic.kr/p/q8P66v">Flickr</a> (under CC BY-ND 2.0).</p>
Sat, 19 Nov 2016 00:00:00 +0100
https://ozagordi.github.io/2016/11/19/a-very-simple-plot/
https://ozagordi.github.io/2016/11/19/a-very-simple-plot/datavizBreaking ground<p>Most of our daily experiences, be it writing a blog post on a laptop or slicing
bread with a knife, build on the enormous amount of knowledge and progress
mankind has accumulated over the millennia. Most of the times, and quite
understandably, we take all this for granted and live thinking about the future
rather than the past. It is not worth recalling the iron age every time we use
a fork, or all the people involved in the invention of the transistor every
time we use a computer. Every now and then, though, it is hard not to be
overwhelmed by the magnitude of past scientists.</p>
<p>It occurred to me some weeks ago with mathematicians, when I read a nice
<a href="http://www.nature.com/news/majority-of-mathematicians-hail-from-just-24-scientific-families-1.20491">article</a>
by Davide Castelvecchi on Nature News. It reports an analysis done on the
<a href="http://genealogy.math.ndsu.nodak.edu/">Mathematics Genealogy Project</a>, a database
aiming to list present and past mathematicians together with their advisor, in
order to build a <em>family tree</em> of the advisor-advisee relationships. This
analysis shows that two thirds of the over 200 thousand mathematicians present
in the database can be assigned to one of only 24 families, the biggest of which
founded by the Italian mathematician (physician and natural philosopher)
Sigismondo Polcastro, who lived between 14th and 15th century.</p>
<p>Although I had already explored this database previously, after reading this
article I decided to go further into the past, something easily done by clicking
on the <strong>advisor</strong> link. While doing so I encountered many many <em>big names</em>,
those that you meet multiple times when you study calculus. And I was somehow
startled that they were in direct advisor-advisee relation! It is really not too
difficult to end up on <a href="https://genealogy.math.ndsu.nodak.edu/id.php?id=38586">Euler</a>
(over 96 thousand descendants) meeting Dirichlet, Fourier and Lagrange on the path,
to name a few.</p>
<p>So I decided to save some of these advisor-advisee relationship and plot them
as a directed graph with <a href="https://d3js.org">d3</a>. In the graph below an arrow
from A to B means <em>A was advisor of B</em>.
You can (and probably have to) drag the points around to clarify some links.</p>
<div id="d3graph"></div>
<style>
.box {
font: 10px sans-serif;
}
.node {
fill: #657b83;
opacity: 0.4;
}
.node text {
fill: #000;
pointer-events: none;
font: 13px sans-serif;
}
.link {
stroke: #657b83;
stroke-opacity: .7;
}
</style>
<script type="text/javascript" src="https://d3js.org/d3.v3.js"></script>
<script>
var links = [
{source: "Marin Mersenne", target: "Blaise Pascal"},
{source: "Marin Mersenne", target: "Frans van Schooten, Jr."},
{source: "Frans van Schooten, Jr.", target: "Christiaan Huygens"},
// {source: "Frans van Schooten, Jr.", target: "Johan de Witt"},
{source: "Christiaan Huygens", target: "Gottfried W. Leibniz"},
{source: "Gottfried W. Leibniz", target: "Nicolas Malebranche"},
{source: "Nicolas Malebranche", target: "Jakob Bernoulli"},
{source: "Jakob Bernoulli", target: "Nikolaus (I) Bernoulli"},
{source: "Jakob Bernoulli", target: "Johann Bernoulli"},
{source: "Johann Bernoulli", target: "Daniel Bernoulli"},
{source: "Johann Bernoulli", target: "Leonhard Euler"},
{source: "Leonhard Euler", target: "Joseph-Louis Lagrange"},
{source: "Giovanni B. Beccaria", target: "Joseph-Louis Lagrange"},
{source: "Joseph-Louis Lagrange", target: "Jean-Baptiste Fourier"},
{source: "Joseph-Louis Lagrange", target: "Simeon D. Poisson"},
{source: "Jean Le Rond d'Alembert", target: "Pierre-Simon Laplace"},
{source: "Pierre-Simon Laplace", target: "Simeon D. Poisson"},
{source: "Jean-Baptiste Fourier", target: "Gustav Dirichlet"},
{source: "Simeon D. Poisson", target: "Gustav Dirichlet"},
{source: "Simeon D. Poisson", target: "Joseph Liouville"},
{source: "Jean-Baptiste Fourier", target: "Giovanni A. A. Plana"},
{source: "Joseph-Louis Lagrange", target: "Giovanni A. A. Plana"},
// {source: "Gustav Dirichlet", target: "August Kramer"},
{source: "Gustav Dirichlet", target: "Leopold Kronecker"},
{source: "Gustav Dirichlet", target: "Rudolf Lipschitz"},
{source: "Rudolf Lipschitz", target: "C. Felix Klein"}
];
var nodes = {};
// Compute the distinct nodes from the links.
links.forEach(function(link) {
link.source = nodes[link.source] || (nodes[link.source] = {name: link.source});
link.target = nodes[link.target] || (nodes[link.target] = {name: link.target});
});
var width = 740,
height = 800;
var force = d3.layout.force()
.nodes(d3.values(nodes))
.links(links)
.size([width, height])
.linkDistance(55)
.charge(-900)
.start();
//var svg = d3.select("body")
var svg = d3.select("#d3graph")
.append("svg")
.attr("class", "box")
.attr("width", width)
.attr("height", height);
//Create all the line svgs but without locations yet
var link = svg.selectAll(".link")
.data(force.links())
.enter().append("line")
.attr("class", "link")
.style("marker-end", "url(#advisor)"); //Added
var node = svg.selectAll(".node")
.data(force.nodes())
.enter().append("g")
.attr("class", "node")
.call(force.drag);
node.append("circle")
.attr("r", 4)
node.append("text")
.attr("dx", 8)
.attr("dy", ".85em")
.text(function(d) { return d.name });
force.on("tick", function () {
link.attr("x1", function (d) {
return d.source.x;
})
.attr("y1", function (d) {
return d.source.y;
})
.attr("x2", function (d) {
return d.target.x;
})
.attr("y2", function (d) {
return d.target.y;
});
d3.selectAll("circle").attr("cx", function (d) {
return d.x;
})
.attr("cy", function (d) {
return d.y;
});
d3.selectAll("text").attr("x", function (d) {
return d.x;
})
.attr("y", function (d) {
return d.y;
});
});
// this defines the marker-end programmatically
svg.append("defs").selectAll("marker")
.data(["advisor"])
.enter().append("marker")
.attr("id", function(d) { return d; })
.attr("viewBox", "0 -1 2 2")
.attr("refX", 2)
.attr("refY", 0)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("path")
.attr("d", "M0,-1 L2,0 L0,1 z")
//.style("stroke", "#f00")
.style("fill", "#268bd2")
.style("opacity", "1.0");
</script>
<p>The selection of nodes is completely arbitrary, I only saved nodes that
somehow struck me because something important in mathematics is named after
them.</p>
<p>There should be other interesting cliques in the database. For example, where
are Gauss, Riemann, Dedekind? And what about Cauchy and Hilbert?</p>
<h4 id="technical-notes">Technical notes</h4>
<p>I found the following links useful to draw the network:</p>
<ul>
<li>An A to Z of extra features for the D3 force layout, a
<a href="http://www.coppelia.io/2014/07/an-a-to-z-of-extra-features-for-the-d3-force-layout/">tutorial</a>
by Simon Raper,</li>
<li>a <a href="http://stackoverflow.com/questions/22651346/how-to-embed-a-d3-js-example-to-the-jekyll-blog-post">question</a>
on Stack Overflow, <em>How to embed d3 in a jekyll blog post</em> and references therein.</li>
<li>Last but not least, <a href="http://stackoverflow.com/questions/13865606/append-svg-canvas-to-element-other-than-body-using-d3">this</a>
helped me correct a fundamental mistake I was making in selecting the relevant html element.</li>
</ul>
<p><em>Header image from <a href="https://flic.kr/p/xZjJW">Flickr</a></em></p>
Sat, 08 Oct 2016 00:00:00 +0200
https://ozagordi.github.io/2016/10/08/breaking-ground/
https://ozagordi.github.io/2016/10/08/breaking-ground/mathematicsdatavizd3There is (almost) always a model<p>One of the nice things of my job is the close contact between the experiments
and the analysis. While my collaborators develop the method in the lab,
working with reagents, patients samples and sophisticated machines, I sit in
front of a computer developing tools like <a href="http://github.com/ozagordi/VirMet">this</a>
and using it to analyse the results coming from the lab.</p>
<p>Recently, one interesting question was raised when a student defended her
PhD thesis. I wasn’t there so I was only asked about this later. The question
was “did the development of the experimental method and that of
the analysis pipeline influence each other?” No, said one colleague.
Of course they did, I later replied.</p>
<p>Why did we have different opinions, since we both know how everything was
developed? My reasoning was that, without a computational pipeline to analyse
the results produced in the lab, my colleagues would have had hardly any hint
on whether their experiments were working or not. At the same time I could test
my tools on data of realistic size, with realistic signal-to-noise ratio and so
on. One side could not have done it without the other. My colleague, maybe because
of the context in which the question was asked, intended it in a more
specific way: did you tweak the parameters in the software in order to specifically
respond to a change in the experimental conditions? And the answer here is no,
undoubtedly.</p>
<p>My most successful tool, at least in terms of citations, is a
<a href="http://github.com/ozagordi/shorah/">software</a> to denoise DNA sequences, chiefly
in the analysis of viral samples. It is an unsupervised learning algorithm that
models in an explicit way, though simplified, the process of sequencing a genetically diverse
sample. We first wrote the <em>generative</em> model, then the algorithm to do
inference on its parameters and we implemented it in an efficient way. There is a
nice correspondence between parameters in the model and physical processes
taking place in the lab. Thus, it is easily interpretable.</p>
<p>It was great, but it took quite some time.</p>
<p><img src="/img/model.png" alt="Graphical representation of a generative model" /></p>
<p>But do you always need such a model? According to <a href="http://r4ds.had.co.nz/model-intro.html">some</a>,
<q>the goal of a model is to provide a simple low-dimensional summary of a dataset</q>,
and I would stretch this to say that even a simple data visualisation can be considered a model.</p>
<p>For example, in this more recent project I relied heavily on some heuristic
tools to denoise my data, match them by similarity and obtain summary statistics.
Advantages of these heruristic tools: they are fast,
thoroughly tested by a whole community of researchers, and I could easily use them.
I did not have to struggle to write efficient code from scratch.
On the other hand, I do not have many parameters that I could directly link to
anything that is happening in the lab.</p>
<p>A slightly related topic of discussion in machine learning is the trade-off
between interpretability and accuracy of a model. Although accuracy here depends
on many more things, this observation makes me appreciate even more the power
of some methods in machine learning. The models underlying tools like clustering,
trees, regression are so general and powerful that they can be routinely used for
a plethora of problems in all sorts of domains with great results. It maybe
worth reminding it next time we apply one of these.</p>
<p><em>Header image from <a href="https://flic.kr/p/66jA7n">Flickr</a></em></p>
Wed, 07 Sep 2016 00:00:00 +0200
https://ozagordi.github.io/2016/09/07/modelling/
https://ozagordi.github.io/2016/09/07/modelling/fundamentalssciencemathematicsFirst steps with Docker<p>You might have read or heard that science, I mean academic science, has a few
<a href="http://www.vox.com/2016/7/14/12016710/science-challeges-research-funding-peer-review-process">problems</a>,
among these a <a href="http://www.vox.com/2016/7/14/12016710/science-challeges-research-funding-peer-review-process#3">reproducibility
crisis</a>.</p>
<p>Although the terms are often used interchangeably, there is an important
difference between reproducibility and replicability. The former is the ability
to repeat some analysis thanks to the fact that sufficient detail has been
shared by those who performed it first. Replicability is the probability
that an independent experiment will reach the same results and conclusions that
were first reported.</p>
<p>A data scientist is typically responsible for the analysis of data that have
been produced in the lab (or in the field) by someone else, so he/she is
primarily responsible to guarantee its reproducibility. This shouldn’t be too
hard, since all you have to do is 1) saying which data were analyzed and 2) how
exactly. You share both and you are done; certainly easier than replicate a
complex and usually expensive experiment. Rmarkdown documents (by the way, this
blog is written in Rmarkdown) and Python notebook have become popular in recent
years also because they address this need.</p>
<p>For more complex projects, where multiple tools are needed, it can be more
complicated than that, and it is recognized that much of the scientific
literature is hard to reproduce. Reasons for this are largely to be found in a
lack of right incentives, but there are several remarkable technical challenges
posed by the large number of factors that influence the results. Even if we
only consider the analysis part, the huge variety of available systems makes
reproducibility challenging.
<em>You performed the analysis on a Linux cluster, will it run the same on a Mac
OS X? If you ran Python 3.4, will it still work and give the same results with
Python 3.5.1? Do I have to update that library, or is the one I have already
installed recent enough? And I’m still missing that dependency…</em></p>
<h2 id="full-virtualization">Full virtualization</h2>
<p>Virtual machines (VM), which I understand were created for entirely different
reasons, offer a possible way to make replication easier.
The researcher can setup an environment to run the analysis, then make an image
of this and share it. Another user who wants to reproduce the analysis can
launch this image with a VM and it will be like sitting at the same computer:
everything exactly the same.</p>
<p>This solution presents a few disadvantages. Images are pretty big objects,
often in the order of several gigabytes. Moving them around is fairly
inconvenient, to begin with. Further, a VM takes in the order of minutes to
launch. Not too much for a one time analysis, still slightly annoying. In
general, VMs tend to be quite heavy on the hardware.</p>
<h2 id="enter-containers">Enter containers</h2>
<p>Containers offer a light-weight option to VMs. They are usually smaller and,
above all, they are launched in a second or less.</p>
<p>I’ve been recently developing a <a href="http://github.com/ozagordi/VirMet">pipeline</a>
for the analysis of DNA sequences that relies on many external tools. As it is
customary in academia, the plan is to “advertise” it with a scientific paper
and hope that many other users will find it useful, use it and cite it. But
installing ten other tools is certainly a disincentive.</p>
<p>So, I recently looked into <a href="https://www.docker.com">Docker</a>, which is now
probably the most famous containerisation software.</p>
<p>A developer usually starts from one of the images found in Docker Hub and adds
the necessary configuration by writing a <code class="highlighter-rouge">Dockerfile</code>. Each instruction in this
file is a layer, in docker terms, and the engine builds the image by stacking
these layers in an optimized way. The resulting image can be made available to
others via Docker Hub. In this way one can make available to the community an
environment in which the application is certainly running as expected. The
advantages of Docker over VMs are not limited to being light and fast. The
presence of a central directory and the possibility to develop images easily
and in an open format both make Docker interesting. It must be said that at the
same time this poses a small risk: making dozens of images for every possible
little project that one makes and pushing it onto the hub.</p>
Sun, 24 Jul 2016 00:00:00 +0200
https://ozagordi.github.io/2016/07/24/first-steps-with-docker/
https://ozagordi.github.io/2016/07/24/first-steps-with-docker/dockerreplicabilitysoftware developmentdevopsConfidence interval and hypothesis test<h2 id="if-all-tests-are-negative-the-positive-rate-is-zero-and-its-confidence-interval">If all tests are negative, the positive rate is zero. And its confidence interval?</h2>
<p>Some time ago a colleague presented me a simple statistics question which, as
usual, turned out to be quite interesting and intriguing. They had run 23 tests
that were all negative, and they wanted to have a confidence interval for the
proportion of positive outcomes. So, to put it in statistical terms, we have
<script type="math/tex">n</script> observations that can be 0 or 1. Each observation has a probability
<script type="math/tex">\theta</script> of being positive. In formulas</p>
<script type="math/tex; mode=display">x_i \in \{0, 1\} \quad \forall i = 1, \ldots , n \\
p(x_i=1) = \theta .</script>
<p>We know that the number of successes is binomially distributed, thus a
sufficient statistics is <script type="math/tex">T = \sum_i x_i</script> and that the maximum likelihood
estimator for the binomial proportion is</p>
<script type="math/tex; mode=display">\hat{\theta} = \frac{T}{n} .</script>
<p>If the observations are all negative, the estimate for the rate is clearly zero,
but what about its confidence interval? I had never been asked to estimate the
confidence interval for the binomial distribution, so I was totally unprepared
(shame on me!). Quite instinctively, I computed the probability to observe all
23 results negative for a given <script type="math/tex">\theta</script>, set this to 5% and solved for
<script type="math/tex">\theta</script>. Result, 12.2%. In formula</p>
<script type="math/tex; mode=display">p(\mathrm{all\;negatives}\, |\, \theta) = (1-\theta)^n \leq \alpha,</script>
<p>which gives, solving for <script type="math/tex">\theta</script>,</p>
<script type="math/tex; mode=display">\theta_u = 1 - \alpha^{1/n} \\
\theta_u(n=23) = 12.2% .</script>
<p>In other words, what I did was to compute the highest <script type="math/tex">\theta</script> for which I
would be very surprised (5%) to see all negative outcomes. This is equivalent
to being quite sure (95% sure) to observe at least one positive outcome.</p>
<p>My colleague had found the formula for the
<a href="http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval">Clopper-Pearson interval</a>
and applied it to their case: different result. Of course, I thought: I
computed the probability to observe data as extreme or more extreme than those
observed for a given value of the parameter. I did a hypothesis test, not a
confidence interval.</p>
<p>It would have been the end of the story, had I not tried this
<a href="http://www.danielsoper.com/statcalc3/calc.aspx?id=85">online calculator</a>.
It reports <script type="math/tex">0 \leq \theta \leq 0.12</script> to be the 90% interval for <script type="math/tex">n=23</script>.
This was no coincidence, as the plot below shows.</p>
<p>In other words, for any number of tests (at least, between five and thirty)
<em>my</em> estimate (violet points) matches the upper limit of the 90% confidence
interval computed with the Clopper-Pearson method (magenta line).</p>
<div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggthemr</span><span class="p">)</span><span class="w">
</span><span class="n">ggthemr</span><span class="p">(</span><span class="s1">'solarized'</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="o">=</span><span class="s1">'outer'</span><span class="p">)</span><span class="w">
</span><span class="c1"># computing 90% confidence interval with "exact" meaning Clopper-Pearson
</span><span class="n">library</span><span class="p">(</span><span class="n">binom</span><span class="p">)</span><span class="w">
</span><span class="n">cis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">binom.confint</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="o">:</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">conf.level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.9</span><span class="p">,</span><span class="w"> </span><span class="n">methods</span><span class="o">=</span><span class="s2">"exact"</span><span class="p">)</span><span class="w">
</span><span class="c1"># max theta for which n all negative outcomes have probability 5%
</span><span class="n">p_all_neg</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1-0.05</span><span class="o">**</span><span class="p">(</span><span class="m">1</span><span class="n">.</span><span class="o">/</span><span class="m">5</span><span class="o">:</span><span class="m">30</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="m">5</span><span class="o">:</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="m">1</span><span class="o">=</span><span class="n">cis</span><span class="o">$</span><span class="n">upper</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="m">2</span><span class="o">=</span><span class="n">p_all_neg</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_line</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="o">$</span><span class="n">y</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="s1">'#d33682'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">df</span><span class="o">$</span><span class="n">y</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">show.legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="o">=</span><span class="s1">'#6c71c4'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s2">"trials"</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="nf">expression</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylim</span><span class="p">(</span><span class="m">0.0</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="n">axis.title.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">face</span><span class="o">=</span><span class="s2">"bold"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">18</span><span class="p">),</span><span class="w">
</span><span class="n">axis.text.x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">vjust</span><span class="o">=</span><span class="m">0.0</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">14</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">face</span><span class="o">=</span><span class="s2">"bold"</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">18</span><span class="p">),</span><span class="w">
</span><span class="n">axis.text.y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="o">=</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">vjust</span><span class="o">=</span><span class="m">0.0</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="o">=</span><span class="m">14</span><span class="p">))</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">p</span><span class="p">)</span><span class="w">
</span></code></pre>
</div>
<p><img src="/figure/source/2016-07-07-confidence-interval-and-hypothesis-test/plot-1.png" alt="plot of chunk plot" /></p>
<h2 id="finding-confidence-intervals">Finding confidence intervals</h2>
<p>There is a one-to-one correspondence between confidence intervals and hypothesis
tests. As a matter of fact, confidence sets are found by inverting a test
statistics.</p>
<p>Let’s start revisiting a few concepts, following the
<a href="http://books.google.ch/books/about/Statistical_inference.html?id=0x_vAAAAMAAJ&redir_esc=y">classics</a>.
We have a hypothesis for a parameter of interest $\theta$. The hypothesis says
that it has a certain value</p>
<script type="math/tex; mode=display">H_0 : \theta = \theta_0.</script>
<p>We have data <script type="math/tex">X</script> and a test statistics telling us whether to reject the
hypothesis or not. The acceptance region <script type="math/tex">A(\theta_0)</script> is the region of the
sample space for which we do not reject <script type="math/tex">H_0</script> at a level <script type="math/tex">\alpha</script>. In symbols</p>
<script type="math/tex; mode=display">p(X \in A(\theta_0)) \geq 1 - \alpha</script>
<p>or</p>
<script type="math/tex; mode=display">p(X \notin A(\theta_0)) \leq \alpha.</script>
<p>Now, for each realisation of the data <script type="math/tex">X</script>, we take the values of <script type="math/tex">\theta</script> for
which the hypothesis <script type="math/tex">H_0</script> is accepted. Then we have built a confidence set for
<script type="math/tex">\theta</script>. We define <script type="math/tex">C(X)</script> as the set in parameter</p>
<script type="math/tex; mode=display">C(X)=\{\theta_0 : x \in A(\theta_0)\}.</script>
<p>Then, <script type="math/tex">C(X)</script> is a <script type="math/tex">1 - \alpha</script> confidence set. This follows quite immediately
from the definition of acceptance region above</p>
<script type="math/tex; mode=display">p(\theta \in C(X)) = p(X \in A(\theta_0)) \geq 1 − \alpha.</script>
<h2 id="clopper-pearson-estimator">Clopper-Pearson estimator</h2>
<p>There are several estimators for the confidence interval of the binomial
proportion. The advantage of this one is that it is exact, rather than based on
the normal approximation (see the Wikipedia page linked above). The
disadvantage is that it is conservative, i.e. there might be a smaller interval
with the same confidence level. The estimated interval is defined as</p>
<script type="math/tex; mode=display">\{\theta : p(\mathrm{Bin}(n, \theta) \leq T) > \frac{\alpha}{2} \} \cap
\{\theta : p(\mathrm{Bin}(n, \theta) \geq T) > \frac{\alpha}{2} \}.</script>
<p>From this definition we reconcile the fact that my estimate at 95% coincides
with the Clopper-Pearson at 90%. In fact, since we are in the special case of
<script type="math/tex">T=0</script>, we can write the Clopper-Pearson as</p>
<script type="math/tex; mode=display">\{\theta : p(\mathrm{Bin}(n, \theta) \leq 0) > \frac{\alpha}{2} \} \cap
\{\theta : p(\mathrm{Bin}(n, \theta) \geq 0) > \frac{\alpha}{2} \} =
\{\theta : p(\mathrm{Bin}(n, \theta) = 0) > \frac{\alpha}{2} \}.</script>
<p>By taking <script type="math/tex">\alpha = 0.10</script> in the Clopper-Pearson estimate, we have the formula
I used for my estimate.</p>
<h2 id="a-final-observation">A final observation</h2>
<p>What is most interesting to my colleague? The hypothesis test says that those
data would already exclude <script type="math/tex">\theta=0.12</script> or higher at 5%, but the 95% interval
according to Clopper-Pearson is <script type="math/tex">0 \leq \theta \leq 0.15</script>. As we saw they are
two different but intimately related things. What is more important to you
depends largely on your taste.
We also had the chance to underline something that is often neglected: finding
an <script type="math/tex">\alpha</script> confidence interval doesn’t mean we found the smallest interval
that contains the true value with probability <script type="math/tex">1 - \alpha</script>. If you want to
investigate an extreme consequence of this, you can visit
<a href="http://www.roma1.infn.it/~dagos/ci_calc.html">the ultimate confidence intervals calculator</a>.</p>
<p><em>Header image from <a href="https://flic.kr/p/9u9wZk">Flickr</a></em></p>
Thu, 07 Jul 2016 00:00:00 +0200
https://ozagordi.github.io/2016/07/07/confidence-interval-and-hypothesis-test/
https://ozagordi.github.io/2016/07/07/confidence-interval-and-hypothesis-test/statistical inferencenhstconfindence intervalbinomialThe most important concept<p>If we had to choose a single statement to pass to generations after an
imaginary destruction of the whole scientific knowledge, what would you choose?
Which idea would help the subsequent generations the most in recreating the
lost body of knowledge and, ultimately, the civilization? Richard Feynman once
stated that it is the atomic hypothesis,</p>
<blockquote>
that all things are made of atoms—little particles that move around in perpetual
motion, attracting each other when they are a little distance apart, but
repelling upon being squeezed into one another.
</blockquote>
<p>This struck me again a few days ago when a new coworker started here and I was
asked to introduce her to (loosely speaking) data analysis and computation.
Since years now, I’ve been working mainly with people without a strong
mathematical background. I’ve always done my best to explain things
from my side, and I’ve always been irritated by scientists who try to
impress/humiliate others with some “theorem dropping”.</p>
<p><img src="/figure/source/2016-07-06-most-important-concept/unnamed-chunk-1-1.png" alt="plot of chunk unnamed-chunk-1" /></p>
<p>So, if you had to choose one single concept that a person absolutely needs to
grasp before endeavouring in data analysis, what should this concept be?</p>
<p>This is a very different question from the one Feynman answered. I am now asking
what is the concept that is absolutely needed to understand methods and results
in data analysis <strong>explained by someone who knows it</strong>, not the single notion
that would help the most a new civilization rebuilding science from scratch.</p>
<p>I cannot imagine a more pervasive and fundamental concept than that of function
in mathematics. From plotting any kind of results, to writing even the simplest
program, anything is hard or impossible to understand without this type of
abstraction.</p>
<p><em>Header image from <a href="https://flic.kr/p/ecpHBw">Flickr</a></em></p>
Wed, 06 Jul 2016 00:00:00 +0200
https://ozagordi.github.io/2016/07/06/most-important-concept/
https://ozagordi.github.io/2016/07/06/most-important-concept/fundamentalssciencemathematics