Short Reads Assembly into Haplotypes

Download this project as a .zip file Download this project as a tar.gz file

What is ShoRAH

ShoRAH was updated in October 2013 to version 0.8. See changes

ShoRAH is an open source project for the analysis of next generation sequencing data. It is designed to analyse genetically heterogeneous samples. Its tools are written in different programming languages and provide error correction, haplotype reconstruction and estimation of the frequency of the different genetic variants present in a mixed sample.

The software suite ShoRAH (Short Reads Assembly into Haplotypes) consists of several programs, the most imporant of which are: - amplicon based analysis - local error correction based on diri_sampler

diri_sampler - Gibbs sampling for error correction via Dirichlet process mixture

contain - removal of redundant reads - maximum matching haplotype construction

freqEst - EM algorithm for haplotype frequency - detects single nucleotide variants, taking strand bias into account - wrapper for everything


If you use shorah, please cite the application note paper on BMC Bioinformatics.


Dependencies and installation

Please download and install:

Please note that these dependencies can be satisfied also using the package manager of many operating system. For example MacPorts on Mac OS X, yum on several linux installations and so on.

Type ‘make’ to build the C++ programs. This should be enough in most cases. If your GSL installation is not standard, you might need to edit the relevant lines in the Makefile (location /opt/local/ is already included).

Windows users

Although we did not develop shorah for Windows, Cygwin offers a Linux-like environment on Windows machines. Users have reported successful compilation by installing Cygwin and then

Then, compile shorah with make. We would like to hear if you succeed. Thanks to NKC


The input is a sorted bam file. Analysis can be performed in local or global mode.

Local analysis

The local analysis alone can be run invoking or (program for the amplicon mode). They work by cutting windows from the multiple sequence alignment, invoking diri_sampler on the windows and calling for the SNV calling.

Global analysis

The whole global reconstruction consists of the following steps:

  1. error correction (i.e. local haplotype reconstruction);
  2. SNV calling;
  3. removal of redundant reads;
  4. global haplotype reconstruction;
  5. frequency estimation.

These can be run one after the other, or one can invoke, that runs the whole process from bam file to frequency estimation and SNV calling.

Getting help

See the dedicated page.

Authors and Contributors

Niko Beerenwinkel

Arnab Bhattacharya

Nicholas Eriksson

Moritz Gerstung

Lukas Geyrhofer

Fabio Luciani

Kerensa McElroy

Osvaldo Zagordi