Genome Assembly

SPAdes genome assembler. The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. SPAdes is a new genome assembler for both single-cell and standard (multicell) assembly. SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies.

QUAST: QUality Assessment Tool for Genome Assembly. Limitations of genome sequencing techniques have led to dozens of assembly algorithms, none of which is perfect. A number of methods for comparing assemblers have been developed, but none is yet a recognized benchmark. Further, most existing methods for comparing assemblies are only applicable to new assemblies of finished genomes; the problem of evaluating assemblies of previously unsequenced species has not been adequately considered. QUAST is a quality assessment tool for evaluating and comparing genome assemblies. This tool improves on leading assembly comparison software with new ideas and quality metrics. QUAST can evaluate assemblies both with a reference genome, as well as without a reference. QUAST produces many reports, summary tables and plots to help scientists in their research and in their publications.

EULER genome assembler. I was on the project team for some releases of the EULER genome assembler. EULER replaced the classical "overlap - layout - consensus" paradigm used in genome assembly tools by an Eulerian path approach that accomodates repeats. EULER also was the first genome assembler to use a de Bruijn graph, now used in many genome assemblers. EULER has been superceded by our current assembler, SPAdes.

EULER + Velvet-SC (E+V-SC): Single-cell genome assembler. Whole genome amplification by the multiple displacement amplification (MDA) method allows sequencing of DNA from single cells of bacteria that cannot be cultured. Assembling a genome is challenging, however, because MDA generates highly nonuniform coverage of the genome. Here we describe an algorithm tailored for short-read data from single cells that improves assembly through the use of a progressively increasing coverage cutoff. Assembly of reads from single Escherichia coli and Staphylococcus aureus cells captures >91% of genes within contigs, approaching the 95% captured from an assembly based on many E. coli cells. We apply this method to assemble a genome from a single cell of an uncultivated SAR324 clade of Deltaproteobacteria, a cosmopolitan bacterial lineage in the global ocean. Metabolic reconstruction suggests that SAR324 is aerobic, motile and chemotaxic. Our approach enables acquisition of genome assemblies for individual uncultivated bacteria using only short reads, providing cell-specific genetic information absent from metagenomic studies.

Our software and datasets are still available, but the software been superceded by SPAdes.

Genome Rearrangements

GRIMM: Genome Rearrangements in Man and Mouse (and other pairwise genome rearrangements)
MGR: Multiple Genome Rearrangements

Two genomes may many genes in common, but the genes may be arranged in a different sequence or be moved between chromosomes. The pairwise genome rearrangement problem is to find an optimal scenario transforming one genome to another via reversals, translocation, fission, and fusion. We provide a web server combining rearrangement algorithms for unichromosomal and multichromosomal genomes, with either signed or unsigned gene data. In each case, it computes the minimum possible number of rearrangement steps, and determines a possible scenario taking this number of steps. GRIMM is integrated into a related project, MGR, for constructing optimal phylogenetic trees with multiple genomes.

GRIMM-Synteny: Determine synteny blocks for use by GRIMM and MGR
Distribution of Segment Lengths in Genome Rearrangements. The study of gene orders for constructing phylogenetic trees was introduced by Dobzhansky and Sturtevant in 1938. Different genomes may have homologous genes arranged in different orders. In the early 1990s, Sankoff and colleagues modelled this as ordinary (unsigned) permutations on a set of numbered genes 1,2,...,n, with biological events such as inversions modelled as operations on the permutations. Signed permutations may be used when the relative strands of the genes are known, and "circular permutations" may be used for circular genomes. We use combinatorial methods (generating functions, commutative and noncommutative formal power series, asymptotics, recursions, and enumeration formulas) to study the distributions of the number and lengths of conserved segments of genes between two or more unichromosomal genomes, including signed and unsigned genomes, and linear and circular genomes. This generalizes classical work on permutations from the 1940s-60s by Wolfowitz, Kaplansky, Riordan, Abramson, and Moser, who studied decompositions of permutations into strips of ascending or descending consecutive numbers. In our setting, their work corresponds to comparison of two unsigned genomes (known gene orders, unknown gene orientations). Maple software implementing our formulas is available at

More Bioinformatics Software

BLASR: Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement. BLASR (Basic Local Alignment with Successive Refinement) is a tool for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error. The software is by Mark Chaisson; Glenn Tesler helped develop the underlying mathematical model.

RAPID: RApid Pair IDentification. In complex disorders, independently evolving locus pairs might interact to confer disease susceptibility, with only a modest effect at each locus. With genome-wide association studies on large cohorts, testing all pairs for interaction confers a heavy computational burden, and a loss of power due to large Bonferroni-like corrections. Correspondingly, limiting the tests to pairs that show marginal effect at either locus, also has reduced power. Here, we describe an algorithm that discovers interacting locus pairs without explicitly testing all pairs, or requiring a marginal effect at each locus. The central idea is a mathematical transformation that maps ‘statistical correlation between locus pairs’ to ‘distance between two points in a Euclidean space’. This enables the use of geometric properties to identify proximal points (correlated locus pairs), without testing each pair explicitly. For large datasets (~106 SNPs), this reduces the number of tests from 1012 to 106, significantly reducing the computational burden, without loss of power. The speed of the test allows for correction using permutation-based tests. The algorithm is encoded in a tool called RAPID (RApid Pair IDentification) for identifying paired interactions in case-control GWAS.

PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps Without Knowledge of the Favored Allele. Methods for detecting the genomic signatures of natural selection are heavily studied, and have been successful in identifying many selective sweeps. For the vast majority of these sweeps the favored allele remains unknown, making it difficult to distinguish carriers of the sweep from non-carriers. In an ongoing selective sweep, carriers of the favored allele are likely to contain a future most recent common ancestor. Therefore, identifying them may prove useful in predicting the trajectory evolutionary for example, in contexts involving drug-resistant pathogen strains or cancer subclones. The main contribution of this paper is the development and analysis of a new statistic, the Haplotype Allele Frequency (HAF) score. The HAF score, assigned to individual haplotypes in a sample, naturally captures many of the properties shared by haplotypes carrying a favored allele. We provide a theoretical model for computing expected HAF scores under different evolutionary scenarios, and validate the theoretical predictions through extensive simulations. As an application of HAF score computations, we develop an algorithm (PreCIOSS: Predicting Carriers of Ongoing Selective Sweeps) to identify carriers of the favored allele in selective sweeps, and we demonstrate its power on simulations of both hard and soft sweeps, as well as on data from well-known sweeps in human populations.

Pysub: Characterizing the Relationship between Steady State and Response Using Analytical Expressions for the Steady States of Mass Action Models. The steady states of cells affect their response to perturbation. Indeed, diagnostic markers for predicting the response to therapeutic perturbation are often based on steady state measurements. In spite of this, no method exists to systematically characterize the relationship between steady state and response. Mathematical models are established tools for studying cellular responses, but characterizing their relationship to the steady state requires that it have a parametric, or analytical, expression. For some models, this expression can be derived by the King-Altman method. However, King-Altman requires that no substrate act as an enzyme, and is therefore not applicable to most models of signal transduction. For this reason we developed py-substitution, a simple but general method for deriving analytical expressions for the steady state of any mass action model. Where the King-Altman method is applicable, we show that py-substitution yields an equivalent expression, and at comparable efficiency. We use py-substitution to study the relationship between steady state and sensitivity to the anti-cancer drug candidate, dulanermin (recombinant human TRAIL). First, we use py-substitution to derive an analytical expression for the steady state of a published model of TRAIL-induced apoptosis. We show that the amount of TRAIL required for cell death is sensitive to the steady state concentrations of procaspase 8 and its negative regulator, Bar, but not the other procaspase molecules. This suggests that activation of caspase 8 is a critical point in the death decision process. Finally, we show that changes in the threshold at which TRAIL results in cell death is not always equivalent to changes in the time of death, as is commonly assumed. Our work demonstrates that an analytical expression is a powerful tool for identifying steady state determinants of the cellular response to perturbation.

Other Mathematical Software

Multi de Bruijn Sequences. We generalize the notion of a de Bruijn sequence to a “multi de Bruijn sequence”: a cyclic or linear sequence that contains every k-mer over an alphabet of size q exactly m times. For example, over the binary alphabet {0,1}, the cyclic sequence (00010111) and the linear sequence 000101110 each contain two instances of each 2-mer 00,01,10,11. We derive formulas for the number of such sequences. The formulas and derivation generalize classical de Bruijn sequences (the case m=1). We also determine the number of multisets of aperiodic cyclic sequences containing every k-mer exactly m times; for example, the pair of cyclic sequences (00011)(011) contains two instances of each 2-mer listed above. This uses an extension of the Burrows-Wheeler Transform due to Mantaci et al, and generalizes a result by Higgins for the case m=1.