DNA divergence models that are less crazy

A DNA sequence is the information system that encodes all life ¹. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data ². The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as “annotations” that pinpoint, for instance, the linear segments of DNA that correspond to individual genes ³. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor ⁴.

One of the things biologists love to do is use DNA sequences to establish evolutionary relationships. The reason this “works” is because: Darwin said all species evolved from a common ancestor; Mendel said there’s a hereditary molecule that encodes an organism’s capabilities; and numerous scientists have since shown that mutations (the changes in genetic information transmitted between generations) accumulate proportional to the time elapsed since those organisms last shared a common ancestor. So far so good.

To uncover relationships among sequences, biologists and their quantitative colleagues have developed probabilistic models that explain how sequences change. Historically, it was necessary to impose some simplifying assumptions on these models to make them computationally tractable. One of the assumptions that remains in widespread use is that DNA is just a soup of unlinked chemical units (the “bases”). Given we all know DNA is a polymer, that’s crazy and almost certainly a major factor in why there is so much controversy in the field.

In reality, estimates from empirical data indicate that DNA letters are affected by up to the six preceding letters. In other words, DNA is Markov order 6, not Markov order 0!

Keywords: Software design; Plugin architecture; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open source

Goals

To accommodate the long-range interactions among DNA letters, you will implement a flexible algorithm (representing a random Markov field of sequence evolution) to compute a pseudo-likelihood. The project is inspired by the Bayesian approach of Hwang and Green (2004), followed by the frequentist work of Christensen (2006) and a log-linear formulation to reduce parameter space by Yap and Huttley (unpublished).

In the project, you will get to solve problems related to numerical precision and compute performance. Additionally, you will define an API that enables the exploration of distinct model parameterisations.

Requirements

Sophisticated understanding of Python
Software design patterns
Experience in C / C++, Rust, or GPU computing is desirable

Gain

You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.

You will contribute to the cogent3 open source project for computational biology [^*]. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.

By contributing to an open source project, your work benefits the large global community of bioinformatics scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.

You will get access to working space in the Robertson Building.

References

Hwang, Dick G, and Phil Green. “Bayesian Markov Chain Monte Carlo Sequence Analysis Reveals Varying Neutral Substitution Patterns in Mammalian Evolution.” Proc Natl Acad Sci U S A 101, no. 39 (2004): 13994–1. https://doi.org/10.1073/pnas.0404142101.

Christensen, Ole F. “Pseudo-Likelihood for Non-Reversible Nucleotide Substitution Models with Neighbour Dependent Rates.” Stat Appl Genet Mol Biol 5 (2006): Article18. https://doi.org/10.2202/1544-6115.1217.

Contacts

Prof Gavin Huttley and Dr Thang Bui

The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!\ ↩
The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.\ ↩
A gene is a DNA segment that encodes a molecular machine, e.g., a protein.\ ↩
Thanks Charles Darwin. ↩