Accelerating biological sequence alignment algorithms

A DNA sequence is the information system that encodes all life ¹. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data ². The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as “annotations” that pinpoint, for instance, the linear segments of DNA that correspond to individual genes ³. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor ⁴.

New DNA sequencers are increasingly portable – they can be powered off a USB hub – increasing the production rate of novel sequence data. Making sense of this new data depends on sequence alignment algorithms. Unsurprisingly, these technologies have increased demand for fast, efficient, statistically robust alignment algorithms. Sequence alignment algorithms are dynamic programming algorithms that can be represented as finite-state machines or probabilistically as Hidden Markov Models.

Keywords: Software design; Plugin architecture; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open source

Goals

In this project, you will explore how fast can sequence alignment algorithms implemented in Python go? You will focus initially on pairwise alignment, using the current cogent3 algorithms as motivation, and explore using numba (numba.pydata.org) and SIMD for performance optimisation. You then will extend this to the special case of aligning three sequences. You will then evaluate other performance optimisation approaches, e.g. Cython, JAX and GPU based algorithms. The work will then be extended to multiple sequence alignment. The developed algorithm(s) will integrate with cogent3 objects.

The final result will be made available as a plugin for the genome data science library cogent3.

Requirements

Sophisticated understanding of Python
Software design patterns
Experience in C / C++, Rust, or GPU computing is desirable

Gain

You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.

You will contribute to the cogent3 open source project for computational biology [^*]. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.

By contributing to an open source project, your work benefits the large global community of bioinformatics scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.

You will get access to working space in the Robertson Building.

Contacts

Prof Gavin Huttley and Dr Thang Bui

The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!\ ↩
The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.\ ↩
A gene is a DNA segment that encodes a molecular machine, e.g., a protein.\ ↩
Thanks Charles Darwin. ↩