Applying geometry and machine learning to detect genes in DNA sequences

A DNA sequence is the information system that encodes all life ¹. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data ². The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as “annotations” that pinpoint, for instance, the linear segments of DNA that correspond to individual genes ³. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor ⁴.

Microorganisms are in nearly every environment surveyed, including extremes such as rocks, hot springs, and glaciers. They play an extremely influential role in the biology of multi-cellular organisms such as us. Biologists detect the presence of bacteria by extracting and sequencing DNA from environmental materials – a study termed metagenomics. This DNA, and the derived sequence data, are typically fragmented. Additionally, a single environmental sample can contain millions of such reads.

One of the major computational challenges presented by metagenomic studies is the identification of what genes and/or species are present in a sample. Techniques exist, but they are extremely slow and error prone.

**Keywords **: Software design; Plugin architecture; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open source

Goals

In this project, we will examine ways in which the linear nature of DNA can be transformed into a geometric representation suitable as a feature set for the application of machine learning to the problem of classifying the gene content of a sample.

Requirements

Sophisticated understanding of Python
Software design patterns
background in machine learning

Gain

You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.

You will contribute to the cogent3 open source project for computational biology [^*]. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.

By contributing to an open source project, your work benefits the large global community of bioinformatics scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.

You will get access to working space in the Robertson Building.

Contacts

Prof Gavin Huttley and Dr Thang Bui

The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!\ ↩
The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.\ ↩
A gene is a DNA segment that encodes a molecular machine, e.g., a protein.\ ↩
Thanks Charles Darwin. ↩