Graph algorithms for efficient classification of pathogens

Picture of ahad-zehmakan.md Ahad N. Zehmakan

19 Dec 2024

Body

DNA sequences are the information system used to encode life [1]. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data [2]. The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as “annotations” that pinpoint, for instance, the linear segments of DNA that correspond to individual genes [3]. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor [4].

Major advances in DNA sequencing technology provide a capacity to sequence DNA sampled from a diverse array of environmental conditions. From this data, biologists want to determine what species and, in many cases, the biological functions (encoded in the DNA) present in a sample of environmental DNA sequences. This information is critical to solving central problems such as detecting pathogens of humans, domesticated animals or important crop species.

Keywords

Graph Algorithms; Software Design; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open source

Goals

This project seeks to apply graph algorithms to the critical problem of predicting the species of origin and functional capabilities of environmental DNA.

Current algorithms applied to this problem are incredibly inefficient. This project will tackle the problem using an innovative approach that combines de Bruin Graphs and partial order graphs.

While the motivating problem is in the field of genomics, students do not need a background in that field in order to be able to undertake the project.

Requirements

  • Sophisticated understanding of Python
  • An interest in graph algorithms
  • Software design patterns
  • Experience in C / C++ / Rust is desirable

Gain

You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.

You will be using, and possibly contributing to, the cogent3 open source project for computational biology [5]. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.

By contributing to an open source project, your work benefits the large global community of bioinformaticians and scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.

You will get access to working space in the Robertson Building.

Contact

Supervisor: Gavin Huttley and Ahad N. Zehmakan

If you are interested, please write an email to gavin.huttley@anu.edu.au and ahadn.zehmakan@anu.edu.au, including (1) what aspects of this project interest you the most, (2) what type of research project you are looking for, 6-unit, 12-unit, or 24-unit, (3) a copy of your transcripts and/or CV, (4) any questions you may have.

References

[1]: The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!

[2]: The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.

[3]: A gene is a DNA segment that encodes a molecular machine, e.g., a protein.

[4]: Thanks Charles Darwin.

[5]: cogent3 is available on PyPi (~13k downloads per month) and bioconda. It is the successor to the high impact `PyCogent` library, which provided the critical foundations for multiple widely used spinoff projects, including `QIIME`, `QIIME2` and `scikit-bio`.

arrow-left bars search times