Online Service for Enriching Text with Semantic Annotation: Visual Interactive Deep, Transfer, and Active (Machine) Learning Approach to Sequence Labelling

Research areas



Motivation: In the big data era, despite structured data being fast growing, the majority of information still originates from unstructured text, including web pages, social media, research papers,  interview transcripts, and clinical notes, among other text documents. Failures in its information flow leads to inefficiencies and adverse events in qualitative research, healthcare surveillance, business decision-making, and other data intensive fields.

Approach:  To gain useful insights from this vast amount of text, our  Natural Language Processing (NLP) Team has been developing powerful low-cost techniques to map them into structured representations. Our deep learning methods are able to use less than 100 expert-annotated sentences to achieve performance comparable to the state-of-the-art systems, initialised with ten times more data. Similarly, our methods for clinical language processing have been among the finest elite in ALTA, CLEF, and TREC shared tasks.


In this project, we focus on automated coding, annotation, or extraction of entities (e.g., “Data61" or other proper nouns or words relevant to headings such as  “Medical History” and “Care Plan”) in free-form text as a key to unlock digital language for content or sequence analysis and situational awareness. This leads to improved capabilities for not only qualitative research but also computational surveillance and decision-making. When combined with the best practice in science of annotation and performance evaluation, further gains can be achieved in reliability and repeatability of experimental results and related data. The project addresses multiple disciplines with an emphasis on health and social sciences. As its means, it uses free-form text; its semantic annotations, marked by hand using a web-based visual, interactive interface or automatically by a computational classifier; and state-of-the-art methods to build and evaluate these classifiers.


To increase the project visibility and facilitate easy engagement with real-life applications, we are seeking for a student to help us extend our existing NLP software with a web-based tool for annotation and results demonstration. This can be divided into three desired outcomes: 1) release of an open-source web- and cloud-computing service that enables promoting the best practice in developing, evaluating, and transferring annotated text collections and machine learning methods for their processing, 2) completion and inclusion of our deep and transfer learning modules in the computing service as a way to improve performance in clinical form-filling and job-advertisement analysis tasks, and 3) launch of web-based demonstration systems as a proof-of-concept that our technology generalises from these tasks to others.


This project will appeal to students with excellent skills in experimentation, programming, and teamwork. The preference is on students who have finished/are taking the units of Artificial Intelligence, Document Analysis, and/or Machine Learning in The ANU or similar.

Background Literature

See, for example, the following paper:  Suominen H, Zhou L, Hanlen L, Ferraro G. Benchmarking clinical speech recognition and information extraction: New data, methods and evaluations.  JMIR Medical Informatics 2015 3(2), e19.


This student project is a part of the activities of the NLP Team within ML Group in The Australian National University (ANU) and Data61 in Canberra, the capital of Australia. The OECD Regional Well-Being Report 2014 evaluated Canberra as the most livable city in the world. The ML Group has been recently (in 2014) ranked among the top five in the world in ML, the others being Microsoft Research, Max Planck Institute Tübingen, University of Berkeley, and University of Cambridge. According to the QS World University Rankings for 2015-16, The ANU ranks within the top-20 universities globally with the overall score of 91.0 out of 100.0 (19th) whilst the next best Australian university scored 83.1 (42nd) and for the field of research (FOR) code of AI and Image Processing, applicable to ML and NLP, under Information and Computer Sciences, The ANU has obtained the top 5 out of 5 score in the Excellence in Research for Australia (ERA) evaluations, both in 2010 and 2012.


Active Learning, Artificial Intelligence (AI), Big Data, Data Analytics, Deep Learning, Machine Learning (ML), Natural Language Processing (NLP), Transfer Learning, Visual-Interactive Text Search and Exploration

Updated:  1 June 2019/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing