Natural Language Processing for Small Languages



Research areas

External Member

Dr Danielle Barth. ANU College of Asia and the Pacific


We need to understand what kinds of groups of people are similar or different, how and why. This project is about using machine learning to classify text collected in Papua New Guinea as part of language documentation project for an endangered indigenous language. Machine learning will be used to classify speakers into groups (younger vs older, male vs female, etc.) and to classify texts into types (conversation vs narrative, etc.). What are the n-grams that are most associated with each group? What kind of model works best with this small dataset? This project will help with the anthropological and sociolinguistic understanding of a small community of PNG, as well as provide valuable insight into using machine learning with non-English data in a small and variable dataset, a very common kind of data.


External supervisor

Dr. Danielle Barth.

ANU College of Asia and the Pacific


Familiarized with Machine Learning. Good coding skills in Python coding is a plus!



Gain a good understanding of machine learning models for natural language processing, and learn how to implement and apply these techniques in a research project


Natural Language Processing, Machine Learning, Small data

Updated:  1 June 2019/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing