Data61 Projects in Machine Learning and Computer Vision *2 New Added*


Supervisory Chair


External Member

Weihao Li


Please contact the named contact below for enquiries about each project.

(New) Vision Transformer for extreme crowd counting:

 This project aims to evaluate crowd counting and density estimation in a highly dense situation using a vision transformer. The dataset provided will be used for benchmarking and comparative analysis with other state-of-the-art detection and density estimation approaches. The images in the dataset are very high resolution and can easily be divided into patches and made suitable for the vision transformer to model the contextual information. The problem is challenging because the object detector performs very well in a low-density crowd, while the density-based approaches perform well in a high-density crowd. The project will evaluate how Transformer holds its place when applied to high-density and low-density crowds. The students will be provided with a completely annotated dataset and will be encouraged to extend the dataset. The computational resources will be provided to perform experimental evaluations.

Contact: Saeed Anwar,


(New) Domain adaption for object detection:

 The performance of object detector trained on one domain called source domain significantly drop when applied directly another domain called target domain potentially different in illumination, the number of objects in a scene, scene layout, object co-occurrence. The project aims to extract features independent of the domain. A few techniques, such as image-to-image translation, feature level alignment, etc., have been developed to reduce the domain shift between source and target domain. The students will be provided computational resources to perform an experimental evaluation.

Contact: Saeed Anwar,


Few-shot Action Recognition:

 One of the most critical tasks in video understanding is to recognize human actions. It has numerous real-world applications, including behaviour interpretation, video retrieval,human-robot interaction, gaming, and entertainment. Human action understanding includes recognizing, localizing, and predicting human behaviours. The task to identify human actions in a video is termed as video action recognition. Knowledge about an action is inferred usually by learning from the labelled data in a supervised manner. Even as more complex models are being built, it is common to observe that the number of actions is progressively increasing. Consequently, annotating videos of this growing number of classes can be a cumbersome task and consequently limits the scalability of a fully supervised action recognition for a large number of categories. To address this problem, it is required to develop a video action recognition framework that can work on a low data regime, where there are just a few training examples for some categories in the training set.

Contact: Ali Cheraghian,


Long-tailed object detection:

Object detection is one of the most vital yet challenging tasks in computer vision. Contemporary progressions are chiefly driven by large-scale datasets that are manually balanced, such as PASCAL VOC and COCO dataset. However, in reality, the distribution of object classes is typically long-tailed. For instance, in underwater environmental monitoring applications, we want to identify several species of corals, algae, fish, etc, when new species showing up on the go; the distribution of these species are natural long-tailed. We aim to design an object detection model which can handle long-tailed distribution data.

Contact: Ali Cheraghian,


Few-shot incremental learning:

In a real-world scenario, we may not access information about all possible classes when the system is first trained. It is more realistic to assume that we will obtain class-specific data incrementally as time goes by. Therefore, in such a scenario, we require that our model be adapted with new information made available without hampering the performance on what has been learnt so far. Although a natural task for human beings, it is difficult for an intelligent machine due to the possibility of catastrophic forgetting. A trained model tends to forget old tasks when learning new information. Furthermore, in many applications, new tasks (a set of novel classes) come with only a few examples per class, making the class-incremental learning even more challenging. This setting is called few-shot class-incremental learning(FSCIL). The main challenges in FSCIL are catastrophic forgetting of already acquired knowledge and overfitting the network to novel classes. The aim of this work to design a framework that can address the problem of FSCIL.

Contact: Ali Cheraghian,


Data-efficient Self-supervised Learning

Self-supervised pretraining has achieved significant success in learning image feature representations. Most recent methods are trained using large-scale general image datasets like ImageNet. However, for many specific domains, such as underwater, soil, biology, and hyperspectral imaging, collecting large-scale images is extremely expensive and difficult. In this project, we will explore a new data-efficient self-supervised learning method for small-scale image datasets. 

Contact: Weihao Li,


Open World Object Detection and Segmentation

Object detection models are often trained inside a closed-world paradigm, for example, detecting and segmenting objects from a fixed set of categories. However, our visual world is naturally open, dynamic, vast, and unpredictable. Algorithms developed in the closed world cannot adapt and robustly generalize efficiently to the open world. This project will develop new object detection and segmentation algorithms for open-world vision systems. 

Contact: Weihao Li,


Image Segmentation Evaluation Measure Analysis

This project will perform an extensive analysis across different error types and object sizes for image segmentation evaluation measures. Then we will design a new evaluation measure for image segmentation tasks with desirable characteristics, such as symmetry w.r.t. prediction/ground truth pairs and balanced responsiveness across scales. 

Contact: Weihao Li,


Vision Transformer: 

Recently, transformer-based language models have achieved super human performance in several Natural Language Processing (NLP) task including language translation, question answering, summarization, sentiment analysis etc. These state-of-the-art models are mostly inspired by Google’s BERT[1] or Open-AI’s GPT[2] transformer architecture. With by the success of transformer-based models in NLP, there has been significant interest in building efficient transformer models for Computer Vision tasks, commonly dubbed as Vision Transformers [3].

One of the major challenges when working with Vision Transformer is the need to gigantic datasets and expensive training. In this project student will be tasked to apply transfer learning, meta-learning and multi-task learning to effectively train and build vision transformers for reasonably sized dataset and computational budget. The target application of the proposed vision transformer could be any computer vision task such as object detection, semantic segmentation, instance segmentation.


[1] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[2] Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).

[3] Khan, Salman, et al, “Transformers in Vision: A Survey.” arXiv preprint arXiv:2101.01169 (2021).

Contact: Moshiur Farazi,


Updated:  10 August 2021/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing