Fast Parameter-free Clustering using Enhanced Iterative Label Spreading

People

External Member

Amanda Parker (Primary Supervisor), School of Computing

Description

Iterative Label Spreading (ILS) is an unsupervised learning algorithm that overcomes the challenges in applying clustering methods appropriately and effectively to high-dimensional scientific data. It was developed specifically for small (<10e5) materials science data sets and is based on a general definition of a cluster and cluster result quality. Even in simple cases of clustering in two dimensions (where visual inspection of clustering results is possible) common clustering methods can fail to give the expected result (including k-Means clustering, Ward hierarchical clustering, and DBSCAN). ILS on the other hand, can be used for performing clustering, and assessing a clustering result found by any method and doesn’t require pre-defined hyperparameters. The trade-off is with scaling that is limited by the iterative nature of the algorithm.  The goal of this project is to combine agglomerative clustering with ILS to improve scaling while maintaining the integrity and quality of the clustering result.   

The Primary Supervisor for this project is Dr Amanda Parker, who can be contacted at amanda.parker@anu.edu.au

Goals

To create a new version of the ILS software and profile the improvements in performance.

Requirements

python programming and experience in data science and machine learning is essential (such as COMP3720, COMP4660, COMP4670, COMP6670, COMP8420).  Familiarity with platforms such as scikit-learn is desirable.

Gain

This can be a 12cp or a 24cp project.

Keywords

machine learning, clustering, software engineering, data science

Updated:  10 August 2021/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing