Background
Vector databases and Approximate Nearest Neighbour (ANN) search are powering the modern AI revolution—from LLM retrieval systems to recommendation engines—by enabling lightning-fast similarity search across massive datasets. With the rapid adoption of Large Language Models and Retrieval Augmented Generation(RAG), the technical demands on vector search have fundamentally changed, requiring fresh approaches to algorithm evaluation.
Research Challenge
Current ANN benchmarks remain stuck in the past. Most rely on decade-old datasets like SIFT and GIST that poorly reflect the complex distributions and high dimensionality of modern LLM embeddings. Worse still, benchmarking practices often employ inconsistent parameter selection methodologies, leading to potentially misleading conclusions about algorithm performance. We need a rigorous, reproducible benchmarking framework specifically designed for the LLM era.
Project Aim
Design, implement and evaluate a comprehensive framework in C++ for fair and insightful benchmarking of ANN algorithms on modern, high-dimensional datasets. This project will produce both valuable research contributions and a practical tool for the vector database community.
Key Research Tasks
Task 1: Dataset Generation & Analysis
Task 2: Benchmarking Test
Task 3: Parameter impact test and novel algorithm design
Skills you’ll harness
Through this project, you’ll gain advanced C++ programming skills including modern standards, SIMD optimization, and cache-efficient implementations. You’ll develop expertise in high-performance computing and algorithm optimization techniques while cultivating rigorous experimental methodology and advanced data analysis capabilities. This work will provide you with a deep understanding of vector database internals and ANN algorithms, along with valuable skills in high-dimensional data visualization and analysis.
Expected Outcomes
The project will deliver a comprehensive benchmarking framework that provides value to the broader vector database community. You’ll produce a thesis with publication-quality research findings and make valuable contributions to open-source ANN libraries. The skills you develop will be highly transferable and sought after in AI/ML industry roles, positioning you well for future career opportunities.
What we’re looking for
Essential:strong C++, solid data‑structures/algorithms background, eligibility for enrolment in a 24-unit Honours/Masters project or MPhil
Desirable:Basic understanding of vector embeddings; experience with Linux
Ready to apply?
Email mengxuan.zhang@anu.edu.au with your CV, academic transcript and a brief statement of interest