Vector Search for the LLM Era

This project offers an exceptional opportunity to work at the intersection of databases, algorithms, and modern AI systems—developing skills that are in high demand across industry and academia.

Picture of mengxuan-zhang.md Mengxuan Zhang

10 May 2025

Background

Vector databases and Approximate Nearest Neighbour (ANN) search are powering the modern AI revolution—from LLM retrieval systems to recommendation engines—by enabling lightning-fast similarity search across massive datasets. With the rapid adoption of Large Language Models and Retrieval Augmented Generation(RAG), the technical demands on vector search have fundamentally changed, requiring fresh approaches to algorithm evaluation.

Research Challenge

Current ANN benchmarks remain stuck in the past. Most rely on decade-old datasets like SIFT and GIST that poorly reflect the complex distributions and high dimensionality of modern LLM embeddings. Worse still, benchmarking practices often employ inconsistent parameter selection methodologies, leading to potentially misleading conclusions about algorithm performance. We need a rigorous, reproducible benchmarking framework specifically designed for the LLM era.

Project Aim

Design, implement and evaluate a comprehensive framework in C++ for fair and insightful benchmarking of ANN algorithms on modern, high-dimensional datasets. This project will produce both valuable research contributions and a practical tool for the vector database community.

Key Research Tasks

Task 1: Dataset Generation & Analysis

Task 2: Benchmarking Test

Task 3: Parameter impact test and novel algorithm design

Skills you’ll harness

Through this project, you’ll gain advanced C++ programming skills including modern standards, SIMD optimization, and cache-efficient implementations. You’ll develop expertise in high-performance computing and algorithm optimization techniques while cultivating rigorous experimental methodology and advanced data analysis capabilities. This work will provide you with a deep understanding of vector database internals and ANN algorithms, along with valuable skills in high-dimensional data visualization and analysis.

Expected Outcomes

The project will deliver a comprehensive benchmarking framework that provides value to the broader vector database community. You’ll produce a thesis with publication-quality research findings and make valuable contributions to open-source ANN libraries. The skills you develop will be highly transferable and sought after in AI/ML industry roles, positioning you well for future career opportunities.

What we’re looking for

Essential:strong C++, solid data‑structures/algorithms background, eligibility for enrolment in a 24-unit Honours/Masters project or MPhil

Desirable:Basic understanding of vector embeddings; experience with Linux

Ready to apply?

Email mengxuan.zhang@anu.edu.au with your CV, academic transcript and a brief statement of interest

arrow-left bars search times