Research Projects: Retrieval Infrastructure for LLMs

How do AI agents remember, retrieve, and reason over massive knowledge bases? The answer lies in vector databases, and this project puts you at the center of building them.

Embedding unstructured data (e.g., text, images, audio) as high-dimensional vectors is an emerging and popular way to represent, manage, and utilize data from diverse sources. The storage of such high-dimensional large-scale vectors is called vector databases, which serve as indispensable external knowledge repositories by providing AI models with contextually relevant information (e.g., Retrieval-Augmented Generation (RAG) in LLMs), acting as the infrastructure of AI agents (e.g., supporting retrieval-augmented planning), and playing important roles in modern recommendation (e.g., music recommendation in Spotify). To facilitate efficient information retrieval, the Approximate Nearest Neighbor (ANN) search returns semantically similar vectors for a given query vector with the help of indexes. Nevertheless, the high dimensionality of vector datasets introduces the curse of dimensionality, which poses great challenges to ANN index performance.

View the Facebook AI Similarity Search (Faiss) introduction on GitHub and Pinecone, for more preliminary knowledge of vector databases.

What you can get from working on this project:

One-to-one weekly mentoring, with hand-on-hand guidance on how to do research
Potentially top-tier publications and the international research network around me. That means, joining one of these projects is a potential pathway to obtain the PhD scholarship in my research lab.
Increase your success rate in finding a job. For example, there are many jobs requiring vector database related techniques in Indeed. Here are other examples (link1, link2, link3, link4).

Below are several research projects (listed in no particular order) aimed at improving the ANN search performance.

Project 1: Effective Graph Construction for Approximate Nearest Neighbor Search (Two positions)

One type of index is a similarity graph, in which similar vectors are retrieved via graph traversal. The quality of the similarity graph is critical, as it determines and largely affects search performance, i.e., efficiency and accuracy. To address this concern, this project will focus on designing a novel, effective graph construction to improve ANN search performance.

Requirements: basic understanding of Graph Theory

Related works on similarity graph construction and graph-based ANN search:

Fu, Cong, et al. “Fast approximate nearest neighbor search with the navigating spreading-out graph.” PVLDB 2019.

Wang, Mengzhao, et al. “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search.” VLDB 2021.

Malkov, Yu A., and Dmitry A. Yashunin. “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.” TPAMI 2018.

Dong, Wei, Charikar Moses, and Kai Li. “Efficient k-nearest neighbor graph construction for generic similarity measures.” WWW 2011.

Project 2: Intrinsic-dimension Aware ANN Indexing and Search (one position)

Although unstructured data is embedded in high-dimensional vectors, their intrinsic dimensionality may be lower than the actual dimensionality, offering an opportunity to optimize ANN search. Therefore, the objective of this project is to improve ANN search performance by exploring the intrinsic dimensionality of high-dimensional datasets, e.g., by transforming them into their intrinsic representation.

Requirements: passion for dimension reduction or related theoretical study

Related works on ANN search and intrinsic dimension exploration:

Harwood, Ben, and Tom Drummond. “Fanng: Fast approximate nearest neighbour graphs.” CVPR 2016.

Buyanov, Igor, Vasiliy Yadrintsev, and Ilya Sochenkov. “Using Autoencoders to improve Nearest Neighbor Search on large Datasets.”

Project 3: Query-Aware ANN search (one position)

In real-life applications, the submitted queries tend to be highly skewed, with a long-tail and head-heavy distribution. Based on this phenomenon, we can identify the frequent query space, which can be reused to accelerate the subsequent query answering. Therefore, this project focuses on speeding up ANN query answering through query awareness.

Requirements: foundation in data structures and algorithms and embedding model

Related works on cache-based query speed-up:

Li, Lingli, Zhanyu He, and Zhuo Zhang. “ANN-Cache: Accelerating Approximate Nearest Neighbor Search via Caching.” Authorea Preprints (2025).

Zeng, Ximu, et al. “LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search.” Proceedings of the ACM on Web Conference 2025. 2025

Project 4: GPU-accelerated ANN search (one position)

Real-life applications (e.g., OpenAI, Copilot, Spotify, Amazon, Meta) that rely on vector DBs place high demands on query efficiency; i.e., hundreds of thousands of queries need to be processed per second on million- or billion-scale datasets. To meet real-time query response times, this project will explore how to accelerate ANN search using GPUs, which offer strong parallel computation.

Requirements: GPU programming skills

Related works on GPU-accelerated computation and ANN search:

Groh, Fabian, et al. “Ggnn: Graph-based gpu nearest neighbor search.” IEEE Transactions on Big Data 9.1 (2022): 267-279.

Zhao, Weijie, Shulong Tan, and Ping Li. “Song: Approximate nearest neighbor search on gpu.” ICDE 2020.

Ootomo, Hiroyuki, et al. “Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus.” ICDE 2024.

Project 5: Temporal/spatial/hybrid ANN search (two positions)

General ANN search focuses on finding the top-k most similar vectors to the query vector. In real-life applications, vectors could be associated with temporal attributes or validity periods (e.g., answering time-relevant questions), spatial attributes (e.g., job finding or activity recommendations with location constraints), or other attributes (e.g., shoe color). Solving those queries with temporal, spatial, or other constraints is of great importance for advancing the performance of ANN algorithms.

Requirements: foundation in data structures and algorithms