How do AI agents remember, retrieve, and reason over massive knowledge bases? The answer lies in vector databases, and this project puts you at the center of building them.
Embedding unstructured data (e.g., text, images, audio) as high-dimensional vectors is an emerging and popular way to represent, manage, and utilize data from diverse sources. The storage of such high-dimensional large-scale vectors is called vector databases, which serve as indispensable external knowledge repositories by providing AI models with contextually relevant information (e.g., Retrieval-Augmented Generation (RAG) in LLMs), acting as the infrastructure of AI agents (e.g., supporting retrieval-augmented planning), and playing important roles in modern recommendation (e.g., music recommendation in Spotify). To facilitate efficient information retrieval, the Approximate Nearest Neighbor (ANN) search returns semantically similar vectors for a given query vector with the help of indexes. Nevertheless, the high dimensionality of vector datasets introduces the curse of dimensionality, which poses great challenges to ANN index performance.
View the Facebook AI Similarity Search (Faiss) introduction on GitHub and Pinecone, for more preliminary knowledge of vector databases.
What you can get from working on this project:
- One-to-one weekly mentoring, with hand-on-hand guidance on how to do research
- Potentially top-tier publications and the international research network around me. That means, joining one of these projects is a potential pathway to obtain the PhD scholarship in my research lab.
- Increase your success rate in finding a job. For example, there are many jobs requiring vector database related techniques in Indeed. Here are other examples (link1, link2, link3, link4).
Below are several research projects (listed in no particular order) aimed at improving the ANN search performance.
Project 1: Effective Graph Construction for Approximate Nearest Neighbor Search (Two positions)
One type of index is a similarity graph, in which similar vectors are retrieved via graph traversal. The quality of the similarity graph is critical, as it determines and largely affects search performance, i.e., efficiency and accuracy. To address this concern, this project will focus on designing a novel, effective graph construction to improve ANN search performance.
Requirements: basic understanding of Graph Theory
Related works on similarity graph construction and graph-based ANN search:
Fu, Cong, et al. “Fast approximate nearest neighbor search with the navigating spreading-out graph.” PVLDB 2019.
Wang, Mengzhao, et al. “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search.” VLDB 2021.
Malkov, Yu A., and Dmitry A. Yashunin. “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.” TPAMI 2018.
Dong, Wei, Charikar Moses, and Kai Li. “Efficient k-nearest neighbor graph construction for generic similarity measures.” WWW 2011.
Project 2: Intrinsic-dimension Aware ANN Indexing and Search (one position)
Although unstructured data is embedded in high-dimensional vectors, their intrinsic dimensionality may be lower than the actual dimensionality, offering an opportunity to optimize ANN search. Therefore, the objective of this project is to improve ANN search performance by exploring the intrinsic dimensionality of high-dimensional datasets, e.g., by transforming them into their intrinsic representation.
Requirements: passion for dimension reduction or related theoretical study
Related works on ANN search and intrinsic dimension exploration:
Harwood, Ben, and Tom Drummond. “Fanng: Fast approximate nearest neighbour graphs.” CVPR 2016.
Buyanov, Igor, Vasiliy Yadrintsev, and Ilya Sochenkov. “Using Autoencoders to improve Nearest Neighbor Search on large Datasets.”
Project 3: Query-Aware ANN search (one position)
In real-life applications, the submitted queries tend to be highly skewed, with a long-tail and head-heavy distribution. Based on this phenomenon, we can identify the frequent query space, which can be reused to accelerate the subsequent query answering. Therefore, this project focuses on speeding up ANN query answering through query awareness.
Requirements: foundation in data structures and algorithms and embedding model
Related works on cache-based query speed-up:
Li, Lingli, Zhanyu He, and Zhuo Zhang. “ANN-Cache: Accelerating Approximate Nearest Neighbor Search via Caching.” Authorea Preprints (2025).
Zeng, Ximu, et al. “LIRA: A Learning-based Query-aware Partition Framework for Large-scale ANN Search.” Proceedings of the ACM on Web Conference 2025. 2025
Project 4: GPU-accelerated ANN search (one position)
Real-life applications (e.g., OpenAI, Copilot, Spotify, Amazon, Meta) that rely on vector DBs place high demands on query efficiency; i.e., hundreds of thousands of queries need to be processed per second on million- or billion-scale datasets. To meet real-time query response times, this project will explore how to accelerate ANN search using GPUs, which offer strong parallel computation.
Requirements: GPU programming skills
Related works on GPU-accelerated computation and ANN search:
Groh, Fabian, et al. “Ggnn: Graph-based gpu nearest neighbor search.” IEEE Transactions on Big Data 9.1 (2022): 267-279.
Zhao, Weijie, Shulong Tan, and Ping Li. “Song: Approximate nearest neighbor search on gpu.” ICDE 2020.
Ootomo, Hiroyuki, et al. “Cagra: Highly parallel graph construction and approximate nearest neighbor search for gpus.” ICDE 2024.
Project 5: Temporal/spatial/hybrid ANN search (two positions)
General ANN search focuses on finding the top-k most similar vectors to the query vector. In real-life applications, vectors could be associated with temporal attributes or validity periods (e.g., answering time-relevant questions), spatial attributes (e.g., job finding or activity recommendations with location constraints), or other attributes (e.g., shoe color). Solving those queries with temporal, spatial, or other constraints is of great importance for advancing the performance of ANN algorithms.
Requirements: foundation in data structures and algorithms
Related reading:
Wang, Y., et al. Timestamp Approximate Nearest Neighbor Search over High-Dimensional Vector Data. ICDE 2025
Dhingra, B., et al. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics 2022.
Project 6: Dynamic ANN search (one position)
Existing algorithms can solve the ANN search quite well, but they all suppose that the dataset is static. Nevertheless, the real-life dataset is continually evolving, with new vectors being inserted and old vectors being deleted, to maintain its freshness and better support downstream applications. To this end, this project will explore how to keep the high-performance ANN query processing in dynamic scenarios.
Requirements: foundation in data structures and algorithms
Related works on dynamic ANN search:
Yamashita, T., et al. How Should We Evaluate Data Deletion in Graph-Based ANN Indexes?. arXiv preprint arXiv:2512.06200.
Mishra, N., et al. Graph-based Nearest Neighbors with Dynamic Updates via Random Walks. arXiv preprint arXiv:2512.18060.
Liu, D., et al. Wolverine: Highly Efficient Monotonic Search Path Repair for Graph-Based ANN Index Updates. VLDB 2025
Note: All of these projects will be implemented in C++. Ideally, you already have some experience with C++, but if not, that’s totally fine — as long as you’re willing to learn and pick it up during the project.
Each of these projects is conducted independently and research-focused. At least one (high-quality) paper can be/ is expected to be produced if properly conducted for each project. There are many other interesting topics falling into the vector database that are not listed on this page; you are welcome to find me as well if you would like to study them.
If you are interested in one of them and meet the listed requirements, please first read the relevant papers and contact Mengxuan, specifying the project title you are interested in and attaching your transcripts via email.