A Brief Benchmark Test of Approximate Nearest Neighbor Search in Vector Database

Introduction

In the midst of the AI revolution, most applications rely on vector embedding to capture the semantic information. The vector database is designed to index and store those vector embeddings for fast retrieval and similarity search. Approxiamate Nearest Neighbor search in vector database is a significant driver in clicks, views, and sales across several platforms like Google and Amazon. To achieve better performance of vector search, several types of algorithms have been proposed including Locality Sensitive Hashing, IVFSQ index, Hierarchical Navigable Small-world Graph. Those approaches are often compared on multiple representative datasets, but it is unclear how they perform under the same data size with different dimension or under the same dimension with different data size, and the possible reason behind the observation. Therefore, in this project, we will conduct a brief benchmark text through experimental studies on the existing indexes.

Research tasks

We will implement this project by finishing the following subtasks:

1) Familiarize with the principle of three or more indexes [1][2] for vector database

2) Generate the vector datasets

3) Implement the algorithms for index management

4) Experimental study and observation

More advanced exploration can follow by optimizing one of those algorithms or designing an algorithm selection mechanism to guide the real-life applications.

References

[1] Pan, James, Jianguo Wang, and Guoliang Li. “Vector Database Management Techniques and Systems.” In Companion of the International Conference on Management of Data (SIGMOD). 2024.

[2] Li, Wen, Ying Zhang, Yifang Sun, Wei Wang, Mingjie Li, Wenjie Zhang, and Xuemin Lin. “Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement.” IEEE Transactions on Knowledge and Data Engineering 32, no. 8 (2019): 1475-1488.

Requirements

Background and experience in data structure and graph theory (preferred).

Programming experience in C++ is essential.

Contact

If you are interested in this project, contact Dr. Mengxuan Zhang