KV Cache Compress for LLM Efficiency

Are you interested in exploring the recent advancements in the capabilities of AI Models???

What does this project do!!!

Introduction

Large Language Models (LLMs) have attracted significant attention due to their impressive performance in tasks like natural language processing and text generation. However, deploying these models can be quite expensive because they require a large amount of memory. A key contributor to this memory usage is the storage of temporary information, known as the KV cache, in the GPU. The size of this data increases as the input length. To address this, researchers have developed methods that aim to reduce memory usage by selecting only the most important information or applying compression techniques. While these approaches have achieved notable performance, they still leave considerable space for improvement. In this project, we aim to explore a more effective way to compress this data, reducing memory usage even further and improving overall efficiency.

Research Object

1) Investigate the KV relationships (like cos-sim comparison may be by layer:)) and the impact of different pivot token choice

2) Propose a more effective KV compress method with performance guarantee

3) Explore hybrid approaches that combine our compress method with other memory management techniques

4) Conduct a comprehensive experimental study and analyze the results

Requirement

Familiar with Linux operating system and Pytorch

Implement in Python is mandatory

General background in Machine Learning (e.g. COMP3670, COMP4670)

Background and experience in C++, Cuda preferred

Want to join!!!

Make sure to describe relevant experiences

Contact

If you are interested in this project, contact Dr. Mengxuan Zhang.

Reference

[1] Zhang, Zhenyu, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song et al. “H2o: Heavy-hitter oracle for efficient generative inference of large language models.” Advances in Neural Information Processing Systems 36 (2024).

[2] Xiao, Guangxuan, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. “Efficient streaming language models with attention sinks.” arXiv preprint arXiv:2309.17453 (2023).