Authors:
(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);
(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);
(3) Derek F. Wong, University of Macau;
(4) Longyue Wang, Tencent AI Lab, and corresponding author.
3 Anchor-based Large Language Models
3.2 Anchor-based Self-Attention Networks
4 Experiments and 4.1 Our Implementation
4.2 Data and Training Procedure
7 Conclusion, Limitations, Ethics Statement, and References
In Section 5, we report the testing acceleration ratio following the setting of Wang et al. (2023), comparing the time difference between caching and non-caching inference. Although our method reduces the keys/values caches, enabling smaller space for prefix information storage and improving testing time up to ×3.5, we are still curious about whether it would enhance inference efficiency if conventional methods use full caches that save all keys/values of prefix tokens. As a supplement to Table 1, we present the testing acceleration ratio between anchor-caching and full-caching inference in Table 3. The acceleration ratios for AnLLMEP-AnSAN and AnLLM-AC-AnSAN achieve the highest improvements observed in tasks such as HS, SCIQ, and BoolQ. The average acceleration ratios for AnLLM-EP-AnSAN and AnLLM-ACAnSAN are 1.03. These results demonstrate that our anchor-based caching method can enhance inference efficiency even when compared to conventional methods that save all keys/values of prefix tokens. These results suggest that our anchor-based caching approach, which saves only the keys/values caches of anchor tokens, can effectively accelerate the inference process for the lengthy prefix texts.
To examine the scalability of our approach, we extend the AnLLM-AC model to 13B and assess its performance on eight question-answering benchmarks using the same evaluation strategy as previously mentioned. In comparison to the 7B AnLLM models in Table 1, Results in Table 4 indicate that as the model size expands, the AnLLM-AC model achieves accuracies of 67.5% and 70.0% for 0-shot and 5-shot testing, respectively, resulting in up to a 2.4% improvement. Moreover, by incorporating anchor-based attention, the AnLLM-AC-AnSAN model achieves an average accuracy of 69.5%, signifying a 2.0% increase. The performance enhancement underscores the effectiveness of our methods in accommodating larger model capacities. The consistent improvements observed in the AnLLMAC model across various scenarios highlight its robustness and adaptability. Furthermore, the increased performance of the AnLLM-AC-AnSAN model, facilitated by anchor-based attention, emphasizes the potential of our approaches in optimizing LLMs. Collectively, these findings point to promising avenues for future research aimed at maximizing the utility and efficiency of AnLLM.
To elaborate on the optimization of keys/values caches by AnLLM-EP and AnLLM-AC during real-time inference, we reference examples from the translation task in Section 6.2. As per Table 5, AnLLM-EP and AnLLM-AC use "endpoints" (".") and "" tokens as anchor tokens, respectively. During inference, both models employ auto-regressive generation, creating outputs token-by-token. Upon generating an anchor token (as per Line 16, Algorithm 1), the Reduction function (defined in Line 1) is activated, preserving relevant caches and eliminating others. As a result, the Keys/Values Cache lengths are reduced to roughly the sequence length, averaging around 50 for AnLLM-EP and 35 for AnLLM-AC, as shown in Table 2.