Authors:
(1) Xueguang Ma, David R. Cheriton School of Computer Science, University of Waterloo;
(2) Liang Wang, Microsoft Research;
(3) Nan Yang, Microsoft Research;
(4) Furu Wei, Microsoft Research;
(5) Jimmy Lin, David R. Cheriton School of Computer Science, University of Waterloo.
Conclusion, Acknowledgements and References
Pre-trained language models based on the Transformer architecture (Vaswani et al., 2017) have demonstrated impressive capabilities when finetuned for various downstream tasks since the advent of BERT (Devlin et al., 2019). Depending on their architecture, pre-trained Transformers can be classified into three categories: encoder-only models (Devlin et al., 2019; Liu et al., 2019; Conneau et al., 2020), encoder–decoder models (Raffel et al., 2020; Lewis et al., 2020a), and decoder-only models (Radford et al., 2018). Decoder-only models like GPT/GPT-2 have been lauded for their simplicity in terms of model architecture and pre-training procedures (Radford et al., 2018, 2019).
Recent research has shown that scaling up LLMs by pre-training larger decoder-only models using larger and higher quality corpora can significantly enhance model capabilities for general-purpose NLP tasks such as question answering and code generation (Wei et al., 2022; Chen et al., 2021). This is achieved by fine-tuning the pre-trained LLMs with instruction-following data using reinforcement learning with human feedback. InstructGPT (Ouyang et al., 2022) and GPT-4 (OpenAI, 2023) are two popular representatives in this class of models. Among the many implementations of open-source large language models, LLaMA (Touvron et al., 2023a,b) is among the most recent and among the top-performing on a variety of tasks.
While multi-stage retrieval pipelines date back well over a decade (Matveeva et al., 2006; Cambazoglu et al., 2010; Wang et al., 2011), they have benefited immensely from pre-trained language models such as BERT in recent years, starting with the monoBERT reranking model (Nogueira and Cho, 2019). Nogueira et al. (2019) proposed a multi-stage retrieval pipeline that employs a BM25 retriever followed by two BERT-based reranking stages. This design demonstrates the effectiveness of pre-trained language models in reranking tasks. RankLLaMA follows the same basic design as monoBERT. The dense passage retriever (DPR) further proposed to fine-tune BERT to replace the BM25 retriever with a dense retrieval model in a bi-encoder design (Karpukhin et al., 2020). DPR encodes text into low-dimensional dense vector representations and treats retrieval as a nearest-neighbor search task. RepLLaMA follows the same bi-encoder design.
Several solutions have been introduced to enhance the effectiveness of retrievers and rerankers in a multi-stage pipeline. On the retriever side, works such as ANCE (Xiong et al., 2021), RocketQA (Qu et al., 2021), CoCondenser (Gao and Callan, 2022b), RetroMAE (Xiao et al., 2022), and SimLM (Wang et al., 2023), have shown that augmenting the training data with hard negative mining or continuous retrieval-oriented pre-training can improve the effectiveness of dense retrievers. On the reranker side, monoT5 (Nogueira et al., 2020) and monoELECTRA (Pradeep et al., 2022) demonstrated that initializing the reranker with a custom pre-trained model can enhance effectiveness. Gao et al., 2021 proposed using a contrastive loss for reranker training to replace the default pairwise loss. Zhuang et al. (2023) studied the use of T5 as a reranker, analyzing the influence of different model architectures and loss functions. However, directly fine-tuning modern billion-parameter-scale large language models for multi-stage retrieval has not been explored to date.
Recently, LLMs have demonstrated impressive effectiveness when prompted to perform few-shot or zero-shot text generation. As mentioned in the introduction, researchers have cast reranking as text generation. These models can be leveraged to directly generate a reordered list of candidates, e.g., LRL (Ma et al., 2023), RankGPT (Sun et al., 2023), RankVicuna (Pradeep et al., 2023). Alternatively, they can compare passages in a pairwise manner, e.g., PRP (Qin et al., 2023). Although promptbased methods have shown good zero-shot effectiveness, they require multiple decoding passes, thus making them slow and non-parallelizable. Furthermore, reranking with prompts makes it difficult to exploit available human judgments such as MS MARCO (Bajaj et al., 2016) to further improve effectiveness. Finally, these approaches do not allow for joint reranker–retriever optimization. In contrast, we address all these issues.
Our work is most similar to GPT-XXL (Ni et al., 2022) and SGPT (Muennighoff, 2022), which also used billion-parameter-scale models as backbones of dense retrievers, achieving better zero-shot effectiveness than smaller models. However, LLaMA has demonstrated even better effectiveness on natural language generation tasks, suggesting that it might serve as a better backbone and warranting further exploration. The cpt-text model (Neelakantan et al., 2022), initialized with the 175-billionparameter GPT-3 model, also shows strong zeroshot effectiveness. However, cpt-text is not an opensource model. Additionally, none of the models referenced above are fully optimized for a multistage retrieval pipeline. Our RepLLaMA and RankLLaMA models are fully open-source and optimized for multi-stage retrieval, achieving state-ofthe-art effectiveness on both retrieval and reranking, for both in-domain and out-of-domain evaluations.