Published at 2023-07-23 | Last Update 2023-07-23
本文来自 2023 年一篇大模型论文: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond, 翻译了其中感兴趣的部分。
论文信息:
@article{yang2023harnessing,
title={Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond},
author={Jingfeng Yang and Hongye Jin and Ruixiang Tang and Xiaotian Han and Qizhang Feng and Haoming Jiang and Bing Yin and Xia Hu},
year={2023},
eprint={2304.13712},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
一些工程信息:
训练
GPT-3 175B
单次训练 460 万美元 [3]。GPT-3 175B
训练了 4990 亿个 token [16]。推理
推理时间
1.969s
。0.21s/request
。译者水平有限,不免存在遗漏或错误之处。如有疑问,敬请查阅原文。
以下是译文。
本文是一份大语言模型(LLMs)实用指南, 目的是帮助从业者和用户更好地完成他们的自然语言处理(NLP)相关任务 —— NLP 是 LLM 的典型使用场景(下游)。本文将从模型、数据和下游任务的角度讨论和分析 LLM 的选型和使用,
此外,我们还探讨了大模型的 spurious biases,以及工程角度的效率、成本和延迟等重要因素, 以便从业者对实际部署大模型有一个全面了解。
本文旨在为研究人员和从业者提供一些最新的技术见解和最佳实践,让大模型能更成功地应用于各种 NLP 任务中。
我们维护了一个资源列表页并定期更新,见 github.com/Mooler0410/LLMsPracticalGuide
。
近年来,大语言模型的快速发展对自然语言处理领域产生了革命性的影响 [12, 128, 131]。 这些强大的模型在各种 NLP 任务 —— 从自然语言理解(NLU)到生成式任务(generation tasks)—— 中都展现出了巨大的潜力,甚至为通用人工智能(AGI)铺平了道路。 但另一方面,如何有效和高效地利用这些模型,就需要了解它们的实际能力和局限性, 还需要考虑具体 NLP 任务及其涉及的数据。
作为一份给大模型从业者和用户的指南,本文主要关注下游 NLP 任务中如何使用 LLM。例如,
本文总结了以下 LLM 实用指南:
为了评估(通用)大语言模型的能力,我们将把它们与微调模型(fine-tuned models)进行比较。 目前,LLM 和微调模型都还没有一个普遍认可的定义。考虑到实际效用,本文将使用如下定义:
From a practical standpoint, we consider models with less than 20B parameters to be fine-tuned models. While it’s possible to fine-tune even larger models like PlaM (540B), in reality, it can be quite challenging, particularly for academic research labs and small teams. Fine-tuning a model with 3B parameters can still be a daunting task for many individuals or organizations.
本文接下来的内容组织如下:
本节简要介绍当前业界最先进的 LLM。 这些模型在训练策略、模型架构和使用场景上有所不同。为了更清晰地理解 LLM 的发展, 本文将它们分为两种类型:
图 1 展示了语言模型的演进历程,
图 1:Fig. 1. The evolutionary tree of modern LLMs traces the development of language models in recent years and highlights some of the most well-known models. Models on the same branch have closer relationships. Transformer-based models are shown in non-grey colors: decoder-only models in the blue branch, encoder-only models in the pink branch, and encoder-decoder models in the green branch. The vertical position of the models on the timeline represents their release dates. Open-source models are represented by solid squares, while closed-source models are represented by hollow ones. The stacked bar plot in the bottom right corner shows the number of models from various companies and institutions.
几点说明:
decoder-only 模型逐渐成为 LLM 的主要发展趋势。
LLM 表现出闭源的趋势。
encoder-decoder 模型仍然还有前途。
表 1 中简要总结了每种类型的特点和代表性 LLM。
表 1:当前各种大语言模型(LLM)总结
Encoder-Decoder or Encoder-only (BERT-style) | Decoder-only (GPT-style) | |
---|---|---|
训练 | Masked Language Models(遮掩式语言模型) | Autoregressive Language Models(自回归语言模型) |
模型类型 | 判别式(Discriminative) | 生成式(Generative) |
预训练任务 | 预测遮掩掉的单词(Predict masked words) | 预测下一个单词(Predict next word) |
大语言模型 | ELMo [80], BERT [28], RoBERTa [65], DistilBERT [90], BioBERT [57], XLM [54], Xlnet [119], ALBERT [55], ELECTRA [24], T5 [84], GLM [123], XLM-E [20], ST-MoE [133], AlexaTM [95] | GPT 3/4 [16,76], OPT [126]. PaLM [22], BLOOM [92], MT-NLG [93], GLaM [32],Gopher [83], chinchilla [41], LaMDA [102], GPT-J [107], LLaMA [103], BloombergGPT [117] |
自然语言数据易于获取。为了更好地利用超级数据集,人们已经提出了很多无监督训练范式(unsupervised training paradigms), 这也促进了自然语言的无监督学习(unsupervised learning)。
这其中,一种常见的方式是在给定上下文的情况下,预测句子中掩掉(masked)的单词。 这种训练范式被称为遮掩语言模型(Masked Language Model,MLM),
典型模型包括
这种模型在许多 NLP 任务(如情感分析和 named entity 识别)中取得了 state-of-the-art 的结果, 已经成为自然语言处理领域的重要工具。
尽管语言模型通常在架构上是任务不可知的(task-agnostic), 但都需要在特定下游任务的数据集上进行微调。
研究人员发现,增大(scaling up)语言模型能显著提高少样本(few-shot)甚至零样本(zero-shot)性能[16]。 少样本和零样本最成功的模型是自回归语言模型(Autoregressive Language Models,ALM),
典型的自回归语言模型包括,
这其中,GPT-3 是一个划时代的模型,它首次通过提示(prompting)和上下文学习(in-context learning) 展示了少样本/零样本也能取得不错的性能,展现了自回归语言模型的优越性。
还有一些模型针对特定任务进行了优化,如
最近的突破是 ChatGPT,它专门针对对话任务优化了 GPT-3,从而在各种实际应用中 互动性、连贯性,以及更好的上下文理解能力。
本节将会看到,在针对给定任务选择合适的模型时,数据(data)扮演着关键角色。 数据对模型效果的影响始于预训练(pre-training)阶段,并持续到训练(training)和推理(inference)阶段。
通用模型(LLM) vs. 微调模型(fine-tuned models)的选择
- 工作在 out-of-distribution 数据(例如 adversarial examples and domain shifts)时,通用模型比微调模型效果更好;
- 工作在有限的标注数据(limited annotated data)时,通用模型更好一些;
- 工作在充足的标注数据(abundant annotated data)时,两个模型都可以,看具体的任务要求;
- 建议在与最终下游任务类似的数据集上进行预训练。
预训练数据在大语言模型的开发中起着关键作用。
作为 LLM 超能力(remarkable capabilities)[5,47] 的基础, 预训练数据的质量、数量和多样性显著影响 LLM 的性能[124]。 常用的预训练数据包括多种文本数据,例如书籍、文章和网站。 数据经过精心挑选,以确保全面代表人类知识、语言差别和文化观点。
预训练数据的重要性在于,它能够极大影响语言模型对词汇知识、语法、句法和语义的理解,以及识别上下文和生成连贯回答的能力。 预训练数据的多样性也对模型性能起着至关重要的作用,LLM 的性能高度依赖于预训练数据的组成。 例如,
简而言之,在针对 NLP 任务做 LLM 选型时,建议选择那些在类似数据领域上进行过预训练的模型。
如果已经有了通用大模型,接下来想部署到线上环境提供服务,那根据手头 打标数据(annotated data)的多少,
可以在部署之前先对大模型进行配置调整或模型微调。
这种情况即没有标注数据,那就没有微调的可能了;在配置方面,将 LLM 设置为 zero-shot
是最合适的。
LLM 的 zero-shot methods [120] 已经比之前更好。此外,这种场景由于模型参数保持不变(remain unaltered), 也不存在参数更新过程(parameter update process), 因此可以避免灾难性遗忘(catastrophic forgetting)[49]。
这种情况下,可以将手头少量的 few-shot examples 直接通过 prompt 输入到 LLM 中, 这被称为上下文学习(in-context learning), 这些数据可以有效地指导 LLM 针对具体任务进行优化(generalize to the task)。
对于特定任务有大量 annotated data 可用的情况下,微调模型和 LLM 都可以考虑。
总之,这种情况下使用微调模型还是 LLM 就看具体任务以及所需的性能、计算资源和部署约束等因素了。
简而言之,
在部署 LLM 用于实际任务时,经常面临测试/用户数据与训练数据之间分布差异导致的性能问题。 这些差异可能涉及到
它们极大降低了微调模型在实际应用中的有效性。 原因是微调模型都是基于特定数据分布拟合的,generalize to OOD data 的能力较差。
另一方面,通用模型在这种情况表现得相当好,因为它们没有明确的拟合过程。
此外,最近的人类反馈强化学习(Reinforcement Learning from Human Feedback,
RLHF
)进一步增强了 LLM 的泛化能力[77]。例如,
本节详细讨论 LLM 在各种 NLP 任务中适合与不适合的场景,以及相应的模型能力。 如图 2 所示,我们将所有讨论总结成一个决策流程,它可以作为面对任务时快速决策的指南:
图 2: The decision flow for choosing LLMs or fine-tuned models 2for user’s NLP applications. The decision flow helps users assess whether their downstream NLP applications at hand meet specific conditions and, based on that evaluation, determine whether LLMs or fine-tuned models are the most suitable choice for their applications. During the decision process in the figure, Y means meeting the condition, and N means not meeting the condition. The yellow circle for Y of the last condition means there’s no model working well on this kind of application.
Traditional NLU tasks are some fundamental tasks in NLP including text classification, named entity recognition (NER), entailment prediction, and so on. Many of them are designed to serve as intermediate steps in larger AI systems, such as NER for knowledge graph construction.
1 As we mention in Section 1, LLMs are pretrained on large and diverse datasets without fine-tuning, while fine-tuned models are typically pretrained on a large dataset and then further trained on a smaller, task-specific dataset to optimize their performance on that task.
Remark 2
Fine-tuned models generally are a better choice than LLMs in traditional NLU tasks, but LLMs can provide help while requiring strong generalization ability.
In most natural language understanding tasks, such as tasks in GLUE[106] and SuperGLUE[105], fine-tuned models still have better performance, if such tasks come with rich well-annotated data and contain very few out-of-distribution examples on test sets. For different tasks and datasets, the gap between small fine-tuned models and LLMs varies.
In text classification, on most datasets, LLMs perform slightly worse than fine-tuned models. For sentiment analysis, such as on IMDB [69] and SST [94], fine-tuned models and LLMs perform equally well. For toxicity detection, which is another iconic text classification task, the gap is much larger. All LLMs cannot perform well on this task, and on CivilComments [13] even the best one is only better than random guessing [59]. On the other hand, most popular fine-tuned models can obtain much better performance [33]. and the Perspective API 3 is still one of the best for detecting toxicity. This API is powered by a multilingual BERT-based model, which is tuned on publicly available toxicity data and several smaller single-language CNNs distilled from this model. This might be due to the fact that toxicity is defined by subtle nuances in linguistic expressions, and large language models are unable to accurately comprehend this task solely based on the provided input.
The trend of performance gaps is similar in some other tasks. For natural language inference (NLI) tasks, on most datasets, such as on RTE [106] and SNLI [14], fine-tuned models perform better than LLMs, while on some data such as CB [105], LLMs have obtained comparable performance with fine-tuned models [22]. For question answering (QA), on SQuADv2 [86], QuAC [21] and many other datasets, fine-tuned models have superior performance, while on CoQA [87], LLMs perform as well as fine-tuned models [22].
In information retrieval (IR) tasks, LLMs are not widely exploited yet. One major reason is that IR tasks are fundamentally different from others. There’s no natural way to transform the thousands of candidate texts into a few/zero-shot form which is required by LLMs. The existing evaluation results on MS MARCO(regular/TREC) [73] show that methods based on fine-tuned models have better performance [59]. In this evaluation, the LLMs rank passages in an unorthodox way, which requires the LLMs to produce probabilities for passages one by one.
For some low-level intermediate tasks, which are not intended for regular users but rather for high level tasks, such as named entity recognition (NER) and dependency parsing, there’s not enough result coming from LLMs, because the most current evaluation of LLMs focuses on practical tasks. According to available evaluation results, for the NER task, CoNLL03 [89] is still a challenge for LLMs [81], where the performance of fine-tuned models is around as twice as LLMs. These intermediate tasks may vanish soon because LLMs can take over high-level tasks without the help of those intermediate tasks (e.g. dependency parsing for coding tasks; NER for some text generation tasks).
In brief, for most traditional NLU tasks, a fine-tuned model is a better choice in terms of the performance on benchmark datasets and the computational cost. The scale of LLMs is usually 10× or even 100× larger than fine-tuned models. One possible cause for the inferior performance of LLMs on certain tasks can be the design of instructions/prompts. Transforming input from tasks like IR and sentence labeling into a few/zero-short instruction form is non-trivial. There may be better ways to adapt language models to traditional NLP tasks in the future. On the other hand, the upper limit of capabilities of fine-tuned models is not reached, and some methods like FLAN-tuning [67] can further boost the performance on NLU tasks. Another interesting finding is that on NLU tasks, after fine-tuning, masked language models, like T5[85], are better than most auto-regressive language models at the same scale, while some recent results imply that this gap can be bridged by scaling[22].
One of the representative tasks is miscellaneous text classification [59]. In contrast to classic domain-specific text classification tasks such as sentiment analysis, miscellaneous text classification deals with a diverse range of topics and categories that may not have a clear or strong relationship with one another. It’s closer to real-world cases and hard to be formatted for using fine-tuned models. Another is the Adversarial NLI (ANLI)[74]. It is a difficult dataset composed of adversarially mined natural language inference questions in three rounds (R1, R2, and R3). LLMs have shown superior performance on ANLI, especially on the R3 and R2. Both examples demonstrate the exceptional ability of LLMs to generalize well on out-of-distribution and sparsely annotated data in traditional NLP tasks, surpassing that of fine-tuned models. We’ve discussed this in the section above 3.3.
Natural Language Generation broadly encompasses two major categories of tasks, with the goal of creating coherent, meaningful, and contextually appropriate sequences of symbols. The first type focuses on converting input texts into new symbol sequences, as exemplified by tasks like paragraph summarization and machine translation. The second type, “open-ended” generation, aims to generate text or symbols from scratch to accurately match input descriptions such as crafting emails, composing news articles, creating fictional stories and writing code.
Remark 3 Due to their strong generation ability and creativity, LLMs show superiority at most generation tasks.
Generation tasks require models to have a comprehensive understanding of the input contents or requirements and a certain level of creativity. This is what LLMs excel at.
For summarization tasks, although LLMs do not have an obvious advantage over fine-tuned models under traditional automatic evaluation metrics, such as ROUGE [60], human evaluation results indicate that humans tend to prefer the results generated by LLMs [38, 127] compared to that of fine-tuned models. For example, on CNN/DailyMail [71] and XSUM [72], fine-tuned models like Brio [66] and Pegasus [125] have much better performance than any LLMs w.r.t. ROUGE, but LLMs like OPT [126] perform far better in human evaluation considering all aspects including faithfulness, coherence, and relevance [127]. This demonstrates the superiority of LLMs in summarization tasks. On the other hand, it implies that current summarization benchmarks don’t contain summaries with high quality or the automatic metrics are not proper for the evaluation of summarization.
In machine translation (MT), LLMs can perform competent translation, although the average performance is slightly worse than some commercial translation tools [45] considering some automatic metrics like BLEU[78]. LLMs are particularly good at translating some low-resource language texts to English texts, such as in the Romanian-English translation of WMT’16 [11], zero-shot or few-shot LLMs can perform better than SOTA fine-tuned model[22]. This is mainly due to the fact that English resources compose the main part of the pre-training data. BLOOM [92] is pre-trained on more multi-lingual data, leading to better translation quality in both rich-resource and low-resource translation. Another interesting finding is that BLOOM achieves good translation quality among Romance languages, even for translation from Galician, which is not included in the pre-training data. One reasonable explanation is that texts from some languages in the same language group can help the LLMs learn more from the similarity. If more multi-lingual texts can be added to the pre-training data, the translation capability may be improved further.
Additionally, LLMs are highly skilled in open-ended generations. One example is that the news articles generated by LLMs are almost indistinguishable from real news articles by humans [16]. LLMs are remarkably adept at code synthesis as well. Either for text-code generation, such as HumanEval [18] and MBPP [7], or for code repairing, such as DeepFix [39], LLMs can perform pretty well. GPT-4 can even pass 25% problems in Leetcode, which are not trivial for most human coders [76]. With training on more code data, the coding capability of LLMs can be improved further [22]. While performing well on such tasks, the codes generated by LLMs should be tested carefully to figure out any subtle bugs, which is one of the main challenges for applying LLMs in code synthesis.
Fine-tuned models, such as DeltaLM+Zcode [118], still perform best on most rich-resource translation and extremely low-resource translation tasks. In rich resource machine translation, fine-tuned models slightly outperform LLMs [22, 92]. And in extremely low-resource machine translation, such as English-Kazakh translation, fine-tuned models significantly perform better than LLMs.
Knowledge-intensive NLP tasks refer to a category of tasks that have a strong reliance on background knowledge, domain-specific expertise, or general real-world knowledge. These tasks go beyond simple pattern recognition or syntax analysis. And they are highly dependent on memorization and proper utilization of knowledge about specific entities, events, and common sense of our real world.
Remark 4 (1) LLMs excel at knowledge-intensive tasks due to their massive real-world knowledge. (2) LLMs struggle when the knowledge requirements do not match their learned knowledge, or when they face tasks that only require contextual knowledge, in which case fine-tuned models can work as well as LLMs.
In general, with billions of training tokens and parameters, LLMs have much more real-world knowledge than fine-tuned models.
Closed-book question-answering tasks require the model to answer a given question about factual knowledge without any external information. It does require the memorization of real-world knowledge in the model. LLMs perform better on nearly all datasets, such as on NaturalQuestions [52], WebQuestions [9], and TriviaQA [46]. On TriviaQA, even zero-shot LLMs is still much better [22].
The massive multitask language understanding (MMLU) [40] is also highly knowledge-intensive. It contains multiplechoice questions spanning over 57 different subjects and requires general knowledge of the model. It’s pretty challenging even for LLMs, although the newly released GPT-4 [76] outperforms existing models by a considerable margin in English with a satisfactory 86.5% accuracy.
Also, some tasks in Big-bench[96], which are designed to probe LLMs and extrapolate their future capabilities, heavily relied on the memorization of real-world knowledge. In such tasks, the performance of some LLMs is better than the average level of humans, and even comparable to the best human performance. For example, the task Hindu_knowledge requires models to give facts about Hindu mythology, Periodic Elements require the capability of predicting the element name from the periodic table and Physics tests the physics knowledge of models by asking for the formula needed to solve a given physics problem.
There are some other tasks requiring knowledge different from that learned by LLMs. The required knowledge is not that learned by LLMs about the real world. In such tasks, LLMs are not notably superior. Some tasks only require the model to capture the self-contained knowledge in the contexts. The knowledge in the contexts from the input is enough for the model to make predictions. For these tasks, small fine-tuned models can work pretty well. One such task is machine reading comprehension (MRC). An MRC task provides several paragraphs and requires the model to predict the answer to questions based on these paragraphs. We’ve discussed MRC in the previous section because it’s also a traditional NLU task.
Another scenario is that the knowledge within LLMs about real world is useless to the task, or even the required knowledge is counterfactual to the real world. As a result, the LLMs cannot work well on such tasks. In some cases, inconsistent knowledge may even make the LLMs worse than random guessing. For example, in Big-Bench, the Mnist ascii task requires the model to tell the digit represented by an ASCII art. The capability required by this task is nothing about real-world knowledge. Also, in the Inverse Scaling Phenomenon competition [70], the task redefine math redefines a common symbol and requires the model to choose between the original meaning and the meaning derived from the redefinition. What it requires contrasts to the LLMs’ knowledge, thus LLMs even perform worse than random guessing. As an alternative to real-world knowledge in LLMs, access to extra knowledge is allowed, and models can thus get enough knowledge for a task via retrieval augmentation. The basic idea of retrieval augmentation is to add an extra information retrieval step prior to making predictions, in which, some useful texts related to the task will be retrieved from a large corpus. Then, the model will make predictions based on both the input contexts and the retrieved texts. With retrieved additional information, the closed-book task can become “open-book”. In such a scenario, fine-tuned models are pretty good with much smaller sizes, because the required knowledge can be obtained by retrieving. For example, on NaturalQuestions [52], with extra corpus, retrieval augmented models [44, 48] are much better than any other methods.
Scaling of LLMs (e.g. parameters, training computation, etc.) can greatly empower pretrained language models. With the model scaling up, a model generally becomes more capable in a range of tasks. Reflected in some metrics, the performance shows a power-law relationship with the model scale. For example, the cross-entropy loss which is used to measure the performance for language modeling decreases linearly with the exponential increase in the model scale, which is also called ’scaling-law’ [41, 47]. For some crucial abilities, such as reasoning, scaling the model has gradually transformed these abilities from a very low state to a usable state, and even approaching human capabilities. In this section, we provide an overview of the usage of LLMs in terms of the abilities and behaviors of LLMs along with scaling.
Remark 5 (1) With the exponential increase of model scales, LLMs become especially capable of reasoning like arithmetic reasoning and commonsense reasoning. (2) Emergent abilities become serendipity for uses that arise as LLMs scale up, such as ability inword manipulation and logical ability. (3) In many cases, performance does not steadily improve with scaling due to the limited understanding of how large language models’ abilities change as they scale up.
Reasoning, which involves making sense of information, drawing inferences, and making decisions, is one of the essential aspects of human intelligence. It is challenging for NLP. Many existing reasoning tasks can be classified into commonsense reasoning and arithmetic reasoning.
Arithmetic reasoning/problem solving. The arithmetic reasoning capability of LLMs benefits greatly from the scaling of model size. For GPT-3, the ability of two-digit addition only becomes apparent when the number of parameters exceeds 13B [16]. Tasks to test arithmetic reasoning are trivial for humans and designed to challenge the capability of transferring natural language into mathematical symbols and multi-step inference. On GSM8k [26], SVAMP [79] and AQuA [61], LLMs, as generalists, have competitive performance with most methods which have task-specific designs. And GPT-4 overperforms any other methods [76], even some huge models particularly tuned for arithmetic problems [104]. Nevertheless, it should be noted that, without the intervention of external tools, LLMs may occasionally make mistakes in performing basic calculations, although chain-of-thought (CoT) prompting [115] can significantly improve LLMs’ ability in calculations.
Commonsense reasoning. Commonsense reasoning not only requires LLMs to remember factual knowledge but also requires LLMs to do several inference steps about the facts. Commonsense reasoning increases gradually with the growth of model size. Compared to fine-tuned models, LLMs keep the superiority on most datasets, such as StrategyQA [36] and ARC-C [25]. Especially on ARC-C, which contains difficult questions in science exams from grade 3 to grade 9, GPT-4 has been close to the performance of 100% (96.3%) [76].
Scaling of models also endows the model with some unprecedented, fantastic abilities that go beyond the power-law rule. These abilities are called “emergent ability”. As defined in [113], emergent abilities of LLMs are abilities that are not present in smaller-scale models but are present in large-scale models. This means such abilities cannot be predicted by extrapolating the performance improvements on smaller-scale models and the model suddenly gains good performance on some tasks once the scale exceeds a certain range. The emergent ability is typically unpredictable and surprising, leading to tasks that emerge randomly or unexpectedly. We examine concrete examples of the emergent abilities of LLMs and provide them as an important reference for deciding whether to leverage LLMs’ emergent abilities.
Handling word manipulation is a typical emergent ability. It refers to the ability to learn symbolic manipulations, such as the reversed words [16], in which the model is given a word spelled backwards, and must output the original word. For example. GPT-3 [16] shows the emergent ability for word sorting, and word unscrambling tasks. PaLM [22] exhibits the emergent ability on ASCII word recognition 4 and hyperbaton 5 task. The logical abilities of language models tend to emerge as the model scales up, such as logical deduction, logical sequence, and logic grid puzzles. Additionally, other tasks, such as advanced coding (e.g., auto debugging, code line description), and concept understanding (e.g., novel concepts, simple Turing concepts), are also use cases with the emergent abilities of large language models.
Although in most cases, as discussed above, larger models bring better performance, there are still many exceptions that should be considered when choosing the appropriate model.
On certain tasks, with the size of LLMs increasing, the performance begins to decrease, such as Redefine-math: tests whether language models are able to work with common symbols when they are redefined to mean something else; Intothe- unknown: requires the model to choose which piece of information would help answer a question; Memo-trap: asks an LM to write a phrase in a way that starts like a famous quote but ends differently6. This is also called Inverse Scaling Phenomenon. Another interesting phenomenon observed in the scaling of LLMs is called the U-shaped Phenomenon [114]. As the name implies, This phenomenon refers to that as LLM size increases, their performance on certain tasks initially improves but then starts to decline before eventually improving again, such as on: Hindsight-neglect: it tests whether language models are able to assess whether a bet was worth taking based on its expected value; NegationQA: this task takes an existing multiple-choice dataset and negates a part of each question to see if language models are sensitive to negation; Quote-repetition: it asks models to repeat back sentences given in the prompt, with few-shot examples to help it recognize the task. Hence the risk of diminishing performance should be noted and if the task is similar to those we just discussed, careful consideration should be given to whether or not to use huge LLMs.
Gaining a deeper understanding of emergent abilities, inverse scaling phenomenon and U-shape phenomenon in LLMs is essential for advancing research in this field. In a certain sense, the U-shape phenomenon suggests that small-scale models and huge-scale models make predictions with different internal mechanisms. From this perspective, the U-shape phenomenon can be seen as a transformation of the inverse-scaling phenomenon due to some emergent abilities from sufficiently large models [114]. GPT-4 [76] exhibits a reversal of the inverse scaling phenomenon in some cases, such as on a task called Hindsight Neglect. The explanation for these behaviors of LLMs during scaling is still an open problem. Several hypotheses have been proposed. For emergent abilities, one explanation is that there may be multiple key steps for a task and the LLM cannot handle this task until it’s large enough to handle every step, and another explanation is focused on the granularity of evaluation metrics [113]. For inverse-scaling phenomenon and
u-shape phenomenon, the explanations mainly focus on the model’s over-reliance on information from its prior rather than the input prompts, valid but misleading few-shot examples, and distracting easier tasks within a hard task [114].
This section explores miscellaneous tasks which cannot be involved in previous discussions, to better understand LLMs’ strengths and weaknesses.
Remark 6 (1) Fine-tuned models or specified models still have their space in tasks that are far from LLMs’ pretraining objectives and data. (2) LLMs are excellent at mimicking human, data annotation and generation. They can also be used for quality evaluation in NLP tasks and have bonuses like interpretability.
LLMs generally struggle with some tasks due to differences in objectives and training data. Although LLMs have achieved remarkable success in various natural language processing tasks, their performance in regression tasks has been less impressive. For example, ChatGPT’s performance on the GLUE STS-B dataset, which is a regression task evaluating sentence similarity, is inferior to a fine-tuned RoBERTa performance [130]. The Regression tasks typically involve predicting a continuous value rather than a discrete label, posing unique challenges for LLMs. One primary reason for their subpar performance is the inherent difference between the language modeling objective and the regression task objective. LLMs are designed to predict the next word in a sequence or generate coherent text, with their pre-training focused on capturing linguistic patterns and relationships. Consequently, their internal representations may not be well-suited for modeling continuous numerical outputs. Besides, LLMs have predominantly been trained on text data, focusing on capturing the intricacies of natural language processing. As a result, their performance on multimodal data, which involves handling multiple data types such as text, images, audio, video, actions, and robotics, remains largely unexplored. And fine-tuned multimodal models, like BEiT[110] and PaLI [19], still dominate many tasks such as visual question answering (VQA) and image captioning. Nonetheless, the recently introduced GPT-4 [76] has taken the step in multimodal fusion, but there is still a lack of detailed evaluation of its capabilities.
LLMs are particularly suitable for certain tasks.
LLMs are very good at mimicking humans, acting as a chatbot, and performing various kinds of tasks. The LLMspowered ChatGPT7 is surprising for its consistency, reliability, informativeness, and robustness during multiple utterances with humans. The human-feedback procedure plays an important role in acquiring such abilities LLMs can both act as a good annotator and data generator for data augmentation, such as in[27, 29, 99, 121, 122]. Some LLMs have been found as good as human annotators [37] in some tasks. And the collected texts from GPT- 3.5 (text-davinci-003) have been used as human-like instruction-following demonstrations to train other language models [100].
LLMs can also be used for quality assessment on some NLG tasks, such as summarization and translation. On summarization tasks, GPT-4 as an evaluator achieves a higher correlation with humans than other methods with a large margin [64]. Some other evaluators based on LLMs [34, 50, 64, 108] also show good human alignment in more NLG tasks, especially compared with traditional automatic metrics. But the LLM evaluator may have a bias towards the LLM-generated texts [64]. Also, as we discussed above, some abilities of LLMs bring bonuses in addition to performance improvement, such as interpretability. The CoT reasoning ability of LLMs can show how an LLM reaches the prediction, which is a good interpretation on the instance level, while it also improves the performance.
In the last part of this section, we would like to discuss the usage of LLMs and fine-tuned models in real-world “tasks”.We use the term “tasks” loosely, as real-world scenarios often lack well-formatted definitions like those found in academia. Many requests to models even cannot be treated as NLP tasks. Models face challenges in the real world from three perspectives:
Remark 7 LLMs are better suited to handle real-world scenarios compared to fine-tuned models. However, evaluating the effectiveness of models in the real world is still an open problem.
Handling such real-world scenarios requires coping with ambiguity, understanding context, and handling noisy input. Compared to fine-tuned models, LLMs are better equipped for this because they have been trained on diverse data sets that encompass various writing styles, languages, and domains. Additionally, LLMs demonstrate a strong ability to generate open-domain responses, making them well-suited for these scenarios. Fine-tuned models, on the other hand, are often tailored to specific, well-defined tasks and may struggle to adapt to new or unexpected user requests. They heavily rely on clear objectives and well-formed training data that specify the types of instructions the models should learn to follow. Fine-tuned models may struggle with noisy input due to their narrower focus on specific distributions and structured data. An additional system is often required as an assistant for fine-tuned models to process unstructured context, determine possible intents, and refine model responses accordingly. Additionally, some mechanics such as instruction tuning [91, 112] and human alignment tuning [77] further boost the capabilities of LLMs to better comprehend and follow user instructions. These methods improve the model’s ability to generate helpful, harmless, and honest responses while maintaining coherence and consistency [77, 91, 112]. While both methods can make LLMs better generalize to unseen tasks and instructions, it has been noticed that while human labelers prefer models tuned for human alignment [77] to models tuned with instructions from public NLP tasks, such as FLAN [112] and T0 [91]. The reason may be similar to reasons for fine-tuned models’ inferiority: public NLP tasks/datasets are designed for easy and automatic evaluation, and they can only cover a small part of real-world usage. One of the main issues when it comes to real-world scenarios is how to evaluate whether the model is good or not. Without any formalized tasks or metrics, the evaluation of model effectiveness can only rely on feedback from human labelers. Considering the complexity and cost of human evaluation, there’s no massive and systematic comparison between fine-tuned models and LLMs yet. Nevertheless, the huge success and popularity of LLMs such as chatGPT, have confirmed the superiority of LLMs to some extent.
虽然通用大模型适用于很多下游任务,但还有一些重要方面需要考虑,例如效率和可信任度。
Remark 8
- Light, local, fine-tuned models should be considered rather than LLMs, especially for those who are sensitive to the cost or have strict latency requirements. Parameter-Efficient tuning can be a viable option for model deployment and delivery.
- The zero-shot approach of LLMs prohibits the learning of shortcuts from task-specific datasets, which is prevalent in fine-tuned models. Nevertheless, LLMs still demonstrate a degree of shortcut learning issues.
- Safety concerns associated with LLMs should be given utmost importance as the potentially harmful or biased outputs, and hallucinations from LLMs can result in severe consequences. Some methods such as human feedback have shown promise in mitigating these problems.
实际部署除了要考虑模型准确性,性能、成本和延迟都是重要考虑因素,。
实践中,从业者必须考虑效率和效果(efficiency with effectiveness)之间的平衡。
近年来,LLM 变得越来越大,例如 GPT-1、GPT-2 和 GPT-3 等模型分别拥有 0.117b、1.5b 和 175 个参数。
LLM 的训练费用跟参数大小直接相关,
GPT-3 175B
单次训练需要 460 万美元 [3]。训练大模型的能耗同样惊人,
数据集大小也随模型大小迅速膨胀,
GPT-3 175B
训练了 4990 亿个 token [16]。反映计算成本的另一个关键指标是 Flops,
除了这些成本,硬件要求也很高。 OpenAI 与 Microsoft 合作,在 Microsoft Azure 中托管了一个超级计算机, 由 285k 个 CPU 和 10k 个高端 GPU 组成,支撑大型模型训练。
对于 OpenAI API 的用户,定价基于模型和使用情况而变化,例如
因此,对于无法承担如此高成本的用户,例如小型初创企业、个人用户等, 选择一个更小的(非 GPT)微调模型可能更合适。
延迟是实际部署 LLM 需要考虑的关键因素。
推理时间是衡量延迟的常用指标,它高度依赖于模型大小、架构和 token 数量。例如,
由于 LLM 通常太大而无法在用户的单台机器上运行,公司通过 API 提供 LLM 服务。 API 延迟可能因用户位置而异,
对于无法接受高延迟的情况,大型 LLM 可能不适用。例如,在许多信息检索应用中,可扩展性至关重要。
InstructGPT
davinci v2(175B*)的理想去噪推理时间(idealized denoised inference time)
(i.e. a query-passage pair to be scored)0.21s/request
,这对于网络搜索引擎来说太慢了。在实践中,可能根据特定的数据集对模型进行调优。
参数效率调优(Parameter-Efficient Tuning,PET)是指 冻结预训练出的 LLM 的大部分参数,只对模型的一小部分参数(或额外参数)进行微调的技术。 主要目标是极大降低计算和存储成本,同时保持原始模型的性能。 常见的 PET 技术包括,
LoRA
(Low-Rank Adaptation,LoRA)[42]以 LoRA 为例,
在将模型微调到特定任务,或微调 LLM 以满足人类对齐(human alignment)等特殊要求情况下, 这些 PFT 方法都是有用的。
Given that LLMs are now involved in sensitive areas such as healthcare, finance, and law, it is crucial to ensure that they are trustworthy and capable of producing reliable output.
Robustness and Calibration. The accuracy and robustness of the LLMs are shown to have a very strong correlation [59]. The models that have high accuracy on the scenario also have good robustness. However, the robustness of the zero-shot becomes worse after being tuned on extra application-specific tasks data [116]. This may due to overfitting, which leads to poor generalizability due to the extremely high complexity of the model and the limited training samples from downstream tasks [43]. In a similar vein, it has been observed that fine-tuning a model can result in significant miscalibrations, owing to over-parameterization [51]. Therefore, fine-tuned models may not be an optimal choice when robustness and calibration are critical considerations. However, human alignment has been found as a potential solution for enhancing model robustness. InstructGPT davinci v2 (175B*) has been shown to outperform other models in terms of robustness. On the other hand, achieving optimal calibration of the model depends on the scenario and adaptation procedure employed.
Fairness and Bias. LLMs have been shown to exhibit disparate treatment and impact, perpetuating societal biases and potentially leading to discrimination [10, 17]. To ensure fairness and equity for all users, it is crucial to address these issues in the development and deployment of NLP models. Disparities in performance between demographic groups can serve as an indicator of fairness problems. LLMs are particularly susceptible to fairness issues, as significant performance disparities have been observed across demographic categories such as dialect, religion, gender, and race [59]. However, research has shown that aligning models with human instructions can improve LLM performance regardless of their size, with the InstructGPTmodel (davinci v2) exhibiting smaller performance disparities than other LLMs [23]. Spurious Biases. The shortcut learning problem has been observed in various natural language understanding tasks under the pretraining and fine-tuning paradigm, where models heavily rely on spurious correlations between input and labels in the fine-tuning data for prediction [31, 35, 98]. For example, in reading comprehension tasks, fine-tuned models tend to focus on the lexical matching of words between the question and the original passage, neglecting the intended reading comprehension task itself [53]. In contrast, large language models are not directly trained on fine-tuned datasets, which makes it less likely for them to learn shortcut features present in the fine-tuned dataset, thereby enhancing the model’s generalization capabilities. However, LLMs are not infallible and may exhibit some shortcut learning during in-context learning. For example, recent preliminary studies have begun investigating the robustness of prompt-based methods in large-scale language models [111, 129]. One such study evaluates the few-shot learning performance of GPT-3 on text classification and information extraction tasks [129]. and reveal that the examined LLMs are susceptible to majority label bias and position bias, where they tend to predict answers based on the frequency or position of the answers in the training data. Moreover, these LLMs exhibit common token bias, favoring answers that are prevalent in their pre-training corpus. Recent studies show that this positional bias can be mitigated by selecting proper prompts [68]. In summary, while LLMs significantly reduce the shortcut learning problem prevalent in fine-tuned models, they still exhibit some shortcut learning issues and should be approached with caution when deploying them in downstream applications.
LLMs have demonstrated their extremely strong capabilities in many areas such as reasoning, knowledge retention, and coding. As they become more powerful and human-like, their potential to influence people’s opinions and actions in significant ways grows. As a result, some new safety challenges to our society should be considered and have caught lots of attention in recent works [75, 76].
Hallucinations. The potential for LLMs to “hallucinate,” or generate nonsensical or untruthful content, can have significant negative impacts on the quality and reliability of information in various applications. As LLMs become increasingly convincing and believable, users may develop an overreliance on them and trust them to provide accurate information in areas with which they are somewhat familiar. This can be particularly dangerous if the model produces content that is entirely false or misleading, leading to incorrect decisions or actions taken based on that information. Such outcomes can have serious consequences in many domains, such as healthcare, finance, or public policy, where the accuracy and reliability of information are critical. To mitigate these issues, reinforcement learning from human feedback (RLHF) is widely used [75, 77] and LLMs themselves have been integrated into the loop [75].
Harmful content. Due to the high coherence, quality, and plausibility of texts generated by LLMs, harmful contents from LLMs can cause significant harm, including hate speech, discrimination, incitement to violence, false narratives, and even social engineering attack. The implementation of safeguards to detect and correct those contents can be mitigation [97]. These LLMs can also have dual-use potential by providing required illicit information, leading to risks such as the proliferation of weapons [75] and even terrorism attack planning. It is crucial to ensure using these LLMs responsibly, with safeguards in place to prevent harm. Also, in existing work, feedback from humans plays an important role in getting rid of harmful outputs. Privacy. LLMs can face serious security issues. An example is the issue of user privacy. It is reported that Samsung employees were using ChatGPT to process their work when they inadvertently leaked top-secret data, including the source code proper of the new program, internal meeting minutes related to the hardware, etc. The Italian data protection agency declared that OpenAI, the developer of ChatGPT, illicitly gathered personal user data, leading Italy to become the first government to prohibit ChatGPT over privacy concerns [1].
近几年大语言模型的发展正在深刻重塑自然语言处理领域。 有效地使用大型语言模型,需要了解它们在各种 NLP 任务中的能力和局限性。 本文作为一份实用指南,介绍如何在下游 NLP 任务中使用大型语言模型。
展望未来,我们认为大语言模型存在如下一些挑战:
在真实世界“数据集”上评估的模型性能。
现有的深度学习模型主要在标准学术数据集上进行评估,例如 ImageNet, 它们不能完全反映真实世界。 随着模型的进步,评估它们在现实世界的更多样化、复杂和真实的数据上的表现是至关重要的,这些数据反映了真实世界的需求。 在真实“数据集”上评估模型,需要更严格的测试其能力的方法,以及更好地了解它们在实际应用中的有效性。 这确保了模型能够应对真实世界的挑战并提供实用的解决方案。
模型对齐。
安全对齐。
虽然讨论 AI existential risks 的重要性不言而喻,但我们需要具体的研究来确保高级人工智能的安全开发(safe development of advanced AI)。 这包括解释性技术、可扩展的监督和治理方法,以及对模型属性进行形式化验证 (interpretability, scalable oversight and governance, and formal verification of model properties)。 安全性不应该被视为附加项,而应作为模型构建过程的一个组成部分。
通过 scaling 进行性能预测(Performance Prediction with Scaling)。
随着模型规模和复杂性的大幅增加,我们很难预测模型性能的变化。 开发更好地预测模型性能预测方法,或提出一些新架构,将使资源的利用更加高效,训练过程也将得到加速。 一些可能的方法包括:
这些方法可以在模型构建之前就洞察其性能。