Published at 2024-03-24 | Last Update 2024-03-24
本文翻译自 2022 年 OpenAI 的论文: Training language models to follow instructions with human feedback, 整理翻译了其中感兴趣的部分。
大模型进化树,可以看到 InstructGPT 所处的年代和位置。来自 大语言模型(LLM)综述与实用指南(Amazon,2023)。
GPT -> InstructGPT -> ChatGPT
的过程,可参考
如何训练一个企业级 GPT 助手(OpenAI,2023)。
译者水平有限,不免存在遗漏或错误之处。如有疑问,敬请查阅原文。
以下是译文。
增大模型尺寸未必就能提高它对用户意图的理解能力。 例如,一些大模型可能会生成不真实、有毒或对用户并无帮助(untruthful, toxic, or simply not helpful)的输出。 换句话说,这些模型与它们的用户没有对齐(not aligned)。
本文展示了一种基于人类反馈进行微调(fine-tuning with human feedback), 从而在各种任务上将语言模型与用户意图对齐的方法。简单来说,
我们将最终得到的这种模型称为 InstructGPT
。
175b GPT-3 vs. 1.3b InstructGPT
的人工测评显示,
大家更喜欢后者,尽管它的参数不到前者的 1%
。尽管 InstructGPT 仍然会犯一些简单的错误,但我们的研究结果表明, 基于人类反馈进行微调(fine-tuning with human feedback)是一个很有前途的 将语言模型与人类意图对齐的方向。
给定一些任务示例(examples of the task)作为输入,大语言模型(LLMs)可以被 “prompt” 去执行一系列自然语言处理(NLP)任务。
然而,这些模型经常会出现一些意外的行为,比如编造事实、生成有偏见或有毒的文本, 或者忽视用户的指示(Bender 等,2021;Bommasani 等,2021;Kenton 等,2021; Weidinger 等,2021;Tamkin 等,2021;Gehman 等,2020)。
vs.
有益且安全地遵循用户指令出现以上现象, 是因为许多近期的 LLM 建模目标都是(基于互联网数据训练)预测下一个 token —— 而并不是“有益且安全地遵循用户的指令”(Radford 等,2019;Brown 等,2020;Fedus 等,2021;Rae 等,2021;Thoppilan 等,2022)。 也就是说,语言建模目标有偏差(the language modeling objective is misaligned)。
由于 LLM 已经部署在大量实际应用中,因此解决大模型的这些非预期行为非常重要。
通过训练语言模型按照用户意图行事(Leike 等,2018)来推进语言模型的对齐。 这里的意图包括
使用 Askell 等(2021)的术语,我们希望语言模型是
helpful
,应该帮助用户解决任务),honest
,不应该捏造信息或误导用户),harmless
,不应该对人或环境造成身体、心理或社会伤害)。我们在第 3.6 节中详细阐述了这些标准的评估。
本文专注于通过微调方法来对齐语言模型。具体来说, 使用人类反馈强化学习(RLHF;Christiano 等,2017;Stiennon 等,2020) 来微调 GPT-3,以便它能遵循类型广泛的各种用户指令。具体过程如图 2 所示,
Figure 2: InstructGPT 三部曲:(1) SFT, (2)
RM training, (3) RLHF via proximal policy optimization (PPO) on RM.
蓝色箭头表示相应的数据用于训练模型。Step 2 中 A-D 是模型输出的采样,然后标注员对它们进行排序。详见 Section 3。
三个步骤:
收集示例数据(demonstration data
),训练一个监督策略(supervised policy
)。
对于给定的输入,标注员给出期望的行为 (详见 3.2 节)。然后,使用监督学习(supervised learning)对一个预训练的 GPT-3 模型进行微调。
收集对比数据(comparison data
),训练一个奖励模型(RM
)。
对给定输入,收集两个输出,标注员给出他们的偏好(which output they prefer)。然后,训练一个奖励模型来预测人类偏好输出(human-preferred output)。
针对奖励模型,使用 PPO 对策略进行优化(optimize a policy)。
将 RM 的输出作为一个标量奖励。通过 PPO 算法 (Schulman 等,2017) 对监督策略进行微调(fine-tune the supervised policy),以优化这一奖励。
步骤 2 和 3 可以持续迭代;在当前最佳策略上收集更多的对比数据,这些数据又用于训练新的 RM 和新的策略。 实际上,大部分对比数据来自于我们的 supervised policies,一小部分来自于我们的 PPO policies。
这个过程将 GPT-3 的行为与特定人群的偏好(stated preferences of a specific group of people,大多是我们的标注员和研究人员), 而非任何更广泛的“人类价值观”对齐;5.2 节将进一步讨论这个问题。
我们训练了三种尺寸的 InstructGPT 模型:
1.3B
6B
175B
所有模型都使用了 GPT-3
架构。
我们的主要发现如下。
我们将 175b GPT-3 vs. 1.3b InstructGPT
的输出进行了人工测评,
大家明显更喜欢后者,尽管它的参数不到前者的 1%
。
这两类模型具有相同的架构,唯一的区别是 InstructGPT 在人工数据上进行了微调。
作为对比,我们给 GPT-3 添加了一个 few-shot prompt 以使其更好地遵循指令(变成了一个提示词调优过的 GPT-3), 但效果仍赶不上 InstructGPT:
85 ± 3%
的结果中优于 175B GPT-3
,71 ± 4%
的结果中优于 few-shot 175B GPT-3
。根据标注员的反馈,InstructGPT 模型的输出更符合 prompt ,并更可靠地遵循指令中的明确约束。
在 TruthfulQA 基准测试中,InstructGPT 生成 truthful & informative 答案的概率比 GPT-3 高约一倍。
对于“封闭域”(closed-domain)任务(输出不应包含输入中不存在的信息,例如摘要和封闭域的问答测试,
InstructGPT 的信息虚构率(编造输入中不存在的信息)只有 GPT-3 的一半(21% vs. 41%
)。
为了衡量毒性,我们使用了 RealToxicityPrompts 数据集(Gehman 等,2020),并进行了自动和人工评估。 当提示模型需要 respectful 时(prompted to be respectful),InstructGPT 生成的有毒输出比 GPT-3 少约 25%。
在 Winogender(Rudinger 等,2018)和 CrowSPairs(Nangia 等,2020)数据集上, InstructGPT 相比 GPT-3 没有明显改进。
在 RLHF 微调过程中,我们观察到在某些公开 NLP 数据集上 InstructGPT 相比 GPT-3 存在性能下降, 尤其是 SQuAD(Rajpurkar 等,2018)、DROP(Dua 等,2019)、HellaSwag(Zellers 等,2019)和 WMT 2015 法英翻译(Bojar 等,2015)。
这是一个“对齐税”(alignment tax)的例子 —— 对齐可能会牺牲在某些任务上的性能。 在不降低标注员偏好分数的前提下,我们通过混合 PPO updates 与 PPO-ptx updates(增加预训练分布的对数似然), 大大减少了在这些数据集上的性能下降。
InstructGPT 可以推广到那些未参与编写训练数据的标注员(held-out labelers)。 为测试 InstructGPT 的泛化能力而进行的初步实验结果表明,与参与训练的标注员(training labelers)一样, 未参与训练的标注员也更喜欢 InstructGPT 而不是 GPT-3 的输出。 当然,还需要进一步研究这些模型在更广泛的用户群体上的表现, 以及它们在人们所期望的行为存在分歧的输入上的表现(inputs where humans disagree about the desired behavior)。
在公开 NLP 数据集上微调
不如在人类偏好数据上微调
的效果好我们比较了两个微调的 GPT-3:
InstructGPT
);在两个公开 NLP 任务(FLAN(Wei 等,2021)和 T0/T0++(Sanh 等,2021)上微调的 GPT-3。
这两个数据集包含多种 NLP 任务,以及每个任务的自然语言指令(natural language instructions)。
标注员明显更喜欢 InstructGPT 的输出。相比基线,
73.4 ± 2%
,26.8 ± 2%
和 29.8 ± 2%
。我们对 InstructGPT 的能力进行了定性探究,发现它能够遵循如下指令:
相比之下,GPT-3 虽然也可以执行这些任务,但需要更精心设计的 prompt ,并且遵循这些领域指令的效果欠佳。
这个结果很令人兴奋,因为它表明 InstructGPT 能够推广“遵循指令”的概念。 即使在只有非常少的直接监督信号(SFT 训练样本)的任务上,它们仍然具备了一定的对齐性。
InstructGPT 仍然会犯一些简单的错误。例如,可能无法遵循指令,捏造事实,对简单问题给出冗长的回答,或无法检测出有错误前提的指令。 但总体而言,我们的结果表明,使用人类偏好微调大语言模型可以显著改善它们在各种任务上的行为, 相应的,也需要更多工作提高它们的安全性和可靠性。
InstructGPT 建立在前人的技术基础上,特别是用人类反馈强化学习(RLHF)来对齐模型。
RLHF 最初是为了在模拟环境(simulated environments)和 Atari 游戏中训练简单机器人而开发的(Christiano 等,2017; Ibarz 等,2018), 最近被用于微调语言模型来总结文本(Ziegler 等,2019; Stiennon 等,2020; Böhm 等,2019; Wu 等,2021)。
这项工作还受到了下列类似工作的影响:
Madaan 等(2022)使用人类反馈来增强 prompts,以提高 GPT-3 的性能。 在基于文本的环境中,使用带有 4a normative prior 的 RL 来对齐 agents(Nahian 等,2021)。
我们的工作可以看作是用 RLHF 在更广泛的语言任务上对齐语言模型。
近期,“语言模型对齐意味着什么”这一问题备受关注(Gabriel,2020)。
Our work is also related to research on crosstask generalization in language models, where LMs are fine-tuned on a broad range of public NLP datasets (usually prefixed with an appropriate instruction) and evaluated on a different set of NLP tasks. There has been a range of work in this domain (Yi et al., 2019; Mishra et al., 2021; Wei et al., 2021; Khashabi et al., 2020; Sanh et al., 2021; Aribandi et al., 2021), which differ in training and evaluation data, formatting of instructions, size of pretrained models, and other experimental details. A consistent finding across studies is that fine-tuning LMs on a range of NLP tasks, with instructions, improves their downstream performance on held-out tasks, both in the zero-shot and few-shot settings.
There is also a related line of work on instruction following for navigation
, where models are trained
to follow natural language instructions to navigate in a simulated environment (Bahdanau et al., 2018;
Abramson et al., 2020; Zhao et al., 2021).
A goal of modifying the behavior of language models is to mitigate the harms of these models when they’re deployed in the real world. These risks have been extensively documented (Bender et al., 2021; Bommasani et al., 2021; Kenton et al., 2021; Weidinger et al., 2021; Tamkin et al., 2021). Language models can produce biased outputs (Dhamala et al., 2021; Liang et al., 2021; Manela et al., 2021; Caliskan et al., 2017; Kirk et al., 2021), leak private data (Carlini et al., 2021), generate misinformation (Solaiman et al., 2019; Buchanan et al., 2021), and be used maliciously; for a thorough review we direct the reader to Weidinger et al. (2021). Deploying language models in specific domains gives rise to new risks and challenges, for example in dialog systems (Henderson et al., 2018; Xu et al., 2020; Dinan et al., 2019b). There is a nascent but growing field that aims to build benchmarks to concretely evaluate these harms, particularly around toxicity (Gehman et al., 2020), stereotypes (Nadeem et al., 2020), and social bias (Dhamala et al., 2021; Nangia et al., 2020; Rudinger et al., 2018). Making significant progress on these problems is hard since well-intentioned interventions on LM behavior can have side-effects (Welbl et al., 2021; Blodgett et al., 2020); for instance, efforts to reduce the toxicity of LMs can reduce their ability to model text from under-represented groups, due to prejudicial correlations in the training data (Xu et al., 2021).
There are many ways to change the generation behavior of language models. Solaiman and Dennison (2021) fine-tune LMs on a small, value-targeted dataset, which improves the models’ ability to adhere to these values on a question answering task. Ngo et al. (2021) filter the pretraining dataset by removing documents on which a language model has a high conditional likelihood of generating a set of researcher-written trigger phrases. When trained on this filtered dataset, their LMs generate less harmful text, at the cost of a slight decrease in language modeling performance. Xu et al. (2020) use a variety of approaches to improve the safety of chatbots, including data filtering, blocking certain words or n-grams during generation, safety-specific control tokens (Keskar et al., 2019; Dinan et al., 2019a), and human-in-theloop data collection (Dinan et al., 2019b). Other approaches for mitigating the generated bias by LMs use word embedding regularization (Liu et al., 2019; Huang et al., 2019), data augmentation (Liu et al., 2019; Dinan et al., 2019a; Sheng et al., 2019), null space projection to make the distribution over sensitive tokens more uniform (Liang et al., 2021), different objective functions (Qian et al., 2019), or causal mediation analysis (Vig et al., 2020). There is also work on steering the generation of language models using a second (usually smaller) language model (Dathathri et al., 2019; Krause et al., 2020), and variants of this idea have been applied to reducing language model toxicity (Schick et al., 2021)
我们延用了 Ziegler 等 (2019) 和 Stiennon 等 (2020) 在 stylistic continuation and summarization 领域应用的方法,
如下基础准备:
GPT-3
(Radford 等,2019; Brown 等,2020; Fedus 等,2021; Rae 等,2021; Thoppilan 等,2022)按照以下三个步骤开始训练,如图 2 所示,
Figure 2: InstructGPT 三部曲:(1) SFT, (2)
RM training, (3) RLHF via proximal policy optimization (PPO) on RM.
蓝色箭头表示相应的数据用于训练模型。Step 2 中 A-D 是模型输出的采样,然后标注员对它们进行排序。详见 Section 3。
收集示范数据(demonstration data
),训练一个监督策略(supervised policy
)。
对于给定的输入,标注员给出期望的行为 (详见 3.2 节)。然后,使用监督学习(supervised learning)对一个预训练的 GPT-3 模型进行微调。
收集对比数据(comparison data
),训练一个奖励模型(RM
)。
对给定输入,收集两个输出,标注员给出他们的偏好(which output they prefer)。然后,训练一个奖励模型来预测人类偏好输出(human-preferred output)。
针对奖励模型,使用 PPO 对策略进行优化(optimize a policy)。
将 RM 的输出作为一个标量奖励。通过 PPO 算法 (Schulman 等,2017) 对监督策略进行微调(fine-tune the supervised policy),以优化这一奖励。
步骤 2 和 3 可以持续迭代;每次在当前最佳策略上收集更多的对比数据,这些数据又用于训练新的 RM 和新的策略。 实际上,大部分对比数据来自于我们的 supervised policies,一小部分来自于我们的 PPO policies。
我们的 prompts 数据集主要来自用户提交给 OpenAI API 的文本 prompts,
尤其是用户通过 OpenAI Playground
interface 提交的那些 prompts ——
本文并没有用到生产环境 OpenAI API 的用户数据。
我们的去重比较简单,有共同长前缀的 prompt 就认为是重复的,并将每个用户 ID 的 prompt 数量限制为 200 个。
我们还基于用户 ID 创建训练、验证和测试集(train, validation, and test splits)。 为了避免模型学习到客户信息,我们过滤掉了训练数据集中包含个人身份信息(PII)的 prompts。
为了训练最初的 InstructGPT 模型,我们要求标注员自己编写 prompt。 这是因为我们需要一些初始的指令式的 prompts 来启动这个过程, 而这类数据很难从 GTP-3 API 的用户数据中获得,用户通常不会提交这些格式的 prompts。
我们要求标注员编写三种类型的 prompt :
Plain
: 标注员提出任意的任务,确保任务具有足够的多样性就行。Few-shot
: 标注员提出一条指令,并为该指令提供多个查询/响应对(query/response pairs)。User-based
: OpenAI API 的 waitlist applications 中我们列了一些使用案例。我们要求标注员提供与这些使用案例相关的 prompts。详见附录 A。
根据以上 prompts,我们生成了三个不同的数据集用于不同的微调过程,如表 6 所示,
Table 6: Dataset sizes, in terms of number of prompts.
split | source | size |
---|---|---|
SFT train | labeler | 11,295 |
SFT train | customer | 1,430 |
SFT valid | labeler | 1,550 |
SFT valid | customer | 103 |
SFT 总计 | ~15k |
|
RM train | labeler | 6,623 |
RM train | customer | 26,584 |
RM valid | labeler | 3,488 |
RM valid | customer | 14,399 |
RM 总计 | ~50k |
|
PPO train | customer | 31,144 |
PPO valid | customer | 16,185 |
PPO 总计 | ~47k |
13k
training prompts33k
training prompts31k
training prompts。表 1 中展示了 API prompt(尤其是 RM 数据集)的类别分布,这些类别由我们的承包商标注。 可以看到,占比最大的是文本生成,
Table 1: API prompt dataset 中 use case 类别及占比
Use-case | (%) |
---|---|
Generation |
45.6% |
Open QA | 12.4% |
Brainstorming | 11.2% |
Chat | 8.4% |
Rewrite | 6.6% |
Summarization | 4.2% |
Classification | 3.5% |
Other | 3.5% |
Closed QA | 2.6% |
Extract | 1.9% |
表 2 展示了几个 prompt 示例(由研究人员编写,提交给 InstructGPT 的格式),
Table 2: API prompt 具体例子。
Use-case | Prompt |
---|---|
Brainstorming | List five ideas for how to regain enthusiasm for my career |
Generation | Write a short story where a bear goes to the beach, makes friends with a seal, and then returns home. |
Rewrite | This is the summary of a Broadway play: “”“ {summary} “”“ This is the outline of the commercial for that play: “”“ |
更多信息:
InstructGPT
的 prompts 见附录 A.2.1,GPT-3
的 prompts(做对比) 见附录 A.2.2,我们的训练任务有两个来源:
这些 prompt 种类繁多,包括生成、问答、对话、摘要、提取和其他自然语言任务 (见表 1)。
我们的数据集中 96%+
是英文,但在 4.3 节中,
我们也探讨了 InstructGPT 对其他语言指令的响应能力以及完成代码任务的能力。
对于每个自然语言 prompt,
也可以间接指定
few-shot examples
,例如,提供两个关于青蛙的故事作为示例,prompt 模型生成一个新的故事,implicit continuation
,例如,提供一个关于青蛙的故事的开头,让模型续写。在每种情况下,我们都要求标注员尽力推断每个 prompt 背后的用户意图, 并要求他们跳过那些任务非常模糊的 prompt。 此外,标注员还会根据我们提供的指导 (见附录 B) 和他们自己的判断, 思考其中隐含的意图(implicit intentions),例如回答的真实性,潜在的有偏见或有毒输出。
为了生成示范和对比数据,以及进行结果评估,我们通过 Upwork 和 ScaleAI 雇了大约 40 名外包人员。
与之前关于摘要任务的人类偏好数据收集工作 (Ziegler 等,2019; Stiennon 等,2020; Wu 等,2021) 相比, 我们的输入数据涵盖了范围更广泛的任务,甚至还包括有争议和敏感的主题。
我们的目标是选择一组标注员,他们对不同人口分布的偏好很敏锐,并擅长识别潜在的有害输出。 因此,
相关的选择过程和标注员分布信息,见附录 B.1。
在训练和评估过程中,我们的对齐标准可能会发生冲突:例如,当用户请求一个潜在有害的响应时。 针对这种情况,我们采取如下方式,
与 Stiennon 等 (2020) 一样,我们在项目过程中与标注员密切合作。我们有一个入职流程,对标注员进行培训, 为每个任务编写详细的说明 (见附录 B.2),并在群聊中回答标注员的问题。
为了解 InstructGPT 推广到其他标注员的偏好时表现有多好, 我们雇佣了另一组独立的标注员,他们不参与编写任何训练数据。 这些标注员来自相同的供应商,但没有经过前面的筛选过程。
Despite the complexity of the task, we find that inter-annotator agreement rates are quite high: training labelers agree with each-other 72:6 ± 1:5% of the time, while for held-out labelers this number is 77:3 ± 1:3%. For comparison, in the summarization work of Stiennon et al. (2020) researcher-researcher agreement was 73 ± 4%.
我们从 GPT-3 预训练模型开始微调。GPT-3 在大量互联网数据上进行了训练,适用于各种下游任务, 但其行为尚未充分符合人类需求。基于 GPT-3,我们使用三种不同技术进行了模型微调。
使用监督学习的方式,在我们的示范数据上对 GPT-3 进行微调。
16 epoch
0.2
得到很多个 SFT 模型。最后根据 validation set 上的 RM 分数选择最终的 SFT 模型。
与 Wu 等(2021)类似,我们发现我们的 SFT 模型在 1 个 epoch 后在 validation loss 上就会过拟合(overfit); 但是,同时我们发现,尽管存在过拟合问题,但更多 epoch 对 RM 分数和人类偏好得分都有帮助。
将 SFT 模型去掉最后的 unembedding 层,然后从这样的模型开始训练,
最后得到的就是一个 RM 模型。
在本文中,我们仅使用 6B
的 RM,因为这样可以节省大量计算资源,
并且我们发现 175B 的 RM 训练可能不稳定,因此不太适合在 RL 中用作 value function(更多细节见附录 C)。
在 Stiennon 等(2020)中,给两个模型相同的输入,然后得到两份输出作为对比数据, RM 是在这个对比数据集(dataset of comparisons)上进行训练的。
They use a cross-entropy loss, with the comparisons as labels—the difference in rewards represents the log odds that one response will be preferred to the other by a human labeler.
为了快速收集对比数据,我们将 $K=4$ 到 $K=9$ 之间的输出(即一个 input/prompt 喂给模型,得到 K 个 output)都提供给标注员,并要求他们对其进行排名(rank)。 这样每个 prompt 就对应 ${K \choose 2}$ 个对比数据。
由于每个 labeling task 内的 comparisons 非常相关,我们发现如果简单地 shuffle the comparisons into one dataset, 对数据集训练一次就会导致 RM 过拟合。
That is, if each of the possible ${K \choose 2}$ comparisons is treated as a separate data point, then each completion will potentially be used for $K-1$ separate gradient updates.
The model tends to overfit after a single epoch
, so repeating data within an epoch also causes it to overfit.
因此,我们将每个 prompt 的所有 ${K \choose 2}$ 个对比作为单个 batch element 进行训练。 这样做在计算上更加高效,因为只需要一次正向遍历(forward pass of the RM for each completion, 而不是 ${K \choose 2}$ forward passes for $K$ completions), 并且由于不再过拟合,它实现了更好的 validation accuracy and log loss。
具体来说,奖励模型的损失函数为:
\[\begin{equation} \label{eq1} \begin{split} \operatorname{loss}\left(\theta \right) = -\frac{1} {K \choose 2} E_{\left(x, y_{w}, y_{l}\right) \sim D}\left[\log \left(\sigma\left(r_{\theta}\left(x, y_{w}\right)-r_{\theta}\left(x, y_{l}\right)\right)\right)\right] \end{split} \end{equation}\]
其中,
最后,由于 RM loss 对奖励的平移不变,我们使用一个 bias 来对奖励模型进行归一化,这样标注员的示范在进行 RL 之前的平均分数为 0。
再次沿用(Stiennon 等,2020),我们使用 PPO(Schulman 等,2017)对 SFT 模型进行微调。 我们创建一个 bandit 环境,
此外,我们添加了一个 per-token 的 KL 惩罚(来自 SFT 模型),以减轻奖励模型的过优化(over-optimization)。
值函数是从 RM 初始化的。我们将这些模型称为PPO
。
我们还尝试将预训练梯度(pretraining gradients)mixing into PPO 梯度中,以减轻在公开 NLP 数据集上的性能下降。
我们将这些模型称为PPO-ptx
。
我们在 RL 训练中最大化以下组合目标函数:
\begin{equation} \label{eq2}
\begin{split}
\operatorname{objective}\left(\phi\right)= & E_{\left(x, y\right) \sim D_{\pi_{\phi}^{\mathrm{RL}}}}\left[r_{\theta}(x, y)-\beta \log \left(\pi_{\phi}^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right] +
& \gamma E_{x \sim D_\textrm{pretrain}}\left[\log(\pi_{\phi}^{\mathrm{RL}}(x))\right]
\end{split}
\end{equation}
其中
除非另有说明,在本文中,InstructGPT 指的是 PPO-ptx models。
我们将 PPO 模型与下列模型进行比较:
GPT-3-prompted
:向 GPT-3 提供一个 few-shot prefix 以“提示”它进入指令跟随模式。在实现上,就是把这个前缀插入倒用户输入的指令之前。
To obtain this prefix, authors RL and DA held a prefix-finding competition: each spent an hour interacting with GPT-3 to come up with their two best prefixes. The winning prefix was the one that led GPT-3 to attain the highest RM score on the prompt validation set. DA won.
在 FLAN 和 T0 数据集上微调过的 175B GPT-3
这两个数据集都包含各种 NLP 任务,以及每个任务的自然语言指令。 我们分别在约 1 million examples 上对它们进行微调,并选择在验证集上获得最高奖励分数的 checkpoint。更多细节见附录 C。
为了评估我们模型的“对齐”程度,首先需要明确“对齐”的含义。
一直以来,“对齐”的定义都很模糊和令人困惑,有很多提法(Chen 等,2021;Leike 等,2018;Gabriel,2020)。
按照 Leike 等(2018)的方法,我们的目标是训练符合用户意图的模型。
更实际地说,为了完成语言任务,我们使用了类似于 Askill 等(2021)的框架,
他们认为,如果模型是 helpful, honest, and harmless
的,那这个模型就是对齐的(aligned)。
要做到有帮助,模型不仅要能遵循指令,还应该能从一个 few-shot prompt 或其他可解释的模式(例如 Q: {question}\nA:
)中推断意图(infer intention)。
给定的 prompt 的意图可能不清楚,这种情况下就依赖于标注员的判断,我们的主要指标是标注员的偏好评分。 但另一方面,由于我们的标注员不是生成 prompt 的用户, 因此,下面二者之间可能存在偏差:
在纯生成式模型中如何衡量诚实度尚无定论;这需要将模型的实际输出与其关于正确输出的“信念”进行比较 (model’s actual output to its “belief” about the correct output), 由于模型是一个黑盒,我们无法推断它的信念。
因此,我们使用真实性(truthfulness)—— 模型关于世界的陈述是否真实 —— 来衡量。 具体到两个指标:
显然,这只能覆盖真实性实际含义(what is actually meant by truthfulness)的一小部分。
与诚实度类似,衡量语言模型的有害性也很难。 在大多数情况下,语言模型的危害取决于其输出在现实世界中是如何被使用的。 例如,对于一个生成有毒输出的模型,
在项目早期,我们让标注员评估输出是否“可能有害”。 但后面经停止了这项工作,因为这需要太多关于输出最终将如何被使用的猜测(speculation), 尤其是我们的部分数据还来自 Playground API 客户。
因此,我们使用了一套更具体的替代方案,旨在捕捉最终部署的模型中可能导致有害的不同行为方面: 我们让标注员从一个用户助理的角度来评估输出是否恰当,是否 denigrates a protected class,或是否包含性或暴力内容。
我们还在衡量偏见和毒性的数据集上对 InstructGPT 进行基准测试,例如 RealToxicityPrompts(Gehman 等,2020)和 CrowS-Pairs(Nangia 等,2020)。
我们的定量评估分为两个独立的部分。
数据来源:OpenAI Playground API
(背后是 InstructGPT) 收集来的用户 prompts。
所以,评估用的 prompts 与训练用的 prompt 同源,但未参与训练,
也就是说只选择那些未参与训练的客户 prompts。
但这里有个问题,训练用的 prompt 是专门为 InstructGPT 设计的,因此 GPT-3 在这些 prompts 上的效果可能不佳,有失公平。
为此,我们还收集了用户通过 OpenAI GPT-3 API
提交的 prompts 进行评估;这些 prompt 通常不是“遵循指令”的风格,而是专门为 GPT-3 设计的。
主要评估指标是人类偏好评分。
对于每个模型,都计算其输出相对于 baseline 被人类偏好的频率;
这里用我们的 175B SFT
模型作为 baseline ,因为它的性能处于中等水平。
此外,我们要求标注员使用 1-7 Likert scale 判断每个 response 的整体质量,并为每个输出收集一些元数据(见表 3)。
Table 3: Labeler-collected metadata on the API distribution
Metadata | Scale |
---|---|
Overall | quality Likert scale; 1-7 |
Fails to follow the correct instruction / task | Binary |
Inappropriate for customer assistant | Binary |
Hallucination | Binary |
Satisifies constraint provided in the instruction | Binary |
Contains sexual content | Binary |
Contains violent content | Binary |
Encourages or fails to discourage violence/abuse/terrorism/self-harm | Binary |
Denigrates a protected class | Binary |
Gives harmful advice | Binary |
Expresses opinion | Binary |
Expresses moral judgment | Binary |
我们在两种公开数据集上进行评估:
我们还在 RealToxicityPrompts 数据集(Gehman 等,2020)上人工评估了毒性。
We are releasing samples from our models on all of the sampling-based NLP tasks.
暂略。 见原文。
Figure 1: Human evaluations of various models on our API prompt distribution, evaluated by how often outputs from each model were preferred to those from the 175B SFT model. Our InstructGPT models (PPO-ptx) as well as its variant trained without pretraining mix (PPO) significantly outperform the GPT-3 baselines (GPT, GPT prompted); outputs from our 1.3B PPO-ptx model are preferred to those from the 175B GPT-3. Error bars throughout the paper are 95% confidence intervals
暂略。 见原文。
Prompt 长什么样非常重要,因此这里给出完整附录。此外,有些 prompts 很有意思。译注。
We first give slightly more details on our prompt boostrapping process. As previously mentioned, for the majority of the project, we obtained prompts directly from external users of the instruct beta models in the OpenAI API. However, this strategy only works once you have a model that accepts instruction-like prompts. In order to train the very first such model, we asked contractors to write prompts themselves. We asked labelers to write three kinds of prompts:
In order to preserve the anonymity of the application information, we had a separate labeler create vague high level tasks based on looking at a list of applications, modifying the task descriptions to eliminate any information that were specific to a given application. This data was used to train the first InstructGPT model via supervised learning, which was deployed in beta in the API in early 2021.
For API prompts, we use prompts submitted by users to the aforementioned earlier version of the InstructGPT model on the OpenAI API Playground. Throughout the paper, we only use data from the Playground, rather than customers using our model in production, as it was easier to get informed consent: every time a user switched to an InstructGPT model, an alert message would pop up stating that prompts submitted to these models could be used to train future versions of our models. We also communicated this in a message on the developer Slack channel upon launching the beta of the InstructGPT models. We filter out prompts from the training split containing personally identifiable information (PII).
To ensure a diversity of use cases, we heuristically deduplicate prompts by checking for prompts that share a long common prefix, and limited the number of prompts to roughly 200 per organization. In addition, we create train, validation, and test splits based on organization IDs, so that e.g. the validation set contains different use cases than the training set. We conceptualized API requests as belonging to one of ten use cases: generation, open QA, closed QA, brainstorming, chat, rewriting, summarization, classification, extraction, or other. Below, we show fictional but realistic prompts from a variety of use cases:
Use Case | Example |
---|---|
brainstorming | List five ideas for how to regain enthusiasm for my career |
brainstorming | What are some key points I should know when studying Ancient Greece? |
brainstorming | What are 4 questions a user might have after reading the instruction manual for a trash compactor? {user manual} 1. |
brainstorming | What are 10 science fiction books I should read next? |
classification | Take the following text and rate, on a scale from 1-10, how sarcastic the person is being (1 = not at all, 10 = extremely sarcastic). Also give an explanation {text} Rating: |
classification | This is a list of tweets and the sentiment categories they fall into. Tweet: {tweet_content1} Tweet: {tweet_content2} |
classification | {java code} What language is the code above written in? |
classification | You are a very serious professor, and you check papers to see if they contain missing citations. Given the text, say whether it is missing an important citation (YES/NO) and which sentence(s) require citing. {text of paper} |
extract | Extract all course titles from the table below: |
extract | Extract all place names from the article below: {news article} |
extract | Given the following list of movie titles, write down any names of cities in the titles. {movie titles} |
generation | Write a creative ad for the following product to run on Facebook aimed at parents: Product: {product description} |
generation | Write a short story where a brown bear to the beach, makes friends with a seal, and then return home. |
generation | Here’s a message to me: — {email} — Here are some bullet points for a reply: Write a detailed reply |
generation | This is an article about how to write a cover letter when applying for jobs: — It’s important to spend some time |
generation | write rap lyrics on the topics mentioned in this news article: — {article} — |
rewrite | This is the summary of a Broadway play: “”“ {summary} “”“ This is the outline of the commercial for that play: |
rewrite | Translate this sentence to Spanish: |
rewrite | Create turn-by-turn navigation given this text: Go west on {road1} unto you hit {road2}. then take it east to {road3}. 1. |
rewrite | Rewrite the following text to be more light-hearted: — {very formal text} — |
chat | The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. Human: Hello, who are you? |
chat | Marv is a chatbot that reluctantly answers questions with sarcastic responses: You: How many pounds are in a kilogram? |
chat | This is a conversation with an enlightened Buddha. Every response is full of wisdom and love. Me: How can I achieve greater peace and equanimity? |
closed qa | Help me answer questions about the following short story: {story} What is the moral of the story? |
closed qa | Answer the following question: What shape is the earth? A) A circle |
closed qa | Tell me how hydrogen and helium are different, using the following facts: {list of facts} |
open qa | I am a highly intelligent question answering bot. If you ask me a question that is rooted in truth, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with “Unknown”. Q: What is human life expectancy in the United States? |
open qa | Who built the statue of liberty? |
open qa | How do you take the derivative of the sin function? |
open qa | who are the indiginous people of New Zealand? |
summarization | Summarize this for a second-grade student: {text} |
summarization | {news article} Tl;dr: |
summarization | {chat transcript} Summarize the above conversation between a customer and customer assistant. Make sure to state any complaints that the customer has. |
other | start with where |
other | Look up “cowboy” on Google and give me the results. |
other | Johnathan Silver goes to the market every day, and brings back a |
Next, we list some schematic examples of API requests for each use-case category, for prompts submitted to GPT-3 models. These are generally less ‘instruction-style’, and contain more explicit prompting. Note that there are some prompts where the user intent is unclear.
Use Case | Example |
---|---|
brainstorming | indie movie ideas: - A guy travels to South America to become a shaman. - A documentary about the world of juggling. |
brainstorming | Baby name ideas for a boy: 1. Alfred 2. Theo 3. |
brainstorming | Tell me a list of topics related to: - interior design - sustainable ecosyste ms - fake plants |
brainstorming | Name some rare gems |
classification | This is a tweet sentiment classifier. {tweet} |
classification | The following is a list of products and the kind of product they are. Product: {product}. Type: {type} Product: {product}. Type: {type} Product: {product}. Type: |
classification | The following is a list of companies and the categories they fall into: Apple, Facebook, Fedex Apple Category: Technology Category: Social Media Fedex Category: |
extract | Text: {text} Keywords: |
generation | “Hey, what are you doing there?” Casey was startled. He hadn’t even begun to |
generation | The name of the next Star Wars movie is |
generation | This is the research for an essay: === {description of research} === Write a high school essay on these topics: === |
generation | Write an outline for an essay about John von Neumann and his contributions to computing: I. Introduction, his life and background A: His early life B: |
rewrite | Covert my resume into a profile overview. {resume} Profile overview: |
rewrite | Rephrase this for me: “I can’t seem to find out how to work this darn thing.” Alternate phrasing: “ |
rewrite | Original: She no go to sleep. Standard American English: She didn’t go to sleep Original: It real bad for I to make do of this. |
chat | The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. Human: Hello, who are you? |
chat | This is a conversation with Steven. Steven likes to watch Netflix and hasn’t left his home in 2 weeks. John: Hey man what’s up? |
closed qa | When you drop a heavy stone from a tree, what happens? A. The stone falls to the ground. B: The stone stays in the tree. C: The stone floats. D: Nothing happens. Answer: |
closed qa | Text: {article describing what yoga mats to buy} Question: What are the things I should consider when buying a yoga mat? Answer: |
open qa | Q: Who is Batman? A: Batman is a fictional comic book character. Q: What is torsalplexity? A: ? Q: What is Devz9? A: ? Q: Who is George Lucas? A: George Lucas is American film director and producer famous for creating Star Wars. Q: What is the capital of California? A: |
open qa | Who was the best human who ever lived? |
open qa | Q: Who is Leonardo da Vinci? A: |
summarization | My second grader asked me what this passage means. “”“ I rephrased it for him in plain terms that a second grader could understand: “”” |
summarization | ””“ {text} “”“ I summarized the above as: |
other | She said, and I quote AI: |
other | - I like to play Call of Duty - I like to play Call of Duty - I like to play Call of Duty - I like to play Call of Duty |
SFT 15k / RM 50k / PPO 47k
用来 train/validate SFT, RM, RL 三个模型的数据集大小,以及多少是标注员写的,多少来自 OpenAI API 的用户数据,
Table 6: Dataset sizes, in terms of number of prompts.
split | source | size |
---|---|---|
SFT train | labeler | 11,295 |
SFT train | customer | 1,430 |
SFT valid | labeler | 1,550 |
SFT valid | customer | 103 |
SFT 总计 | ~15k |
|
RM train | labeler | 6,623 |
RM train | customer | 26,584 |
RM valid | labeler | 3,488 |
RM valid | customer | 14,399 |
RM 总计 | ~50k |
|
PPO train | customer | 31,144 |
PPO valid | customer | 16,185 |
PPO 总计 | ~47k |
For SFT, note that we have many more labeler-written prompts than customer prompts—this is because, at the start of the project, we had labelers write instructions with a user interface that asked them to give an overarching template instruction as well as few-shot examples for that instruction.
We synthetically constructed multiple SFT datapoints from the same instruction by sampling different sets of few-shot examples.
For the RM, recall that for every prompt, we collected rankings for K outputs (ranging from 4 to 9) and trained the model on all K2, so the number of ranked pairs we trained the model on is an order of magnitude larger than the number of prompts.
The data that we collect spans a wide range of categories and use cases. Table 1 shows the diversity of categories in our RM training and validation datasets,来自标注员的打标。The distribution of categories for the PPO datasets was similar. We additionally show a subset of our labeled prompt metadata in Table 7.
Table 7: Dataset annotations
Annotation | RM test | RM train | SFT valid | SFT train | SFT valid |
---|---|---|---|---|---|
Ambiguous | – | 7.9% | 8.0% | 5.1% | 6.4% |
Sensitive content | – | 6.9% | 5.3% | 0.9% | 1.0% |
Identity dependent | – | – | – | 0.9% | 0.3% |
Closed domain | 11.8% | 19.4% | 22.9% | 27.4% | 40.6% |
Continuation style | – | 15.5% | 16.2% | 17.9% | 21.6% |
Requests opinionated content | 11.2% | 7.7% | 7.5% | 8.6% | 3.4% |
Requests advice | 3.9% | – | – | - | |
Requests moral judgment | 0.8% | 1.1% | 0.3% | 0.3% | 0.0% |
Contains explicit safety constraints | – | 0.4% | 0.4% | 0.3% | 0.0% |
Contains other explicit constraints | – | 26.3% | 28.9% | 25.6% | 20.7% |
Intent unclear | 7.9% | – | – | – | – |
Note that our annotation fields changed over the course of the project, so not every prompt was annotated for every field.
We used a lightweight classifier (langid.py) to classify the language of all instructions in our dataset. Empirically, around 96% of our dataset (110k datapoints) is classified as English, although we estimate that the actual fraction may be 99% or higher, due to classifier inaccuracies. Besides English, a small minority of prompts were found in at least 20 other languages: Spanish, French, German, Portuguese, Italian, Dutch, Romanian, Catalan, Chinese, Japanese, Swedish, Polish, Danish, Turkish, Indonesian, Czech, Norwegian, Korean, Finnish, Hungarian, Hebrew, Russian, Lithuanian, Esperanto, Slovak, Croatian, Swahili, Estonian, Slovenian, Arabic, Thai, Vietnamese, Malayalam, Greek, Albanian, and Tibetan.
Table 8 shows the average number of prompts each customer contributed to the dataset.
Table 8: Average prompts per customer
Model | Split | Prompts per customer |
---|---|---|
SFT | train | 1.65 |
SFT | valid | 1.87 |
RM t | rain | 5.35 |
RM v | alid | 27.96 |
PPO | train | 6.01 |
PPO | valid | 31.55 |
– | test | 1.81 |
Table 9: Prompt lengths by dataset
Model | Split | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|---|---|
SFT | train | 12725 | 408 | 433 | 1 | 37 | 283 | 632 | 2048 |
SFT | valid | 1653 | 401 | 433 | 4 | 41 | 234 | 631 | 2048 |
RM | train | 33207 | 199 | 334 | 1 | 20 | 64 | 203 | 2032 |
RM | valid | 17887 | 209 | 327 | 1 | 26 | 77 | 229 | 2039 |
PPO | train | 31144 | 166 | 278 | 2 | 19 | 62 | 179 | 2044 |
PPO | valid | 16185 | 186 | 292 | 1 | 24 | 71 | 213 | 2039 |
– | test set | 3196 | 115 | 194 | 1 | 17 | 49 | 127 | 1836 |
In Table 9, we report descriptive statistics for prompt lengths (in tokens) used to train various models, and in Table 10 we break down token lengths by use case.
Table 10: Prompt lengths by category
Category | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|---|
Brainstorming | 5245 | 83 | 149 | 4 | 17 | 36 | 85 | 1795 |
Chat | 3911 | 386 | 376 | 1 | 119 | 240 | 516 | 1985 |
Classification | 1615 | 223 | 318 | 6 | 68 | 124 | 205 | 2039 |
Extract | 971 | 304 | 373 | 3 | 74 | 149 | 390 | 1937 |
Generation | 21684 | 130 | 223 | 1 | 20 | 52 | 130 | 1999 |
QA, closed | 1398 | 325 | 426 | 5 | 68 | 166 | 346 | 2032 |
QA, open | 6262 | 89 | 193 | 1 | 10 | 18 | 77 | 1935 |
Rewrite | 3168 | 183 | 237 | 4 | 52 | 99 | 213 | 1887 |
Summarization | 1962 | 424 | 395 | 6 | 136 | 284 | 607 | 1954 |
Other | 1767 | 180 | 286 | 1 | 20 | 72 | 188 | 1937 |
Table 11: Prompt and demonstration lengths
Prompt source | Measurement | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
---|---|---|---|---|---|---|---|---|---|
Contractor | prompt length | 12845 | 437 | 441 | 5 | 42 | 324 | 673 | 2048 |
Contractor | demo length | 12845 | 38 | 76 | 1 | 9 | 18 | 41 | 2048 |
Customer | prompt length | 1533 | 153 | 232 | 1 | 19 | 67 | 186 | 1937 |
Customer | demo length | 1533 | 88 | 179 | 0 | 15 | 39 | 88 | 2048 |
Finally, we also report lengths of contractor-written demonstrations used for our SFT model in table 11, both for contractor-written and labeler-written prompts.
暂略。见原文。
GPT-3
架构(Brown et al., 2020)。fp16
权重和激活,with fp32 master copies of weights。2k token
的上下文。1k token
的都不要;1k token
。β1 = 0.9
和 β2 = 0.95
。SFT 模型训练
最终模型是基于 RM 分数选择的,我们发现与 validation loss 相比,RM 分数更能预测人类偏好结果。
同一个 6B RM 模型用于所有尺寸的 PPO 模型。 175B RM 有可能实现更低的 validation loss,但
初步实验结果显示,6B RM 模型在大范围的学习率上都很稳定,能训练出一样强大的 PPO 模型。
The final reward model was initialized from a 6B GPT-3 model that was fine-tuned on a variety of public NLP datasets (ARC, BoolQ, CoQA, DROP, MultiNLI, OpenBookQA, QuAC, RACE, and Winogrande). This was mostly for historical reasons; we find similar results when initializing the RM from the GPT-3 or SFT models. We trained for a single epoch over the full reward model training set (see Table 6) at a learning rate of lr = 9e-6, a cosine learning rate schedule (dropping to 10% of its initial value by the end of training), and a batch size of 64. Training did not appear to be very sensitive to the learning rate or schedule; changes of up to 50% in the learning rate resulted in similar performance. Training was quite sensitive to the number of epochs: multiple epochs quickly overfit the model to the training data with obvious deterioration in the validation loss. The batch size here represents the distinct number of prompts per batch. Each prompt had between K = 4 and K = 9 labeled completions, from which there were up to K2 possible comparisons. Ties were dropped. Therefore, a single batch could contain up to 64 × K2 ≤ 2,304 comparisons.
We initialize the RLHF models from a pretrained GPT-3 model and apply supervised fine-tuning for 2 epochs on the demonstration dataset. We also mix in 10% pretraining data during fine-tuning, since we find it helpful for PPO training (see Appendix E.11 for details). Cosine learning rate schedule is used and the learning rate eventually decays to 10% of the peak learning rate. We use a batch size of 32 for 1.3B and 6B models and 8 for the 175B model. We compare a few different peak learning rates for each model and pick the one with low losses on both the demonstration and the pretraining validation datasets. A log linear sweep of 5 values of the LR’s are compared for 1.3B and 6B models and 3 values are compared for the 175B model. The resultant LR’s for the 1.3B, 6B, and 175B models are 5e-6, 1.04e-5 and 2.45e-6, respectively.
We then initialize the RL policies from the above supervised fine-tuned models with pretraining mix. These models are also used to compute the KL reward, in the same way as Stiennon et al. (2020), with β = 0:02 (see Equation 2). We train all the RL models for 256k episodes. These episodes include about 31k unique prompts, after filtering out prompts with PII and deduplication based on common prefixes. The batch size for each iteration is 512, with a minibatch size of 64. In other words, each batch is randomly split into 8 minibatches and is trained on for only a single inner epoch (Schulman et al., 2017). A constant learning rate is applied with a warmup over the first 10 iterations, starting with one tenth of the peak learning rate. Exponential moving averages of the weights are applied, with a decay rate of 0.992. No discount is applied when estimating the generalized advantage (Schulman et al., 2016). The PPO clip ratio is set to 0.2, and the sampling temperature is 1 for rollouts. As previously mentioned, for all PPO models we use a 6B RM and a 6B value function, and the latter is initialized from the former. By using the same 6B reward model and value function on policies of all model sizes, it’s easier to compare the effect of policy model size on policy performance. A fixed learning rate of 9e-6 for the value function is used for 1.3B and the 6B policies and 5e-6 for the 175B policy.
Our initial RLHF experiments showed regressions on public NLP datasets, such as SQuADv2 and DROP, and we mitigate the regressions by mixing in pretraining gradients during PPO training. We use 8 times more pretraining examples than the number of the RL training episodes. The pretraining data is randomly drawn from the dataset used to train the GPT-3 models. For each minibatch, we compute the PPO gradients and pretraining gradients in consecutive steps and accumulate them both into the gradient buffers. We multiply the pretraining gradients by a coefficient, γ = 27:8 (see Equation 2), to control the relative strength of gradients from PPO and pretraining distributions.
We obtain our FLAN and T0 baselines by fine-tuning a 175B GPT-3 model on the FLAN and T0 datasets. For T0, note that we trained on the T0++ version of the dataset. Because T0 contains much more data (96M datapoints) than FLAN (1.2M datapoints), we subsampled T0 to 1 million datapoints to make the amount of training data comparable for each model. Note that the original models train on epochs where datapoints can be repeated, but in our epochs we go through every datapoint without repeats (to better match the way we trained our SFT baselines). We applied a cosine learning rate schedule, and try initial learning rates of 4e-6 and 6e-6 for each dataset. The learning rate decays to 10% of its peak at the end of training, and we use a batch size of 64 for both experiments.
To choose the best FLAN checkpoint, we use our 6B reward model to score the completions on the validation set of prompts. As shown in Figure 13, the reward saturates after the initial 400k examples of training. This indicates that training for even longer will unlikely improve the human eval performance. We picked the checkpoint with the highest RM score for our human evaluation, which is the one trained with learning rate of 4e-6 and for 896k examples.
We perform two similar experiments to find the best T0 checkpoint. In one experiment, we used a batch size of 128, a learning rate of 4e-6 and 1.28 million examples. The other experiment used a batch size of 64, a learning rate of 6e-6 and 1 million examples. Once again using the reward model score, we picked the checkpoint from the former experiment after 896k examples of training
暂略。见原文。
暂略。见原文。
In this section, we provide some additional samples from both the 175B GPT-3 and 175B InstructGPT (PPO-ptx) models. We sample at T = 1 for InstructGPT, and use T = 0:7 for GPT-3, since GPT-3 performs poorly at high temperatures (this slightly disadvantages InstructGPT).
In Figure 42, we show the full French sample from Figure 8, illustrating that our model is sometimes able to follow instructions in other languages, despite our dataset containing almost exclusively English. In Figure 44, we show our model’s propensity to answer instructions that may be harmful, a result of us prioritizing helpfulness to the user in our training data. In Figure 45, we show another example of our model describing code, though it is still far from perfect.
In Figures 46–50, we show labeler-written prompts from our dataset, along with model samples and the human-written demonstration. These 5 prompts were selected from 15 to show a range of different tasks.
(略)。