Authors:
(1) Suzanna Sia, Johns Hopkins University;
(2) David Mueller;
(3) Kevin Duh.
Self-supervised large language models have demonstrated the ability to perform Machine Translation (MT) via in-context learning, but little is known about where the model performs the task with respect to prompt instructions and demonstration examples. In this work, we attempt to characterize the region where large language models transition from in-context learners to translation models. Through a series of layer-wise context-masking experiments on GPTNEO2.7B, BLOOM3B, LLAMA7B and LLAMA7B-CHAT, we demonstrate evidence of a "task recognition" point where the translation task is encoded into the input representations and attention to context is no longer necessary. We further observe correspondence between the low performance when masking out entire layers, and the task recognition layers. Taking advantage of this redundancy results in 45% computational savings when prompting with 5 examples, and task recognition achieved at layer 14 / 32. Our layer-wise fine-tuning experiments indicate that the most effective layers for MT fine-tuning are the layers critical to task recognition.
In-context learning (ICL) refers to the phenomenon in which large generative pretrained transformers (GPTs) perform tasks with no gradient updates when shown task examples or descriptions in their context (Brown et al., 2020; Bommasani et al., 2021). While in-context learning in GPT models appears to be generally applicable to any natural language task, to study task location, we use Machine Translation (MT) as there is little to no ambiguity in evaluating whether the model has recognised the task, since it must generate tokens in a different language. While in-context MT has yet to reach parity with supervised neural MT models, it’s off-the-shelf translation performance is comparatively 1 Johns Hopkins University. Correspondence to: Suzanna Sia . Conference Paper Under Review. strong and suggests a promising direction for the future of MT (Hendy et al., 2023; Garcia et al., 2023). Prior work on in-context MT has focused on prompt-engineering, treating GPT models as black boxes by focusing on which examples to provide in-context (Moslem et al., 2023). Agrawal et al. (2022) apply similarity-based retrieval to select in-context examples, while Sia & Duh (2023) suggest a coherencebased approach. However, these works apply surface level interventions leaving the internal mechanism of MT in GPT models largely not understood.
In this work, we ask where does in-context Machine Translation occur in GPT models? We conduct an initial exploration into locating self-attention layers responsible for in-context MT in three base pre-trained and one instruction tuned open-source GPT models . Using causal masking over different parts of the context we demonstrate that there exists a "task-recognition" point after which attention to the context is no longer necessary (Section 3). The potential implications are large computational savings when the context is several times longer than the test source sentence (Section 5). Having identified the layers in which "task recognition" occurs, we study the extent to which subsequent layers are either redundant or corresponding to the "task recognition" layers. Simple layer-wise masking shows that for 3B parameter models, removing attention around the "task-recognition" layers can cause the model to fail to perform translation all-together, whereas layers towards the end of the model are much more redundant (Section 4.1).
Next, we observe that very lightweight fine-tuning of LoRA parameters (Hu et al., 2021) are most effective at earlier layers of the model compared to the later ones (Section 6.2). This provides supports for the conjecture that earlier layers are more important for the task.
We further investigate the extent of MT task redundancy using differentiable L0 regularisation to train discrete attention head gates (Section 6.5). We find that around 10% of the attention heads can be masked, which fundamentally differs from the literature in supervised NMT where attention heads are highly specialised for MT (Voita et al., 2019b; Michel et al., 2019; Behnke & Heafield, 2021).
In-Context Learning was first demonstrated by Brown et al. (2020) who showed that GPT-3 could be used to perform a huge variety of tasks without any task-specific parameters or training, by conditioning the model’s generation on a prompt which included a few labeled examples of the task of interest. Since then, interest in using GPT models for ICL has grown significantly, with several recent works introducing methods such as instruction-tuning (Sanh et al., 2022; Wang et al., 2022) or chain-of-thought prompting (Wei et al., 2022) to improve downstream ICL accuracy.
Ostensibly, ICL can work for nearly any task that can be defined or described in natural language, and therefore has potential for incredibly broad impact. However, ICL can often still underperform supervised fine-tuning (Bhatia et al., 2023), prompting research in analyzing the mechanisms underlying ICL. One line of work studies in-context learning with linear functions, typically linear regression, characterizing the learnability of these functions with ICL (Li et al., 2023; Garg et al., 2022) and even the learning algorithm a transformer uses (Akyürek et al., 2022; Dai et al., 2023; von Oswald et al., 2023). A second body of work suggests that in-context learning locates existing latent concepts (tasks) which have been already learnt during pretraining (Xie et al., 2021; Wies et al., 2023). Finally, Wei et al. (2023) suggest that model size may change the mechanisms behind ICL from latent inference to actual learning algorithms as size increases. Our work which focuses on Machine Translation, fits into this recent chain of work by demonstrating that there exists a point in the model’s layers where the task has been located.
Many works study layers of the model as a natural unit of analysis for interpretability (Hewitt & Liang, 2019; De Cao et al., 2020; Pasad et al., 2021; Durrani et al., 2022; BenShaul & Dekel, 2022; Sajjad et al., 2023). We highlight some of the work which is more closely related to task performance. Xie et al. (2022) study the layer-wise adaptability by a hidden-state variability ratio while Voita et al. (2019a) study evolution of representations in MT-supervised transformer models. Phang et al. (2021) studies when model layers can be skipped by feeding intermediate representations into the final output layer of a pre-trained supervised model. Our work adds to this body of work by considering the perspective of when and where layers are responsible for task location in in-context learning models.
In-Context Machine Translation While GPT models are strong few-shot learners, their pre-training data is historically dominated by English, limiting their ability to perform translation tasks (Hendy et al., 2023). Lin et al. (2022) find that an explicitly multilingual GPT significantly outperforms traditional english models such as GPT-3, and Garcia et al. (2023) find that such models can even be competitive with supervised MT models in some settings. However, even with explicit multilingual pre-training, in-context MT has been found to be very sensitive to the examples used Liu et al. (2022) and their orders Lu et al. (2022). In response, recent work focuses on how to select prompts that elicit the best downstream MT performance (Agrawal et al., 2022; Sia & Duh, 2023). However, further improvement to translation with GPT models is limited by our understanding of how MT emerges in GPT models. Our work directly analyses when, in layer representations, a GPT model becomes a translation model via in-context learning, and how that may inform decisions around parameter tuning and redundancy.