Authors:
(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;
(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel.
K et al. (2020) have suggested that structural similarity of languages is essential for language model’s multilingual generalization capabilities. Their suggestion was further discussed by Dufter and Schütze (2020), who highlighted the essential components for a model to possess “multilinguality”, and show that the order of the words in the sentence is key to the model’s cross-lingual generalization capabilities. mBERT, as introduced by Devlin et al. (2019), was a pioneering language model that encompassed multiple languages, including Arabic and Hebrew. However, both Arabic and Hebrew are significantly under-represented in the pre-training data, resulting in inferior performance compared to the equivalent monolingual models on various downstream tasks (Antoun et al., 2020; Lan et al., 2020; Chriqui and Yahav, 2022; Seker et al., 2022). GigaBERT (Lan et al., 2020) is another multilingual model, trained for English and Arabic. However, the best results for most of the known NLP tasks are typically achieved by one of the large monolingual models in both Arabic and Hebrew. CAMeLBERT (Inoue et al., 2021), is one of those models. It is trained on texts written in Modern Standard Arabic (MSA), Classical Arabic, as well as dialectal variants. In the realm of Hebrew language models, AlephBERT (Seker et al., 2022) stands out as one of the leading performers, alongside others like HeBERT (Chriqui and Yahav, 2022). Among other datasets, the monolingual models mentioned above use the relevant parts of the OSCAR dataset (Ortiz Suárez et al., 2020) for training. Our model relies solely on the OSCAR dataset for both Hebrew and Arabic, resulting in a considerably smaller total number of words for each language in comparison to the existing monolingual language models.
The effect of transliteration on cross-lingual generalization were discussed previously in (Dhamecha et al., 2021; Chau and Smith, 2021) and more recently in (Moosa et al., 2023; Purkayastha et al., 2023). None of these works study the languages of our focus: Arabic and Hebrew. Dhamecha et al. (2021) focused on languages from the Indo-Aryan family, which has been studied before for cross-lingual generalization and also has several publicly available multilingual models. To the best of our knowledge, our work is first to study generalization between Arabic and Hebrew and no multilingual masked language models that include both languages have been published apart from mBERT.
Chau and Smith (2021) address the generalization from high- to low-resourced languages. However, both Arabic and Hebrew are currently considered medium- to high-resourced languages. Furthermore, their evaluation focuses solely on tokenlevel classification tasks, such as dependency parsing and part-of-speech tagging, whereas our evaluation targets machine translation, a sequence-to-sequence bilingual task.
Purkayastha et al. (2023) employ Romanization for transliteration, whereas we transliterate Arabic into the Hebrew script. Analogous to Chau and Smith (2021), their evaluation centers on token-level classification tasks, which are not addressed in our work.