Training Large Language Models (LLMs) requires significant computational resources, and most existing models are primarily trained on English text. This creates substantial challenges for training LLMs in other languages, primarily due to high computational costs and insufficient training data in these languages.
To address these challenges, we need methods that facilitate the training of LLMs in new languages while minimizing environmental impact. One promising approach is Language Transfer, which involves initializing a model in a target language using a model trained on a source language, typically English.
Language Transfer approaches can be categorized into two main types:
- Mono-Lingual or Cross-Lingual Transfer
- This approach uses a model trained on a single source language to initialize a new model in a target language that is different from the source language. For example, transferring a model trained on English to German.
- Multi-Lingual Transfer
- This method utilizes a model trained on one or more source languages to initialize a model in a target language, which can be one of the source languages or a different language. For instance, transferring a model trained on English and German to either German or French.
WECHSEL
The paper [Min22W] introduces WECHSEL, a method to transfer trained models to new languages.
It requires as input:
- a tokenizer in the source language.
- a pre-trained language model in the source language.
- a tokenizer in the target language.
- 2 monolingual fastText embeddings [Boj17E]
for source and target languages respectively.
They can be obtained in one of 2 ways:
- use pre-trained fastText embeddings.
- train fastText embeddings from scratch.
Algorithm
Figure 1 is a diagram representing a high-level summary of the method and its different components.
The method proceeds as follows:
Use the tokenizers to split the words in the bilingual dictionary into subwords (tokens).
Use the fastText embeddings to compute subword embeddings as the sum of the embeddings of their n-grams.
$$ \mathbf{u}_x = \sum_{g \in \mathbb{G}(x)} \mathbf{w}_g $$
where $\mathbb{G}(x)$ is the set of n-grams occurring in the subword $x$ and $\mathbf{w}_g$ is the embedding of the n-gram $g$.
The embedding of subwords in which no known n-gram occurs are initialized to zero.
Align subword embeddings using the bilingual dictionary and the Orthogonal Procrustes method [Sch66G, Art16L].
$$ \underset{W}{\text{argmin}} \lVert \mathbf{U}^t W - \mathbf{U}^s \rVert_F^2 $$
Where $\lVert \cdot \rVert_F$ is the Frobenius norm and the matrix $W$ is required to be an orthogonal matrix ($W^T W = I$)
Compute the cosine similarity of every subword in the source language, $x$ to every subword in the target language, $y$, denoted as $s_{x,y}$.
$$ s_{x,y} = \frac{ \mathbf{u}_x^t \mathbf{u}_y^{sT} }{ \lVert \mathbf{u}_x^t \rVert \lVert \mathbf{u}_y^s \rVert } $$
Initialize embeddings of target model as the weighted average of the embeddings of the source model by using cosine similarity between aligned sub-word embeddings as weights.
$$ \mathbf{e}_x^t = \frac{ \sum_{y \in \mathcal{J}_x} \exp(s_{x,y} / \tau ) \cdot \mathbf{e}_y^s }{ \sum_{y^{\prime} \in \mathcal{J}_x} \exp(s_{x,y^{\prime}} / \tau) } $$
where $\mathcal{J}_x$ is the set of $k$ neighbouring subwords in the source language.
Subword embeddings that were set to zero are initialized from a random normal distribution $\mathcal{N}(\mathbb{E}[\mathbf{E}^s], Var[\mathbf{E}^s])$.
Copy non-embedding parameters of source model to target model.
Experiments
For their experiments, the authors used RoBERTa and GPT-2 models trained on English, transferring them to four medium-resource languages: French, German, Chinese, and Swahili, as well as four low-resource languages: Sundanese, Scottish Gaelic, Uyghur, and Malagasy.
They employed automatically generated bilingual dictionaries from MUSE [Con18W] for French, German and Chinese and a bilingual dictionary from FreeDict1 for Swahili. For low-resource languages, they used bilingual dictionaries scraped from Wiktionary, stored in their repository2.
The authors compared their method to two other approaches:
- FullRand Randomly initializes the target model and trains it from scratch.
- TransInner Randomly initializes the embedding parameters and copies the non-embedding parameters. It trains only the embedding parameters for a fixed amount of steps while freezing the remaining parameters, then trains the entire model.
For all methods and models, they trained on 65.5 billion tokens, significantly fewer than the models compared against (e.g., CamemBERT on 419.4 billion tokens, GBERTBase on 255.6 billion tokens, and BERTBase-Chinese on 131.1 billion tokens). All models were trained for 250k steps with consistent hyperparameters across languages.
WECHSEL-RoBERTa was evaluated by fine-tuning on XNLI for NLI performance and on the balanced train-dev-test split of WikiANN for NER performance. WECHSEL-GPT-2 was evaluated by Perplexity (PPL) on a held-out set from the training corpus.
Results
From Figure 2, we observe that WECHSEL significantly improves cross-lingual parameter transfer and outperforms models of comparable size trained from scratch, achieving up to 64x less training effort.
As shown in Table 1 and Table 2, models initialized with WECHSEL generally outperform models trained from scratch and those initialized with TransInner across all languages examined.
Notably, the close relatedness of the source and target languages is not a prerequisite for effective transfer. For instance, in NLI tasks, WECHSEL improves absolute accuracy by 7.15%, 6.31%, 6.94%, and 4.71% over models trained from scratch for French, German, Chinese, and Swahili, respectively.
Comments
Despite its increased efficiency compared to training from scratch, the WECHSEL method has several inherent weaknesses. These include its complexity and a significant reliance on bilingual dictionaries for aligning embedding vectors.
This reliance is a significant issue especially if these dictionaries contain mistakes. Their quality directly impacts the alignment and, consequently, the model initialization. In the experiments, the authors used automatically extracted dictionaries for three of the languages, which introduced omissions and errors. For instance, in the French-English dictionary, several mistakes were identified:
Incorrect Word Mappings: The order of mappings is incorrect.
For example, on line 80,
feasible
is mapped toviable
whereas it should be mapped tofeasable
; on line 48641,friend
is mapped topadna
which is a word that doesn’t exist in French, according to this online dictionary;Inconsistent Order: The order of languages is inconsistent.
In most lines, English precedes French (e.g., line 94), while in some, French precedes English (e.g., line 81).
These inaccuracies in the dictionaries can significantly affect the performance and reliability of the resulting language models.
Additionally, a crucial gap in the research is the lack of investigation into the method’s effect on social biases within the models. Although the authors acknowledge the potential risk of these biases, they neither explore its extent nor propose solutions to mitigate or alleviate it. Addressing these concerns is essential for improving the reliability and ethical considerations of using WECHSEL for transferring language models.