WECHSEL: Cross-Lingual Transfer | TransferLab

Training Large Language Models (LLMs) requires significant computational resources, and most existing models are primarily trained on English text. This creates substantial challenges for training LLMs in other languages, primarily due to high computational costs and insufficient training data in these languages.

To address these challenges, we need methods that facilitate the training of LLMs in new languages while minimizing environmental impact. One promising approach is Language Transfer, which involves initializing a model in a target language using a model trained on a source language, typically English.

Language Transfer approaches can be categorized into two main types:

Mono-Lingual or Cross-Lingual Transfer: This approach uses a model trained on a single source language to initialize a new model in a target language that is different from the source language. For example, transferring a model trained on English to German.
Multi-Lingual Transfer: This method utilizes a model trained on one or more source languages to initialize a model in a target language, which can be one of the source languages or a different language. For instance, transferring a model trained on English and German to either German or French.

WECHSEL

The paper [Min22W] introduces WECHSEL, a method to transfer trained models to new languages.

It requires as input:

a tokenizer in the source language.
a pre-trained language model in the source language.
a tokenizer in the target language.
2 monolingual fastText embeddings [Boj17E] for source and target languages respectively. They can be obtained in one of 2 ways:
- use pre-trained fastText embeddings.
- train fastText embeddings from scratch.

Figure 1: Summary of the WECHSEL method with inputs in blue, intermediate results in orange and outputs in green

Algorithm

Figure 1 is a diagram representing a high-level summary of the method and its different components.

The method proceeds as follows:

Use the tokenizers to split the words in the bilingual dictionary into subwords (tokens).
Use the fastText embeddings to compute subword embeddings as the sum of the embeddings of their n-grams.
$$ \mathbf{u}_x = \sum_{g \in \mathbb{G}(x)} \mathbf{w}_g $$
where $\mathbb{G}(x)$ is the set of n-grams occurring in the subword $x$ and $\mathbf{w}_g$ is the embedding of the n-gram $g$.
The embedding of subwords in which no known n-gram occurs are initialized to zero.
Align subword embeddings using the bilingual dictionary and the Orthogonal Procrustes method [Sch66G, Art16L].
$$ \underset{W}{\text{argmin}} \lVert \mathbf{U}^t W - \mathbf{U}^s \rVert_F^2 $$
Where $\lVert \cdot \rVert_F$ is the Frobenius norm and the matrix $W$ is required to be an orthogonal matrix ($W^T W = I$)
Compute the cosine similarity of every subword in the source language, $x$ to every subword in the target language, $y$, denoted as $s_{x,y}$.
$$ s_{x,y} = \frac{ \mathbf{u}_x^t \mathbf{u}_y^{sT} }{ \lVert \mathbf{u}_x^t \rVert \lVert \mathbf{u}_y^s \rVert } $$
Initialize embeddings of target model as the weighted average of the embeddings of the source model by using cosine similarity between aligned sub-word embeddings as weights.
$$ \mathbf{e}_x^t = \frac{ \sum_{y \in \mathcal{J}_x} \exp(s_{x,y} / \tau ) \cdot \mathbf{e}_y^s }{ \sum_{y^{\prime} \in \mathcal{J}_x} \exp(s_{x,y^{\prime}} / \tau) } $$
where $\mathcal{J}_x$ is the set of $k$ neighbouring subwords in the source language.
Subword embeddings that were set to zero are initialized from a random normal distribution $\mathcal{N}(\mathbb{E}[\mathbf{E}^s], Var[\mathbf{E}^s])$.
Copy non-embedding parameters of source model to target model.

Experiments

For their experiments, the authors used RoBERTa and GPT-2 models trained on English, transferring them to four medium-resource languages: French, German, Chinese, and Swahili, as well as four low-resource languages: Sundanese, Scottish Gaelic, Uyghur, and Malagasy.

They employed automatically generated bilingual dictionaries from MUSE [Con18W] for French, German and Chinese and a bilingual dictionary from FreeDict¹ for Swahili. For low-resource languages, they used bilingual dictionaries scraped from Wiktionary, stored in their repository².

The authors compared their method to two other approaches:

FullRand Randomly initializes the target model and trains it from scratch.
TransInner Randomly initializes the embedding parameters and copies the non-embedding parameters. It trains only the embedding parameters for a fixed amount of steps while freezing the remaining parameters, then trains the entire model.

For all methods and models, they trained on 65.5 billion tokens, significantly fewer than the models compared against (e.g., CamemBERT on 419.4 billion tokens, GBERT_Base on 255.6 billion tokens, and BERT_Base-Chinese on 131.1 billion tokens). All models were trained for 250k steps with consistent hyperparameters across languages.

WECHSEL-RoBERTa was evaluated by fine-tuning on XNLI for NLI performance and on the balanced train-dev-test split of WikiANN for NER performance. WECHSEL-GPT-2 was evaluated by Perplexity (PPL) on a held-out set from the training corpus.

Results

From Figure 2, we observe that WECHSEL significantly improves cross-lingual parameter transfer and outperforms models of comparable size trained from scratch, achieving up to 64x less training effort.

As shown in Table 1 and Table 2, models initialized with WECHSEL generally outperform models trained from scratch and those initialized with TransInner across all languages examined.

Notably, the close relatedness of the source and target languages is not a prerequisite for effective transfer. For instance, in NLI tasks, WECHSEL improves absolute accuracy by 7.15%, 6.31%, 6.94%, and 4.71% over models trained from scratch for French, German, Chinese, and Swahili, respectively.

Table 1: Results from fine-tuning RoBERTa models. Accuracy is reported for NLI on XNLI and micro F1 score for NER on WikiANN. Results are averaged over 3 runs. Scores are reported before training (Score@0), after 10% of the total number of steps (Score@25k) and after training (Score@250k). Results from fine-tuning prior monolingual models and XLM–R (Score (more training)) are also reported. For each language, the best results in every column are indicated with underlines. The overall best results including the comparison with existing monolingual/multilingual models of comparable size are shown in bold.

Figure 2: Test scores over training steps from fine-tuning RoBERTa models on NLI (using XNLI) and NER (using WikiANN). Perplexity on the held-out set over training steps of GPT-2 models. The evaluation is done every 12.5k steps.

Table 2: Results of training GPT2 models. The perplexity before training (PPL@0), after 10% of of the total number of steps (PPL@25k) and after training (PPL@250k) are reported.

Comments

Despite its increased efficiency compared to training from scratch, the WECHSEL method has several inherent weaknesses. These include its complexity and a significant reliance on bilingual dictionaries for aligning embedding vectors.

This reliance is a significant issue especially if these dictionaries contain mistakes. Their quality directly impacts the alignment and, consequently, the model initialization. In the experiments, the authors used automatically extracted dictionaries for three of the languages, which introduced omissions and errors. For instance, in the French-English dictionary, several mistakes were identified:

Incorrect Word Mappings: The order of mappings is incorrect.
For example, on line 80, feasible is mapped to viable whereas it should be mapped to feasable; on line 48641, friend is mapped to padna which is a word that doesn’t exist in French, according to this online dictionary;
Inconsistent Order: The order of languages is inconsistent.
In most lines, English precedes French (e.g., line 94), while in some, French precedes English (e.g., line 81).

These inaccuracies in the dictionaries can significantly affect the performance and reliability of the resulting language models.

Additionally, a crucial gap in the research is the lack of investigation into the method’s effect on social biases within the models. Although the authors acknowledge the potential risk of these biases, they neither explore its extent nor propose solutions to mitigate or alleviate it. Addressing these concerns is essential for improving the reliability and ethical considerations of using WECHSEL for transferring language models.

Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method – called WECHSEL – to efficiently and …

Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the …

Mapping word embeddings of different languages into a single space has multiple applications. In order to map from a source space into a target space, a common approach is to learn a linear mapping that minimizes the distances between equivalences listed in a bilingual dictionary. In this paper, we propose a framework that generalizes previous work, provides an efﬁcient exact method to learn the …

WECHSEL

Algorithm

Experiments

Results

Comments

References

In this series →