CLP-Transfer: Cross-Lingual and Progressive Transfer Learning

In [Min22W], we explored WECHSEL, a cross-lingual transfer method leveraging fastText embeddings and bilingual dictionaries to initialize a new language model in a target language from trained models in one or more source languages.

In this post, we examine Cross-Lingual and Progressive Transfer (CLP-Transfer), introduced by [Ost23E]. This method uses a smaller language model in the target language with the desired tokenizer for cross-lingual transfer.

Key Assumptions of CLP-Transfer

1. Vocabulary Overlap

Given tokenizers and their respective vocabularies $V_s$ and $V_t$ for source and target languages, respectively, CLP-Transfer assumes a significant token overlap, i.e., $| V_s \cap V_t| \gg 0$.
Table 1 shows the result of an empirical experiment that demonstrates this assumption holds across various tokenizers and languages.
Overlapping token embeddings from the source model are copied to the target model¹:
$$ \mathbf{u}_t := \mathbf{u}_s, \quad \text{if} \quad v \in V_s \cap V_t $$
These overlapping tokens serve as anchors for computing embeddings of non-overlapping tokens.
Table 1: Number of overlapping vocabulary tokens between different tokenizers, normalized by the source vocabulary size. The tokenizers are English GPT-2, Arabic GPT-2, Finnish GPT-23, multilingual BLOOM, multilingual XGLM, and the authors’ German tokenizer.

2. Comparable Token Embeddings

Token embeddings are assumed to be comparable across models of different sizes but with the same tokenizer, i.e., they can be substituted for one another in computations.
Figure 1 shows the results of an experiment comparing token embeddings across various sizes of English OPT models. The set of k-nearest neighbors $N_v$ with $k = 10$ for each token $v$ was analyzed to measure overlapping neighbors across different model sizes $N_v^{(large)} \cap N_v^{(small)}$. This measure is normalized and computed for all available tokens.
Figure 1: Similarity of token embeddings of different OPT model sizes measured as overlapping k = 10 nearest neighbors for all tokens in the vocabulary.
This finding justifies using a smaller helper model to compute the cosine similarity between its embeddings and those of the source model as a proxy for the larger target model:
$$ \mathbf{u}^{\text{(large)}}_t := \underset{\widehat{v} \in V_s \cap V_t}{\sum} \frac{\widehat{\mathbf{u}}^{\text{(large)}}_s} {\delta(\mathbf{u}_t, \widehat{\mathbf{u}}_t)}, \quad \text{if} \quad v \notin V_s \cap V_t $$
Where the weight function $\delta$ has the objective to transfer the spatial properties from the small model to the large model and is defined as follows:
$$ \delta(\mathbf{u}, \widehat{\mathbf{u}}) := \frac{ \cos(\mathbf{u}^{\text{(small)}}_t, \widehat{\mathbf{u}}^{\text{(small)}}_t) }{ \underset{\widehat{v}^{\prime} \in V_s \cap V_t; \quad v^{\prime} \in V_s \cup V_t}{\sum} \cos(\mathbf{u}^{\prime\text{(small)}}_t, \widehat{\mathbf{u}}^{\prime\text{(small)}}_t) } $$

The remaining model parameters are copied over as in WECHSEL [Min22W].

Experiments

Models

For their experiments, the authors used the following models:

GPT-2:
- Source: English GPT-2-XL model with 1.5B parameters.
- Helper (small): GPT-2-base model with 117M parameters initialized with WECHSEL and then further trained.
- Target: GPT-2 models with parameter sizes from 117M to 1.5B.
BLOOM:
- Source: Multilingual² BLOOM model with 7.1B parameters.
- Helper (small): German BLOOM model with 1.5B parameters.
- Target: BLOOM models with parameter sizes from 1.5B to 6.4B.

Datasets

GPT-2
- Data: Web-crawled data from the German subset of OSCAR v2019, similar to WECHSEL[Min22W].
- Training Set: First 4GB of the data, approximately 30.8B tokens.
- Validation Set: Next 0.4GB of the data.
BLOOM
- Data: Web-crawled content from the German subset of OSCAR v22.01 (excluding headers, footers, noisy, and adult content) and the GC4 Corpus (including only head and middle parts).
- Deduplication: Removed duplicated content from CommonCrawl.
- Additional Data: German court decisions from Open Legal Data.
- Training Set: Approximately 50.4B tokens.

Evaluation

The trained models were evaluated by:

Perplexity on the GPT-2 validation dataset.
Zero-shot performance on German downstream tasks:
- Sentiment analysis from GermEval 2017.
- Hate speech classification from GermEval 2018.
- News topic classification from GNAD10.
- Paraphrase identification from PAWSX.
- Natural language inference from XNLI.
- Stance detection from X-Stance.

Methods

The authors compared their method against three others:

From-Scratch Training (referred to as FullRand in WECHSEL) Randomly initializes the target model and trains it from scratch.
WECHSEL
Random Uniformly chooses a class from the downstream classification tasks without using a model.

Additionally, the authors compared their monolingual German models against multilingual models trained on German data, namely XGLM and mGPT.

Results

Table 2: Evaluation results of German downstream tasks in a zero-shot setting. The average score excludes the OSCAR validation perplexity (PPL). Smaller models are on par or worse than the random baseline. The BLOOM-CLP 6.4B model achieves the best results on average but not in all benchmarks.

Perplexity Evaluation

As shown in Table 2, all models initialized with CLP-Transfer achieve the best evaluation perplexity on the OSCAR dataset. This is further illustrated in Figure 2 and Figure 3.

Figure 2: GPT-2-XL German (1.5B parameters). Validation perplexity with respect to the number of tokens comparing From-Scratch Training, WECHSEL, and CLP-Transfer. CLP-Transfer achieves the same perplexity as from-scratch training with fewer tokens.

GPT-2-XL: Figure 2 shows that the GPT-2-XL model initialized with CLP-Transfer achieves the same perplexity as the from-scratch training after training on only ~50% of the total number of tokens (dashed line). The difference between CLP-Transfer and WECHSEL is noticeable, favoring CLP-Transfer.
BLOOM: Figure 3 shows that the BLOOM-6B-German model initialized with CLP-Transfer achieves the same perplexity as from-scratch training after training on only ~20% of the total number of tokens (dashed line).

Figure 3: BLOOM-6B-German. Validation perplexity with respect to the number of tokens comparing From-Scratch Training and CLP-Transfer. CLP-Transfer achieves the same perplexity as from-scratch training with fewer tokens.

Downstream Task Performance

The results on downstream tasks are generally disappointing, as noted by the authors. The models initialized with CLP-Transfer do not perform significantly better than the random baseline on these tasks and perform even worse on some of the tasks.

The GPT-2-XL-CLP model does not achieve the best results on any specific dataset and performs significantly worse on the hate speech classification task from GermEval 2018.

The seemingly strange discrepancy between the good perplexity score on the validation set and the poor results on the downstream tasks can be explained by several factors:

Dataset Splits: Both the training and validation sets come from the same dataset (OSCAR) with splits based on the order of the data rather than content. This can lead to better perplexity scores that do not necessarily translate to improved performance on downstream tasks.
Perplexity as a Proxy: Perplexity is merely a proxy measure and does not necessarily correlate with performance on specific tasks.

Additionally, the authors mention two other points that contribute to the poor performance on downstream tasks:

Lack of Fine-Tuning and Prompt Engineering: The models were neither fine-tuned nor subjected to prompt engineering, which is crucial for achieving good performance, especially given the model sizes and number of training tokens used.
Dataset Quality: The quality of the datasets used for validation is variable. Some datasets, such as PAWSX, contain poorly translated samples, leading to less meaningful results.

Comments

Although CLP-Transfer does not necessarily achieve better results than WECHSEL on downstream tasks, it is signficantly simpler. It does not require training fastText embeddings or aligning these embeddings using a bilingual dictionary, making it more straightforward to implement.

CLP-Transfer relies on token overlap and the existence of a small pre-trained model using the desired tokenizer. This requirement can be a limitation in cases where:

A custom tokenizer is used, which does not have pre-trained models available.
The desired tokenizer has minimal overlap with existing ones (e.g., between languages that use different alphabets like English and Arabic).

The authors evaluated their method only on decoder-only language models. It would be more comprehensive to include encoder-only and encoder-decoder language models in future evaluations to better understand the method’s versatility.

All tokenizers in the study were trained using Byte-Pair Encoding (BPE). It would be beneficial to investigate the effectiveness of CLP-Transfer with tokenizers trained using other methods such as Unigram and WordPiece.

The authors have made the pre-trained model checkpoints and source code publicly available on the HuggingFace Hub (BLOOM-CLP 6.4B, GPT2-XL-CLP 1.5B) and GitHub, respectively. This facilitates reproducibility and further research by the community.

A web-based demo for the German BLOOM-CLP model with 6.4B parameters was initially provided by the authors. However, it is no longer available at the time of this publication.

unlike in the paper, we denote the embedding vector by $\mathbf{u}$ instead of $\mathbf{v}$ to avoid any possible confusion with the token which is denoted by $v$. ↩︎
trained on Arabic, Basque, Bengali, Chinese, Catalan, English, French, Hindi, Indonesian, Portuguese, Spanish, Urdu, and Vietnamese. ↩︎

Most Transformer language models are primarily pretrained on English text, limiting their use for other languages. As the model sizes grow, the performance gap between English and other languages with fewer compute and data resources increases even further. Consequently, more resource-efﬁcient training methods are needed to bridge the gap for languages with fewer resources available. To address …

Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method – called WECHSEL – to efficiently and …