In [Min22W], we explored WECHSEL, a cross-lingual transfer method leveraging fastText embeddings and bilingual dictionaries to initialize a new language model in a target language from trained models in one or more source languages.
In this post, we examine Cross-Lingual and Progressive Transfer (CLP-Transfer), introduced by [Ost23E]. This method uses a smaller language model in the target language with the desired tokenizer for cross-lingual transfer.
Key Assumptions of CLP-Transfer
1. Vocabulary Overlap
Given tokenizers and their respective vocabularies $V_s$ and $V_t$ for source and target languages, respectively, CLP-Transfer assumes a significant token overlap, i.e., $| V_s \cap V_t| \gg 0$.
Table 1 shows the result of an empirical experiment that demonstrates this assumption holds across various tokenizers and languages.
Overlapping token embeddings from the source model are copied to the target model1:
$$ \mathbf{u}_t := \mathbf{u}_s, \quad \text{if} \quad v \in V_s \cap V_t $$
These overlapping tokens serve as anchors for computing embeddings of non-overlapping tokens.
2. Comparable Token Embeddings
Token embeddings are assumed to be comparable across models of different sizes but with the same tokenizer, i.e., they can be substituted for one another in computations.
Figure 1 shows the results of an experiment comparing token embeddings across various sizes of English OPT models. The set of k-nearest neighbors $N_v$ with $k = 10$ for each token $v$ was analyzed to measure overlapping neighbors across different model sizes $N_v^{(large)} \cap N_v^{(small)}$. This measure is normalized and computed for all available tokens.
This finding justifies using a smaller helper model to compute the cosine similarity between its embeddings and those of the source model as a proxy for the larger target model:
$$ \mathbf{u}^{\text{(large)}}_t := \underset{\widehat{v} \in V_s \cap V_t}{\sum} \frac{\widehat{\mathbf{u}}^{\text{(large)}}_s} {\delta(\mathbf{u}_t, \widehat{\mathbf{u}}_t)}, \quad \text{if} \quad v \notin V_s \cap V_t $$
Where the weight function $\delta$ has the objective to transfer the spatial properties from the small model to the large model and is defined as follows:
$$ \delta(\mathbf{u}, \widehat{\mathbf{u}}) := \frac{ \cos(\mathbf{u}^{\text{(small)}}_t, \widehat{\mathbf{u}}^{\text{(small)}}_t) }{ \underset{\widehat{v}^{\prime} \in V_s \cap V_t; \quad v^{\prime} \in V_s \cup V_t}{\sum} \cos(\mathbf{u}^{\prime\text{(small)}}_t, \widehat{\mathbf{u}}^{\prime\text{(small)}}_t) } $$
The remaining model parameters are copied over as in WECHSEL [Min22W].
Experiments
Models
For their experiments, the authors used the following models:
- GPT-2:
- Source: English GPT-2-XL model with 1.5B parameters.
- Helper (small): GPT-2-base model with 117M parameters initialized with WECHSEL and then further trained.
- Target: GPT-2 models with parameter sizes from 117M to 1.5B.
- BLOOM:
- Source: Multilingual2 BLOOM model with 7.1B parameters.
- Helper (small): German BLOOM model with 1.5B parameters.
- Target: BLOOM models with parameter sizes from 1.5B to 6.4B.
Datasets
- GPT-2
- Data: Web-crawled data from the German subset of OSCAR v2019, similar to WECHSEL[Min22W].
- Training Set: First 4GB of the data, approximately 30.8B tokens.
- Validation Set: Next 0.4GB of the data.
- BLOOM
- Data: Web-crawled content from the German subset of OSCAR v22.01 (excluding headers, footers, noisy, and adult content) and the GC4 Corpus (including only head and middle parts).
- Deduplication: Removed duplicated content from CommonCrawl.
- Additional Data: German court decisions from Open Legal Data.
- Training Set: Approximately 50.4B tokens.
Evaluation
The trained models were evaluated by:
- Perplexity on the GPT-2 validation dataset.
- Zero-shot performance on German downstream tasks:
- Sentiment analysis from GermEval 2017.
- Hate speech classification from GermEval 2018.
- News topic classification from GNAD10.
- Paraphrase identification from PAWSX.
- Natural language inference from XNLI.
- Stance detection from X-Stance.
Methods
The authors compared their method against three others:
- From-Scratch Training (referred to as FullRand in WECHSEL) Randomly initializes the target model and trains it from scratch.
- WECHSEL
- Random Uniformly chooses a class from the downstream classification tasks without using a model.
Additionally, the authors compared their monolingual German models against multilingual models trained on German data, namely XGLM and mGPT.
Results
Perplexity Evaluation
As shown in Table 2, all models initialized with CLP-Transfer achieve the best evaluation perplexity on the OSCAR dataset. This is further illustrated in Figure 2 and Figure 3.
GPT-2-XL: Figure 2 shows that the GPT-2-XL model initialized with CLP-Transfer achieves the same perplexity as the from-scratch training after training on only ~50% of the total number of tokens (dashed line). The difference between CLP-Transfer and WECHSEL is noticeable, favoring CLP-Transfer.
BLOOM: Figure 3 shows that the BLOOM-6B-German model initialized with CLP-Transfer achieves the same perplexity as from-scratch training after training on only ~20% of the total number of tokens (dashed line).
Downstream Task Performance
The results on downstream tasks are generally disappointing, as noted by the authors. The models initialized with CLP-Transfer do not perform significantly better than the random baseline on these tasks and perform even worse on some of the tasks.
The GPT-2-XL-CLP model does not achieve the best results on any specific dataset and performs significantly worse on the hate speech classification task from GermEval 2018.
The seemingly strange discrepancy between the good perplexity score on the validation set and the poor results on the downstream tasks can be explained by several factors:
- Dataset Splits: Both the training and validation sets come from the same dataset (OSCAR) with splits based on the order of the data rather than content. This can lead to better perplexity scores that do not necessarily translate to improved performance on downstream tasks.
- Perplexity as a Proxy: Perplexity is merely a proxy measure and does not necessarily correlate with performance on specific tasks.
Additionally, the authors mention two other points that contribute to the poor performance on downstream tasks:
- Lack of Fine-Tuning and Prompt Engineering: The models were neither fine-tuned nor subjected to prompt engineering, which is crucial for achieving good performance, especially given the model sizes and number of training tokens used.
- Dataset Quality: The quality of the datasets used for validation is variable. Some datasets, such as PAWSX, contain poorly translated samples, leading to less meaningful results.
Comments
Although CLP-Transfer does not necessarily achieve better results than WECHSEL on downstream tasks, it is signficantly simpler. It does not require training fastText embeddings or aligning these embeddings using a bilingual dictionary, making it more straightforward to implement.
CLP-Transfer relies on token overlap and the existence of a small pre-trained model using the desired tokenizer. This requirement can be a limitation in cases where:
- A custom tokenizer is used, which does not have pre-trained models available.
- The desired tokenizer has minimal overlap with existing ones (e.g., between languages that use different alphabets like English and Arabic).
The authors evaluated their method only on decoder-only language models. It would be more comprehensive to include encoder-only and encoder-decoder language models in future evaluations to better understand the method’s versatility.
All tokenizers in the study were trained using Byte-Pair Encoding (BPE). It would be beneficial to investigate the effectiveness of CLP-Transfer with tokenizers trained using other methods such as Unigram and WordPiece.
The authors have made the pre-trained model checkpoints and source code publicly available on the HuggingFace Hub (BLOOM-CLP 6.4B, GPT2-XL-CLP 1.5B) and GitHub, respectively. This facilitates reproducibility and further research by the community.
A web-based demo for the German BLOOM-CLP model with 6.4B parameters was initially provided by the authors. However, it is no longer available at the time of this publication.
unlike in the paper, we denote the embedding vector by $\mathbf{u}$ instead of $\mathbf{v}$ to avoid any possible confusion with the token which is denoted by $v$. ↩︎
trained on Arabic, Basque, Bengali, Chinese, Catalan, English, French, Hindi, Indonesian, Portuguese, Spanish, Urdu, and Vietnamese. ↩︎