Markets

How Tokenizer Choices Shape Hidden Risks in Popular Language Models

Abstract and 1. Introduction

  1. Methods

    2.1 Tokenizer analysis

    2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens

  2. Results

    3.1 Effectiveness of indicators and verification

    3.2 Common observations

    3.3 Model-specific observations

  3. Closed-source models

  4. Discussion, Acknowledgments, and References

A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

3.3 Model-specific observations

In this section we outline some model-specific observations, grouped by the tokenizer used. These examples are mainly intended to illustrate the variety of different under-trained tokens and configuration issues that can be found using our methods, and are not exhaustive.

3.3.1 Models based on the GPT-2 tokenizer

GPT-2 introduced the framework for much of current-day tokenization and training of LLMs [1], and the tokenizer has been re-used extensively. We confirm previous findings with a significant number of tokens

(a) Token indicators for Rakuten 7B, based on non-tied embeddings. The Rakuten model clearly shows a bi-modal distribution, with the new Japanese tokens appearing as a different peak closer to zero. All three indicators are suitable here and detect similar under-trained candidate tokens.(a) Token indicators for Rakuten 7B, based on non-tied embeddings. The Rakuten model clearly shows a bi-modal distribution, with the new Japanese tokens appearing as a different peak closer to zero. All three indicators are suitable here and detect similar under-trained candidate tokens.

(b) Token indicators for Gemma 7B, based on tied embeddings. Note the clearer separation caused by removing the first principal component, and the high correlation between the resulting metric and the indicator based on embedding norms at lower values, showing both are effective predictors of under-trained tokens.(b) Token indicators for Gemma 7B, based on tied embeddings. Note the clearer separation caused by removing the first principal component, and the high correlation between the resulting metric and the indicator based on embedding norms at lower values, showing both are effective predictors of under-trained tokens.

Figure 3: Comparison of (un)embedding based indicators. The scatter plots are coloured by token id, from light green to dark blue.

related to (fragments of) usernames (e.g. _TheNitrome, _RandomRedditor, StreamerBot). Although the model is aimed at English text, there are a few under-trained non-English tokens, including the Japanese token _サーティ[4]. In addition to the 13 bytes unused in UTF-8, we detect that all ASCII characters in the 0-31 range, except for the newline character, appear untrained. Although removing or replacing many of these ‘control’ characters is a reasonable normalization step, the tokenizer as published does not perform such normalization. Most notably this means that the horizontal tab character \t as well as the carriage return \r are out of distribution for the models.

We also evaluated a few models that base their tokenizer on GPT-2, including Phi-2 and GPT-J 6B [15, 16] [5]. These models share many of the same under-trained tokens, and have significantly more confirmed tokens than GPT-2, likely due to their training data being further removed from the data used to train the tokenizer. These additional tokens include _SolidGoldMagikarp, which is not among verified candidates in GPT-2.

3.3.2 Models based on the GPT-NeoX tokenizer

GPT-NeoX is an open-source library and associated family of models which uses a tokenizer with the same vocabulary size as GPT-2, but trained on the same ‘The Pile’ dataset also used for model training, and with added tokens for multiple spaces [14]. The GPT-NeoX 20B model has very few under-trained tokens, likely in part due to this alignment between tokenizer and model training, with the fragment FFIRMED showing up most consistently. The Pythia 6.7B model based on the same library [17] also shows very similar results.

The OLMo open language models [13] also use this tokenizer, but have a much higher rate of under-trained tokens, including a wide range of punctuation-based tokens. We detect over 200 unreachable tokens representing combinations of spaces and line breaks in the tokenizer, which appear to be caused by the aforementioned ‘multiple spaces’ tokens taking precedence. However, many of them appear to have been seen in training, based on both our indicators and provided data on token counts in training[6].

In addition, we noticed that embedding based indicators are not near zero for the NeoX, Pythia, and OLMo v1 models. For the NeoX/Pythia models, this was explained to be due to an specific implementation of weight decay, where only weights that are used in the forward pass are affected. The OLMo v1 model instead applies no weight decay. For the NeoX and Pythia models, we find that having low but non-zero embeddings is still a good predictor for under-trained tokens, and keep the default choice. For the OLMo v1 model, we find a large number of tokens near the minimum of approximately 1, and use the more discriminative metric based on unembeddings. The OLMo v1.7 model does apply weight decay to embeddings, and its embedding norms are near zero for untrained tokens (cf. Figure 2), but we maintain the same choice for more consistent comparison between the two versions.

3.3.3 Llama2 and related models

The Llama2 models are a family of models by Meta and use a SentencePiece based BPE tokenizer with 32,000 tokens [18]. We detect no unreachable or partial UTF-8 sequence tokens aside from the expected single byte fallback tokens. The model’s relatively few confirmed under-trained tokens typically relate to specific long non-English words. These include _Mediabestanden, _Portály, and _Расподела. We also observe a few fragments which are ‘occluded’ by more complete words, such as ederbörd and nederbörd (_årsnederbörd, _nederbörd), _gepublic (_gepubliceerd), and oreferrer (noreferrer). Several of these tokens were also found in previous work on optimizing prompts for steering model outputs [8].

3.3.4 Mistral and related models

Derived models such as Zephyr beta and variants such as Solar 10.7B use the same tokenizer [12, 21], and show a slight increase in the number of under-trained tokens within our threshold, but no change in the most severely under-trained tokens. The model by Rakuten [22] is an extended-vocabulary Japanese model based on the Mistral 7B base model, with continued pre-training. Among the extended vocabulary we find a few under-trained fragments such as 稲田大学 (早稲田大学, ‘Waseda University’). Their presence is proportional to the extended vocabulary, and the extended vocabulary tokens form a distinct cluster when visualising their indicators (see Figure 3a).

3.3.5 Gemma

Finally, the Gemma models stand out due to the high similarity between the rows of their (un)embedding matrix, making removal of the first principal component a particularly important step in determining an indicator for detecting under-trained tokens in these models (see Figure 3b.

3.3.6 Command R and R+

Cohere’s ‘Command R’ and ‘Command R+’ models [23] also have a large multi-lingual vocabulary with over 255,000 tokens. The most notable discovery in these models was that over 1,400 tokens related to emojis are all clearly untrained according to their indicators, and are categorized as ‘unreachable’ due to the tokenizer using a different representation when encoding text which includes them. Among confirmed under-trained tokens we finds the typical fragments of long words, such as ephritidae (Tephritidae) and ishockeyspieler (Eishockeyspieler).

3.3.7 Models using the OpenAI ‘tiktoken’ tokenizer

A number of models have been published that use the OpenAI ‘cl100k’ tokenizer as used in GPT-3.5 and GPT-4, and published in the ‘tiktoken’ library [25]. The pattern used to ‘pre-tokenize’ text is unusual in allowing not just a single space character in front of words, but any character that is not a number, letter, or line break. This choice results in tokens such as \tTokenNameIdentifier and $PostalCodesNL, which are highly sensitive to pre-tokenization splitting, with leading spaces before the token resulting in different tokenization such as _$, PostalCodesNL. In combination with their specific content, this is likely to have made them more severely under-trained across models.

StableLM2 is a model by Stability AI [26], and uses a slightly modified version of this tokenizer, adding digits splitting, as well as special code tokens similar to StarCoder2. Due to the added digit splitting, the original multi-digit tokens in the tokenizer are expected to show up as both unreachable and untrained. However, initially these tokens appeared only as untrained due to a tokenizer configuration error [9].

The Qwen model family by Alibaba significantly extends the tokenizer to over 150,000 tokens with “commonly used Chinese characters and words, as well as those in other languages” [27]. The combination of many thousands of manually added tokens, as well as a training corpus that is likely even further removed from the tokenizer’s training data than usual, results in many under-trained tokens. These include archaic Chinese characters (such as 𬳽) and Korean characters which are typographically valid but never seen in normal text (such as 앐).

Llama3 is a recent model by Meta AI which also extends the tiktoken tokenizer with 28,000 additional tokens.[10] Aside from sharing many under-trained tokens with other models using this tokenizer, the newly added tokens include additional under-trained tokens such as ЎыџNЎыџN and krvldkf.

3.3.8 StarCoder2

StarCoder2 is a series of models resulting from the BigCode project, an open-scientific collaboration collaboration focused on the development of code models [28], and have published both their tokenizer training method and the dataset. The token ittrLoremipumdolorsitametconsecteturadipiscingelitIntegervelvel and fragments thereof are the most eye-catching among the verified under-trained tokens. Being a code focused model, there are a also number of long and specific variable and function names, such as simpleIndexQueryParserTests, and fragments of them. Additional notable under-trained tokens include the words for a number of Swiss German dialects (Ostschwizertütsch, Baseldytsch, _Züritüütsch, _Bärndütsch), and a number of seemingly random tokens such as BjKPZFq.

The open nature of the project represents a great opportunity for further investigation, and allowed us to determine the source of these tokens in the tokenizer training data. We find a single document which repeats ‘LoremipumdolorsitametconsecteturadipiscingelitIntegervelvelittr’ to illustrate maximal variable lengths in Java, a single document with base-64 encoded strings as the origin of the random looking tokens, and a single source code file with a list of solutions of a German Wordle game with words categorized by dialect.

As mentioned in section 3.2.1, some models exclude bytes not used in UTF-8. The StarCoder2 tokenizer is unique in additionally missing the 0xF1 fallback byte. Although this byte is not used in any defined Unicode block, an example phrase with a Unicode escape sequence in that range such as \U0006baeb encodes to [0, 142, 142, 142] where token id 0 is the special <|endoftext|> token being used in the absence of both an 0xF1 byte and a specialized token. This could potentially be used to disrupt systems which use not only this specific model, but any derived fine-tuned models regardless of their training data.

3.3.9 Yi

Yi-9B is a base model by 01.ai whose training data is focused on English and Chinese [29]. The model has a number of typical under-trained tokens, including punctuation-based tokens, (partial) non-English/Chinese words such as Разпространение, and some apparent user handles including mabaochang. Unique to the model are a number of strange tokens starting with ‘n’, including nConsequently and nInterestingly which may have been caused by incorrectly processing newline characters in some data. In addition, three tokens with Chinese phrases are unusual unreachable tokens, as they re-encode to a different sequence, including 毛泽东 (‘Mao Zedong’), which tokenizes to three separate tokens despite the presence of a dedicated token. Finally, a number of tokens representing HTML tags appear to have been seen in training, although they initially appeared as unreachable when using the ‘fast’ version of the Hugging Face tokenizer[11].

3.3.10 Jamba


[4] A fragment of the token _サーティワン (‘thirty-one’), the Japanese name for the ‘Baskin-Robbins’ ice cream chain.

[5] Although GPT-J uses a bias in the unembedding layer, it does not affect our indicators due to its small magnitude.

[6] This was in part traced to a breaking change in tokenizers v0.14 (Luca Soldaini, personal communication).

[7] This bug was reported and subsequently fixed by the Gemma team, and our results are based on the fixed version.

[8] We submitted a fix for this at https://github.com/spencermountain/wtf_wikipedia/pull/573.

[9] This was reported to the Stability AI team and has been fixed by disabling the incorrect ‘slow’ tokenizer.

[10] We use tokenizers v0.19.1, in which the tokenization of 588 formerly unreachable tokens was fixed.

[11] This has been reported to the 01.ai team, who advise not to use the ‘fast’ version.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblocker Detected

Please consider supporting us by disabling your ad blocker