Markets

Comprehensive Detection of Untrained Tokens in Language Model Tokenizers

Abstract and 1. Introduction

  1. Methods

    2.1 Tokenizer analysis

    2.2 Indicators for detecting under-trained tokens and 2.3 Verification of candidate tokens

  2. Results

    3.1 Effectiveness of indicators and verification

    3.2 Common observations

    3.3 Model-specific observations

  3. Closed-source models

  4. Discussion, Acknowledgments, and References

A. Verification details

B. A short primer on UTF-8 encoding

C. Outputs for API-based verification

Abstract

The disconnect between tokenizer creation and model training in language models has been known to allow for certain inputs, such as the infamous _SolidGoldMagikarp token, to induce unwanted behaviour. Although such ‘glitch tokens’ that are present in the tokenizer vocabulary, but are nearly or fully absent in training, have been observed across a variety of different models, a consistent way of identifying them has been missing. We present a comprehensive analysis of Large Language Model (LLM) tokenizers, specifically targeting this issue of detecting untrained and under-trained tokens. Through a combination of tokenizer analysis, model weight-based indicators, and prompting techniques, we develop effective methods for automatically detecting these problematic tokens. Our findings demonstrate the prevalence of such tokens across various models and provide insights into improving the efficiency and safety of language models.

1 Introduction

Large Language Models (LLMs) have undergone remarkable advancements, becoming increasingly capable of understanding and generating human-like text. While most components of these models are trained in an unsupervised fashion on vast amounts of data, the tokenizer typically remains a separately trained component based on custom algorithms and smaller datasets.

GPT-2 laid the foundation for much of current-day transformer-based language modelling [1], including a framework for tokenization building on previous work in byte-pair encoding (BPE) [2], that has since been widely adopted. Tokenization using BPE converts input text to a sequence of token ids by iteratively merging two neighbouring tokens using a fixed set of merge rules. These rules are learned using a greedy training algorithm on a smaller dataset. In addition to choosing this training dataset, which is ideally representative of the LLM’s training data, training a tokenizer involves optimizing various settings, such as vocabulary size [3], the addition of special tokens, and strategies for handling out-of-vocabulary tokens.

Recent work in this area has primarily focused on techniques to remove the need for tokenization altogether by moving to raw byte input [4]. This typically comes at a significant cost in inference speed, which can be compensated for by specialized architectures at the initial and final layers [5], or variable compute at intermediate layers [6]. However, the development of these techniques have not been widely adopted, and the vast majority of current models still rely on standard BPE tokenization.

Despite its widespread use, the tokenization step has generally been found to be unsatisfactory, being at the root of many unwanted behaviours and problems of LLMs [7]. In particular, the disconnect between tokenizer and model training creates the potential for some tokens to rarely or never be seen in training. Including such tokens in model inputs can lead to unexpected model behaviour including as hallucination or the generation of garbled outputs, leading to such tokens commonly being referred to as ‘glitch tokens’ [8]. We refer to these as ‘under-trained’ or ‘untrained’ tokens, reserving the latter term only for cases in which we have clear indication that the specific token had no model training data occurrences.

The presence of such under-trained tokens has several drawbacks. Firstly, they occupy capacity in a fixed-size tokenizer that could be better utilized for more common tokens, reducing input/output length and inference costs Secondly, their deliberate or accidental presence in input data has the potential to cause unwanted outputs and break downstream applications. Robustness to such unexpected or malicious input data is increasingly important with the proliferation of tool use and agents in LLMs that retrieve and process external data. Lastly, these tokens can potentially be exploited to more easily circumvent guardrails by pushing the model beyond its trained distribution [8]. Although some work has been done on identifying such tokens through model and tokenizer analysis [9, 10, 11], there is a lack of reliable and well-explained automated methods that are tested across a wide range of models. Reliable tools for detecting tokenizer problems provide not only a way to test and iteratively improve the development of tokenizers, but can also provide a way to protect deployed models from unwanted input via input sanitization.

In this work, we present effective and efficient techniques for identifying such problematic tokens based on the model (un)embedding weights and tokenizer configuration. We apply these methods to a range of popular and recent open-weight models, including the Cohere Command R, Google Gemma, Meta Llama2, Mistral, Alibaba Qwen and OpenAI GPT-2 models. Finally, we include a brief exploration of extensions of these techniques to closed-source models. We also publish a general analysis tool compatible with Hugging Face models, along with detailed results for each analyzed model.

2 Methods

Our method consists of three steps; i) first, we perform a tokenizer analysis by inspecting its vocabulary and observing its encoding/decoding behaviour, ii) second, we calculate a number of indicators that identify candidate tokens that have likely not been seen during model training, and iii) third, we verify whether identified candidate tokens are indeed out of distribution by prompting the target model.

2.1 Tokenizer analysis

We start by defining a number of useful categories for tokens:

• Partial UTF-8 sequences: The token contains a partial UTF-8 sequence and can not be converted to a string by itself. This is typical for ‘fallback byte’ tokens in the 0x80-0xFF range (also see Appendix B), but depending on tokenizer configuration, can also include a combination of full and partial characters.

• Unreachable: When no input string can result in the token id, we categorize it as ‘unreachable’. We test this by checking if decoding the token to a string, and re-encoding it again, results in the token. Such tokens are typically the result of tokenizer configuration errors or conflicts between trained and manually added vocabulary. As this test does not work when tokens can not be decoded to a string, we exclude partial UTF-8 sequences from this category.

• Special tokens: Manually defined tokens carrying specific meanings as control tokens, such as . We identify special tokens using the patterns <…> and […] and list them separately from unreachable tokens, even if they might be considered as such due to input sanitization in tokenizer preprocessing.

• Tokens not in any of the other categories, which constitute the vast majority.

We detect and exclude partial UTF-8 sequences and unreachable tokens from our token detection pipeline, as they are not suitable for automatically building verification prompts. Our published model reports include tables with such tokens, and we briefly discuss some interesting model-specific results in section 3.3.

2.2 Indicators for detecting under-trained tokens

We propose and use model architecture-dependent indicators to identify potentially under-trained tokens. An key distinction is made based on whether a model uses the same matrix for its token embeddings E and the final model layer, consisting of the ‘unembedding’ matrix, U, which converts the final internal embeddings to a probability distribution over tokens.[1] Regardless of model architecture, all weights of the unembedding matrix influence the token predictions at every training step. Specifically, the training loss is minimized when the probability of unused tokens is predicted as zero, regardless of the input, making their logits converge towards −∞. The model can achieve such an input-independent prediction by a constant vector in the residual stream, and the negative of this vector in rows of the unembedding matrix, resulting in a constant negative contribution to the logit values of unused tokens. Using this intuition, we can find unused tokens from the unembedding weights as follows:

2.3 Verification of candidate tokens

Our proposed indicators naturally provide a ranking of candidate under-trained tokens, but do not give a definitive threshold, and their relative simplicity is likely to result in a somewhat noisy relation between indicator and model behaviour. To confirm that candidate tokens indeed induce unwanted model outputs, we verify all tokens which rank among the most likely 2% according to the chosen indicator, excluding partial UTF-8 sequences and unreachable tokens. This verification process involves constructing specific repetitive prompts that induces a high output probability for normal tokens, and checking if a candidate token has a very low output probability (see Appendix A for details).


[1] We assume the conventional final layer structure, consisting solely of the unembedding matrix without a bias.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblocker Detected

Please consider supporting us by disabling your ad blocker