Markets

Optimization of Language Models: Decoding Griffin's Local attention and memory efficiency

Authors:

(1) Soham de, Google DeepMind and equal contribution;

(2) Samuel L. Smith, Google DeepMind and with an equal contribution;

(3) Anhaan Fernando, Google DeepMind and equal contribution;

(4) Alexandar Botev, Google DeepMind and with an equal contribution;

(5) George Cristian-Muraru, Google DeepMind and equal contribution;

(6) Albert Gu, work done on Google Deepmint;

(7) Ruba Haroun, Google DeepMind;

(8) Leonard Berrada, Google DeepMind;

(9) Yutian Chen, Google DeepMind;

(10) Srivatsan Srinivasan, Google DeepMind;

(11) Guillaume Desjardins, Google DeepMind;

(12) Arnaud Doucet, Google DeepMind;

(13) David Budden, Google DeepMind;

(14) Yee Whye Teh, Google DeepMind;

(15) David Budden, Google DeepMind;

(16) Razvan Pascan, Google DeepMind;

(17) Nando de Freitas, Google DeepMind;

(18) Caglar Gulcehre, Google DeepMind.

1 Introduction

2 Model Architecture

3 repeat models scalize as effectively as transformers

3.1. Scaling curves

3.2. Evaluation at the following tasks

4 Repeated models effectively on the device and 4.1. The parallelity of the model for large -scale training

4.2. Effective linear repetition on the device

4.3. Training speed on longer sequences

5. The speed of conclusions

5.1. A simple model of decode stage

5.2. Results

6. Modeling of a long context and 6.1. Improving the prediction of the next symbolism with a longer context

6.2. Copy and procurement options

7. Related works

8. Conclusion, recognitions and references

A. RG-Lru's recurrence gate

B. Complex Linear Repeated Unit (CG-LRU)

C. Model scale hyperparameters

D. Effective linear repetitions on the device

E. Griffin's local attention window size

F. The speed of conclusions

G. Improving the prophecy of the next symbolism with longer contexts: additional results

H. More information on a copy and search tasks

E. Griffin's local attention window size

Griffin uses both repeated blocks and local attention layers in its temporal mixing blocks. For all tests that have been shown in the past 2048, we use local attention window 1024. Now we are investigating how the performance of the different window size of the local attention varies from the length of the training sequence.

We are considering 400 m parameters trained in the sequence length 2048, 4096 and 8192 chips,

Figure 9 | The performance of the 400m parameter Griffin and MQA Transformer models using a variety of local attention windows and different training sequence lengths. The window sizes of the local attention layers are shown above each strip of each story. We notice that the global attention MQA Transformer is much better than the MQA transformer local attention variants (where the size of the windows is smaller than the length of the training sequence). In addition, we see that the use of a local attention window of 1024 (marked '1K') fixed for the Griffin model exceeds the entire global attention and local attention MQA Transformer bases in all training sequences.Figure 9 | The performance of the 400m parameter Griffin and MQA Transformer models using a variety of local attention windows and different training sequence lengths. The window sizes of the local attention layers are shown above each strip of each story. We notice that the global attention MQA Transformer is much better than the MQA transformer local attention variants (where the size of the windows is smaller than the length of the training sequence). In addition, we see that the use of a local attention window of 1024 (marked '1K') fixed for the Griffin model exceeds the entire global attention and local attention MQA Transformer bases in all training sequences.

Where we keep the total number of training signs fixed. For each sequence length, we train Griffin models using different local attention sizes. As the base lines, we train MQA transformers using global attention layers, as well as MQA transformers using local attention layers of different window sizes. The results are shown in Figure 9, where the window sizes are shown on each strip (MQA transformer strips, which are equal to the length of the workout sequence, the global attention is the MQA Transformer starting line).

In Figure 9, we see that significantly, even if the size of a fixed window for Griffin's local attention layers is 1024, it outperforms the starting line of the Global Thday MQA Transformer at all lengths tested. However, it is worth noting that the performance between Griffin, with local attention window 1024 and global attention, MQA Transformer decreases as the sequence length increases. Therefore, if the length of the sequence increases, it is probably important to slowly grow a local attention window. In practice, used hardware also determines the size of the optimum local attention window in terms of training and conclusions. Finally, we note that the MQA transformers use purely local attention (windows less than the length of the workout order) work significantly worse than both global MQA transformers, but also Griffin.

F. The speed of conclusions

F.1. Evaluation of memory boundaries

Language models concluding during decodes limits memory loading. As already described in section 4.2, the linear RNN is related to memory. Following, we show that this applies to other components (which are linear layers and self -control) in our repeated models and transformer models.

F.2. Evaluation of memory limiting of linear layers

As shown D.1, the outer dimension (usually consists of a batch 𝐵 and sequence 𝑇 dimensions) must be at least 136 to be tied. During the decoding 𝑇 = 1 and if we assume 𝐵≲128, the linear layers are associated with memory during decods.

F.3. Evaluating the memory limiting of self -monitoring

In the following, we calculate the relationship between memory access and arithmetic operations to calculate the attention of the decood phase to show that it is also related to memory.

To simplify the next analysis, we assume that we will start with an empty waist (or assume that the preference contains 0 signs).

F.4. Cache sizes

The following are analyzed on the relative size of our recurrent and transformers. All scales of all cache sizes linearly batch and the next we expect 𝐵 = 1.

F.4.1. Size of KV Cache

For MHA or MQA, the size of the KV cache may exceed the number of model parameters if the sequence length 𝑇 is large. Therefore, we hope to monitor the transition from the parameter -related mode when the sequence length is short, the decoding rate dominates the model parameters to the device, to a large sequence cache mode where the decoding rate is dominated by the KV for charging.

F.4.2. The size of a recurrent state

F.4.3. Local attention cache

G. Improving the prophecy of the next symbolism with longer contexts: additional results

Figure 10 shows an additional result that shows the performance of the following symbolic prediction in a data set of articles of Arxivi of different contexts of different contexts. We believe that the results of this data set are qualitatively similar to the results shown in Figure 5.

Figure 10 | 1b parameter assessment performance in the sequence length ranging from the Arxivi Articles Evaluation Kits. On the left, we compare the performance of the different models, which is trained with a sequence of 2048, which is valued up to 32,768 sequence lengths. On the right, we compare Griffin and Hawk when we train 2048 (2K) and 8192 (8K) sequence. The results are qualitatively similar to the evaluation of the books in Figure 5.Figure 10 | 1b parameter assessment performance in the sequence length ranging from the Arxivi Articles Evaluation Kits. On the left, we compare the performance of the different models, which is trained with a sequence of 2048, which is valued up to 32,768 sequence lengths. On the right, we compare Griffin and Hawk when we train 2048 (2K) and 8192 (8K) sequence. The results are qualitatively similar to the evaluation of the books in Figure 5.

H. More information on a copy and search tasks

Figure 11 is an illustration of selective copying and induction heads.

In the case of an optional copying task, the model must study the data chips (Figure 11 colored signs) from the sequence while ignoring noise marks (white signs in Figure 11). The signs in Figure 6 intersect the signs that indicate the loss of masked signs.

Figure 11 | Illustration of the (right) tasks of copying (left) and induction heads.Figure 11 | Illustration of the (right) tasks of copying (left) and induction heads.

When performing the task of the induction heads, the model must be reminded of a special symbol immediately (black mark in Figure 11). As before, the output marks denote the loss of masked signs.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button

Adblocker Detected

Please consider supporting us by disabling your ad blocker