Actual Runtime Training and inference can be decomposed into forward, backward, and generation. Since both the forward (during training and inference) and backward can be parallelized, the researchers used the dual form. Generating new k (also known as decoding) is inherently sequential, so the researchers used the primal form. Due to resource constraints, this experiment was written in J and run on . However, since (implemented in , and ) can only run on , the researchers also rewrote the method to run on for a fair comparison. Specifically, the researchers wrote a kernel for the forward pass in K.
Historically, the forward and backward passes have lithuania phone number been inefficient due to the improper use of parallelism and matrix multiplication. The goal of this forward kernel is to demonstrate the effectiveness of the dual form of these problems - and . The left figure of the figure shows the latency of the forward kernel for batch size .All model parameters are . (for .). The time for each k increases linearly with the increase of context length but remains roughly the same for other methods. In addition, the researchers wrote another kernel for generation in and benchmarked the speed of batch size in the right figure of the figure.
It can be seen that the latency of - and is almost the same, which is significantly smaller than and -. After seeing the birth of such a powerful new architecture, there is no lack of heated discussion in the community. Some netizens said that this will be the closest to real-time context? I would like to hear everyone's thoughts. This means that even during use, it can learn and adapt to provide better performance for long contexts without incurring the high computational costs usually associated with . Video generation researchers said that this research looks very interesting.