Attention and Transformer Seven Questions

Beck Moulton
7 min read5 days ago

Introduction

Recently, ChatGPT and other chatbots have pushed large language models LLMs to the forefront of the trend. This has led many people who are not in the field of ML and NLP to pay attention to and learn attention and Transformer models. In this article, we will raise several questions about the structure of the Transformer model and delve into the technical theory behind it. The audience of this article is colleagues who have read Transformer papers and have a general understanding of how attention works.

Without further ado, let’s get started!

Why is Attention necessary?

First, let’s take Machine Translation as an example to explain. Before the appearance of the Attention mechanism, most Machine Translation was implemented through the encoder-decoder network structure. Among them, the role of the encoder is to encode the input sentence (such as “I love you”) into a feature vector through an RNN network; at the same time, the role of the decoder is to receive the above feature vector and decode it into other languages (such as “I love you”).

With the above method, the encoder must compress the entire input into a fixed-sized vector, which is then passed to the decoder — we expect this vector to contain all the information of the input…

--

--

Beck Moulton

Focus on the back-end field, do actual combat technology sharing Buy me a Coffee if You Appreciate My Hard Work https://www.buymeacoffee.com/BeckMoulton