As far as I understood, ChatGPT occasionally creates a summary of older parts of a text. These summaries are not just keywords used for keeping context. Instead they are complete sentences.
The attention mechanism enables the transformer to effectively handle sequences of considerable length, provided that the memory capacity of the device permits.
The self-attention mechanism is a fundamental principle that allows transformers to learn super-long sequences effectively. In a transformer model, each position in the input sequence can attend to all other positions, thanks to the self-attention mechanism. This enables the model to weigh the importance of all tokens, including itself when making predictions.
The self-attention mechanism computes a weighted sum of all other tokens' representations based on relevance, as determined by attention scores learned during training. In essence, the model learns the relationship between different features, specifically temporal ones, using mathematical relations.
However, neural networks (NNs) such as transformers are considered black boxes, making it almost impossible to understand how they work or why they understand the way they do. Although many have tried to explain clearly, no one has ever completely succeeded. This is particularly true with models like transformers, whose inner workings are difficult to visualize. While exploring the workings of CNN-based models by visualizing feature map activations might be possible, this may not be as easy with models like transformers.