For the understanding, most probably, not only the overall concept of using an attention method by a transformer architecture to determine global input-output dependencies is needed, but specific mathermatical, logical and computtaional details. What is the way of employing an attention methodand and what is the essence of the attention method from information processing point of view? How are the inputs of varying lengths treated and how does the method (computational mechanism) change its attention depending on the length of the sequence? Why can it replace sophisticated recurrent or convolutional neural networks?