Soft Attention: Mechanism, Soft attention calculates weights in a differentiable manner, often using functions like softmax. This allows the model to pay attention to different parts of the input, with the weights indicating the importance of each part. Training, Since it's differentiable, soft attention can be trained using standard backpropagation techniques, making it easier to integrate into most neural network architectures, Deterministic: is deterministic, meaning it will produce the same output given the same input.
Example: BERT or GPT series.
Generally more computationally efficient as it avoids sampling.
Hard Attention: Mechanism;Hard attention selectively focuses on certain parts of the input while ignoring others. It's often non-differentiable, as it involves making discrete choices about which parts of the input to attend to; Training, Due to its non-differentiable nature, hard attention usually requires alternative training techniques like reinforcement learning or variance reduction methods. Stochastic, is stochastic, meaning it might focus on different parts of the input each time, leading to varied outputs.
Examples, Used in specific applications where discrete attention decisions are beneficial, like in certain image processing tasks.
Can be less computationally efficient due to the need for sampling or more complex training techniques.