Enter a sentence to see how its tokens might attend to each other.
Shows simulated attention scores (Query rows attend to Key columns).
Hover over a token label (row) in the heatmap to see its context update.
Hover over a token label on the left of the heatmap to see details here.
Estimated operations for key attention steps, highlighting quadratic scaling.
Sequence Length (n): 9 tokens
(Using assumed dimensions per layer: 96 heads, d_k=128)
Key takeaway: The core attention calculation complexity is O(n2), scaling quadratically with sequence length. This becomes very expensive for long sequences.
Note: These are rough estimates focusing on major matrix multiplications. Actual FLOPs depend on specific implementations and hardware. Linear projection costs (~O(n)) are not included in the total shown.