Transformer Attention Visualization

Input Sentence

Enter a sentence to see how its tokens might attend to each other.

Attention Heatmap

Shows simulated attention scores (Query rows attend to Key columns).

the
quick
brown
fox
jumps
over
the
lazy
dog
the
0.04
0.12
0.12
0.13
0.14
0.17
0.11
0.10
0.07
quick
0.06
0.01
0.23
0.20
0.10
0.14
0.08
0.07
0.12
brown
0.19
0.07
0.03
0.08
0.13
0.19
0.06
0.15
0.10
fox
0.09
0.21
0.20
0.01
0.08
0.11
0.10
0.16
0.05
jumps
0.15
0.10
0.11
0.12
0.04
0.18
0.11
0.08
0.12
over
0.07
0.06
0.08
0.09
0.21
0.02
0.23
0.16
0.09
the
0.13
0.07
0.11
0.17
0.16
0.15
0.00
0.06
0.15
lazy
0.06
0.08
0.16
0.06
0.08
0.08
0.20
0.01
0.28
dog
0.12
0.06
0.07
0.11
0.10
0.12
0.16
0.21
0.05

Conceptual Meaning Shift (Delta)

Hover over a token label (row) in the heatmap to see its context update.

Hover over a token label on the left of the heatmap to see details here.

Computational Complexity

Estimated operations for key attention steps, highlighting quadratic scaling.

Sequence Length (n): 9 tokens

(Using assumed dimensions per layer: 96 heads, d_k=128)

Dominant Quadratic Operations (Approx.):

  • Q * KT calculation: 995.3 K ops(~ n2 * d_k * heads)
  • Scores * V calculation: 995.3 K ops(~ n2 * d_k * heads)
  • Total (Quadratic part): 2.0 M ops

Key takeaway: The core attention calculation complexity is O(n2), scaling quadratically with sequence length. This becomes very expensive for long sequences.

Note: These are rough estimates focusing on major matrix multiplications. Actual FLOPs depend on specific implementations and hardware. Linear projection costs (~O(n)) are not included in the total shown.