ttnn.transformer.windowed_scaled_dot_product_attention

ttnn.transformer.windowed_scaled_dot_product_attention(input_tensor_q: ttnn.Tensor, input_tensor_k: ttnn.Tensor, input_tensor_v: ttnn.Tensor, cu_window_seqlens: ttnn.Tensor, *, scale: float = None, memory_config: ttnn.MemoryConfig = None, program_config: SDPAProgramConfig = None, compute_kernel_config: ttnn.DeviceComputeKernelConfig = None) ttnn.Tensor

Windowed scaled dot product attention. This is similar to the standard SDPA but instead of accepting an explicit attention mask, it accepts cumulative window sequence lengths and builds the attention mask internally to create block-diagonal attention patterns.

This is particularly useful for vision transformers with windowed attention mechanisms like Qwen2.5-VL where attention is restricted to specific windows in the sequence.

Parameters:
  • input_tensor_q (ttnn.Tensor) – the query tensor. [b x nqh x s x dh]

  • input_tensor_k (ttnn.Tensor) – the key tensor. [b x nkv x s x dh]

  • input_tensor_v (ttnn.Tensor) – the value tensor. [b x nkv x s x dh]

  • cu_window_seqlens (ttnn.Tensor) – cumulative window sequence lengths that define attention boundaries. [window_count + 1]

Keyword Arguments:
  • scale (float, optional) – Defaults to None. Scale factor for QK^T.

  • memory_config (ttnn.MemoryConfig, optional) – Memory configuration for the operation. Defaults to None.

  • program_config (SDPAProgramConfig, optional) – Defaults to None.

  • compute_kernel_config (ttnn.DeviceComputeKernelConfig, optional) – Defaults to None.

Returns:

ttnn.Tensor – the output tensor [b x nqh x s x dh].

Example

# For a sequence with 3 windows of sizes 10, 15, and 20 tokens:
cu_window_seqlens = [0, 10, 25, 45]
output = ttnn.transformer.windowed_scaled_dot_product_attention(
    q, k, v, cu_window_seqlens
)