ttnn.transformer.scaled_dot_product_attention

ttnn.transformer.scaled_dot_product_attention(input_tensor_q: ttnn.Tensor, input_tensor_k: ttnn.Tensor, input_tensor_v: ttnn.Tensor, *, attn_mask: ttnn.Tensor = None, is_causal: bool = true, scale: float = None, sliding_window_size: int = None, memory_config: ttnn.MemoryConfig = None, program_config: SDPAProgramConfig = None, compute_kernel_config: ttnn.DeviceComputeKernelConfig = None, attention_sink: ttnn.Tensor = None) ttnn.Tensor

Causal scaled dot product attention. This API mimicks the PyTorch API of the same name. The implementation is FlashAttention-2.”

Accepts a SDPAProgramConfig which specifies the grid size and chunk tiles in the Q and K sequence lengths. The op parallelizes over b, nqh, and Q’s s dimension.

Parameters:
  • input_tensor_q (ttnn.Tensor) – the input tensor. [b x nqh x s x dh]

  • input_tensor_k (ttnn.Tensor) – the input tensor. [b x nkv x s x dh]

  • input_tensor_v (ttnn.Tensor) – the input tensor. [b x nkv x s x dh]

Keyword Arguments:
  • attn_mask (ttnn.Tensor, optional) – Defaults to None. Either [b x 1 x s x s] with head broadcasting implied or [b x nqh x s x s].

  • is_causal (bool) – Defaults to true.

  • scale (float, optional) – Defaults to None.

  • sliding_window_size (int, optional) – Defaults to None. Size of sliding window for attention. If provided && is_causal, only attends to the last sliding_window_size tokens. If provided && !is_causal, attends to a window of size sliding_window_size centered at the current position.

  • memory_config (ttnn.MemoryConfig, optional) – Memory configuration for the operation. Defaults to None.

  • program_config (SDPAProgramConfig, optional) – Defaults to None.

  • compute_kernel_config (ttnn.DeviceComputeKernelConfig, optional) – Defaults to None.

  • attention_sink (ttnn.Tensor, optional) – Defaults to None. [1 x nqh x 1 x 1]. Single attention sink value per head. The kernel will efficiently replicate this value across all query positions.

Returns:

ttnn.Tensor – the output tensor [b x nqh x s x dh].