ttnn.transformer.scaled_dot_product_attention_decode

ttnn.transformer.scaled_dot_product_attention_decode(input_tensor_q: ttnn.Tensor, input_tensor_k: ttnn.Tensor, input_tensor_v: ttnn.Tensor, *, is_causal: bool = True, attn_mask: ttnn.Tensor | None = None, cur_pos: List of int | None = None, memory_config: ttnn.MemoryConfig | None = None, queue_id: int | None = 0, cur_pos_tensor: ttnn.Tensor | None = None, scale: float | None = None, program_config: SDPAProgramConfig | None = None, compute_kernel_config: ttnn.DeviceComputeKernelConfig | None = None) ttnn.Tensor

A version of scaled dot product attention specifically for decode. The implementation is Flash-Decode and it currently only supports MQA on decoding single token.

Accepts a SDPAMultiCoreProgramConfig which specifies the grid size and chunk tiles in the K/V/Mask sequence lengths (Q chunk tiles is not used). The op parallelizes over b and K/V/Mask’s s dimension.

Parameters:
  • input_tensor_q (ttnn.Tensor) – the input tensor [1 x b x nh x dh]

  • input_tensor_k (ttnn.Tensor) – the input tensor [b x nkv x s x dh]

  • input_tensor_v (ttnn.Tensor) – the input tensor [b x nkv x s x dh]

Keyword Arguments:
  • is_causal (bool) – whether the attention is is_causal. Defaults to True.

  • attn_mask (ttnn.Tensor, optional) – the input tensor [b x 1 x s x s]. Defaults to None.

  • cur_pos (List of int, optional) – list of integers of length b. Defaults to None.

  • memory_config (ttnn.MemoryConfig, optional) – Memory configuration for the operation. Defaults to None.

  • queue_id (int, optional) – command queue id. Defaults to 0.

  • cur_pos_tensor (ttnn.Tensor, optional) – [b] tensor of integers of length b. Defaults to None.

  • scale (float, optional) – Defaults to None.

  • program_config (SDPAProgramConfig, optional) – Defaults to None.

  • compute_kernel_config (ttnn.DeviceComputeKernelConfig, optional) – Defaults to None.

Returns:

ttnn.Tensor – the output tensor [1 x b x pnh x dh].

“Accepts a SDPAMultiCoreProgramConfig which specifies the grid size and chunk tiles in the K/V/Mask sequence lengths (Q chunk tiles is not used). The op parallelizes over b and K/V/Mask’s s dimension.” “If a position is given as (-1), compute for the corresponding index in the batch is skipped.”