ttnn.transformer.flash_mla_prefill
- ttnn.transformer.flash_mla_prefill(input_tensor_q: ttnn.Tensor, input_tensor_k: ttnn.Tensor, head_dim_v: uint32_t, *, attn_mask: ttnn.Tensor = None, is_causal: bool = true, memory_config: ttnn.MemoryConfig = None, scale: float = None, program_config: SDPAProgramConfig = None, compute_kernel_config: ttnn.DeviceComputeKernelConfig = None) ttnn.Tensor
-
Causal MLA attention.”
Accepts a SDPAProgramConfig which specifies the grid size and chunk tiles in the Q and K sequence lengths. The op parallelizes over b, nqh, and Q’s s dimension.
- Parameters:
-
input_tensor_q (ttnn.Tensor) – the input tensor. [b x nqh x s x dh]
input_tensor_k (ttnn.Tensor) – the input tensor. [b x nkv x s x dh]
head_dim_v (uint32_t) – the head dimension of V.
- Keyword Arguments:
-
attn_mask (ttnn.Tensor, optional) – Defaults to None. [b x 1 x s x s]. Head broadcasting is implied.
is_causal (bool) – Defaults to true.
memory_config (ttnn.MemoryConfig, optional) – Memory configuration for the operation. Defaults to None.
scale (float, optional) – Defaults to None.
program_config (SDPAProgramConfig, optional) – Defaults to None.
compute_kernel_config (ttnn.DeviceComputeKernelConfig, optional) – Defaults to None.
- Returns:
-
ttnn.Tensor – the output tensor [b x nqh x s x dh].