ttnn.transformer.scaled_dot_product_attention

ttnn.transformer.scaled_dot_product_attention = Operation(python_fully_qualified_name='ttnn.transformer.scaled_dot_product_attention', function=<ttnn._ttnn.operations.transformer.scaled_dot_product_attention_t object>, preprocess_golden_function_inputs=<function default_preprocess_golden_function_inputs>, golden_function=None, postprocess_golden_function_outputs=<function default_postprocess_golden_function_outputs>, is_cpp_operation=True, is_experimental=False)

Causal scaled dot product attention. This API mimicks the PyTorch API of the same name. The implementation is FlashAttention-2.”

Accepts a SDPAProgramConfig which specifies the grid size and chunk tiles in the Q and K sequence lengths. The op parallelizes over b, nqh, and Q’s s dimension.

Parameters:

input_tensor_q (ttnn.Tensor) – the input tensor. [b x nqh x s x dh]
input_tensor_k (ttnn.Tensor) – the input tensor. [b x nkv x s x dh]
input_tensor_v (ttnn.Tensor) – the input tensor. [b x nkv x s x dh]

Keyword Arguments:

attn_mask (ttnn.Tensor, optional) – Defaults to None. [b x 1 x s x s]. Head broadcasting is implied.
is_causal (bool) – Defaults to true.
memory_config (ttnn.MemoryConfig, optional) – Memory configuration for the operation. Defaults to None.
queue_id (int, optional) – command queue id. Defaults to 0.
scale (float, optional) – Defaults to None.
program_config (SDPAProgramConfig, optional) – Defaults to None.
compute_kernel_config (ttnn.DeviceComputeKernelConfig, optional) – Defaults to None.

Returns:

ttnn.Tensor – the output tensor [b x nqh x s x dh].