ttnn.SoftmaxShardedMultiCoreProgramConfig

class ttnn.SoftmaxShardedMultiCoreProgramConfig

Bases: object

Multi-core sharded program configuration for Softmax operations.

This configuration is designed for sharded tensors and enables multi-core execution with customizable block sizes and compute grid configuration. It provides fine-grained control over the computation parameters for optimal performance on sharded data.

Parameters:
  • compute_with_storage_grid_size (CoreCoord) – The grid size for compute cores with storage capability.

  • subblock_w (int) – Width of sub-blocks for computation. Must be divisible by the tensor’s width.

  • block_h (int) – Height of blocks for processing. Controls the vertical granularity of computation.

  • block_w (int) – Width of blocks for processing. Controls the horizontal granularity of computation. Can be modified after creation.

Note

  • This configuration is specifically designed for sharded tensors.

  • Block dimensions must be compatible with the tensor’s shard specification.

  • Proper block sizing can significantly impact performance.

Example

# Setup input tensor and mask
input_shape = (1, 1, 32, 32)

attention_mask_t = ttnn.rand(input_shape, dtype=ttnn.bfloat8_b, layout=ttnn.TILE_LAYOUT, device=device)
input_tensor = ttnn.rand(input_shape, dtype=ttnn.bfloat8_b, layout=ttnn.TILE_LAYOUT, device=device)

# Apply in-place scale mask softmax
tt_output = ttnn.scale_mask_softmax_in_place(
    input_tensor=input_tensor,
    scale=1.0,
    mask=attention_mask_t,
)
logger.info(f"Scale Mask Softmax In Place result: {tt_output}")

compute_grid_size = device.compute_with_storage_grid_size()
fuse_head = 2
batch = compute_grid_size.x
num_cores_r = compute_grid_size.y

input_shape = (batch, num_cores_r, fuse_head * 384, 768)

attention_mask_t = ttnn.rand((batch, 1, 384, 768), dtype=ttnn.bfloat8_b, layout=ttnn.TILE_LAYOUT, device=device)

input_tensor = ttnn.rand(input_shape, dtype=ttnn.bfloat8_b, layout=ttnn.TILE_LAYOUT, device=device)

# Shard the input tensor
grid_coord = ttnn.CoreCoord(compute_grid_size.x - 1, compute_grid_size.y - 1)
shard_grid = ttnn.CoreRangeSet({ttnn.CoreRange(ttnn.CoreCoord(0, 0), grid_coord)})
shard_shape = [fuse_head * 384, 768]
shard_spec = ttnn.ShardSpec(shard_grid, shard_shape, ttnn.ShardOrientation.ROW_MAJOR)
sharded_mem_config = ttnn.MemoryConfig(ttnn.TensorMemoryLayout.HEIGHT_SHARDED, ttnn.BufferType.L1, shard_spec)

input_sharded = ttnn.to_memory_config(input_tensor, sharded_mem_config)

# Create sharded program config
program_config = ttnn.SoftmaxShardedMultiCoreProgramConfig(
    compute_with_storage_grid_size=compute_grid_size,
    subblock_w=8,
    block_h=12 * fuse_head,
    block_w=24,
)

tt_output = ttnn.scale_mask_softmax_in_place(
    input_tensor=input_sharded,
    scale=1.0,
    mask=attention_mask_t,
    program_config=program_config,
)
logger.info(f"Scale Mask Softmax In Place result: {tt_output}")
property block_w

(self) -> int