ttnn.MatmulMultiCoreReuseMultiCast1DProgramConfig

class ttnn.MatmulMultiCoreReuseMultiCast1DProgramConfig

Bases: pybind11_object

Configuration class for 1D multicast matmul operations with advanced features.

This program config is for use with width and height sharded tensors, or very narrow interleaved tensors.

property compute_with_storage_grid_size

Grid size for compute cores with storage capability.

Defines the 2D grid of cores that will be used for computation. In 1D multicast, this grid is used to determine the communication pattern for broadcasting data along one dimension while distributing computation.

from_json(self: str) ttnn._ttnn.operations.matmul.MatmulMultiCoreReuseMultiCast1DProgramConfig
property fuse_batch

Whether to fuse batch dimensions into matrix dimensions.

When true, batch dimensions are incorporated into the matrix computation, allowing for more efficient processing of batched operations in the 1D multicast implementation.

property fused_activation

Optional fused activation function to apply during computation.

If specified, the activation function is applied directly during the matmul operation, eliminating the need for a separate activation pass and improving overall performance in 1D multicast scenarios.

property gather_in0

Defaults to false. Used by ops that call matmul internally. Should not be specified or left as the default value for all other uses.

property hop_cores

Defaults to empty set. Used by ops that call matmul internally. Should not be specified or left as the default value for all other uses.

property in0_block_w

Block width for both input tensors along the K dimension (shared inner dimension).

Determines the data granularity by specifying how many tiles wide each block is along the inner dimension for both input_tensor_a and input_tensor_b. This parameter impacts 1D multicast performance as it affects the size of data chunks that are broadcast across cores and memory access patterns for both tensors.

property mcast_in0

Whether to multicast the first input tensor (input_tensor_a).

When true, input_tensor_a is broadcast across cores using the 1D multicast pattern, which can significantly reduce memory bandwidth requirements for certain matrix shapes and improve performance.

property num_global_cb_receivers

Defaults to 1. Used by ops that call matmul internally. Should not be specified or left as the default value for all other uses.

property out_block_h

Height of output blocks in tiles.

Defines the output block size along the M dimension. If not specified, defaults to per_core_M. This parameter is important for optimizing the 1D multicast pattern and memory access efficiency.

property out_block_w

Width of output blocks in tiles.

Defines the output block size along the N dimension. If not specified, defaults to per_core_N. This affects the efficiency of data distribution in the 1D multicast implementation.

property out_subblock_h

Height of output subblocks in tiles.

Controls computation granularity within output blocks along the M dimension. In 1D multicast, this affects how computation is scheduled and memory usage patterns across the participating cores.

property out_subblock_w

Width of output subblocks in tiles.

Controls computation granularity within output blocks along the N dimension. This parameter affects the efficiency of the 1D multicast communication pattern and compute scheduling.

property per_core_M

Number of output tiles each core processes along the M dimension.

Determines the workload distribution along the M dimension in the 1D multicast pattern. This affects both load balancing and communication efficiency.

property per_core_N

Number of output tiles each core processes along the N dimension.

Determines the workload distribution along the N dimension in the 1D multicast pattern. This parameter is crucial for achieving optimal performance in 1D multicast scenarios.

to_json(self: ttnn._ttnn.operations.matmul.MatmulMultiCoreReuseMultiCast1DProgramConfig) str
property untilize_out

Whether to untilize the output tensor.

When true, the output is converted from tiled layout to row-major layout during the operation. This can be useful when the subsequent operation expects row-major data and can eliminate a separate untilization pass. Defaults to false.