ttnn.all_reduce

ttnn.all_reduce = Operation(python_fully_qualified_name='ttnn.all_reduce', function=<ttnn._ttnn.operations.ccl.all_reduce_t object>, preprocess_golden_function_inputs=<function default_preprocess_golden_function_inputs>, golden_function=None, postprocess_golden_function_outputs=<function default_postprocess_golden_function_outputs>, is_cpp_operation=True, is_experimental=False)

ttnn.Tensor, cluster_axis: Optional[int] = None, subdevice_id: Optional[ttnn.SubDeviceId] = None, memory_config: Optional[ttnn.MemoryConfig] = None, num_links: Optional[int] = None, topology: Optional[ttnn.Topology] = None) -> ttnn.Tensor

All-reduce operation across devices with Sum reduction. If cluster axis is specified, the all-reduce is performed along the cluster axis. All-reduce is a collective operation that reduces data from all devices using the Sum operation and returns the result to all devices.

Args:

input_tensor (ttnn.Tensor): Input tensor to be reduced.

Keyword Args:

cluster_axis (int, optional): The axis on the mesh device to reduce across. Defaults to None. subdevice_id (ttnn.SubDeviceId, optional): Subdevice id for worker cores. memory_config (ttnn.MemoryConfig, optional): Output memory configuration. num_links (int, optional): Number of links to use for the all_reduce operation. Defaults to None. topology (ttnn.Topology, optional): Fabric topology. Defaults to None.

Returns:

ttnn.Tensor – The reduced tensor with the same shape as the input tensor.

Example:
>>> full_tensor = torch.randn([1, 1, 32, 256], dtype=torch.bfloat16)
>>> mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, 8))
>>> ttnn_tensor = ttnn.from_torch(
                full_tensor,
                dtype=input_dtype,
                device=mesh_device,
                layout=layout,
                memory_config=mem_config,
                mesh_mapper=ShardTensor2dMesh(mesh_device, mesh_shape=(1, 8), dims=(-1, -2)))
>>> output = ttnn.all_reduce(ttnn_tensor)
>>> print(output.shape)
[1, 1, 32, 256]
Type:

all_reduce(input_tensor