ttnn.experimental.all_reduce

ttnn.experimental.all_reduce = Operation(python_fully_qualified_name='ttnn.experimental.all_reduce', function=<ttnn._ttnn.operations.experimental.ccl_experimental.all_reduce_t object>, preprocess_golden_function_inputs=<function default_preprocess_golden_function_inputs>, golden_function=None, postprocess_golden_function_outputs=<function default_postprocess_golden_function_outputs>, is_cpp_operation=True, is_experimental=False)

Performs an all_reduce operation on multi-device input_tensor across all devices.

Parameters:

input_tensor (ttnn.Tensor) – multi-device tensor

Keyword Arguments:
  • num_links (int, optional) – Number of links to use for the all-gather operation. Defaults to 1.

  • memory_config (ttnn.MemoryConfig, optional) – Memory configuration for the operation. Defaults to input tensor memory config.

  • num_workers (int, optional) – Number of workers to use for the operation. Defaults to None.

  • num_buffers_per_channel (int, optional) – Number of buffers per channel to use for the operation. Defaults to None.

  • topology (ttnn.Topology, optional) – The topology configuration to run the operation in. Valid options are Ring and Linear. Defaults to ttnn.Topology.Ring.

Returns:

ttnn.Tensor – the output tensor.

Example

>>> full_tensor = torch.randn([1, 1, 256, 256], dtype=torch.bfloat16)
>>> mesh_device = ttnn.open_mesh_device(ttnn.MeshShape(1, 8))
>>> input_tensor = ttnn.from_torch(
        full_tensor,
        mesh_mapper=ttnn.ShardTensorToMesh(mesh_device, dim=3),
    )
>>> output = ttnn.experimental.all_reduce(input_tensor, topology=ttnn.Topology.Linear)