Matmul (Multi Core)
We’ll build a program that will perform matmul operations on two tensors with equal-size inner dimension on as many cores as possible on the accelerator chip. This example builds on top of the previous single core example.
In terms of API usage, there isn’t much change. We will discuss the specific changes to:
Using different kernels with their different runtime arguments.
Programming in terms of looping over all cores we will use.
All important ways we use the API different are in the new
matmul_multi_core
function.
The full example program is in
tt_metal/programming_examples/matmul_multi_core/matmul_multi_core.cpp
To build and execute, you may use the following commands. Note that we include the necessary environment variables here, but you may possibly need more depending on the most up-to-date installation methods.
export ARCH_NAME=<arch name>
export TT_METAL_HOME=<this repo dir>
./build_metal.sh
./build/programming_examples/matmul_multi_core
Accessing all the cores
We first must get information on the layout of the entire chip. This means the grid of cores and its x/y dimensions in number of cores.
auto compute_with_storage_grid_size = device->compute_with_storage_grid_size();
uint32_t num_cores_x = compute_with_storage_grid_size.x;
uint32_t num_cores_y = compute_with_storage_grid_size.y;
Splitting the work across cores
We need to split the work across multiple cores. This means figuring out how many tiles each core will use to compute the matmul.
We use a helper function that the Metal team developed to do this. Given the number of tiles of the multiplication and the entire grid upon which we want to execute, we receive back:
The total number of cores, which is the product of the dimensions
The
CoreRangeSet
of all coresThe first
CoreRangeSet
of cores which will perform a certain count of tile operations, AThe second
CoreRangeSet
of cores which will perform another count of tile operations, B, if applicableA as
uint32_t
B as
uint32_t
, if applicable
auto [num_cores, all_cores, core_group_1, core_group_2, num_output_tiles_per_core_group_1, num_output_tiles_per_core_group_2] = split_work_to_cores(compute_with_storage_grid_size, num_output_tiles_total);
The reason why we may have two separate sets of cores and tile counts is because depending on the grid size, it may not be possible to evenly distribute tiles over the full set of cores. We may need another separate set of cores to perform the “spill-over” number of computations.
This means we will need to be careful when it comes programming each set of cores. We also need to account for the case where we can evenly distribute work, meaning the second set will be empty.
Using different kernels for reader/writer
We use more complex reader/writer kernels to feed our compute engine.
These kernels will use banking and interleaving techniques to ensure consistent and performant dataflow operations. We must ensure that cores only receive and perform computations on their specific set of tiles, and nothing more or less.
auto reader_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/dataflow/reader_bmm_8bank_output_tiles_partitioned.cpp",
all_cores,
tt_metal::DataMovementConfig{.processor = DataMovementProcessor::RISCV_1, .noc = NOC::RISCV_1_default, .compile_args = reader_compile_time_args});
auto writer_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/dataflow/writer_unary_interleaved_start_id.cpp",
all_cores,
tt_metal::DataMovementConfig{.processor = DataMovementProcessor::RISCV_0, .noc = NOC::RISCV_0_default, .compile_args = writer_compile_time_args});
Compute kernel args
We need to account for the fact that we may two separate groups of cores that require different arguments for tile count.
vector<uint32_t> compute_args_group_1 = {
1, // B
1, // Mt
Kt, // Kt
num_output_tiles_per_core_group_1 // Nt
}; // bmm compute kernel the B, Mt, Nt are just 3 for loops that technically act as 1 large loop, so only set Nt for simplicity
auto matmul_multi_core_kernel_group_1_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/compute/bmm.cpp",
core_group_1,
tt_metal::ComputeConfig{.math_fidelity = math_fidelity, .compile_args = compute_args_group_1}
);
if (!core_group_2.ranges().empty()) {
vector<uint32_t> compute_args_group_2 = {
1, // B
1, // Mt
Kt, // Kt
num_output_tiles_per_core_group_2 // Nt
}; // bmm compute kernel the B, Mt, Nt are just 3 for loops that technically act as 1 large loop, so only set Nt for simplicity
auto matmul_multi_core_kernel_group_2_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/compute/bmm.cpp",
core_group_2,
tt_metal::ComputeConfig{.math_fidelity = math_fidelity, .compile_args = compute_args_group_2}
);
}
Reader/writer kernel runtime args
Here, we introduce the concept of looping over all cores to apply an API.
In this case, we must set runtime args for reader/writer kernels. Note that we also must take care to account for the split groups of cores and to use the appropriate tile count when assigning args.
for (uint32_t i = 0, num_tiles_written = 0; i < num_cores; i++){
CoreCoord core = {i / num_cores_y, i % num_cores_y};
uint32_t num_output_tiles_per_core;
if (core_group_1.core_coord_in_core_ranges(core)) {
num_output_tiles_per_core = num_output_tiles_per_core_group_1;
} else if (core_group_2.core_coord_in_core_ranges(core)) {
num_output_tiles_per_core = num_output_tiles_per_core_group_2;
} else {
TT_ASSERT(false, "Core not in specified core ranges");
}
tt_metal::SetRuntimeArgs(
program, reader_id, core,
{src0_addr,
src1_addr,
Mt,
Kt,
Nt,
MtKt,
KtNt,
B,
uint32_t(bcast_batch),
num_tiles_written,
num_output_tiles_per_core,
MtNt }
);
tt_metal::SetRuntimeArgs(
program,
writer_id,
core,
{dst_addr,
num_output_tiles_per_core,
num_tiles_written }
);
num_tiles_written += num_output_tiles_per_core;
}
Conclusion
Those are all the major changes that we made in order to upgrade our single core matmul example into one that will use as many cores as possible. To see a more complicated example using data reuse among these cores, please refer to please refer to the Matmul multi-core data reuse example.