Data Reuse in matmul_multicore_reuse
Fine-Grained Block Size Control
Advanced matrix dimension controls are found in the Programming Example’s matmul_common directory, namely Block Matrix Multiply Ops (bmm_op.hpp). Including this header allows us advanced dynamic means of defining and retrieving matrix parameters. Our matmul kernels that work out-of-the-box perform on row-major and tile-major layouts, so you have the power to define your own outer-dimensional tile sizes, desired core grid dimensions, as well as your own input block width, all depending on your problem at hand.
In our reuse example, we can employ the get_large_matmul_params(...)
function and pass our inputs as described above. By doing so, we let METALIUM’s bmm op utility functions do the heavy lifting for us mathematically, and calculate our matmul’s exact work-per-core size and work output size seamlessly. (You can consult the header for the prime factorization method used, plus many other details).
auto matmul_params = bmm_op_utils::get_large_matmul_params(Mt, Nt, num_cores_y, num_cores_x, in0_block_w);
uint32_t per_core_M = std::get<0>(matmul_params);
uint32_t per_core_N = std::get<1>(matmul_params);
uint32_t out_subblock_h = std::get<2>(matmul_params);
uint32_t out_subblock_w = std::get<3>(matmul_params);
Take note of the example’s use of “subblocks” above. Recall that until now, we have optimized matmul by dividing matrices into blocks and subdivided those into tiles, which are laid out neatly on our compute cores. A key optimization here in matmul_multicore_reuse is the introduction of an intermediate subdivision of blocks, called subblocks. Below are some optimal subblock layouts already provided for you in the header, which run efficiently on our hardware.
constexpr std::array<std::tuple<uint32_t, uint32_t>, 20> SUBBLOCK_HW_CHOICES = {{
{4, 2}, {2, 4}, {8, 1}, {1, 8},
{7, 1}, {1, 7},
{3, 2}, {2, 3}, {6, 1}, {1, 6},
{5, 1}, {1, 5},
{2, 2}, {4, 1}, {1, 4},
{3, 1}, {1, 3},
{2, 1}, {1, 2},
{1, 1},
}};
Intermediate Circular Buffer Configuration
In addition to our double-buffer config, we introduce a third circular buffer denoted as interm0_cb_index
. Out of the 32 possible circular buffers provided by the API (which you can view in the $TT_METAL_HOME/tt_metal/hostdevcommon/kernel_structs.h
file), this one belongs to a subset of intermediate CBs. This buffer acts as a temporary storage for the intermediate results of matrix multiplication before they are combined into the final output.
uint32_t output_cb_index = CBIndex::c_16;
uint32_t interm0_cb_index = CBIndex::c_24; // Index for the intermediate circular buffer
std::map<uint8_t, tt::DataFormat> output_cb_data_format_spec {
{output_cb_index, cb_data_format}, // Output buffer configuration
{interm0_cb_index, cb_data_format} // Intermediate buffer configuration
};
CircularBufferConfig cb_output_config = CircularBufferConfig(out_CB_size, output_cb_data_format_spec)
.set_page_size(output_cb_index, single_tile_size)
.set_page_size(interm0_cb_index, single_tile_size);
auto cb_output = tt_metal::v0::CreateCircularBuffer(program, all_cores, cb_output_config);
Stride Kernel Arguments
The runtime arguments for the read, write, and compute kernels are set up in a certain way to employ data reuse through the intermediate circular buffer. This setup aligns with the execution model of the bmm_tile_layout.cpp
reader and writer kernels, and bmm_large_block_zm.cpp
compute kernel.
/*
* Create Kernels (Reader, Writer, Compute)
*/
// Create reader and writer kernels per core
auto reader_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/dataflow/reader_bmm_tile_layout.cpp",
all_cores,
tt_metal::DataMovementConfig{.processor = DataMovementProcessor::RISCV_1, .noc = NOC::RISCV_1_default, .compile_args = reader_compile_time_args});
auto writer_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/dataflow/writer_bmm_tile_layout.cpp",
all_cores,
tt_metal::DataMovementConfig{.processor = DataMovementProcessor::RISCV_0, .noc = NOC::RISCV_0_default, .compile_args = writer_compile_time_args});
// Create compute kernel
auto mm_kernel_id = tt_metal::CreateKernel(
program,
"tt_metal/programming_examples/matmul_common/kernels/compute/bmm_large_block_zm.cpp",
all_cores,
tt_metal::ComputeConfig{.math_fidelity = math_fidelity, .compile_args = compute_kernel_args}
);
Recall our compile-time kernel compute args:
vector<uint32_t> compute_kernel_args = {
in0_block_w, // in0_block_w
in0_num_subblocks, // in0_num_subblocks
in0_block_num_tiles, // in0_block_num_tiles
in0_subblock_num_tiles, // in0_subblock_num_tiles
in1_num_subblocks, // in1_num_subblocks
in1_block_num_tiles, // in1_block_num_tiles
in1_per_core_w, // in1_per_core_w
num_blocks, // num_blocks
out_subblock_h, // out_subblock_h
out_subblock_w, // out_subblock_w
out_subblock_num_tiles, // out_subblock_num_tiles
B // batch
};
To properly run the reader and writer kernels, we must set up the runtime arguments with this information. For each block of in0 and in1 matrices, we read the tiles pertaining to a certain subblock from DRAM into that core’s L1, and we perform the bmm_large_block_zm on tiles therein using stride arguments. Recall each tile is a member of a certain subblock, and subblocks are distributed across different cores in the core grid (specifically, in each core’s L1). The writer kernel then stores the partial matmul results into its corresponding output subblock.
Reader:
std::vector<uint32_t> mm_reader_args = {
(std::uint32_t) src0_dram_buffer->address(), // in0_tensor_addr
(std::uint32_t) Kt * per_core_M * output_idx_y, // in0_tensor_start_tile_id
(std::uint32_t) 1, // in0_tensor_stride_w
(std::uint32_t) Kt, // in0_tensor_stride_h
(std::uint32_t) in0_block_w, // in0_tensor_next_block_stride
(std::uint32_t) in0_block_w, // in0_block_w
(std::uint32_t) per_core_M, // in0_block_h
(std::uint32_t) in0_block_w * per_core_M, //in0_block_num_tiles
(std::uint32_t) src1_dram_buffer->address(), // in1_tensor_addr
(std::uint32_t) per_core_N * output_idx_x, //in1_tensor_start_tile_id
(std::uint32_t) 1, // in1_tensor_stride_w
(std::uint32_t) Nt, // in1_tensor_stride_h
(std::uint32_t) in0_block_w * Nt, //in1_tensor_next_block_stride
(std::uint32_t) per_core_N, // in1_block_w
(std::uint32_t) in0_block_w, //in1_block_h
(std::uint32_t) per_core_N * in0_block_w, // in1_block_num_tiles
(std::uint32_t) Kt / in0_block_w, // num_blocks
(std::uint32_t) Mt * Kt, // MtKt
(std::uint32_t) Kt * Nt, // KtNt
(std::uint32_t) B, // batch
(std::uint32_t) bcast_batch // bcast_B
};
Writer:
std::vector<uint32_t> writer_args = {
(std::uint32_t) dst_dram_buffer->address(), // out_buffer_addr
(std::uint32_t) output_idx_x * per_core_N + output_idx_y * per_core_M * Nt, // out_tensor_start_tile_id
(std::uint32_t) 1, // out_tensor_stride_w
(std::uint32_t) Nt, // out_tensor_stride_h
(std::uint32_t) out_subblock_w, // out_tensor_next_subblock_stride_w
(std::uint32_t) out_subblock_h * Nt, // out_tensor_next_subblock_stride_h
(std::uint32_t) out_subblock_w, // out_subblock_w
(std::uint32_t) out_subblock_h, // out_subblock_h
(std::uint32_t) out_subblock_w * out_subblock_h, // out_subblocks_w * out_subblocks_h
(std::uint32_t) per_core_N / out_subblock_w, // out_num_subblocks_w
(std::uint32_t) per_core_M / out_subblock_h, // out_num_subblocks_h
(std::uint32_t) Mt * Nt, // MtNt
(std::uint32_t) B // batch
};
Intermediate Results Handling
In bmm_large_block_zm.cpp
,
-
Preparing the Intermediate Buffer:
Reserving Partial Results Space: For a given block (excluding the last block), we reserve space for intermediate (ie. partial) results in the rear of the intermediate circular buffer with
cb_reserve_back(...)
. Each consecutive subblock within this block will access this space, and contribute their partial results.
cb_reserve_back(tt::CBIndex::c_24, out_subblock_num_tiles);
Storing Partial Results: Partial results are stored via a packing mechanism with
pack_tile(...)
into the above reserved space.
for (uint32_t i = 0; i < out_subblock_num_tiles; i++) { pack_tile(i, tt::CBIndex::c_24); } cb_push_back(tt::CBIndex::c_24, out_subblock_num_tiles);
-
Computing with Partial Results:
Result Retrieval: During block computations after the first block, we retrieve the stored results
cb_wait_front(...)
for further computation. This retrieval, also known as “reloading” data, is the heart of our data reuse concept. It is leveraged only when our flagenable_reload
is set to true. Recall from our understanding of circular buffers that there needs be synchronization that all tile work thus far be finished before contributing more partial results.
if (enable_reload) { cb_wait_front(tt::CBIndex::c_24, out_subblock_num_tiles); for (uint32_t i = 0; i < out_subblock_num_tiles; i++) { copy_tile(tt::CBIndex::c_24, i, i); } cb_pop_front(tt::CBIndex::c_24, out_subblock_num_tiles); }
Execution with `matmul_tiles`: Now we are ready to compute partial results and integrate them back into the computation stream (or for the last block of computation, culminate our data reuse to produce the final output tensor). We call the
matmul_tiles(...)
function to execute our matmul on the core’s subblocks of tiles.
// Compute output sub-block from in0_subblock x in1_subblock int dst_index = 0; int in0_index_h_offset = 0; for (uint32_t h = 0; h < out_subblock_h; h++) { for (uint32_t w = 0; w < out_subblock_w; w++) { int in1_index_inner_dim_offset = 0; for (uint32_t inner_dim = 0; inner_dim < in0_block_w; inner_dim++) { int in0_index = in0_index_subblock_offset + in0_index_h_offset + inner_dim; int in1_index = in1_index_subblock_offset + in1_index_inner_dim_offset + w; matmul_tiles(tt::CBIndex::c_0, tt::CBIndex::c_1, in0_index, in1_index, dst_index, false /* transpose */); in1_index_inner_dim_offset += in1_per_core_w; } dst_index++; } in0_index_h_offset += in0_block_w; }
-
Wrapping Up the Intermediate Buffer:
Freeing Up Space: After all partial results have been computed and stored in our output subblock, we have completed the cycle of reuse, so now we free up the space in the intermediate circular buffer with
cb_pop_front(...)
.
Conclusion
Those are the additional steps for getting matmul_multicore_data_reuse
operations up and running on the compute engine. To see a more complicated example using core-to-core data movement, please refer to the Matmul multi-core data mcast example.