pack_untilize_block
-
template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim, bool narrow_row = false, std::uint32_t row_num_datums = TILE_C_DIM>
void ckernel::pack_untilize_dest_init(uint32_t ocb, uint32_t face_r_dim = 16, uint32_t num_faces = 4)
-
Performs the necessary hardware and software initialization for the pack untilize operation. This initialization function should be used when the desired PACK input is already in DEST register - therefore, it doesn’t configure UNPACK and MATH threads for transferring data from circular buffers to DEST register (this is done with
pack_untilize_init
function). Matching pack untilize operation for this initialization function ispack_untilize_dest
. In order for this untilization to be performed correctly, some other function must place the tiles in the DEST register, e.g.reduce_tile
,copy_tile
, etc. This initialization function should, therefore, be called right after the op-specific initialization function (reduce_init
,copy_init
, etc.).Since pack untilize works on a block of tiles, the user should specify the width of a single block (block_ct_dim), and the width of the full block (
full_ct_dim
). It is not needed to provide the height of the block during the initialization, sincepack_untilize_block
will loop over the height of the block. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:half-sync mode (16-bit mode): 8 tiles
half-sync mode (32-bit mode): 4 tiles
full-sync mode (16-bit mode): 16 tiles
full-sync mode (32-bit mode): 8 tiles
NOTE: This function allows the user to specify
face_r_dim
andnum_faces
through function parameters. Setting these parameters results in an expensive MMIO write and cannot be avoided currently. This should be addressed more systematically within the issue tt-metal#22820, since these two values can be inferred from the circular buffer description, the same way as it is done inllk_pack_hw_configure_disaggregated
. This would remove the need forllk_pack_untilize_hw_configure_disaggregated
altogether and we would pay the price of the MMIO write only once, incompute_kernel_hw_startup
.Return value: None
Param Type
Name
Description
Type
Valid Range
Required
Template
block_ct_dim
Width of a single block in tiles
uint32_t
1 to max (see note)
False (default = 8)
Template
full_ct_dim
Width of a full input in tiles
uint32_t
>= block_ct_dim
False
Template
narrow_row
Whether the provided input is narrow
bool
true/false
False
Template
row_num_datums
Number of datums per row
uint32_t
>= 1
False
Function
ocb
Output circular buffer identifier
uint32_t
0 to 31
True
Function
face_r_dim
Face height in rows
uint32_t
1, 8 or 16
False (default = 16)
Function
num_faces
Number of faces
uint32_t
1, 2 or 4
False (default = 4)
-
template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim>
void ckernel::pack_untilize_init(uint32_t icb, uint32_t ocb)
-
Performs the necessary hardware and software initialization for the pack untilize operation. Initializes all three threads: UNPACK, MATH, and PACK. This function should be used when the desired PACK input is not yet in DEST register. Internally, this function calls
pack_untilize_dest_init
to initialize the PACK thread.Since pack untilize works on a block of tiles, the user should specify the width of a single block (block_ct_dim), and the width of the full block (
full_ct_dim
). It is not needed to provide the height of the block during the initialization, sincepack_untilize_block
will loop over the height of the block. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:half-sync mode (16-bit mode): 8 tiles
half-sync mode (32-bit mode): 4 tiles
full-sync mode (16-bit mode): 16 tiles
full-sync mode (32-bit mode): 8 tiles
NOTE: This function uses default
face_r_dim
andnum_faces
values (16 and 4, respectively). Setting these parameters results in an expensive MMIO write and cannot be avoided currently. This should be addressed more systematically within the issue tt-metal#22820, since these two values can be inferred from the circular buffer description, the same way as it is done inllk_pack_hw_configure_disaggregated
. This would remove the need forllk_pack_untilize_hw_configure_disaggregated
altogether and we would pay the price of the MMIO write only once, incompute_kernel_hw_startup
.Return value: None
Param Type
Name
Description
Type
Valid Range
Required
Template
block_ct_dim
Width of a single block in tiles
uint32_t
1 to max (see note)
False (default = 8)
Template
full_ct_dim
Width of a full input in tiles
uint32_t
>= block_ct_dim
False
Function
icb
Input circular buffer identifier
uint32_t
0 to 31
True
Function
ocb
Output circular buffer identifier
uint32_t
0 to 31
True
-
template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim>
void ckernel::pack_untilize_block(uint32_t icb, uint32_t block_rt_dim, uint32_t ocb, uint32_t block_c_index = 0)
-
Performs the untilize operation on a block of tiles. Loops over the provided block size. The block is characterized by its width in tiles (block_ct_dim) and height in tiles (block_rt_dim). The width of the block has to be the same as the one provided during the initialization of the pack untilize operation (
pack_untilize_init
). It is not needed to provide the height of the block during the initialization, sincepack_untilize_block
will loop over the height of the block. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:half-sync mode (16-bit mode): 8 tiles
half-sync mode (32-bit mode): 4 tiles
full-sync mode (16-bit mode): 16 tiles
full-sync mode (32-bit mode): 8 tiles
Return value: None
Param Type
Name
Description
Type
Valid Range
Required
Template
block_ct_dim
Width of a single block in tiles
uint32_t
1 to max (see note)
False (default = 8)
Template
full_ct_dim
Width of a full input in tiles
uint32_t
>= block_ct_dim
False
Function
icb
Input circular buffer identifier
uint32_t
0 to 31
True
Function
block_rt_dim
Height of a single block in tiles
uint32_t
>= 1
True
Function
ocb
Output circular buffer identifier
uint32_t
0 to 31
True
Function
block_c_index
Index of the currently processed block
uint32_t
>= 0
False
-
template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim, bool diagonal = false, bool narrow_row = false, std::uint32_t row_num_datums = TILE_C_DIM>
void ckernel::pack_untilize_dest(uint32_t ocb, uint32_t block_rt_dim = 1, uint32_t block_c_index = 0, uint32_t face_r_dim = 16, uint32_t num_faces = 4)
-
Performs the pack untilize operation when PACK input is already in DEST register. In order to properly initialize the operation, a call to
pack_untilize_dest_init
must be made before this function. The width of the block has to be the same as the one provided during the initialization of the pack untilize operation (pack_untilize_dest_init
). In order for this untilization to be performed correctly, some other function must place the tiles in the DEST register, e.g.reduce_tile
,copy_tile
, etc. Similarly aspack_untilize_block
, this function operates on a whole block that needs to be untilized. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:half-sync mode (16-bit mode): 8 tiles
half-sync mode (32-bit mode): 4 tiles
full-sync mode (16-bit mode): 16 tiles
full-sync mode (32-bit mode): 8 tiles
Return value: None
Param Type
Name
Description
Type
Valid Range
Required
Template
block_ct_dim
Width of a single block in tiles
uint32_t
1 to max (see note)
False (default = 8)
Template
full_ct_dim
Width of a full input in tiles
uint32_t
>= block_ct_dim
False
Template
diagonal
Whether to use diagonal packing
bool
true/false
False
Template
narrow_row
Whether the provided input is narrow
bool
true/false
False
Template
row_num_datums
Number of datums per row
uint32_t
>= 1
False
Function
ocb
Output circular buffer identifier
uint32_t
0 to 31
True
Function
block_rt_dim
Height of a single block in tiles
uint32_t
>= 1
False (default=1)
Function
block_c_index
Block column index (used when full_ct_dim > block_ct_dim)
uint32_t
>= 0
False (default=0)
Function
face_r_dim
Face height in rows
uint32_t
1, 8 or 16
False (default=16)
Function
num_faces
Number of faces
uint32_t
1, 2 or 4
False (default=4)
-
void ckernel::pack_untilize_uninit(uint32_t ocb)
-
Uninitializes the pack untilize operation, allowing another operations to be initialized. Needs to be called after the last call to
pack_untilize_dest
orpack_untilize_block
, before initializing another operation.NOTE: This function is not in line with our programming model, and will be removed by the end of 2025 as a part of tt-metal#22904.
Return value: None
Param Type
Name
Description
Type
Valid Range
Required
Function
ocb
Output circular buffer identifier
uint32_t
0 to 31
True