pack_untilize_block

template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim, bool narrow_row = false, std::uint32_t row_num_datums = TILE_C_DIM, bool dense = false, bool configure_remap = true>
void ckernel::pack_untilize_dest_init(uint32_t ocb, uint32_t call_line = __builtin_LINE())

Performs the necessary hardware and software initialization for the pack untilize operation. This initialization function should be used when the desired PACK input is already in DEST register - therefore, it doesn’t configure UNPACK and MATH threads for transferring data from circular buffers to DEST register (this is done with pack_untilize_init function). Matching pack untilize operation for this initialization function is pack_untilize_dest. In order for this untilization to be performed correctly, some other function must place the tiles in the DEST register, e.g. reduce_tile, copy_tile, etc. This initialization function should, therefore, be called right after the op-specific initialization function (reduce_init, copy_init, etc.).

Since pack untilize works on a block of tiles, the user should specify the width of a single block (block_ct_dim), and the width of the full block (full_ct_dim). It is not needed to provide the height of the block during the initialization, since pack_untilize_block will loop over the height of the block. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:

  • half-sync mode (16-bit mode): 8 tiles

  • half-sync mode (32-bit mode): 4 tiles

  • full-sync mode (16-bit mode): 16 tiles

  • full-sync mode (32-bit mode): 8 tiles

NOTE: Face geometry (face_r_dim, num_faces) is derived from the output circular buffer metadata configured on the host via CircularBufferConfig::set_unpack_face_geometry / CBFormatDescriptor::face_geometry. Callers that need non-default face geometry must configure it on the output CB at program creation time.

By default this init configures BH DEST remap. Pass configure_remap = false only when the caller has already configured BH DEST remap and no intervening operation requires a different DEST remap state.

Return value: None

Param Type

Name

Description

Type

Valid Range

Required

Template

block_ct_dim

Width of a single block in tiles

uint32_t

1 to max (see note)

False (default = 8)

Template

full_ct_dim

Width of a full input in tiles

uint32_t

Divisible by block_ct_dim

False

Template

narrow_row

Whether the provided input is narrow

bool

true/false

False

Template

row_num_datums

Number of datums per row

uint32_t

>= 1

False

Template

dense

Packs two 2 face tiles in a single 4 face region

bool

true/false

False (default false)

Template

configure_remap

Whether to (re)configure BH DEST remap (BH only)

bool

true/false

False (default true)

Function

ocb

Output circular buffer identifier

uint32_t

0 to 31

True

template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim>
void ckernel::pack_untilize_init(uint32_t icb, uint32_t ocb, uint32_t call_line = __builtin_LINE())

Performs the necessary hardware and software initialization for the pack untilize operation. Initializes all three threads: UNPACK, MATH, and PACK. This function should be used when the desired PACK input is not yet in DEST register. Internally, this function calls pack_untilize_dest_init to initialize the PACK thread.

Since pack untilize works on a block of tiles, the user should specify the width of a single block (block_ct_dim), and the width of the full block (full_ct_dim). It is not needed to provide the height of the block during the initialization, since pack_untilize_block will loop over the height of the block. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:

  • half-sync mode (16-bit mode): 8 tiles

  • half-sync mode (32-bit mode): 4 tiles

  • full-sync mode (16-bit mode): 16 tiles

  • full-sync mode (32-bit mode): 8 tiles

NOTE: Face geometry (face_r_dim, num_faces) is derived from the output circular buffer metadata configured on the host via CircularBufferConfig::set_unpack_face_geometry / CBFormatDescriptor::face_geometry.

This default init configures BH DEST remap. Use pack_untilize_init_skip_remap only when the caller has already configured BH DEST remap and no intervening operation requires a different DEST remap state.

Return value: None

Param Type

Name

Description

Type

Valid Range

Required

Template

block_ct_dim

Width of a single block in tiles

uint32_t

1 to max (see note)

False (default = 8)

Template

full_ct_dim

Width of a full input in tiles

uint32_t

Divisible by block_ct_dim

False

Function

icb

Input circular buffer identifier

uint32_t

0 to 31

True

Function

ocb

Output circular buffer identifier

uint32_t

0 to 31

True

template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim>
void ckernel::pack_untilize_block(uint32_t icb, uint32_t block_rt_dim, uint32_t ocb, uint32_t block_c_index = 0)

Performs the untilize operation on a block of tiles. Loops over the provided block size. The block is characterized by its width in tiles (block_ct_dim) and height in tiles (block_rt_dim). The width of the block has to be the same as the one provided during the initialization of the pack untilize operation (pack_untilize_init). It is not needed to provide the height of the block during the initialization, since pack_untilize_block will loop over the height of the block. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:

  • half-sync mode (16-bit mode): 8 tiles

  • half-sync mode (32-bit mode): 4 tiles

  • full-sync mode (16-bit mode): 16 tiles

  • full-sync mode (32-bit mode): 8 tiles

Return value: None

Param Type

Name

Description

Type

Valid Range

Required

Template

block_ct_dim

Width of a single block in tiles

uint32_t

1 to max (see note)

False (default = 8)

Template

full_ct_dim

Width of a full input in tiles

uint32_t

Divisible by block_ct_dim

False

Function

icb

Input circular buffer identifier

uint32_t

0 to 31

True

Function

block_rt_dim

Height of a single block in tiles

uint32_t

>= 1

True

Function

ocb

Output circular buffer identifier

uint32_t

0 to 31

True

Function

block_c_index

Index of the currently processed block

uint32_t

>= 0

False

template<uint32_t block_ct_dim = 8, uint32_t full_ct_dim = block_ct_dim, bool diagonal = false, bool narrow_row = false, std::uint32_t row_num_datums = TILE_C_DIM, uint32_t tile_dst_ct_offset = 0, bool dense = false>
void ckernel::pack_untilize_dest(uint32_t ocb, uint32_t block_rt_dim = 1, uint32_t block_c_index = 0, uint32_t tile_dst_rt_offset = 0)

Performs the pack untilize operation when PACK input is already in DEST register. In order to properly initialize the operation, a call to pack_untilize_dest_init must be made before this function. The width of the block has to be the same as the one provided during the initialization of the pack untilize operation (pack_untilize_dest_init). In order for this untilization to be performed correctly, some other function must place the tiles in the DEST register, e.g. reduce_tile, copy_tile, etc. Similarly as pack_untilize_block, this function operates on a whole block that needs to be untilized. Note that the maximum size of the block is limited by the size of the DEST and synchronization mode used. These are maximum sizes:

  • half-sync mode (16-bit mode): 8 tiles

  • half-sync mode (32-bit mode): 4 tiles

  • full-sync mode (16-bit mode): 16 tiles

  • full-sync mode (32-bit mode): 8 tiles

Return value: None

Param Type

Name

Description

Type

Valid Range

Required

Template

block_ct_dim

Width of a single block in tiles

uint32_t

1 to max (see note)

False (default = 8)

Template

full_ct_dim

Width of a full input in tiles

uint32_t

Divisible by block_ct_dim

False

Template

diagonal

Whether to use diagonal packing

bool

true/false

False

Template

narrow_row

Whether the provided input is narrow

bool

true/false

False

Template

row_num_datums

Number of datums per row

uint32_t

>= 1

False

Template

tile_dst_ct_offset

Compile time offset for the index of the tile in the dest from which to pack

uint32_t

0 to 7 (0 to 3 if fp32 dest is enabled)

False (default=0)

Template

dense

Packs two 2 face tiles in a single 4 face region

bool

true/false

False (default false)

Function

ocb

Output circular buffer identifier

uint32_t

0 to 31

True

Function

block_rt_dim

Height of a single block in tiles

uint32_t

>= 1

False (default=1)

Function

block_c_index

Block column index (used when full_ct_dim > block_ct_dim)

uint32_t

>= 0

False (default=0)

Function

tile_dst_offset

Runtime offset for the index of the tile in the dest from which to pack

uint32_t

0 to 7 (0 to 3 if fp32 dest is enabled)

False (default=0)

NOTE: Face geometry (face_r_dim, num_faces) is derived from the output circular buffer metadata configured on the host via CircularBufferConfig::set_unpack_face_geometry / CBFormatDescriptor::face_geometry.

void ckernel::pack_untilize_uninit(uint32_t ocb)

Uninitializes the pack untilize operation, allowing another operations to be initialized. Needs to be called after the last call to pack_untilize_dest or pack_untilize_block, before initializing another operation.

NOTE: This function is not in line with our programming model, and will be removed by the end of 2025 as a part of tt-metal#22904.

Return value: None

Param Type

Name

Description

Type

Valid Range

Required

Function

ocb

Output circular buffer identifier

uint32_t

0 to 31

True