Achieving FP32 Accuracy for Computation
Tensix provides two main compute engines: the matrix engine (FPU) and the vector engine (SFPU). Each has distinct strengths and limitations that affect numerical accuracy and throughput. For a detailed overview of these engines, see Compute Engines and Data Flow within Tensix.
The matrix engine is built for speed and scale, handling large matrix operations efficiently. Its design favors throughput, but this comes with a trade-off: most operations use bfloat16 or TF32 formats, which offer less precision than standard IEEE 754 FP32. Additionally, the matrix engine does not handle special values (inf, NaN, …) properly. For many machine learning tasks, this is sufficient, but it may not meet the needs of workloads that demand high numerical accuracy. For detailed information about FPU and SFPU numerical accuracy characteristics, please review the follwoing documentations:
The vector engine, on the other hand, supports full 32-bit floating-point (FP32) arithmetic and is more IEEE 754-compliant (though not 100%). This makes it suitable for computations where precision is critical. However, as a vector unit, it processes data in smaller batches and at lower throughput—behavior similar to SIMD units found in conventional CPUs and GPUs.
Choosing between these engines depends on the requirements of your workload. Use the matrix engine for bulk computation where speed is the priority and the vector engine when higher accuracy is needed.
To achieve maximum accuracy with the vector engine, several conditions must be met, from host-side configuration to kernel-side implementation.
Host-Side Configuration
On the host, the DeviceComputeKernelConfig
struct controls the precision settings for compute kernels, including both the matrix engine (FPU), the vector engine (SFPU) and other components. To ensure the highest possible accuracy, enable the following two options:
fp32_dest_acc_en = true
: This setting allocates 32-bit space in the Dst registers. This is required to store intermediate and final results at FP32 precision. If disabled (false
, the default), the Dst registers will store 16-bit data, with FP32 values automatically converted to BFP16.math_approx_mode = false
: This disables optimizations that approximate certain math operations, ensuring that calculations are performed with maximum fidelity that the kernel library provides. By default, this istrue
.
Note
The math_fidelity
setting in DeviceComputeKernelConfig
only applies to the matrix engine. The vector engine always performs operations in 32-bit mode.
// On the host, configure the kernel for FP32 computation
KernelHandle compute_kernel = CreateKernel(
program,
"/path/to/your/kernels/compute.cpp",
core,
DeviceComputeKernelConfig{
.math_approx_mode = false,
.fp32_dest_acc_en = true,
}
);
Additionally, ensure that the circular buffers that will handle the FP32 data are created with the DataFormat::Float32
type.
Note
Some functions, most notably exp_tile
and the various trigonometric functions, have inherent limitations due to their polynomial approximations. Some functions have multiple available approximations (e.g. approx and fast_and_approx template parameters for exp_tile). These limitations can lead to reduced accuracy for certain input ranges, even when using the vector engine with FP32 settings. Always validate the accuracy of results for your specific use case. The operator implementations are built to balance performance and accuracy for the intended (machine learning) workloads. If your application requires higher precision across all input ranges, consider implementing custom functions.
Kernel-Side Implementation
Inside the compute kernel, you must use the vector engine (SFPU) for computations and correctly configure the unpacker and packer for FP32 data.
-
Configure Unpacker and Packer: Before moving data, you must explicitly configure the unpacker and packer to handle the FP32 format.
Call
copy_tile_init()
before unpacking data from a circular buffer into the Dst registers. This function reconfigures the unpacker to correctly interpret the 32-bit data from the circular buffer.Call
pack_reconfig_data_format()
before packing data from Dst registers to an output circular buffer. This ensures the packer formats the data correctly for the destination.
Warning
If you are unpcking or packing to multiple circular buffers of different data formats, you must call copy_tile_init()
and pack_reconfig_data_format()
each time you switch between circular buffers with different formats. Otherwise the data may be misinterpreted, leading to incorrect results.
The following example demonstrates a typical compute kernel structure for achieving FP32 accuracy.
#include "compute_kernel_api/common.h"
#include "compute_kernel_api/tile_move_copy.h"
#include "compute_kernel_api/binary.h"
namespace NAMESPACE {
void MAIN {
constexpr auto cb_in0 = tt::CBIndex::c_in0;
constexpr auto cb_in1 = tt::CBIndex::c_in1;
constexpr auto cb_out0 = tt::CBIndex::c_out0;
constexpr uint32_t num_tiles = 8;
// Initialize for a binary operation on the SFPU
init_sfpu(cb_in0, cb_out0);
add_binary_tile_init();
for(uint32_t i = 0; i < num_tiles; i++) {
// Wait for input data
cb_wait_front(cb_in0, 1);
cb_wait_front(cb_in1, 1);
// Acquire Dst registers
tile_regs_acquire();
// Configure unpacker for FP32 and copy data from CB to Dst
copy_tile_init(cb_in0);
copy_tile(cb_in0, 0, 0); // Copy tile from cb_in0 to Dst[0]
copy_tile_init(cb_in1);
copy_tile(cb_in1, 0, 1); // Copy tile from cb_in1 to Dst[1]
// Perform computation on the SFPU
add_binary_tile(0, 1, 0); // Dst[0] = Dst[0] + Dst[1]
// Commit results and release Dst for the packer
tile_regs_commit();
// Reserve space in the output CB
cb_reserve_back(cb_out0, 1);
// Wait for packer to be ready
tile_regs_wait();
// Configure packer for FP32 and pack data from Dst to CB
// This can be hoisted out of the loop as only one output
// exists in the kernel
pack_reconfig_data_format(cb_out0);
pack_tile(0, cb_out0);
// Release Dst registers
tile_regs_release();
// Announce data is available in output CB
cb_push_back(cb_out0, 1);
// Pop from input CBs
cb_pop_front(cb_in0, 1);
cb_pop_front(cb_in1, 1);
}
}
} // NAMESPACE
Warning
Failing to call copy_tile_init()
and pack_reconfig_data_format()
will result in data being treated as 16-bit, leading to a loss of precision, even if fp32_dest_acc_en
is enabled.
Distinguishing Between matrix and vector engine APIs
A general way to distinguish between matrix engine (FPU) and vector engine (SFPU) APIs is by their parameters.
matrix engine APIs typically take circular buffer indices as arguments, as the FPU operates directly on data unpacked from circular buffers into its dedicated
SrcA
andSrcB
registers.vector engine APIs operate on data already present in the
Dst
registers. Therefore, their arguments are indices into theDst
register set.
For example:
// Adding tiles using the FPU
// Operands are specified by their location in circular buffers.
// Result is written to Dst tile 0.
// DO NOT use if accuracy is of concern
add_tiles(cb_in0, cb_in1, 0, 0, 0);
// Adding tiles using the SFPU
// Operands are specified by their location in Dst registers.
// Result is written back to Dst tile 0.
add_binary_tile(0, 1, 0);