Smoothstep using SFPI

This document details the implementation of a custom SFPI kernel for the smoothstep function. It is intended for developers familiar with parallel programming concepts who are new to the Tenstorrent platform.

This example builds upon the Vector addition using SFPI example, and introduces the following advanced SFPI concepts:

  • Parameter Passing: Passing scalar arguments to an SFPI kernel.

  • Vector Predicates: Performing element-wise conditional operations.

The smoothstep function is a non-linear interpolation function commonly used in graphics (see GLSL smoothstep documentation for reference). It is defined as:

\[\begin{split}\operatorname{smoothstep}(e_0, e_1, x) = \begin{cases} 0, & x \leq e_0, \\ 1, & x \geq e_1, \\ \left( \dfrac{x - e_0}{e_1 - e_0} \right)^2 \bigl(3 - 2 \tfrac{x - e_0}{e_1 - e_0}\bigr), & e_0 < x < e_1 . \end{cases}\end{split}\]

Although smoothstep is conceptually simple, its implementation is complex enough to demonstrate several advanced features of SFPI, such as parameter passing and vector predicates.

The full source code is available in the tt_metal/programming_examples/custom_sfpi_smoothstep directory.

Building and Running

Build the example using the --build-programming-examples flag:

export TT_METAL_HOME=</path/to/tt-metal>
./build_metal.sh --build-programming-examples

Run the example:

./build/programming_examples/metal_example_custom_sfpi_smoothstep

Warning

Tenstorrent does not guarantee backward compatibility for user-implemented SFPI functions.

Program setup

The host-side setup for this custom SFPI example follows the standard pattern: mesh device initialization, DRAM buffer creation, circular buffer allocation for kernel communication, and kernel creation. Unlike binary operations that require two input buffers, smoothstep is a unary operation requiring only a single input buffer (src0_dram_buffer) plus the output buffer (dst_dram_buffer). Correspondingly, we need only one input circular buffer (cb_in0) and one output buffer (cb_out0).

// Create a 1x1 mesh on device 0 (same API scales to multi-device meshes)
constexpr int device_id = 0;
auto mesh_device = distributed::MeshDevice::create_unit_mesh(device_id);

// Submit work via the mesh command queue: uploads/downloads and program execution
distributed::MeshCommandQueue& cq = mesh_device->mesh_command_queue();
Program program = CreateProgram();

// A MeshWorkload is a collection of programs that will be executed on the mesh
distributed::MeshWorkload workload;
distributed::MeshCoordinateRange device_range = distributed::MeshCoordinateRange(mesh_device->shape());

// Configure mesh buffers with single-tile page size
distributed::DeviceLocalBufferConfig dram_config{
    .page_size = tile_size_bytes,
    .buffer_type = BufferType::DRAM};
distributed::ReplicatedBufferConfig dram_buffer_config{
    .size = dram_buffer_size};

// DRAM buffers: single input + one output for unary operation
auto src0_dram_buffer = distributed::MeshBuffer::create(dram_buffer_config, dram_config, mesh_device.get());
auto dst_dram_buffer = distributed::MeshBuffer::create(dram_buffer_config, dram_config, mesh_device.get());

// Circular buffers for kernel communication
CreateCircularBuffer(program, core, /* cb_in0 config */);
CreateCircularBuffer(program, core, /* cb_out0 config */);

// Kernels: reader, writer, and custom SFPU compute
auto reader = CreateKernel(program, "..../read_tiles.cpp", core, DataMovementConfig{...});
auto writer = CreateKernel(program, "..../write_tile.cpp", core, DataMovementConfig{...});
auto compute = CreateKernel(program, "..../tiles_smoothstep.cpp", core, ComputeConfig{});

The Kernels

Data Movement Kernels

The reader kernel reads tiles from a single source DRAM buffer and pushes them into the input circular buffer. Since smoothstep is a unary operation, we only need to read from one source buffer.

// tt_metal/programming_examples/custom_sfpi_smoothstep/kernels/dataflow/read_tiles.cpp
void kernel_main() {
    uint32_t in0_addr = get_arg_val<uint32_t>(0);
    uint32_t n_tiles = get_arg_val<uint32_t>(1);
    ...
    for (uint32_t i = 0; i < n_tiles; i++) {
        cb_reserve_back(cb_in0, 1);
        uint32_t cb_in0_addr = get_write_ptr(cb_in0);
        noc_async_read_tile(i, in0, cb_in0_addr);
        noc_async_read_barrier();
        cb_push_back(cb_in0, 1);
    }
}

The writer kernel is straightforward: it reads result tiles from the output circular buffer and writes them to the destination DRAM buffer.

// tt_metal/programming_examples/custom_sfpi_smoothstep/kernels/dataflow/write_tile.cpp
void kernel_main() {
    uint32_t c_addr = get_arg_val<uint32_t>(0);
    uint32_t n_tiles = get_arg_val<uint32_t>(1);
    ...
    for (uint32_t i = 0; i < n_tiles; i++) {
        cb_wait_front(cb_out0, 1);
        uint32_t cb_out0_addr = get_read_ptr(cb_out0);
        noc_async_write_tile(i, out0, cb_out0_addr);
        noc_async_write_barrier();
        cb_pop_front(cb_out0, 1);
    }
}

SFPI Compute Kernel

The compute kernel is where the custom SFPI logic resides. It waits for tiles from the input CB, performs the smoothstep operation using the SFPI, and pushes the result to the output CB.

The overall flow follows the standard pattern for unary compute kernels:

// tt_metal/programming_examples/custom_sfpi_smoothstep/kernels/compute/tiles_smoothstep.cpp
namespace NAMESPACE {
void MAIN {
    uint32_t n_tiles = get_arg_val<uint32_t>(0);

    constexpr auto cb_in0 = tt::CBIndex::c_0;
    constexpr auto cb_out0 = tt::CBIndex::c_16;

    constexpr float edge0 = 0.0f;
    constexpr float edge1 = 1.0f;
    // pre-calculate inverse as it is used multiple times and slow (the Baby RISC-V cores)
    // uses software floating-point. Constexpr making this evaulation compile-time
    constexpr float inv_delta = 1.0f / (edge1 - edge0);

    init_sfpu(cb_in0, cb_out0);

    for (uint32_t i = 0; i < n_tiles; i++) {
        cb_wait_front(cb_in0, 1);
        tile_regs_acquire();
        copy_tile(cb_in0, 0, 0); // input x
        my_smoothstep_tiles(0, edge0, edge1, inv_delta);  // <-- Custom SFPI smoothstep
        tile_regs_commit();
        tile_regs_wait();
        cb_reserve_back(cb_out0, 1);
        pack_tile(0, cb_out0);
        cb_push_back(cb_out0, 1);
        cb_pop_front(cb_in0, 1);
        tile_regs_release();
    }
}

Custom SFPI Implementation of Smoothstep

The my_smoothstep_tiles function uses the layered abstraction pattern shown in previous examples. This section focuses on the new concepts introduced in this kernel.

// tt_metal/programming_examples/custom_sfpi_smoothstep/kernels/compute/tiles_smoothstep.cpp

#ifdef TRISC_MATH

// Low-level function operating on a tile face
void my_smoothstep_tile_face(float edge0, float edge1, float inv_delta) {
    constexpr size_t vectors_per_face = 8;
    for (size_t i = 0; i < vectors_per_face; i++) {
        vFloat x = dst_reg[i];
        vFloat t = (x - edge0) * inv_delta;
        v_if(t < sfpi::vConst0) { t = sfpi::vConst0; }
        v_elseif(t > sfpi::vConst1) { t = sfpi::vConst1; }
        v_endif;
        vFloat result = t * t * (3.0f - 2.0f * t);
        dst_reg[i] = result;
    }
}
#endif // TRISC_MATH

// High-level API function
// Accepts `edge0`, `edge1` and `inv_delta` as parameters
inline void my_smoothstep_tile(uint32_t idx_dst0, float edge0, float edge1, float inv_delta) {
    MATH(_llk_math_eltwise_unary_sfpu_params_<false>(
        smoothstep_tile_face,
        idx_dst0,
        VectorMode::RC, // Apply on all 4 faces of the tile
        edge0,
        edge1,
        inv_delta));
}

Parameter Passing

The smoothstep function needs two scalar parameters: edge0 and edge1. These are passed to the SFPI kernel using the _llk_math_eltwise_unary_sfpu_params_ helper function.

// Passes edge0 and edge1 as arguments to the SFPI kernel
my_smoothstep_tile(uint32_t idx_dst0, float edge0, float edge1, float inv_delta);
// ↓
// Use the parameters for all elements in the tile face
my_smoothstep_tile_face(float edge0, float edge1, float inv_delta);

The helper function is a template that takes the low-level face function as its first argument, followed by the destination register index, vector mode, and any scalar parameters required by the face function. This approach makes it easy to pass constants or runtime values into the SFPI kernel.

Vector Predicates

The clamping of the intermediate value t to the [0, 1] range is implemented using vector predicates.

v_if(t < sfpi::vConst0) { t = sfpi::vConst0; }
v_elseif(t > sfpi::vConst1) { t = sfpi::vConst1; }
v_endif;

The v_if and v_elseif instructions perform element-wise conditional assignments on the vFloat vector t. Each lane of the SIMD vector is evaluated independently. A v_endif is required to terminate the conditional block.

The SFPI constants sfpi::vConst0 and sfpi::vConst1 are vectors with all 32 lanes set to 0.0f and 1.0f, respectively. These constants are hardware-defined, readily available for SFPI programs, and do not require manual initialization. Using these pre-defined constants is more efficient than using literal values because the SFPU operates on vectors. Literal values would require broadcasting to a vector, which adds instructions and overhead.

This is analogous to conditional execution in other parallel programming models, where a mask is used to control which processing elements are active.

Runtime Arguments and Execution

Back on the host, we set the runtime arguments for the kernels. Since this is a unary operation, the reader and writer kernels need only a single DRAM buffer address each, and all three kernels need to know the number of tiles to process.

// tt_metal/programming_examples/custom_sfpi_smoothstep/custom_sfpi_smoothstep.cpp
SetRuntimeArgs(program, reader, core, {
    src0_dram_buffer->address(),
    n_tiles
});

SetRuntimeArgs(program, writer, core, {
    dst_dram_buffer->address(),
    n_tiles
});

SetRuntimeArgs(program, compute, core, {
    n_tiles
});

Finally, we add the program to the mesh workload and enqueue it for execution, then read back the results from the destination DRAM buffer to verify correctness against the expected smoothstep function output.

// tt_metal/programming_examples/custom_sfpi_smoothstep/custom_sfpi_smoothstep.cpp
// Add the program to the workload for the mesh
workload.add_program(device_range, std::move(program));

// Enqueue the workload for execution on the mesh (non-blocking) and wait for completion
distributed::EnqueueMeshWorkload(cq, workload, /*blocking=*/false);
distributed::Finish(cq);

std::vector<bfloat16> result_vec;
distributed::EnqueueReadMeshBuffer(cq, result_vec, dst_dram_buffer, /*blocking*/ true);

// Validation against golden smoothstep output
for (size_t i = 0; i < result_vec.size(); ++i) {
    // CPU version of the same smoothstep function for validation
    auto smoothstep = [](float edge0, float edge1, float x) {
        x = (x - edge0) / (edge1 - edge0);
        x = std::clamp(x, 0.0f, 1.0f);
        return x * x * (3 - 2 * x);
    };
    const float expected = smoothstep(0.0f, 1.0f, a_data[i].to_float());
    const float actual = result_vec[i].to_float();
    // Check for match within tolerance...
}

Conclusion

This example demonstrates the implementation of a custom SFPI kernel with parameter passing and conditional logic. Key takeaways are:

  • Parameter Passing: The _llk_math_eltwise_*_sfpu_params_ family of functions is used to pass scalar arguments to a custom SFPI kernel.

  • Vector Predicates: The v_if, v_elseif, and v_endif instructions provide a mechanism for element-wise conditional logic within an SFPI kernel.

  • Unary Operations: Unary SFPI kernels can be implemented efficiently by performing the computation in-place in the destination registers.