Adding New TT-NN Operation
Note
This document is meant for contributors to TT-NN.
Not all operations may be functional on all Tenstorrent hardware (Grayskull, Wormhole, or others).
FAQ
What is a TT-NN operation?
A TT-NN operation is a function that takes in one or more input tensors and produces one or more output tensors. It is implemented in C++ and can be called from Python.
What steps are needed to add TT-NN operation in C++?
There are 2 options for writing a new operation. Option
ais to write a device operation and optionbis to write an operation that calls other operations a. Implement device operation in C++. Device operation is a struct that satisfies DeviceOperationConcept and specifies how to create output tensors and a program to run on the device. b. Implement an operation in C++ that calls other operations. This type of operation simply defines aninvoke()method that calls other operations.Expose the operation as a free function under
ttnnorttnn::experimental(e.g.ttnn::tilizeorttnn::experimental::dropout) namespace that invokes correspondingttnn::primoperation(e.g.ttnn::prim::dropout).
What steps are needed to add TT-NN operation in Python?
Take an existing C++ operation and add a nanobind Python binding for it using
ttnn::bind_function. If the operation is calledttnn::addin C++, then the Python binding will bettnn.add.(Optional) Attach golden function to the operation using
ttnn.attach_golden_function. This is useful for debugging and testing.
Example of Adding a new Device Operation
Let’s implement ttnn.example (It will just copy the input tensor to the output tensor on the device)
C++ Implementation
Step 1: Implement device operation
In order to add a new device operation, follow the directory structure shown below:
ttnn/cpp/ttnn/operations/<category>/<operation_name>/device/<operation_name>_device_operation.hpp ttnn/cpp/ttnn/operations/<category>/<operation_name>/device/<operation_name>_device_operation.cpp ttnn/cpp/ttnn/operations/<category>/<operation_name>/device/<program_factory_0>_program_factory.cpp
Note
Add as many program factories as needed. But the minimum requirement is one program factory.
Note
All new operations must use the ProgramDescriptor pattern (see below).
The old CachedProgram / shared_variables_t pattern is legacy and should not
be used for new operations.
A concrete example of a device operation can be found in ttnn/cpp/ttnn/operations/examples/example/device
ProgramDescriptor Pattern (Recommended)
The ProgramDescriptor pattern is the recommended way to write program factories.
Instead of imperatively constructing a Program object and returning a CachedProgram,
you declaratively describe the program using a ProgramDescriptor struct. The framework
then handles program construction, caching, and buffer address patching on cache hits.
Key benefits:
No ``shared_variables_t`` — you don’t need to store kernel handles or core lists.
No manual buffer address patching — the framework auto-patches buffer addresses on cache hits.
Cleaner code — the declarative style is easier to read and less error-prone.
Single-descriptor operations (recommended):
Place create_descriptor directly on the operation struct. No wrapper struct,
program_factory_t, or select_program_factory needed:
struct MyDeviceOperation {
// ... operation_attributes_t, tensor_args_t, etc. ...
static tt::tt_metal::ProgramDescriptor create_descriptor(
const operation_attributes_t& operation_attributes,
const tensor_args_t& tensor_args,
tensor_return_value_t& tensor_return_value);
};
Multi-variant programs (advanced):
When an operation needs different program strategies, define named factory structs
with create_descriptor and put them in a variant:
struct SmallInput {
static tt::tt_metal::ProgramDescriptor create_descriptor(
const operation_attributes_t&, const tensor_args_t&, tensor_return_value_t&);
};
struct LargeInput {
static tt::tt_metal::ProgramDescriptor create_descriptor(
const operation_attributes_t&, const tensor_args_t&, tensor_return_value_t&);
};
using program_factory_t = std::variant<SmallInput, LargeInput>;
static program_factory_t select_program_factory(
const operation_attributes_t&, const tensor_args_t&);
Building a ProgramDescriptor:
ProgramDescriptor desc;
// 1. Declare circular buffers
desc.cbs.push_back(CBDescriptor{
.total_size = num_tiles * tile_size,
.core_ranges = all_cores,
.format_descriptors = {{CBFormatDescriptor{
.buffer_index = cb_id,
.data_format = data_format,
.page_size = tile_size,
}}},
});
// 2. Declare kernels with compile-time args and config
// Use ReaderConfigDescriptor{} for reader, WriterConfigDescriptor{} for writer,
// and ComputeConfigDescriptor{...} for compute kernels.
//
// IMPORTANT: Kernels are identified by their index in desc.kernels (0, 1, 2, ...).
// The framework uses these indices as kernel handles when applying runtime
// arguments to a cached Program. Push kernels in a fixed, deterministic
// order so that indices are stable across create_descriptor calls.
KernelDescriptor reader_desc;
reader_desc.kernel_source = "path/to/reader_kernel.cpp";
reader_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
reader_desc.core_ranges = all_cores;
reader_desc.compile_time_args = {cb_id};
// If the kernel uses get_named_compile_time_arg_val(), set named args:
reader_desc.named_compile_time_args = {{"cb_in0", tt::CBIndex::c_0}};
reader_desc.config = ReaderConfigDescriptor{};
KernelDescriptor compute_desc;
compute_desc.kernel_source = "path/to/compute_kernel.cpp";
compute_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
compute_desc.core_ranges = all_cores;
compute_desc.compile_time_args = {cb_id};
compute_desc.named_compile_time_args = {
{"cb_in0", tt::CBIndex::c_0},
{"cb_out", tt::CBIndex::c_4},
{"cb_intermed0", tt::CBIndex::c_5},
};
compute_desc.config = ComputeConfigDescriptor{
.math_fidelity = MathFidelity::HiFi4,
.fp32_dest_acc_en = false,
.math_approx_mode = false,
};
// 3. Add runtime args per core
reader_desc.runtime_args.emplace_back(
core, KernelDescriptor::CoreRuntimeArgs{buffer_addr, tiles_per_core, offset});
// 4. Push kernels in a fixed order (reader=0, compute=1 here).
desc.kernels.push_back(std::move(reader_desc));
desc.kernels.push_back(std::move(compute_desc));
return desc;
Warning
If a kernel source uses get_named_compile_time_arg_val() to retrieve
compile-time arguments by name, you must set named_compile_time_args
on the corresponding KernelDescriptor. This field maps string names to
tt::CBIndex values and causes the KERNEL_COMPILE_TIME_ARG_MAP macro
to be defined during JIT compilation. Without it, the kernel will fail to
compile with a 'get_named_compile_time_arg_val' was not declared in this
scope error. This applies to all kernel types (reader, writer, and
compute).
Note
Always use named_compile_time_args to map CB indices by name in every
kernel descriptor, even when the kernel only uses positional compile-time args.
This enables automated tooling to introspect which circular buffers each kernel
references without parsing the kernel source.
Full example files:
1// SPDX-FileCopyrightText: © 2023 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#pragma once
6
7#include <optional>
8#include <variant>
9
10#include "ttnn/tensor/tensor.hpp"
11#include "ttnn/core.hpp"
12#include "ttnn/device_operation.hpp"
13#include "ttnn/types.hpp"
14#include <tt-metalium/program_descriptors.hpp>
15
16namespace ttnn::operations::examples {
17
18struct ExampleDeviceOperation {
19 // Define the operation attributes. This is used to store all variables needed by operations that aren't tensors.
20 struct operation_attributes_t {
21 bool attribute;
22 int some_other_attribute;
23 };
24
25 // Define the tensor arguments. This is used to store all tensors passed in and/or out of the operation.
26 // Tensor arguments don't need to be just input tensors, they can be output tensors, input/output tensors, optional
27 // tensors, etc.
28 struct tensor_args_t {
29 // This example will use a tensor that can only be used as an input
30 const Tensor& input_tensor;
31
32 // However, the following examples show what else can be done with tensor_args_t
33
34 // An example of the tensor that can be used for input/output or just for pre-allocated output
35 // Tensor& io_tensor;
36
37 // An example of an optional tensor
38 // std::optional<Tensor> optional_output_tensor;
39
40 // An example of a vector of tensors
41 // std::vector<Tensor> vector_of_tensors;
42
43 // An example of a vector of optional tensors
44 // std::vector<std::optional<Tensor>> vector_of_optional_tensors;
45 };
46
47 // Define the return types for the spec(s) of the operation.
48 // Can be a single ttnn::TensorSpec, std::optional<ttnn::TensorSpec>, std::vector<ttnn::TensorSpec>,
49 // std::tuple<ttnn::TensorSpec, ...> etc.
50 using spec_return_value_t = ttnn::TensorSpec;
51
52 // Define the return types for the tensor(s) of the operation.
53 // Can be a single Tensor, std::optional<Tensor>, std::vector<Tensor>, std::tuple<Tensor, ...> etc.
54 using tensor_return_value_t = Tensor;
55
56 // Note: spec_return_value_t and tensor_return_value_t should follow the same pattern.
57 // i.e. if spec_return_value_t is a std::vector<std::optional<ttnn::TensorSpec>> then tensor_return_value_t should
58 // be std::vector<std::optional<Tensor>>
59
60 // -------------------------------------------------------------------------
61 // Descriptor-based program factories
62 //
63 // Each factory returns a ProgramDescriptor. The framework handles program
64 // construction, caching, and runtime argument patching automatically --
65 // no shared_variables_t or override_runtime_arguments needed.
66 // -------------------------------------------------------------------------
67
68 // Single-core: pins work to core {0,0}
69 struct SingleCore {
70 static tt::tt_metal::ProgramDescriptor create_descriptor(
71 const operation_attributes_t& operation_attributes,
72 const tensor_args_t& tensor_args,
73 tensor_return_value_t& tensor_return_value);
74 };
75
76 // Multi-core: distributes tiles across all available cores
77 struct MultiCore {
78 static tt::tt_metal::ProgramDescriptor create_descriptor(
79 const operation_attributes_t& operation_attributes,
80 const tensor_args_t& tensor_args,
81 tensor_return_value_t& tensor_return_value);
82 };
83
84 using program_factory_t = std::variant<SingleCore, MultiCore>;
85
86 static program_factory_t select_program_factory(const operation_attributes_t&, const tensor_args_t&);
87
88 // Validate the operation when it creates a program. Also called on cache hit by default.
89 static void validate_on_program_cache_miss(const operation_attributes_t&, const tensor_args_t&);
90
91 // Optional: override to use lighter validation on cache hit.
92 // If not provided, the framework calls validate_on_program_cache_miss.
93 // static void validate_on_program_cache_hit(const operation_attributes_t&, const tensor_args_t&);
94
95 // Compute the output specs based on the operation attributes and tensor args.
96 static spec_return_value_t compute_output_specs(const operation_attributes_t&, const tensor_args_t&);
97
98 // Create the output tensors based on the operation attributes and tensor args.
99 static tensor_return_value_t create_output_tensors(const operation_attributes_t&, const tensor_args_t&);
100};
101
102} // namespace ttnn::operations::examples
103
104namespace ttnn::prim {
105ttnn::operations::examples::ExampleDeviceOperation::tensor_return_value_t example(const Tensor& input_tensor);
106} // namespace ttnn::prim
1// SPDX-FileCopyrightText: © 2023 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#include "example_device_operation.hpp"
6#include "ttnn/device_operation.hpp"
7#include "ttnn/tensor/tensor_ops.hpp"
8
9namespace ttnn::operations::examples {
10
11ExampleDeviceOperation::program_factory_t ExampleDeviceOperation::select_program_factory(
12 const operation_attributes_t& operation_attributes, const tensor_args_t& /*tensor_args*/) {
13 if (operation_attributes.attribute) {
14 return MultiCore{};
15 }
16 return SingleCore{};
17}
18
19void ExampleDeviceOperation::validate_on_program_cache_miss(
20 const operation_attributes_t& /*attributes*/, const tensor_args_t& /*tensor_args*/) {}
21
22ExampleDeviceOperation::spec_return_value_t ExampleDeviceOperation::compute_output_specs(
23 const operation_attributes_t&, const tensor_args_t& tensor_args) {
24 const auto& input_tensor = tensor_args.input_tensor;
25 return TensorSpec(
26 input_tensor.logical_shape(),
27 tt::tt_metal::TensorLayout(
28 input_tensor.dtype(), tt::tt_metal::PageConfig(input_tensor.layout()), MemoryConfig{}));
29}
30
31ExampleDeviceOperation::tensor_return_value_t ExampleDeviceOperation::create_output_tensors(
32 const operation_attributes_t& operation_attributes, const tensor_args_t& tensor_args) {
33 auto output_spec = compute_output_specs(operation_attributes, tensor_args);
34 return create_device_tensor(output_spec, tensor_args.input_tensor.device());
35}
36
37} // namespace ttnn::operations::examples
38
39namespace ttnn::prim {
40ttnn::operations::examples::ExampleDeviceOperation::tensor_return_value_t example(const Tensor& input_tensor) {
41 using OperationType = ttnn::operations::examples::ExampleDeviceOperation;
42 auto operation_attributes = OperationType::operation_attributes_t{true, 42};
43 auto tensor_args = OperationType::tensor_args_t{input_tensor};
44
45 return ttnn::device_operation::launch<OperationType>(operation_attributes, tensor_args);
46}
47} // namespace ttnn::prim
1// SPDX-FileCopyrightText: © 2023 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#include "example_device_operation.hpp"
6#include <tt-metalium/work_split.hpp>
7#include <tt-metalium/tensor_accessor_args.hpp>
8
9namespace ttnn::operations::examples {
10
11using namespace tt;
12using namespace tt::tt_metal;
13
14ProgramDescriptor ExampleDeviceOperation::SingleCore::create_descriptor(
15 const operation_attributes_t& /*operation_attributes*/,
16 const tensor_args_t& tensor_args,
17 tensor_return_value_t& tensor_return_value) {
18 const auto& input_tensor = tensor_args.input_tensor;
19 auto& output_tensor = tensor_return_value;
20
21 auto* src_buffer = input_tensor.buffer();
22 auto* dst_buffer = output_tensor.buffer();
23
24 tt::DataFormat cb_data_format = datatype_to_dataformat_converter(input_tensor.dtype());
25 uint32_t single_tile_size = tile_size(cb_data_format);
26 tt::DataFormat cb_data_format_output = datatype_to_dataformat_converter(output_tensor.dtype());
27 uint32_t single_tile_size_output = tt::tile_size(cb_data_format_output);
28
29 uint32_t num_tiles = input_tensor.physical_volume() / constants::TILE_HW;
30
31 CoreCoord compute_with_storage_grid_size = {1, 1};
32 uint32_t num_cores_y = compute_with_storage_grid_size.y;
33 auto [num_cores, all_cores, core_group_1, core_group_2, num_tiles_per_core_group_1, num_tiles_per_core_group_2] =
34 split_work_to_cores(compute_with_storage_grid_size, num_tiles);
35
36 // ---- Build the ProgramDescriptor ----
37
38 ProgramDescriptor desc;
39
40 // Circular buffers
41 constexpr uint32_t src0_cb_index = CBIndex::c_0;
42 constexpr uint32_t num_input_tiles = 2;
43 desc.cbs.push_back(CBDescriptor{
44 .total_size = num_input_tiles * single_tile_size,
45 .core_ranges = all_cores,
46 .format_descriptors = {{CBFormatDescriptor{
47 .buffer_index = src0_cb_index,
48 .data_format = cb_data_format,
49 .page_size = single_tile_size,
50 }}},
51 });
52
53 constexpr uint32_t output_cb_index = CBIndex::c_2;
54 constexpr uint32_t num_output_tiles = 2;
55 desc.cbs.push_back(CBDescriptor{
56 .total_size = num_output_tiles * single_tile_size_output,
57 .core_ranges = all_cores,
58 .format_descriptors = {{CBFormatDescriptor{
59 .buffer_index = output_cb_index,
60 .data_format = cb_data_format_output,
61 .page_size = single_tile_size_output,
62 }}},
63 });
64
65 // Reader kernel
66 std::vector<uint32_t> reader_compile_time_args;
67 TensorAccessorArgs(*src_buffer).append_to(reader_compile_time_args);
68
69 KernelDescriptor reader_desc;
70 reader_desc.kernel_source =
71 "ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/reader_unary_interleaved_start_id.cpp";
72 reader_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
73 reader_desc.core_ranges = all_cores;
74 reader_desc.compile_time_args = reader_compile_time_args;
75 reader_desc.config = ReaderConfigDescriptor{};
76
77 // Writer kernel
78 std::vector<uint32_t> writer_compile_time_args = {output_cb_index};
79 TensorAccessorArgs(*dst_buffer).append_to(writer_compile_time_args);
80
81 KernelDescriptor writer_desc;
82 writer_desc.kernel_source =
83 "ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp";
84 writer_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
85 writer_desc.core_ranges = all_cores;
86 writer_desc.compile_time_args = writer_compile_time_args;
87 writer_desc.config = WriterConfigDescriptor{};
88
89 // Compute kernel (eltwise_sfpu.cpp reads num_tiles via get_arg_val, i.e. runtime args)
90 KernelDescriptor compute_desc;
91 compute_desc.kernel_source = "ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/compute/eltwise_sfpu.cpp";
92 compute_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
93 compute_desc.core_ranges = core_group_1;
94 compute_desc.config = ComputeConfigDescriptor{
95 .math_fidelity = MathFidelity::HiFi4,
96 .math_approx_mode = false,
97 };
98
99 // Runtime args per core
100 for (uint32_t i = 0, num_tiles_written = 0; i < num_cores; i++) {
101 CoreCoord core = {i / num_cores_y, i % num_cores_y};
102 uint32_t num_tiles_per_core = 0;
103 if (core_group_1.contains(core)) {
104 num_tiles_per_core = num_tiles_per_core_group_1;
105 } else if (core_group_2.contains(core)) {
106 num_tiles_per_core = num_tiles_per_core_group_2;
107 } else {
108 TT_ASSERT(false, "Core not in specified core ranges");
109 }
110
111 reader_desc.runtime_args.emplace_back(
112 core, KernelDescriptor::CoreRuntimeArgs{src_buffer->address(), num_tiles_per_core, num_tiles_written});
113
114 writer_desc.runtime_args.emplace_back(
115 core, KernelDescriptor::CoreRuntimeArgs{dst_buffer->address(), num_tiles_per_core, num_tiles_written});
116
117 compute_desc.runtime_args.emplace_back(core, KernelDescriptor::CoreRuntimeArgs{num_tiles_per_core});
118
119 num_tiles_written += num_tiles_per_core;
120 }
121
122 desc.kernels.push_back(std::move(reader_desc));
123 desc.kernels.push_back(std::move(writer_desc));
124 desc.kernels.push_back(std::move(compute_desc));
125
126 return desc;
127}
128
129} // namespace ttnn::operations::examples
1// SPDX-FileCopyrightText: © 2023 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#include "example_device_operation.hpp"
6#include <tt-metalium/work_split.hpp>
7#include <tt-metalium/tensor_accessor_args.hpp>
8
9namespace ttnn::operations::examples {
10
11using namespace tt;
12using namespace tt::tt_metal;
13
14ProgramDescriptor ExampleDeviceOperation::MultiCore::create_descriptor(
15 const operation_attributes_t& /*operation_attributes*/,
16 const tensor_args_t& tensor_args,
17 tensor_return_value_t& tensor_return_value) {
18 const auto& input_tensor = tensor_args.input_tensor;
19 auto& output_tensor = tensor_return_value;
20
21 auto* src_buffer = input_tensor.buffer();
22 auto* dst_buffer = output_tensor.buffer();
23
24 tt::DataFormat cb_data_format = datatype_to_dataformat_converter(input_tensor.dtype());
25 uint32_t single_tile_size = tile_size(cb_data_format);
26 tt::DataFormat cb_data_format_output = datatype_to_dataformat_converter(output_tensor.dtype());
27 uint32_t single_tile_size_output = tt::tile_size(cb_data_format_output);
28
29 uint32_t num_tiles = input_tensor.physical_volume() / constants::TILE_HW;
30
31 IDevice* device = input_tensor.device();
32 auto compute_with_storage_grid_size = device->compute_with_storage_grid_size();
33 uint32_t num_cores_y = compute_with_storage_grid_size.y;
34 auto [num_cores, all_cores, core_group_1, core_group_2, num_tiles_per_core_group_1, num_tiles_per_core_group_2] =
35 split_work_to_cores(compute_with_storage_grid_size, num_tiles);
36
37 ProgramDescriptor desc;
38
39 constexpr uint32_t src0_cb_index = CBIndex::c_0;
40 constexpr uint32_t num_input_tiles = 2;
41 desc.cbs.push_back(CBDescriptor{
42 .total_size = num_input_tiles * single_tile_size,
43 .core_ranges = all_cores,
44 .format_descriptors = {{CBFormatDescriptor{
45 .buffer_index = src0_cb_index,
46 .data_format = cb_data_format,
47 .page_size = single_tile_size,
48 }}},
49 });
50
51 constexpr uint32_t output_cb_index = CBIndex::c_2;
52 constexpr uint32_t num_output_tiles = 2;
53 desc.cbs.push_back(CBDescriptor{
54 .total_size = num_output_tiles * single_tile_size_output,
55 .core_ranges = all_cores,
56 .format_descriptors = {{CBFormatDescriptor{
57 .buffer_index = output_cb_index,
58 .data_format = cb_data_format_output,
59 .page_size = single_tile_size_output,
60 }}},
61 });
62
63 // Reader kernel
64 std::vector<uint32_t> reader_compile_time_args;
65 TensorAccessorArgs(*src_buffer).append_to(reader_compile_time_args);
66
67 KernelDescriptor reader_desc;
68 reader_desc.kernel_source =
69 "ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/reader_unary_interleaved_start_id.cpp";
70 reader_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
71 reader_desc.core_ranges = all_cores;
72 reader_desc.compile_time_args = reader_compile_time_args;
73 reader_desc.config = ReaderConfigDescriptor{};
74
75 // Writer kernel
76 std::vector<uint32_t> writer_compile_time_args = {output_cb_index};
77 TensorAccessorArgs(*dst_buffer).append_to(writer_compile_time_args);
78
79 KernelDescriptor writer_desc;
80 writer_desc.kernel_source =
81 "ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp";
82 writer_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
83 writer_desc.core_ranges = all_cores;
84 writer_desc.compile_time_args = writer_compile_time_args;
85 writer_desc.config = WriterConfigDescriptor{};
86
87 // Compute kernel (eltwise_sfpu.cpp reads num_tiles via get_arg_val, i.e. runtime args)
88 KernelDescriptor compute_desc;
89 compute_desc.kernel_source = "ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/compute/eltwise_sfpu.cpp";
90 compute_desc.source_type = KernelDescriptor::SourceType::FILE_PATH;
91 compute_desc.core_ranges = all_cores;
92 compute_desc.config = ComputeConfigDescriptor{
93 .math_fidelity = MathFidelity::HiFi4,
94 .math_approx_mode = false,
95 };
96
97 // Runtime args per core
98 for (uint32_t i = 0, num_tiles_written = 0; i < num_cores; i++) {
99 CoreCoord core = {i / num_cores_y, i % num_cores_y};
100 uint32_t num_tiles_per_core = 0;
101 if (core_group_1.contains(core)) {
102 num_tiles_per_core = num_tiles_per_core_group_1;
103 } else if (core_group_2.contains(core)) {
104 num_tiles_per_core = num_tiles_per_core_group_2;
105 } else {
106 TT_ASSERT(false, "Core not in specified core ranges");
107 }
108
109 reader_desc.runtime_args.emplace_back(
110 core, KernelDescriptor::CoreRuntimeArgs{src_buffer->address(), num_tiles_per_core, num_tiles_written});
111
112 writer_desc.runtime_args.emplace_back(
113 core, KernelDescriptor::CoreRuntimeArgs{dst_buffer->address(), num_tiles_per_core, num_tiles_written});
114
115 compute_desc.runtime_args.emplace_back(core, KernelDescriptor::CoreRuntimeArgs{num_tiles_per_core});
116
117 num_tiles_written += num_tiles_per_core;
118 }
119
120 desc.kernels.push_back(std::move(reader_desc));
121 desc.kernels.push_back(std::move(writer_desc));
122 desc.kernels.push_back(std::move(compute_desc));
123
124 return desc;
125}
126
127} // namespace ttnn::operations::examples
Step 2: Implement the operation in C++
In order to add a new operation, add the following file:
ttnn/cpp/ttnn/operations/<category>/<operation_name>/<operation_name>.hpp
A concrete example:
1// SPDX-FileCopyrightText: © 2023 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#pragma once
6
7#include "device/example_device_operation.hpp"
8
9namespace ttnn {
10
11// A composite operation is an operation that calls multiple operations in sequence
12// It is written using invoke and can be used to call multiple primitive and/or composite operations
13Tensor composite_example(const Tensor& input_tensor);
14
15} // namespace ttnn
Python Implementation
Step 1: Add Python binding
In order to add a python binding for the operation, follow the directory structure shown below:
ttnn/python/ttnn/operations/<category>/<operation_name>/<operation_name>_nanobind.hpp ttnn/python/ttnn/operations/<category>/<category>_nanobind.hpp
A concrete example:
1// SPDX-FileCopyrightText: © 2025 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#pragma once
6
7#include "ttnn-nanobind/nanobind_fwd.hpp"
8
9namespace ttnn::operations::examples {
10namespace nb = nanobind;
11void bind_example_operation(nb::module_& mod);
12} // namespace ttnn::operations::examples
1// SPDX-FileCopyrightText: © 2025 Tenstorrent USA, Inc.
2//
3// SPDX-License-Identifier: Apache-2.0
4
5#pragma once
6
7#include "ttnn-nanobind/nanobind_fwd.hpp"
8
9namespace ttnn::operations::examples {
10
11namespace nb = nanobind;
12void py_module(nb::module_& mod);
13
14} // namespace ttnn::operations::examples
Finally, call the module defined in examples/example/example_nanobind.hpp wherever you want it to be added.
Step 2: (Optional) Add golden function for the operation in Python
A golden function can be added to an operation in order to compare its output with an equivalent torch implementation
Add the following code in a python file:
import ttnn
# For the golden function, use the same signature as the operation
# Keep in mind that all `ttnn.Tensor`s are converted to `torch.Tensor`s
# And arguments not needed by torch can be ignored using `*args` and `**kwargs`
def golden_function(input_tensor: "torch.Tensor", *args, **kwargs):
output_tensor: "torch.Tensor" = ...
return output_tensor
# TT-NN Tensors are converted to torch tensors before calling the golden function automatically
# And the outputs are converted back to TT-NN Tensors
# But in some cases you may need to preprocess the inputs and postprocess the outputs manually
# In order to preprocess the inputs manually, use the following signature
# Note that the arguments are not packed into *args and **kwargs as in the golden function!!!
def preprocess_golden_function_inputs(args, kwargs):
# i.e.
ttnn_input_tensor = args[0]
return ttnn.to_torch(ttnn_input_tensor)
# In order to postprocess the outputs manually, use the following signature
# Note that the arguments are not packed into *args and **kwargs as in the golden function!!!
def postprocess_golden_function_outputs(args, kwargs, output):
# i.e.
ttnn_input_tensor = args[0]
torch_output_tensor = outputs[0]
return ttnn.from_torch(torch_output_tensor, dtype=ttnn_input_tensor.dtype, device=ttnn_input_tensor.device)
ttnn.attach_golden_function(
ttnn.example,
golden_function=golden_function,
preprocess_golden_function_inputs=preprocess_golden_function_inputs, # Optional
postprocess_golden_function_outputs=postprocess_golden_function_outputs # Optional
)
Note
ttnn.example is the name of the operation in Python because the operation was registered as ttnn::example in C++.
Step 3: (Optional) Add example usage to docs
It is good practice to include an example demonstrating how to use the new function.
The simplest method is to add an Example section directly in the documentation passed to the ttnn::bind_function function. However, this approach makes it difficult to keep the example up to date and prevents the snippet from being tested.
A better approach is to place the example code in a test file and have it included automatically during the documentation build process.
In the file examples_mapping.py, each function is mapped to an example usage snippet that will appear in its documentation.
Add the new operation to the FUNCTION_TO_EXAMPLES_MAPPING_DICT dictionary, as shown below:
FUNCTION_TO_EXAMPLES_MAPPING_DICT = {
...
"ttnn.example": example.test_example,
...
}
Place the example usage function in a new file named test_example_examples.py (or an existing file, if appropriate).
Make sure the file is imported at the top of examples_mapping.py:
# ...
from . import test_data_movement_examples as data_movement
from . import test_core_examples as core
# Import the new file
from . import test_example_examples as example
# ...
Implement the example as a standard ttnn pytest:
def test_example(device):
# Create tensor
tensor = ttnn.rand((2, 3), ttnn.bfloat16, layout=ttnn.ROW_MAJOR_LAYOUT, device=device)
# Call the new operation
output_tensor = ttnn.example(tensor)
This ensures that all example code snippets are executed and validated in the TT-NN CI pipeline.