API Reference

Python Runtime API

run_inference([module, inputs, input_count, …])

Main “run” function for inference.

run_training([epochs, steps, …])

Main “run” function for training.

initialize_pipeline(training[, …])

Initialize the pipeline to run inference and training through manual run_forward, run_backward, run_optimizer, etc.

run_forward([input_count, _sequential])

Run forward passes on the pre-compiled and initialized pipeline of devices.

run_backward([input_count, zero_grad, …])

Run backward passes on the pre-compiled and initialized pipeline of devices.

run_optimizer([checkpoint, _sequential])

Run optimizer on all devices.

get_parameter_checkpoint([device, _sequential])

Return current parameter values.

get_parameter_gradients([device, _sequential])

Return currently accumulated parameter gradients.

update_device_parameters([device, …])

Push new parameters onto given device, or if none is provided, then all devices in the pipeline.

shutdown()

Shutdown running processes and clean up pybuda

run_inference(module: PyBudaModule | None = None, inputs: List[Tuple[Tensor | Tensor, …] | Dict[str, Tensor | Tensor]] = [], input_count: int = 1, output_queue: Queue | None = None, _sequential: bool = False, _perf_trace: bool = False, _verify_cfg: VerifyConfig | None = None)

Main “run” function for inference. After all modules have been defined and placed on devices, this will execute the workload. Unless ‘sequential’ is set, the function will return as soon as the devices are set up to run, and inference will run as long as new inputs are pushed into the device(s). If sequential mode is on, the function will run through inputs that are already in the input buffer and return when done.

  • Parameters:

    • module (PyBudaModule , optional) – If provided, place given module on a TT Device and run inference. Alternatively, manually create device(s) and placed module(s) on them.

    • inputs (List *[*Union *[*Tuple *[*Union *[*torch.Tensor , Tensor ] , ] , Dict *[*str , Union *[*torch.Tensor , Tensor ] ] ] ] , optional) – An optional list of input tensor tuples or dictionaries (passed as args or kwargs to module), to feed into the inference pipeline. Alternatively, use device.push_to_inputs to manually provide inputs outside of this call.

    • input_count (int , default=1) – The number of inputs to run inference on. If 0, inference will run “forever”, until shutdown or run_inference is called again.

    • output_queue (queue.Queue , optional) – If provided, outputs will be pushed into the queue as they are calculated. Otherwise, one will be created and returned.

    • _sequential (bool , Internal) – Don’t use.

    • _perf_trace (bool , Internal) – Don’t use.

    • _verify_cfg (Internal) – Don’t use.

  • Returns: Queue holding the output results. Either the output_queue provided, or one that’s created.

  • Return type: queue.Queue

run_training(epochs: int = 1, steps: int = 1, accumulation_steps: int = 1, microbatch_count: int = 1, checkpoint_queue: Queue | None = None, loss_queue: Queue | None = None, checkpoint_interval: int = 0, _sequential: bool = False, _perf_trace: bool = False, _verify_cfg: VerifyConfig | None = None)

Main “run” function for training. After all modules have been defined and placed on devices, this will execute the workload.

  • Parameters:

    • epochs (int) – The number of epoch to run. Scheduler, if provided, will be stepped after each one.

    • steps (int) – The number of batches to run. After every step, the optimizer will be stepped.

    • accumulation_steps (int) – The number of mini-batches in a batch. Each mini-batch is limited in size by how much of the intermediate data can fit in device memory.

    • microbatch_count (int) – Each mini-batch is optionally further broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.

    • checkpoint_queue (Queue , optional) – If provided, weight checkpoints will be pushed into this queue, along with the final set of weights. If one is not provided, one will be created and returned.

    • loss_queue (Queue , optional) – If provided, loss values will be pushed into this queeu.

    • checkpoint_interval (int , optional) – The weights will be checkpointed into checkpoint queues on host every checkpoint_interval optimizer steps, if set to non-zero. Zero by default.

    • _sequential (Internal) – Don’t use

    • _perf_trace (Internal) – Don’t use

    • _verify_cfg (Internal) – Don’t use.

  • Returns: Checkpoint queue, holding weight checkpoints, and final trained weights.

  • Return type: queue.Queue

shutdown()

Shutdown running processes and clean up pybuda

initialize_pipeline(training: bool, output_queue: ~queue.Queue | None = None, checkpoint_queue: ~queue.Queue | None = None, sample_inputs: ~typing.Tuple[~torch.Tensor | ~pybuda.tensor.Tensor, …] | ~typing.Dict[str, ~torch.Tensor | ~pybuda.tensor.Tensor] = (), sample_targets: ~typing.Tuple[~torch.Tensor | ~pybuda.tensor.Tensor, …] = (), microbatch_count: int = 1, d2d_fwd_queues: ~typing.List[~queue.Queue] = [], d2d_bwd_queues: ~typing.List[~queue.Queue] = [], _sequential: bool = False, _verify_cfg: ~pybuda.verify.config.VerifyConfig | None = None, _device_mode: ~pybuda._C.backend_api.DeviceMode = <DeviceMode.CompileAndRun: 0>)

Initialize the pipeline to run inference and training through manual run_forward, run_backward, run_optimizer, etc. calls. This should be not used with “all-in-one” APIs like run_inference and run_training, which will initialize the pipeline themselves.

  • Parameters:

    • training (bool) – Set to true to prepare the pipeline for training.

    • output_queue (queue.Queue , optional) – If provided, inference outputs will be pushed into the queue as they are calculated. Otherwise, one will be created and returned (in inference mode)

    • checkpoint_queue (Queue , optional) – If provided, weight checkpoints will be pushed into this queue, along with the final set of weights. If one is not provided, one will be created and returned (in training mode)

    • sample_inputs (Tuple *[*Union *[*torch.Tensor , Tensor ] , ] , optional) – If calling initialize_pipeline directly to compile models and initialize devices, then a representative sample of inputs must be provided to accuractely compile the design. Typically, this would be the first input that will be sent through the model post-compile. The tensors must be of the correct shape and data type.

    • sample_targets (Tuple *[*Union *[*torch.Tensor , Tensor ] , ] , optional) – If calling initialize_pipeline directly to compile models and initialize devices for training, then a representative sample of training tagets must be provided to accuractely compile the design. Typically, this would be the first target that will be sent to the last device post-compile. The tensors must be of the correct shape and data type.

    • microbatch_count (int) – Only relevant for training. This represents the number of microbatches that are pushed through fwd path before bwd path runs. The device will ensure that buffering is large enough to contain microbatch_count number of microbatch intermediate data.

    • d2d_fwd_queues (List *[*queue.Queue ] , optional) – If provided, device-to-device intermediate data that passes through host will also be stored in the provided queues. The queues are assigned in order from the first device in the pipeline. The last device will not be assigned a queue.

    • d2d_bwd_queues (List *[*queue.Queue ] , optional) – If provided, device-to-device intermediate data in the training backward pass, that passes through host will also be stored in the provided queues. The queues are assigned in order from the second device in the pipeline. The first device will not be assigned a queue.

    • _sequential (Internal) – Don’t use

    • _verify_cfg (Internal) – Don’t use.

  • Returns: Output queue for inference, or checkpoint queue for training

  • Return type: queue.Queue

run_forward(input_count: int = 1, _sequential: bool = False)

Run forward passes on the pre-compiled and initialized pipeline of devices. This API should be called from custom implementations of inference and training loops, in lieue of calling run_inference and run_training APIs.

If this is a part of an inference run, the results will be placed in the outptut queues which should have already been setup through initialize_pipeline call. If this is called as a part of the training pass, then loss will be pushed to the output queue, if one was set up.

  • Parameters:

    • input_count (int , default=1) – The number of inputs to run inference on. If 0, inference will run “forever”, until shutdown or run_inference is called again.

    • _sequential (Internal) – Don’t use

run_backward(input_count: int = 1, zero_grad: bool = False, _sequential: bool = False)

Run backward passes on the pre-compiled and initialized pipeline of devices. This API should be called from custom implementations of inference and training loops, in lieue of calling run_inference and run_training APIs.

zero_grad should be set for the first backward call of a batch, to zero out accumulated gradients.

No results will be returned. get_parameter_gradients() can be used to get a snapshot of gradients after the backward pass has completed.

  • Parameters:

    • input_count (int , default=1) – The number of inputs to run inference on. If 0, inference will run “forever”, until shutdown or run_inference is called again.

    • zero_grad (bool , optional) – If set, acccumulated gradients on device will be zeroed out before the backward pass begins.

    • _sequential (Internal) – Don’t use

run_optimizer(checkpoint: bool = False, _sequential: bool = False)

Run optimizer on all devices. If checkpoint is set, a checkpoint of parameters will be taken and placed into the checkpoint queue that has been set up during initialize_pipeline call.

  • Parameters:

    • checkpoint (bool , optional) – If set, checkpoint of parameters will be placed into checkpoint queue.

    • _sequential (Internal) – Don’t use

get_parameter_checkpoint(device: CPUDevice | TTDevice | None = None, _sequential: bool = False)

Return current parameter values. If a device is specified, only parameters for that device will be returned, otherwise a list of parameters for all devices will come back.

  • Parameters:

    • device (Union [CPUDevice , TTDevice ] , Optional) – Device to read parameter values from. If None, all devices will be read from.

    • _sequential (Internal) – Don’t use

  • Returns: List of parameter checkpoints for devices in the pipeline, or the given device

  • Return type: List[Dict[str, Tensor]]

get_parameter_gradients(device: CPUDevice | TTDevice | None = None, _sequential: bool = False)

Return currently accumulated parameter gradients. If a device is specified, only gradients for that device will be returned, otherwise a list of gradients for all devices will come back.

  • Parameters:

    • device (Union [CPUDevice , TTDevice ] , Optional) – Device to read parameter gradients from. If None, all devices will be read from.

    • _sequential (Internal) – Don’t use

  • Returns: List of parameter checkpoints for devices in the pipeline, or the given device

  • Return type: List[Dict[str, Tensor]]

update_device_parameters(device: CPUDevice | TTDevice | None = None, parameters: List[Dict[str, Tensor]] = [], _sequential: bool = False)

Push new parameters onto given device, or if none is provided, then all devices in the pipeline.

  • Parameters:

    • device (Union [CPUDevice , TTDevice ] , Optional) – Device to read parameter values from. If None, all devices will be read from.

    • parameters (List *[*Dict *[*str , torch.Tensor ] ]) – List of dictionaries of parameters to update

    • _sequential (Internal) – Don’t use

C++ Runtime API

The BUDA Backend used by Python Runtime can be optionally used stand-alone to run pre-compiled TTI models. The API reference for stand-alone BUDA Backend Runtime can be found here.

Configuration and Placement

set_configuration_options([…])

Set global compile configuration options.

set_epoch_break(op_names)

Instruct place & route to start a new placement epoch on the given op(s)

set_chip_break(op_names)

Instruct place & route to start placing ops on the next chip in the pipeline.

override_op_size(op_name, grid_size)

Override automatic op sizing with given grid size.

detect_available_devices()

Returns a list of available devices on the system.

set_configuration_options(enable_recompute: bool | None = None, balancer_policy: str | None = None, place_on_one_row: bool | None = None, enable_t_streaming: bool | None = None, manual_t_streaming: bool | None = None, enable_consteval: bool | None = None, default_df_override: DataFormat | None = None, accumulate_df: DataFormat | None = None, math_fidelity: MathFidelity | None = None, performance_trace: PerfTraceLevel | None = None, backend_opt_level: int | None = None, backend_output_dir: str | None = None, backend_device_descriptor_path: str | None = None, backend_cluster_descriptor_path: str | None = None, backend_runtime_params_path: str | None = None, backend_runtime_args: str | None = None, enable_auto_fusing: bool | None = None, enable_conv_prestride: bool | None = None, enable_stable_softmax: bool | None = None, amp_level: int | None = None, harvested_rows: List[List[int]] | None = None, store_backend_db_to_yaml: bool | None = None, input_queues_on_host: bool | None = None, output_queues_on_host: bool | None = None, enable_auto_transposing_placement: bool | None = None, use_interactive_placer: bool | None = None, op_intermediates_to_save: List[str] | None = None, enable_enumerate_u_kt: bool | None = None, enable_device_tilize: bool | None = None, dram_placement_algorithm: DRAMPlacementAlgorithm | None = None, chip_placement_policy: str | None = None, enable_forked_dram_inputs: bool | None = None, device_config: str | None = None)

Set global compile configuration options.

  • Parameters:

    • enable_recompute (Optional *[*bool ]) – For training only. Enable ‘recompute’ feature which significantly reduces memory requirements at a cost of some performance.

    • balancer_policy (Optional *[*str ]) –

      Override default place & route policy. Valid values are:

      ”NLP”: Custom policy with reasonable defaults for NLP-like models “Ribbon”: Custom policy with reasonable defaults for CNN-like models

      [DEBUG ONLY] “MaximizeTMinimizeGrid”: Maximize t-streaming. Verification only. “MinimizeGrid”: Super simple policy that always chooses smallest grid. Verification only. “Random”: Pick random valid grids for each op. Verification only.

      [DEPRECATED] “CNN”

    • place_on_one_row (Optional *[*bool ]) – For place & route to place every op on one row of cores only.

    • enable_t_streaming (Optional *[*bool ]) – Enable buffering optimization which reduces memory usage and latency.

    • manual_t_streaming (Optional *[*bool ]) – Only respect override_t_stream_dir op overrides, otherwise no streaming. enable_t_streaming must also be true to take effect.

    • enable_consteval (Optional *[*bool ]) – Use constant propagation to simplify the model.

    • default_df_override (Optional [DataFormat ] , None default) – Set the default override for all node data formats, None means automatically inferred

    • accumulate_df (Optional [DataFormat ] , Float16_b default) – Set default accumulation format for all operations, if supported by the device.

    • math_fidelity (Optional [MathFidelity ] , MathFidelity.HiFi3 default) – Set default math fidelity for all operations

    • performance_trace (Optional *[*PerfTraceLevel ]) – Set to value other than None to enable performance tracing. Note that the Verbose level could have impact on the performance due to the amount of data being captured and stored.

    • backend_opt_level (Optional *[*int ]) – The level of performance optimization in backend runtime (0-3)

    • backend_output_dir (Optional *[*str ]) – Set location for backend compile temporary files and binaries

    • backend_device_descriptor_path (Optional *[*str ]) – Set location for YAML file to load device descriptor

    • backend_cluster_descriptor_path (Optional *[*str ]) – Set location for YAML file to load multi-device cluster descriptor

    • backend_runtime_params_path (Optional *[*str ]) – Set location for YAML file to dump/load backend database configurations

    • enable_auto_fusing (Optional *[*bool ]) – Enabling automatic fusing of small operations into complex ops

    • enable_conv_prestride (Optional *[*bool ]) – Enabling host-side convolution prestiding (occurs during host-tilizer) for more efficient first convolution layer.

    • amp_level (Optional *[*int ]) – Configures the optimization setting for Automatic Mixed Precision (AMP). 0: No Optimization (default) 1: Optimizer ops are set with { OutputDataFormat.Float32, MathFidelity.HiFi4 }

    • harvested_rows (Optional *[*List *[*int ] ]) – Configures manually induced harvested rows. Only row-indices within 1-5 or 7-11 are harvestable.

    • store_backend_db_to_yaml (Optional *[*bool ]) – Enabling automatic backend database configuration dump to the YAML file specified with backend_runtime_param_path. Note that all backend configurations are loaded from the YAML file if existing YAML file is specified and this flag is set to False.

    • use_interactive_placer (Optional *[*bool ]) – Enable or disable usage of interactive placer within balancer policies which support it. Enabled by default.

    • enable_device_tilize (Optional *[*bool ]) – Enable or Disable Tilize Op on the embedded platform

    • chip_placement_policy (Optional *[*str ]) – Determine the order of the chip ids used in placement

    • dram_placement_algorithm (Optional *[*pyplacer.DRAMPlacementAlgorithm ]) – Set the algorithm to use for DRAM placement. Valid values are: ROUND_ROBIN, ROUND_ROBIN_FLIP_FLOP, GREATEST_CAPACITY, CLOSEST

    • enable_forked_dram_inputs (Optional *[*bool ]) – Enable or Disable Forked Dram Optimization

    • device_config (Optional *[*str ]) – Configure and Set runtime_param.yaml for offline WH compile based on the value. YAML files for supported configurations are mapped at ‘supported_backend_configurations’

set_epoch_break(op_names: str | NodePredicateBuilder | List[str | NodePredicateBuilder])

Instruct place & route to start a new placement epoch on the given op(s)

  • Parameters: op_names (Union *[*str , query.NodePredicateBuilder , List *[*Union *[*str , query.NodePredicateBuilder ] ] ]) – Op or ops or predicate matches to start a new placement epoch

set_chip_break(op_names: str | NodePredicateBuilder | List[str | NodePredicateBuilder])

Instruct place & route to start placing ops on the next chip in the pipeline.

  • Parameters: op_names (Union *[*str , query.NodePredicateBuilder , List *[*Union *[*str , query.NodePredicateBuilder ] ] ]) – Op or ops or predicate matches to start a new chip

override_op_size(op_name: str, grid_size: Tuple[int, int])

Override automatic op sizing with given grid size.

  • Parameters:

    • op_name (str) – Name of the op to override

    • grid_size (Tuple *[*int , int ]) – Rectangular shape (row, column) of the placed op

detect_available_devices()

Returns a list of available devices on the system.

Operations

General

Matmul(name, operandA, operandB[, bias])

Matrix multiplication transformation on input activations, with optional bias.

Add(name, operandA, operandB)

Elementwise add of two tensors

Subtract(name, operandA, operandB)

Elementwise subtraction of two tensors

Multiply(name, operandA, operandB)

Elementwise multiply of two tensors

ReduceSum(name, operandA, dim)

Reduce by summing along the given dimension

ReduceAvg(name, operandA, dim)

Reduce by averaging along the given dimension

Constant(name, *, constant)

Op representing user-defined constant

Identity(name, operandA[, unsqueeze, …])

Identity operation.

Buffer(name, operandA)

Identity operation.

Matmul(name: str, operandA: Tensor, operandB: Tensor | Parameter, bias: Tensor | Parameter | None = None)

Matrix multiplication transformation on input activations, with optional bias. y = ab + bias

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – Input operand A

    • operandB (Tensor) – Input operand B

    • bias (Tenor , optional) – Optional bias tensor

Add(name: str, operandA: Tensor, operandB: Tensor | Parameter)

Elementwise add of two tensors

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • operandB (Tensor) – Second operand

  • Returns: Buda tensor

  • Return type: Tensor

Subtract(name: str, operandA: Tensor, operandB: Tensor)

Elementwise subtraction of two tensors

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • operandB (Tensor) – Second operand

  • Returns: Buda tensor

  • Return type: Tensor

Multiply(name: str, operandA: Tensor, operandB: Tensor | Parameter)

Elementwise multiply of two tensors

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • operandB (Tensor) – Second operand

  • Returns: Buda tensor

  • Return type: Tensor

Identity(name: str, operandA: Tensor, unsqueeze: str | None = None, unsqueeze_dim: int | None = None)

Identity operation.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • unsqueeze (str) – If set, the operation returns a new tensor with a dimension of size one inserted at the specified position.

    • unsqueeze_dim (int) – The index at where singleton dimenion can be inserted

  • Returns: Buda tensor

  • Return type: Tensor

Buffer(name: str, operandA: Tensor)

Identity operation. One key difference is a Buffer op will not get lowered into a NOP and avoid being removed by the time it gets to lowering.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

ReduceSum(name: str, operandA: Tensor, dim: int)

Reduce by summing along the given dimension

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • dim (int) – Dimension along which to reduce. A positive number 0 - 3 or negative from -1 to -4.

  • Returns: Buda tensor

  • Return type: Tensor

ReduceAvg(name: str, operandA: Tensor, dim: int)

Reduce by averaging along the given dimension

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • dim (int) – Dimension along which to reduce. A positive number 0 - 3 or negative from -1 to -4.

  • Returns: Buda tensor

  • Return type: Tensor

Constant(name: str, *, constant: float)

Op representing user-defined constant

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • constant (float) – Constant value

  • Returns: Buda tensor

  • Return type: Tensor

Transformations

HSlice(name, operandA, slices)

Slice along horizontal axis into given number of pieces.

VSlice(name, operandA, slices)

Slice along vertical axis into given number of pieces.

HStack(name, operandA[, slices])

Stack Z dimension along horizontal dimension.

VStack(name, operandA[, slices])

Stack Z dimension along vertical dimension.

Reshape(name, operandA, shape)

TM

Index(name, operandA, dim, start[, stop, stride])

TM

Select(name, operandA, dim, index[, stride])

TM

Pad(name, operandA, pad[, mode, channel_last])

TM

Concatenate(name, *operands, axis)

Concatenate tensors along axis

BinaryStack(name, operandA, operandB, dim)

Elementwise max of two tensors

Heaviside(name, operandA, operandB)

Elementwise max of two tensors

Heaviside(name: str, operandA: Tensor, operandB: Tensor | Parameter)

Elementwise max of two tensors

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • operandB (Tensor) – Second operand

  • Returns: Buda tensor

  • Return type: Tensor

BinaryStack(name: str, operandA: Tensor, operandB: Tensor | Parameter, dim: int)

Elementwise max of two tensors

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • operandB (Tensor) – Second operand

    • dim (int) – Dimention on which to stack

  • Returns: Buda tensor

  • Return type: Tensor

HSlice(name: str, operandA: Tensor, slices: int)

Slice along horizontal axis into given number of pieces.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • slices (int) – The number of slices to create

  • Returns: Buda tensor

  • Return type: Tensor

HStack(name: str, operandA: Tensor, slices: int = -1)

Stack Z dimension along horizontal dimension.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • slices (int , optional) – The number of slices to create. If not provided, it will be equal to current Z dimension.

  • Returns: Buda tensor

  • Return type: Tensor

VSlice(name: str, operandA: Tensor, slices: int)

Slice along vertical axis into given number of pieces.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • slices (int) – The number of slices to create

  • Returns: Buda tensor

  • Return type: Tensor

VStack(name: str, operandA: Tensor, slices: int = -1)

Stack Z dimension along vertical dimension.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • slices (int , optional) – The number of slices to create. If not provided, it will be equal to current Z dimension.

  • Returns: Buda tensor

  • Return type: Tensor

Reshape(name: str, operandA: Tensor, shape: Tuple[int, …])

TM

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – Input operand A

  • Returns: Buda tensor

  • Return type: Tensor

Index(name: str, operandA: Tensor, dim: int, start: int, stop: int | None = None, stride: int = 1)

TM

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – Input operand A

    • dim (int) – Dimension to slice

    • start (int) – Starting slice index (inclusive)

    • stop (int) – Stopping slice index (exclusive)

    • stride (int) – Stride amount along that dimension

  • Returns: Buda tensor

  • Return type: Tensor

Select(name: str, operandA: Tensor, dim: int, index: int | Tuple[int, int], stride: int = 0)

TM

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – Input operand A

    • dim (int) – Dimension to slice

    • index (int) – int: Index to select from that dimension [start: int, length: int]: Index range to select from that dimension

    • stride (int) – Stride amount along that dimension

  • Returns: Buda tensor

  • Return type: Tensor

Pad(name: str, operandA: Tensor, pad: Tuple[int, int, int, int] | Tuple[int, int], mode: str = ‘constant’, channel_last: bool = False)

TM

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – Input operand A

    • pad (tuple) – Either (padding_left, padding_right) or (padding_left, padding_right, padding_top, padding_bottom))

  • Returns: Buda tensor

  • Return type: Tensor

Concatenate(name: str, *operands: Tensor, axis: int)

Concatenate tensors along axis

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operands (Tuple *[*Tensor , ]) – tensors to be concatenated

    • axis (int) – concatenate axis

  • Returns: Buda tensor

  • Return type: Tensor

Activations

Relu(name, operandA[, threshold, mode])

ReLU

Gelu(name, operandA[, approximate])

GeLU

Sigmoid(name, operandA)

* param name:
Op name, unique to the module, or leave blank to autoset

Relu(name: str, operandA: Tensor, threshold=0.0, mode=’min’)

ReLU

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Gelu(name: str, operandA: Tensor, approximate=’none’)

GeLU

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • approximate (str) – The gelu approximation algorithm to use: ‘none’ | ‘tanh’. Default: ‘none’

  • Returns: Buda tensor

  • Return type: Tensor

Sigmoid(name: str, operandA: Tensor)

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Math

Exp(name, operandA)

Exponent operation.

Reciprocal(name, operandA)

Reciprocal operation.

Sqrt(name, operandA)

Square root.

Log(name, operandA)

Log operation: natural logarithm of the elements of operandA

Abs(name, operandA)

Sigmoid

Clip(name, operandA, min, max)

Clips tensor values between min and max

Max(name, operandA, operandB)

Elementwise max of two tensors

Argmax(name, operandA[, dim])

* param name:
Op name, unique to the module, or leave blank to autoset

Max(name: str, operandA: Tensor, operandB: Tensor | Parameter)

Elementwise max of two tensors

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • operandB (Tensor) – Second operand

  • Returns: Buda tensor

  • Return type: Tensor

Exp(name: str, operandA: Tensor)

Exponent operation.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Reciprocal(name: str, operandA: Tensor)

Reciprocal operation.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Sqrt(name: str, operandA: Tensor)

Square root.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Log(name: str, operandA: Tensor)

Log operation: natural logarithm of the elements of operandA : yi = log_e(xi) for all xi in operandA tensor

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Argmax(name: str, operandA: Tensor, dim: int | None = None)

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Abs(name: str, operandA: Tensor)

Sigmoid

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Clip(name: str, operandA: Tensor, min: float, max: float)

Clips tensor values between min and max

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • min (float) – Minimum value

    • max (float) – Maximum value

  • Returns: Buda tensor

  • Return type: Tensor

Convolutions

Conv2d(name, activations, weights[, bias, …])

Conv2d transformation on input activations, with optional bias.

Conv2dTranspose(name, activations, weights)

Conv2dTranspose transformation on input activations, with optional bias.

MaxPool2d(name, activations, kernel_size[, …])

Maxpool2d transformation on input activations

AvgPool2d(name, activations, kernel_size[, …])

Avgpool2d transformation on input activations

Conv2d(name: str, activations: Tensor, weights: Tensor | Parameter, bias: Tensor | Parameter | None = None, stride: int = 1, padding: int | str | List = ‘same’, dilation: int = 1, groups: int = 1, channel_last: bool = False)

Conv2d transformation on input activations, with optional bias.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • activations (Tensor) – Input activations of shape (N, Cin, iH, iW)

    • weights

      Tensor : Input weights of shape (Cout, Cin / groups, kH, kW)

      [Tensor] : Internal Use pre-split Optional Input weights list of shape [(weight_grouping, Cin / groups, Cout)] of length: (K*K // weight_grouping)

    • bias (Tenor , optional) – Optional bias tensor of shape (Cout)

Conv2dTranspose(name: str, activations: Tensor, weights: Tensor | Parameter, bias: Tensor | Parameter | None = None, stride: int = 1, padding: int | str = ‘same’, dilation: int = 1, groups: int = 1, channel_last: bool = False)

Conv2dTranspose transformation on input activations, with optional bias.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • activations (Tensor) – Input activations of shape (N, Cin, iH, iW)

    • weights

      Tensor : Input weights of shape (Cout, Cin / groups, kH, kW)

      [Tensor] : Internal Use pre-split Optional Input weights list of shape [(weight_grouping, Cin / groups, Cout)] of length: (K*K // weight_grouping)

    • bias (Tenor , optional) – Optional bias tensor of shape (Cout)

MaxPool2d(name: str, activations: Tensor, kernel_size: int | Tuple[int, int], stride: int = 1, padding: int | str = ‘same’, dilation: int = 1, ceil_mode: bool = False, return_indices: bool = False, max_pool_add_sub_surround: bool = False, max_pool_add_sub_surround_value: float = 1.0, channel_last: bool = False)

Maxpool2d transformation on input activations

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • activations (Tensor) – Input activations of shape (N, Cin, iH, iW)

    • kernel_size – Size of pooling region

AvgPool2d(name: str, activations: Tensor, kernel_size: int | Tuple[int, int], stride: int = 1, padding: int | str = ‘same’, ceil_mode: bool = False, count_include_pad: bool = True, divisor_override: float | None = None, channel_last: bool = False)

Avgpool2d transformation on input activations

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • activations (Tensor) – Input activations of shape (N, Cin, iH, iW)

    • kernel_size – Size of pooling region

NN

Softmax(name, operandA, *, dim[, stable])

Softmax operation.

Layernorm(name, operandA, weights, bias[, …])

Layer normalization.

Softmax(name: str, operandA: Tensor, *, dim: int, stable: bool = True)

Softmax operation.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

    • dim (int) – – A dimension along which Softmax will be computed (so every slice along dim will sum to 1).

    • stable (bool) – Use stable softmax or not.

  • Returns: Buda tensor

  • Return type: Tensor

Layernorm(name: str, operandA: Tensor, weights: Tensor | Parameter, bias: Tensor | Parameter, dim: int = -1, epsilon: float = 1e-05)

Layer normalization.

  • Parameters:

    • name (str) – Op name, unique to the module, or leave blank to autoset

    • operandA (Tensor) – First operand

  • Returns: Buda tensor

  • Return type: Tensor

Module Types

Module(name)

Module class contains a workload that can be assigned to a single device.

PyTorchModule(name, module[, redirect_forward])

A wrapper around a PyTorch module.

TFModule(name, module)

A wrapper around a TF module.

OnnxModule(name, module, onnx_path)

A wrapper around a Onnx module.

PyBudaModule(name)

A base class for all PyBuda modules.

class Module(name: str)

Module class contains a workload that can be assigned to a single device. The workload can be implemented in PyTorch or in PyBuda.

get_device()

Returns the device that this op is placed onto.

  • Returns: Device, or None if op has not been placed yet

  • Return type: Optional[Device]

get_name()

Returns the name of the module.

  • Returns: Device, or None if op has not been placed yet

  • Return type: Optional[Device]

run(*args)

Run inference on this module on a TT device. There should be no other modules manually placed on any devices.

  • Parameters: *args (tensor) – Inference inputs

  • Returns: Outputs of inference

  • Return type: Tuple[tensor,….]

class PyTorchModule(name: str, module: Module, redirect_forward: bool = True)

A wrapper around a PyTorch module. If placed on a CPU device, PyTorchModules will be executed as is, and if placed on a TT device, modules will be lowered to PyBuda.

forward(*args, **kwargs)

Run PyTorch module forward, with pre-loaded inputs in input queues

  • Parameters:

    • *args – Inputs into the module

    • **kwargs – Keyword inputs into the moduls

  • Returns: Output tensors, one for each of the module outputs

  • Return type: Tuple[torch.tensor]

backward(*args)

Run PyTorch module backward, with pre-loaded inputs in input queues

  • Parameters: *args (List *[*Tuple *[*torch.tensor , torch.tensor ] ]) – List of tuples of output tensors and incoming loss tensors

add_parameter(name: str, parameter: Parameter)

Adds a new parameter.

  • Parameters:

    • name (str) – Parameter name

    • parameter (Parameter) – Parameter to add

    • prepend_name (Bool) – Whether to prepend module name to parameter name

set_parameters(**kwargs)

Set parameters (weights) in this module, by name.

  • Parameters: kwargs – Name-value pairs of parameter/weight names and tensor values

get_parameters()

Return the list of parameters defined in this module

  • Returns: List of all parameters in this module

  • Return type: List[Parameter]

class TFModule(name: str, module: Model)

A wrapper around a TF module. Currently, TF modules can only run on a CPU device.

forward(*args, **kwargs)

Run TF module forward, converting pytorch tensors as necessary

  • Parameters:

    • *args – Inputs into the module

    • **kwargs – Keyword inputs into the moduls

  • Returns: Output tensors, one for each of the module outputs

  • Return type: Tuple[tf.Tensor]

call(*args, **kwargs)

Run TF module forward, with pre-loaded inputs in input queues

  • Parameters:

    • *args – Inputs into the module

    • **kwargs – Keyword inputs into the moduls

  • Returns: Output tensors, one for each of the module outputs

  • Return type: Tuple[tf.Tensor]

backward(*args)

Run TF module backward, with pre-loaded inputs in input queues

  • Parameters: *args (List *[*Tuple *[*tf.Tensor , tf.Tensor ] ]) – List of tuples of output tensors and incoming loss tensors

class OnnxModule(name: str, module: ModelProto, onnx_path: str)

A wrapper around a Onnx module.

class PyBudaModule(name: str)

A base class for all PyBuda modules. User should extend this class and implement forward function with workload implementation.

pre_forward(*args, **kwargs)

Called before forward. Override this function to add custom logic.

add_parameter(name: str, parameter: Parameter, prepend_name: bool = False)

Adds a new parameter.

  • Parameters:

    • name (str) – Parameter name

    • parameter (Parameter) – Parameter to add

    • prepend_name (Bool) – Whether to prepend module name to parameter name

add_constant(name: str, prepend_name: bool = False, shape: Tuple[int] | None = None)

Adds a new constant.

  • Parameters:

    • name (str) – Constant name

    • prepend_name (Bool) – Whether to prepend module name to constant name

get_constant(name)

Gets a constant by name

  • Parameters: name (str) – constant name

  • Returns: constant in module

  • Return type: pybuda.Tensor

set_constant(name: str, data: Tensor | Tensor | ndarray)

Set value for a module constant.

  • Parameters:

    • name (str) – constant name

    • data (SomeTensor) – Tensor value to be set

get_parameter(name)

Gets a parameter by name

  • Parameters: name (str) – Parameter name

  • Returns: Module parameter

  • Return type: Parameter

get_parameters(submodules: bool = True)

Return the list of parameters defined in this module and (optionally) all submodules.

  • Parameters: submodules (bool , optional) – If set, parameters of submodules will be returned, as well. True by default.

  • Returns: List of all parameters in this (and submodules, optionally) module

  • Return type: List[Parameter]

set_parameter(name: str, data: Tensor | Tensor | ndarray)

Set value for a module parameter.

  • Parameters:

    • name (str) – Parameter name

    • data (SomeTensor) – Tensor value to be set

load_parameter_dict(data: Dict[str, Tensor | Tensor | ndarray])

Load all parameter values specified in the dictionary.

  • Parameters: data (Dict *[*str , SomeTensor ]) – Dictionary of name->tensor pairs to be loaded into parameters

insert_tapout_queue_for_op(op_name: str, output_index: int)

Insert an intermediate queue for op (used for checking/debugging)

  • Parameters:

    • op_name (str) – Op name

    • output_index (int) – Index of the output tensor on the op you want to associate with the queue

  • Returns: Unique handle for the tapout queue, used to retrieve values later

  • Return type: IntQueueHandle

Device Types

Device(name[, mp_context])

Device class represents a physical device which can be a Tenstorrent device, or a CPU.

CPUDevice(name[, optimizer_f, scheduler_f, …])

CPUDevice represents a CPU processor.

TTDevice(name, num_chips, chip_ids, arch, …)

TTDevice represents one or more Tenstorrent devices that will receive modules to run.

class Device(name: str, mp_context=None)

Device class represents a physical device which can be a Tenstorrent device, or a CPU. In a typical operation, each device spawns a process on the host CPU which is either used to run commands on the CPU (if device is a CPU), or feeds commands to the Tenstorrent device.

Each device will allocate input queues for the first module it will execute. On a CPU, these are usually some kind of multiprocessing queues with shared memory storage, and Tenstorrent devices have queues in on-device memory.

One or more Modules can be placed on the device to be executed.

place_module(module: Module | Tuple[Module] | List[Module])

Places a module, or list of modules, on this device for execution. Modules will be run as a sequential pipeline on this single device.

  • Parameters: module (Union [Module , Tuple [Module ] , List [Module ] ]) – A single Module or a list of Modules to be placed on the device

place_loss_module(module: Module)

Places a module used to calculate loss on this device. This must be the last device in the pipeline.

  • Parameters: module (Module) – A single loss module

remove_loss_module()

Remove module used to calculate loss from this device

push_to_inputs(*tensors: Tuple[Tensor | Tensor, …] | Dict[str, Tensor | Tensor])

Push tensor(s) to module inputs, either in order, or by keyword argumet if a dictionary is used. The data will be queued up on the target device until it is ready to be consumed.

This call can block if there is no space on the target device’s input queues.

  • Parameters: *tensors (Union *[*torch.Tensor , Tensor ]) – Ordered list of inputs to be pushed into the module’s input queue. Can be pytorch or pybuda tensor.

push_to_target_inputs(*tensors)

Push tensor(s) to module training target inputs, in order. The data will be queued up on the target device until it is ready to be consumed.

This call can block if there is no space on the target device’s input queues.

  • Parameters: tensors – Ordered list of inputs to be pushed into the module’s target input queue

push_to_command_queue(cmd)

Send command to the running main loop in another process

get_command_queue_response()

Read from command queue response. This is blocking.

  • Returns: Command-specific dictionary with response data, or None in case of failures

  • Return type: Optional[Dict]

get_next_command(command_queue: Queue)

Read next command to run, from the given command queue. Blocking.

  • Parameters: command_queue (queue.Queue) – Queue of commands

  • Returns: Next command from the queue, or None if shutdown_even was set

  • Return type: Command

run_next_command(cmd: Command)

In concurrent mode, this is called in a forever loop by the process dedicated to this device. In sequential mode, the main process will call this until there’s no more work to do.

  • Parameters: command_queue (queue.Queue) – Command queue to read commands from

  • Returns: True if quit command was seen

  • Return type: bool

dc_transfer_thread(direction: str, direction_queue: Queue)

Keep transfering data in a thread. One per direction.

dc_transfer(direction: str)

Transfer data between devices

run(output_dir: str)

Main process loop in concurrent mode.

The loop receives commands through its command queue, which indicate how many epochs & iterations to run, whether to run training or inference, and position in the pipeline.

The loop will run until shutdown command is sent in the command queue, or shutdown event is raised due to an exception in another process

  • Parameters: output_dir (str) – Output directory needed by perf trace on every process

compile_for(training: bool, microbatch_size: int = 0, microbatch_count: int = 1)

Save microbatch size and count

get_first_targets()

Return the tuple of first targets pushed to this device

get_first_inputs(peek=False)

Return the microbatch size, and first input in microbatch pushed into the device. If input_shapes/input_types are provided, then those will be used to create input tensors.

This is used to compile and optimize the model for dimensions provided by the first input.

shutdown_device()

Check for any mp queues that are not empty, and drain them

cpueval_backward(bw_inputs: List[Tensor], parameters: Dict[str, Tensor])

Evaluate backward pass for verification. cpueval_forward should’ve been called first, with save_for_backward set.

  • Parameters:

    • bw_inputs (List *[*torch.Tensor ]) – BW inputs, i.e. losses for each fw output

    • parameters (Dict *[*str , torch.Tensor ]) – Module parameters

  • Returns:

    • List[Tensor] – Gradients on ordered inputs

    • Dict[str, Tensor] – Gradients on parameters

generate(loop_count: int, write_index: int)

Run generate forward pass on each module on this device, in order

  • Parameters:

    • loop_count (int) – Number of micro-batches to run

    • write_index (int) – Write location for past cache buffers

forward(loop_count: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

backward(loop_count: int, zero_grad: bool)

Run backward pass on each module on this device, in reverse order

  • Parameters:

    • loop_count (int) – Each mini-batch is broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.

    • zero_grad (bool) – Set to true to have optimizer zero out gradients before the run

class CPUDevice(name: str, optimizer_f: Callable | None = None, scheduler_f: Callable | None = None, mp_context=None, retain_backward_graph=False, module: PyTorchModule | List[PyTorchModule] | None = None, input_dtypes: List[dtype] | None = None)

CPUDevice represents a CPU processor. It will spawn a process and run local operations on the assigned processor.

forward_pt(loop_count: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

forward_tf(loop_count: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

forward(loop_count: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

backward(loop_count: int, zero_grad: bool)

Run backward pass on each module on this device, in reverse order

  • Parameters:

    • loop_count (int) – Each mini-batch is broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.

    • zero_grad (bool) – Set to true to have optimizer zero out gradients before the run

generate(loop_count: int, write_index: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

compile_for_pt(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, microbatch_count: int = 1, verify_cfg: VerifyConfig | None = None)

For a CPU device, there is currently no compilation. This function propagates input shapes through the model to return output shapes and formats.

  • Parameters:

    • inputs (Tuple *[*Tensor , ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.

    • compiler_cfg (CompilerConfig) – Compiler configuration

    • targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module

    • microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.

    • microbatch_count (int) – Only relevant for training and TT devices.

    • verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process

  • Returns: Output tensors

  • Return type: Tuple[Tensor, …]

compile_for_tf(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, verify_cfg: VerifyConfig | None = None)

For a CPU device, there is currently no compilation. This function propagates input shapes through the model to return output shapes and formats.

  • Parameters:

    • inputs (Tuple *[*Tensor , ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.

    • compiler_cfg (CompilerConfig) – Compiler configuration

    • targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module

    • microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.

    • verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process

  • Returns: Output tensors

  • Return type: Tuple[Tensor, …]

compile_for(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, microbatch_count: int = 1, verify_cfg: VerifyConfig | None = None)

For a CPU device, there is currently no compilation. This function propagates input shapes through the model to return output shapes and formats.

  • Parameters:

    • inputs (Tuple *[*Tensor , ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.

    • compiler_cfg (CompilerConfig) – Compiler configuration

    • targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module

    • microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.

    • microbatch_count (int) – Only relevant for training and TT devices.

    • verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process

  • Returns: Output tensors

  • Return type: Tuple[Tensor, …]

cpueval_forward_pt(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])

Evaluate forward pass for verification

  • Parameters:

    • inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)

    • parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters

    • save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.

    • targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target

  • Returns: Forward graph output

  • Return type: List[Tensor]

cpueval_forward_tf(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])

Evaluate forward pass for verification

  • Parameters:

    • inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)

    • parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters

    • save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.

    • targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target

  • Returns: Forward graph output

  • Return type: List[Tensor]

cpueval_forward(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])

Evaluate forward pass for verification

  • Parameters:

    • inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)

    • parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters

    • save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.

    • targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target

  • Returns: Forward graph output

  • Return type: List[Tensor]

cpueval_backward(bw_inputs: List[Tensor], parameters: Dict[str, Tensor])

Evaluate backward pass for verification. cpueval_forward should’ve been called first, with save_for_backward set.

  • Parameters:

    • bw_inputs (List *[*torch.Tensor ]) – BW inputs, i.e. losses for each fw output

    • parameters (Dict *[*str , torch.Tensor ]) – Module parameters

  • Returns:

    • List[Tensor] – Gradients on ordered inputs

    • Dict[str, Tensor] – Gradients on parameters

place_module(module: Module | Tuple[Module] | List[Module])

Places a module, or list of modules, on this device for execution. Modules will be run as a sequential pipeline on this single device.

  • Parameters: module (Union [Module , Tuple [Module ] , List [Module ] ]) – A single Module or a list of Modules to be placed on the device

pop_parameter_checkpoint()

Return a dictionary of current parameter values for the models on this device.

set_debug_gradient_trace_queue(q: Queue)

[debug feature] Provide a queue to which incoming and outgoing gradients will be stored, for debug tracing.

sync()

Block until queued up commands have completed and the device is idle.

class TTDevice(name: str, num_chips: int | None = None, chip_ids: ~typing.List[int] | ~typing.List[~typing.Tuple[int]] | None = None, arch: ~pybuda._C.backend_api.BackendDevice | None = None, devtype: ~pybuda._C.backend_api.BackendType | None = None, device_mode: ~pybuda._C.backend_api.DeviceMode | None = None, optimizer: ~pybuda.optimizers.Optimizer | None = None, scheduler: ~pybuda.schedulers.LearningRateScheduler | None = None, fp32_fallback: ~pybuda._C.DataFormat = <DataFormat.Float16_b: 5>, mp_context=None, module: ~pybuda.module.Module | ~typing.List[~pybuda.module.Module] | None = None)

TTDevice represents one or more Tenstorrent devices that will receive modules to run.

get_device_config(compiler_cfg=None)

Figure out which silicon devices will be used, if in silicon mode

place_module(module: Module | Tuple[Module] | List[Module])

Places a module, or list of modules, on this device for execution. Modules will be run as a sequential pipeline on this single device.

  • Parameters: module (Union [Module , Tuple [Module ] , List [Module ] ]) – A single Module or a list of Modules to be placed on the device

remove_modules()

Remove placed modules, and clear the device

set_active_subgraph(subgraph_index: int)

Set the currently active subgraph by limiting the io queues.

get_active_subgraph()

Gets the currently active subgraph.

generate_graph(*inputs: Tensor, target_tensors: List[Tensor] = [], return_intermediate: bool = False, graph_name: str = ‘default_graph’, compiler_cfg: CompilerConfig | None = None, trace_only: bool = False, verify_cfg: VerifyConfig | None = None)

Generate a buda graph from the modules on the device, and return the graph and output tensors. If input tensors have a value set, the output tensor will also have the calculated output value set.

  • Parameters:

    • inputs (Tuple *[*Tensor , . ]) – Input tensors

    • target_tensors (List *[*Tensor ]) – Target inputs. Optional, if trace_only is set. Otherwise, value must be provided.

    • return_intermediate (bool) – Optional. If set, a dictionary of node IDs -> tensors will be return with intermediate values, for data mismatch debug.

    • trace_only (bool) – If set, the graph is made for a quick trace only and shouldn’t have side-effects

  • Returns: Buda graph, outputs, optional intermediates, original inputs, target tensor

  • Return type: Graph, Tuple[Tensor, …], Dict[str, Tensor], Tuple[Tensor, …], Optional[Tensor]

compile_for(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, microbatch_count: int = 1, verify_cfg: VerifyConfig | None = None)

Compile modules placed on this device, with given input shapes, input formats, and microbatch size.

  • Parameters:

    • training (bool) – Specify whether to compile for training or inference. If set to true, autograd will be executed before the compile.

    • inputs (Tuple *[*Tensor , ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.

    • compiler_cfg (CompilerConfig) – Compiler configuration

    • targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module

    • microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.

    • microbatch_count (int) – Only relevant for training. This represents the number of microbatches that are pushed through fwd path before bwd path runs. The device will ensure that buffering is large enough to contain microbatch_count number of microbatch intermediate data.

    • verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process

  • Returns: Output tensors

  • Return type: Tuple[Tensor, …]

forward(loop_count: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

generate(loop_count: int, write_index: int, tokens_per_iter: int, token_id: int)

Run forward pass on each module on this device, in order

  • Parameters: loop_count (int) – Number of micro-batches to run

cpueval_forward(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])

Evaluate forward pass for verification

  • Parameters:

    • inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)

    • parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters

    • save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.

    • targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target

  • Returns: Forward graph output

  • Return type: List[Tensor]

backward(loop_count: int, zero_grad: bool)

Run backward pass on each module on this device, in reverse order

  • Parameters:

    • loop_count (int) – Each mini-batch is broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.

    • zero_grad (bool) – Set to true to have optimizer zero out gradients before the run

get_parameter_checkpoint()

Return a dictionary of current parameter values for the models on this device

get_all_parameters()

Return a dictionary of current parameter values for the models on this device

get_parameter_gradients()

Return a dictionary of currently accumulated gradient values for the models on this device

get_parameters(ignore_unused_parameters: bool = True)

  • Parameters: ignore_used_parameters (bool) – If true, any parameter not being recorded by the graph-trace (i.e. parameter is unused in graph execution) is not included in the returned list to user.

get_optimizer_params(is_buda: bool)

Return a dictionary of dictionaries of optimizer parameters for each model parameter.

get_scheduler_params(is_buda: bool)

Return a dictionary of dictionaries of optimizer parameters used by scheduler.

get_dram_io_queues(queue_type: str)

Returns the appropriate queue description, tile broadcast information, and original shapes, where applicable

shutdown_device()

Shutdown device at the end of the workload

sync()

Block until queued up commands have completed and the device is idle.

Miscellaneous

DataFormat

Members:

MathFidelity

Members:

class DataFormat

Members:

Float32

Float16

Bfp8

Bfp4

Bfp2

Float16_b

Bfp8_b

Bfp4_b

Bfp2_b

Lf8

UInt16

Int8

RawUInt8

RawUInt16

RawUInt32

Int32

Invalid

from_json(self: str)

property name

to_json(self: pybuda._C.DataFormat)

class MathFidelity

Members:

LoFi

HiFi2

HiFi3

HiFi4

Invalid

from_json(self: str)

property name

to_json(self: pybuda._C.MathFidelity)