API Reference
Python Runtime API
|
Main “run” function for inference. |
---|---|
|
Main “run” function for training. |
|
Initialize the pipeline to run inference and training through manual run_forward, run_backward, run_optimizer, etc. |
|
Run forward passes on the pre-compiled and initialized pipeline of devices. |
|
Run backward passes on the pre-compiled and initialized pipeline of devices. |
|
Run optimizer on all devices. |
|
Return current parameter values. |
|
Return currently accumulated parameter gradients. |
|
Push new parameters onto given device, or if none is provided, then all devices in the pipeline. |
|
Shutdown running processes and clean up pybuda |
run_inference(module: PyBudaModule | None = None, inputs: List[Tuple[Tensor | Tensor, …] | Dict[str, Tensor | Tensor]] = [], input_count: int = 1, output_queue: Queue | None = None, _sequential: bool = False, _perf_trace: bool = False, _verify_cfg: VerifyConfig | None = None)
Main “run” function for inference. After all modules have been defined and placed on devices, this will execute the workload. Unless ‘sequential’ is set, the function will return as soon as the devices are set up to run, and inference will run as long as new inputs are pushed into the device(s). If sequential mode is on, the function will run through inputs that are already in the input buffer and return when done.
Parameters:
module (PyBudaModule , optional) – If provided, place given module on a TT Device and run inference. Alternatively, manually create device(s) and placed module(s) on them.
inputs (List *[*Union *[*Tuple *[*Union *[*torch.Tensor , Tensor ] , … ] , Dict *[*str , Union *[*torch.Tensor , Tensor ] ] ] ] , optional) – An optional list of input tensor tuples or dictionaries (passed as args or kwargs to module), to feed into the inference pipeline. Alternatively, use device.push_to_inputs to manually provide inputs outside of this call.
input_count (int , default=1) – The number of inputs to run inference on. If 0, inference will run “forever”, until shutdown or run_inference is called again.
output_queue (queue.Queue , optional) – If provided, outputs will be pushed into the queue as they are calculated. Otherwise, one will be created and returned.
_sequential (bool , Internal) – Don’t use.
_perf_trace (bool , Internal) – Don’t use.
_verify_cfg (Internal) – Don’t use.
Returns: Queue holding the output results. Either the output_queue provided, or one that’s created.
Return type: queue.Queue
run_training(epochs: int = 1, steps: int = 1, accumulation_steps: int = 1, microbatch_count: int = 1, checkpoint_queue: Queue | None = None, loss_queue: Queue | None = None, checkpoint_interval: int = 0, _sequential: bool = False, _perf_trace: bool = False, _verify_cfg: VerifyConfig | None = None)
Main “run” function for training. After all modules have been defined and placed on devices, this will execute the workload.
Parameters:
epochs (int) – The number of epoch to run. Scheduler, if provided, will be stepped after each one.
steps (int) – The number of batches to run. After every step, the optimizer will be stepped.
accumulation_steps (int) – The number of mini-batches in a batch. Each mini-batch is limited in size by how much of the intermediate data can fit in device memory.
microbatch_count (int) – Each mini-batch is optionally further broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.
checkpoint_queue (Queue , optional) – If provided, weight checkpoints will be pushed into this queue, along with the final set of weights. If one is not provided, one will be created and returned.
loss_queue (Queue , optional) – If provided, loss values will be pushed into this queeu.
checkpoint_interval (int , optional) – The weights will be checkpointed into checkpoint queues on host every checkpoint_interval optimizer steps, if set to non-zero. Zero by default.
_sequential (Internal) – Don’t use
_perf_trace (Internal) – Don’t use
_verify_cfg (Internal) – Don’t use.
Returns: Checkpoint queue, holding weight checkpoints, and final trained weights.
Return type: queue.Queue
shutdown()
Shutdown running processes and clean up pybuda
initialize_pipeline(training: bool, output_queue: ~queue.Queue | None = None, checkpoint_queue: ~queue.Queue | None = None, sample_inputs: ~typing.Tuple[~torch.Tensor | ~pybuda.tensor.Tensor, …] | ~typing.Dict[str, ~torch.Tensor | ~pybuda.tensor.Tensor] = (), sample_targets: ~typing.Tuple[~torch.Tensor | ~pybuda.tensor.Tensor, …] = (), microbatch_count: int = 1, d2d_fwd_queues: ~typing.List[~queue.Queue] = [], d2d_bwd_queues: ~typing.List[~queue.Queue] = [], _sequential: bool = False, _verify_cfg: ~pybuda.verify.config.VerifyConfig | None = None, _device_mode: ~pybuda._C.backend_api.DeviceMode = <DeviceMode.CompileAndRun: 0>)
Initialize the pipeline to run inference and training through manual run_forward, run_backward, run_optimizer, etc. calls. This should be not used with “all-in-one” APIs like run_inference and run_training, which will initialize the pipeline themselves.
Parameters:
training (bool) – Set to true to prepare the pipeline for training.
output_queue (queue.Queue , optional) – If provided, inference outputs will be pushed into the queue as they are calculated. Otherwise, one will be created and returned (in inference mode)
checkpoint_queue (Queue , optional) – If provided, weight checkpoints will be pushed into this queue, along with the final set of weights. If one is not provided, one will be created and returned (in training mode)
sample_inputs (Tuple *[*Union *[*torch.Tensor , Tensor ] , … ] , optional) – If calling initialize_pipeline directly to compile models and initialize devices, then a representative sample of inputs must be provided to accuractely compile the design. Typically, this would be the first input that will be sent through the model post-compile. The tensors must be of the correct shape and data type.
sample_targets (Tuple *[*Union *[*torch.Tensor , Tensor ] , … ] , optional) – If calling initialize_pipeline directly to compile models and initialize devices for training, then a representative sample of training tagets must be provided to accuractely compile the design. Typically, this would be the first target that will be sent to the last device post-compile. The tensors must be of the correct shape and data type.
microbatch_count (int) – Only relevant for training. This represents the number of microbatches that are pushed through fwd path before bwd path runs. The device will ensure that buffering is large enough to contain microbatch_count number of microbatch intermediate data.
d2d_fwd_queues (List *[*queue.Queue ] , optional) – If provided, device-to-device intermediate data that passes through host will also be stored in the provided queues. The queues are assigned in order from the first device in the pipeline. The last device will not be assigned a queue.
d2d_bwd_queues (List *[*queue.Queue ] , optional) – If provided, device-to-device intermediate data in the training backward pass, that passes through host will also be stored in the provided queues. The queues are assigned in order from the second device in the pipeline. The first device will not be assigned a queue.
_sequential (Internal) – Don’t use
_verify_cfg (Internal) – Don’t use.
Returns: Output queue for inference, or checkpoint queue for training
Return type: queue.Queue
run_forward(input_count: int = 1, _sequential: bool = False)
Run forward passes on the pre-compiled and initialized pipeline of devices. This API should be called from custom implementations of inference and training loops, in lieue of calling run_inference and run_training APIs.
If this is a part of an inference run, the results will be placed in the outptut queues which should have already been setup through initialize_pipeline call. If this is called as a part of the training pass, then loss will be pushed to the output queue, if one was set up.
Parameters:
input_count (int , default=1) – The number of inputs to run inference on. If 0, inference will run “forever”, until shutdown or run_inference is called again.
_sequential (Internal) – Don’t use
run_backward(input_count: int = 1, zero_grad: bool = False, _sequential: bool = False)
Run backward passes on the pre-compiled and initialized pipeline of devices. This API should be called from custom implementations of inference and training loops, in lieue of calling run_inference and run_training APIs.
zero_grad should be set for the first backward call of a batch, to zero out accumulated gradients.
No results will be returned. get_parameter_gradients() can be used to get a snapshot of gradients after the backward pass has completed.
Parameters:
input_count (int , default=1) – The number of inputs to run inference on. If 0, inference will run “forever”, until shutdown or run_inference is called again.
zero_grad (bool , optional) – If set, acccumulated gradients on device will be zeroed out before the backward pass begins.
_sequential (Internal) – Don’t use
run_optimizer(checkpoint: bool = False, _sequential: bool = False)
Run optimizer on all devices. If checkpoint is set, a checkpoint of parameters will be taken and placed into the checkpoint queue that has been set up during initialize_pipeline call.
Parameters:
checkpoint (bool , optional) – If set, checkpoint of parameters will be placed into checkpoint queue.
_sequential (Internal) – Don’t use
get_parameter_checkpoint(device: CPUDevice | TTDevice | None = None, _sequential: bool = False)
Return current parameter values. If a device is specified, only parameters for that device will be returned, otherwise a list of parameters for all devices will come back.
Parameters:
Returns: List of parameter checkpoints for devices in the pipeline, or the given device
Return type: List[Dict[str, Tensor]]
get_parameter_gradients(device: CPUDevice | TTDevice | None = None, _sequential: bool = False)
Return currently accumulated parameter gradients. If a device is specified, only gradients for that device will be returned, otherwise a list of gradients for all devices will come back.
Parameters:
Returns: List of parameter checkpoints for devices in the pipeline, or the given device
Return type: List[Dict[str, Tensor]]
update_device_parameters(device: CPUDevice | TTDevice | None = None, parameters: List[Dict[str, Tensor]] = [], _sequential: bool = False)
Push new parameters onto given device, or if none is provided, then all devices in the pipeline.
C++ Runtime API
The BUDA Backend used by Python Runtime can be optionally used stand-alone to run pre-compiled TTI models. The API reference for stand-alone BUDA Backend Runtime can be found here.
Configuration and Placement
Set global compile configuration options. |
|
---|---|
|
Instruct place & route to start a new placement epoch on the given op(s) |
|
Instruct place & route to start placing ops on the next chip in the pipeline. |
|
Override automatic op sizing with given grid size. |
Returns a list of available devices on the system. |
set_configuration_options(enable_recompute: bool | None = None, balancer_policy: str | None = None, place_on_one_row: bool | None = None, enable_t_streaming: bool | None = None, manual_t_streaming: bool | None = None, enable_consteval: bool | None = None, default_df_override: DataFormat | None = None, accumulate_df: DataFormat | None = None, math_fidelity: MathFidelity | None = None, performance_trace: PerfTraceLevel | None = None, backend_opt_level: int | None = None, backend_output_dir: str | None = None, backend_device_descriptor_path: str | None = None, backend_cluster_descriptor_path: str | None = None, backend_runtime_params_path: str | None = None, backend_runtime_args: str | None = None, enable_auto_fusing: bool | None = None, enable_conv_prestride: bool | None = None, enable_stable_softmax: bool | None = None, amp_level: int | None = None, harvested_rows: List[List[int]] | None = None, store_backend_db_to_yaml: bool | None = None, input_queues_on_host: bool | None = None, output_queues_on_host: bool | None = None, enable_auto_transposing_placement: bool | None = None, use_interactive_placer: bool | None = None, op_intermediates_to_save: List[str] | None = None, enable_enumerate_u_kt: bool | None = None, enable_device_tilize: bool | None = None, dram_placement_algorithm: DRAMPlacementAlgorithm | None = None, chip_placement_policy: str | None = None, enable_forked_dram_inputs: bool | None = None, device_config: str | None = None)
Set global compile configuration options.
Parameters:
enable_recompute (Optional *[*bool ]) – For training only. Enable ‘recompute’ feature which significantly reduces memory requirements at a cost of some performance.
balancer_policy (Optional *[*str ]) –
Override default place & route policy. Valid values are:
”NLP”: Custom policy with reasonable defaults for NLP-like models “Ribbon”: Custom policy with reasonable defaults for CNN-like models
[DEBUG ONLY] “MaximizeTMinimizeGrid”: Maximize t-streaming. Verification only. “MinimizeGrid”: Super simple policy that always chooses smallest grid. Verification only. “Random”: Pick random valid grids for each op. Verification only.
[DEPRECATED] “CNN”
place_on_one_row (Optional *[*bool ]) – For place & route to place every op on one row of cores only.
enable_t_streaming (Optional *[*bool ]) – Enable buffering optimization which reduces memory usage and latency.
manual_t_streaming (Optional *[*bool ]) – Only respect override_t_stream_dir op overrides, otherwise no streaming. enable_t_streaming must also be true to take effect.
enable_consteval (Optional *[*bool ]) – Use constant propagation to simplify the model.
default_df_override (Optional [DataFormat ] , None default) – Set the default override for all node data formats, None means automatically inferred
accumulate_df (Optional [DataFormat ] , Float16_b default) – Set default accumulation format for all operations, if supported by the device.
math_fidelity (Optional [MathFidelity ] , MathFidelity.HiFi3 default) – Set default math fidelity for all operations
performance_trace (Optional *[*PerfTraceLevel ]) – Set to value other than None to enable performance tracing. Note that the Verbose level could have impact on the performance due to the amount of data being captured and stored.
backend_opt_level (Optional *[*int ]) – The level of performance optimization in backend runtime (0-3)
backend_output_dir (Optional *[*str ]) – Set location for backend compile temporary files and binaries
backend_device_descriptor_path (Optional *[*str ]) – Set location for YAML file to load device descriptor
backend_cluster_descriptor_path (Optional *[*str ]) – Set location for YAML file to load multi-device cluster descriptor
backend_runtime_params_path (Optional *[*str ]) – Set location for YAML file to dump/load backend database configurations
enable_auto_fusing (Optional *[*bool ]) – Enabling automatic fusing of small operations into complex ops
enable_conv_prestride (Optional *[*bool ]) – Enabling host-side convolution prestiding (occurs during host-tilizer) for more efficient first convolution layer.
amp_level (Optional *[*int ]) – Configures the optimization setting for Automatic Mixed Precision (AMP). 0: No Optimization (default) 1: Optimizer ops are set with { OutputDataFormat.Float32, MathFidelity.HiFi4 }
harvested_rows (Optional *[*List *[*int ] ]) – Configures manually induced harvested rows. Only row-indices within 1-5 or 7-11 are harvestable.
store_backend_db_to_yaml (Optional *[*bool ]) – Enabling automatic backend database configuration dump to the YAML file specified with backend_runtime_param_path. Note that all backend configurations are loaded from the YAML file if existing YAML file is specified and this flag is set to False.
use_interactive_placer (Optional *[*bool ]) – Enable or disable usage of interactive placer within balancer policies which support it. Enabled by default.
enable_device_tilize (Optional *[*bool ]) – Enable or Disable Tilize Op on the embedded platform
chip_placement_policy (Optional *[*str ]) – Determine the order of the chip ids used in placement
dram_placement_algorithm (Optional *[*pyplacer.DRAMPlacementAlgorithm ]) – Set the algorithm to use for DRAM placement. Valid values are: ROUND_ROBIN, ROUND_ROBIN_FLIP_FLOP, GREATEST_CAPACITY, CLOSEST
enable_forked_dram_inputs (Optional *[*bool ]) – Enable or Disable Forked Dram Optimization
device_config (Optional *[*str ]) – Configure and Set runtime_param.yaml for offline WH compile based on the value. YAML files for supported configurations are mapped at ‘supported_backend_configurations’
set_epoch_break(op_names: str | NodePredicateBuilder | List[str | NodePredicateBuilder])
Instruct place & route to start a new placement epoch on the given op(s)
Parameters: op_names (Union *[*str , query.NodePredicateBuilder , List *[*Union *[*str , query.NodePredicateBuilder ] ] ]) – Op or ops or predicate matches to start a new placement epoch
set_chip_break(op_names: str | NodePredicateBuilder | List[str | NodePredicateBuilder])
Instruct place & route to start placing ops on the next chip in the pipeline.
Parameters: op_names (Union *[*str , query.NodePredicateBuilder , List *[*Union *[*str , query.NodePredicateBuilder ] ] ]) – Op or ops or predicate matches to start a new chip
override_op_size(op_name: str, grid_size: Tuple[int, int])
Override automatic op sizing with given grid size.
Parameters:
op_name (str) – Name of the op to override
grid_size (Tuple *[*int , int ]) – Rectangular shape (row, column) of the placed op
detect_available_devices()
Returns a list of available devices on the system.
Operations
General
|
Matrix multiplication transformation on input activations, with optional bias. |
---|---|
|
Elementwise add of two tensors |
|
Elementwise subtraction of two tensors |
|
Elementwise multiply of two tensors |
|
Reduce by summing along the given dimension |
|
Reduce by averaging along the given dimension |
|
Op representing user-defined constant |
|
Identity operation. |
|
Identity operation. |
Matmul(name: str, operandA: Tensor, operandB: Tensor | Parameter, bias: Tensor | Parameter | None = None)
Matrix multiplication transformation on input activations, with optional bias. y = ab + bias
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – Input operand A
operandB (Tensor) – Input operand B
bias (Tenor , optional) – Optional bias tensor
Add(name: str, operandA: Tensor, operandB: Tensor | Parameter)
Elementwise add of two tensors
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
operandB (Tensor) – Second operand
Returns: Buda tensor
Return type: Tensor
Subtract(name: str, operandA: Tensor, operandB: Tensor)
Elementwise subtraction of two tensors
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
operandB (Tensor) – Second operand
Returns: Buda tensor
Return type: Tensor
Multiply(name: str, operandA: Tensor, operandB: Tensor | Parameter)
Elementwise multiply of two tensors
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
operandB (Tensor) – Second operand
Returns: Buda tensor
Return type: Tensor
Identity(name: str, operandA: Tensor, unsqueeze: str | None = None, unsqueeze_dim: int | None = None)
Identity operation.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
unsqueeze (str) – If set, the operation returns a new tensor with a dimension of size one inserted at the specified position.
unsqueeze_dim (int) – The index at where singleton dimenion can be inserted
Returns: Buda tensor
Return type: Tensor
Buffer(name: str, operandA: Tensor)
Identity operation. One key difference is a Buffer op will not get lowered into a NOP and avoid being removed by the time it gets to lowering.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
ReduceSum(name: str, operandA: Tensor, dim: int)
Reduce by summing along the given dimension
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
dim (int) – Dimension along which to reduce. A positive number 0 - 3 or negative from -1 to -4.
Returns: Buda tensor
Return type: Tensor
ReduceAvg(name: str, operandA: Tensor, dim: int)
Reduce by averaging along the given dimension
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
dim (int) – Dimension along which to reduce. A positive number 0 - 3 or negative from -1 to -4.
Returns: Buda tensor
Return type: Tensor
Constant(name: str, *, constant: float)
Op representing user-defined constant
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
constant (float) – Constant value
Returns: Buda tensor
Return type: Tensor
Transformations
|
Slice along horizontal axis into given number of pieces. |
---|---|
|
Slice along vertical axis into given number of pieces. |
|
Stack Z dimension along horizontal dimension. |
|
Stack Z dimension along vertical dimension. |
|
TM |
|
TM |
|
TM |
|
TM |
|
Concatenate tensors along axis |
|
Elementwise max of two tensors |
|
Elementwise max of two tensors |
Heaviside(name: str, operandA: Tensor, operandB: Tensor | Parameter)
Elementwise max of two tensors
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
operandB (Tensor) – Second operand
Returns: Buda tensor
Return type: Tensor
BinaryStack(name: str, operandA: Tensor, operandB: Tensor | Parameter, dim: int)
Elementwise max of two tensors
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
operandB (Tensor) – Second operand
dim (int) – Dimention on which to stack
Returns: Buda tensor
Return type: Tensor
HSlice(name: str, operandA: Tensor, slices: int)
Slice along horizontal axis into given number of pieces.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
slices (int) – The number of slices to create
Returns: Buda tensor
Return type: Tensor
HStack(name: str, operandA: Tensor, slices: int = -1)
Stack Z dimension along horizontal dimension.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
slices (int , optional) – The number of slices to create. If not provided, it will be equal to current Z dimension.
Returns: Buda tensor
Return type: Tensor
VSlice(name: str, operandA: Tensor, slices: int)
Slice along vertical axis into given number of pieces.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
slices (int) – The number of slices to create
Returns: Buda tensor
Return type: Tensor
VStack(name: str, operandA: Tensor, slices: int = -1)
Stack Z dimension along vertical dimension.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
slices (int , optional) – The number of slices to create. If not provided, it will be equal to current Z dimension.
Returns: Buda tensor
Return type: Tensor
Reshape(name: str, operandA: Tensor, shape: Tuple[int, …])
TM
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – Input operand A
Returns: Buda tensor
Return type: Tensor
Index(name: str, operandA: Tensor, dim: int, start: int, stop: int | None = None, stride: int = 1)
TM
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – Input operand A
dim (int) – Dimension to slice
start (int) – Starting slice index (inclusive)
stop (int) – Stopping slice index (exclusive)
stride (int) – Stride amount along that dimension
Returns: Buda tensor
Return type: Tensor
Select(name: str, operandA: Tensor, dim: int, index: int | Tuple[int, int], stride: int = 0)
TM
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – Input operand A
dim (int) – Dimension to slice
index (int) – int: Index to select from that dimension [start: int, length: int]: Index range to select from that dimension
stride (int) – Stride amount along that dimension
Returns: Buda tensor
Return type: Tensor
Pad(name: str, operandA: Tensor, pad: Tuple[int, int, int, int] | Tuple[int, int], mode: str = ‘constant’, channel_last: bool = False)
TM
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – Input operand A
pad (tuple) – Either (padding_left, padding_right) or (padding_left, padding_right, padding_top, padding_bottom))
Returns: Buda tensor
Return type: Tensor
Concatenate(name: str, *operands: Tensor, axis: int)
Concatenate tensors along axis
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operands (Tuple *[*Tensor , … ]) – tensors to be concatenated
axis (int) – concatenate axis
Returns: Buda tensor
Return type: Tensor
Activations
Relu(name: str, operandA: Tensor, threshold=0.0, mode=’min’)
ReLU
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Gelu(name: str, operandA: Tensor, approximate=’none’)
GeLU
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
approximate (str) – The gelu approximation algorithm to use: ‘none’ | ‘tanh’. Default: ‘none’
Returns: Buda tensor
Return type: Tensor
Sigmoid(name: str, operandA: Tensor)
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Math
|
Exponent operation. |
---|---|
|
Reciprocal operation. |
|
Square root. |
|
Log operation: natural logarithm of the elements of operandA |
|
Sigmoid |
|
Clips tensor values between min and max |
|
Elementwise max of two tensors |
|
* param name: |
Max(name: str, operandA: Tensor, operandB: Tensor | Parameter)
Elementwise max of two tensors
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
operandB (Tensor) – Second operand
Returns: Buda tensor
Return type: Tensor
Exp(name: str, operandA: Tensor)
Exponent operation.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Reciprocal(name: str, operandA: Tensor)
Reciprocal operation.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Sqrt(name: str, operandA: Tensor)
Square root.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Log(name: str, operandA: Tensor)
Log operation: natural logarithm of the elements of operandA : yi = log_e(xi) for all xi in operandA tensor
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Argmax(name: str, operandA: Tensor, dim: int | None = None)
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Abs(name: str, operandA: Tensor)
Sigmoid
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Clip(name: str, operandA: Tensor, min: float, max: float)
Clips tensor values between min and max
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
min (float) – Minimum value
max (float) – Maximum value
Returns: Buda tensor
Return type: Tensor
Convolutions
|
Conv2d transformation on input activations, with optional bias. |
---|---|
|
Conv2dTranspose transformation on input activations, with optional bias. |
|
Maxpool2d transformation on input activations |
|
Avgpool2d transformation on input activations |
Conv2d(name: str, activations: Tensor, weights: Tensor | Parameter, bias: Tensor | Parameter | None = None, stride: int = 1, padding: int | str | List = ‘same’, dilation: int = 1, groups: int = 1, channel_last: bool = False)
Conv2d transformation on input activations, with optional bias.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
activations (Tensor) – Input activations of shape (N, Cin, iH, iW)
weights –
Tensor : Input weights of shape (Cout, Cin / groups, kH, kW)
[Tensor] : Internal Use pre-split Optional Input weights list of shape [(weight_grouping, Cin / groups, Cout)] of length: (K*K // weight_grouping)
bias (Tenor , optional) – Optional bias tensor of shape (Cout)
Conv2dTranspose(name: str, activations: Tensor, weights: Tensor | Parameter, bias: Tensor | Parameter | None = None, stride: int = 1, padding: int | str = ‘same’, dilation: int = 1, groups: int = 1, channel_last: bool = False)
Conv2dTranspose transformation on input activations, with optional bias.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
activations (Tensor) – Input activations of shape (N, Cin, iH, iW)
weights –
Tensor : Input weights of shape (Cout, Cin / groups, kH, kW)
[Tensor] : Internal Use pre-split Optional Input weights list of shape [(weight_grouping, Cin / groups, Cout)] of length: (K*K // weight_grouping)
bias (Tenor , optional) – Optional bias tensor of shape (Cout)
MaxPool2d(name: str, activations: Tensor, kernel_size: int | Tuple[int, int], stride: int = 1, padding: int | str = ‘same’, dilation: int = 1, ceil_mode: bool = False, return_indices: bool = False, max_pool_add_sub_surround: bool = False, max_pool_add_sub_surround_value: float = 1.0, channel_last: bool = False)
Maxpool2d transformation on input activations
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
activations (Tensor) – Input activations of shape (N, Cin, iH, iW)
kernel_size – Size of pooling region
AvgPool2d(name: str, activations: Tensor, kernel_size: int | Tuple[int, int], stride: int = 1, padding: int | str = ‘same’, ceil_mode: bool = False, count_include_pad: bool = True, divisor_override: float | None = None, channel_last: bool = False)
Avgpool2d transformation on input activations
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
activations (Tensor) – Input activations of shape (N, Cin, iH, iW)
kernel_size – Size of pooling region
NN
Softmax(name: str, operandA: Tensor, *, dim: int, stable: bool = True)
Softmax operation.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
dim (int) – – A dimension along which Softmax will be computed (so every slice along dim will sum to 1).
stable (bool) – Use stable softmax or not.
Returns: Buda tensor
Return type: Tensor
Layernorm(name: str, operandA: Tensor, weights: Tensor | Parameter, bias: Tensor | Parameter, dim: int = -1, epsilon: float = 1e-05)
Layer normalization.
Parameters:
name (str) – Op name, unique to the module, or leave blank to autoset
operandA (Tensor) – First operand
Returns: Buda tensor
Return type: Tensor
Module Types
|
Module class contains a workload that can be assigned to a single device. |
---|---|
|
A wrapper around a PyTorch module. |
|
A wrapper around a TF module. |
|
A wrapper around a Onnx module. |
|
A base class for all PyBuda modules. |
class Module(name: str)
Module class contains a workload that can be assigned to a single device. The workload can be implemented in PyTorch or in PyBuda.
get_device()
Returns the device that this op is placed onto.
Returns: Device, or None if op has not been placed yet
Return type: Optional[Device]
get_name()
Returns the name of the module.
Returns: Device, or None if op has not been placed yet
Return type: Optional[Device]
run(*args)
Run inference on this module on a TT device. There should be no other modules manually placed on any devices.
Parameters: *args (tensor) – Inference inputs
Returns: Outputs of inference
Return type: Tuple[tensor,….]
class PyTorchModule(name: str, module: Module, redirect_forward: bool = True)
A wrapper around a PyTorch module. If placed on a CPU device, PyTorchModules will be executed as is, and if placed on a TT device, modules will be lowered to PyBuda.
forward(*args, **kwargs)
Run PyTorch module forward, with pre-loaded inputs in input queues
Parameters:
*args – Inputs into the module
**kwargs – Keyword inputs into the moduls
Returns: Output tensors, one for each of the module outputs
Return type: Tuple[torch.tensor]
backward(*args)
Run PyTorch module backward, with pre-loaded inputs in input queues
Parameters: *args (List *[*Tuple *[*torch.tensor , torch.tensor ] ]) – List of tuples of output tensors and incoming loss tensors
add_parameter(name: str, parameter: Parameter)
Adds a new parameter.
Parameters:
name (str) – Parameter name
parameter (Parameter) – Parameter to add
prepend_name (Bool) – Whether to prepend module name to parameter name
set_parameters(**kwargs)
Set parameters (weights) in this module, by name.
Parameters: kwargs – Name-value pairs of parameter/weight names and tensor values
get_parameters()
Return the list of parameters defined in this module
Returns: List of all parameters in this module
Return type: List[Parameter]
class TFModule(name: str, module: Model)
A wrapper around a TF module. Currently, TF modules can only run on a CPU device.
forward(*args, **kwargs)
Run TF module forward, converting pytorch tensors as necessary
Parameters:
*args – Inputs into the module
**kwargs – Keyword inputs into the moduls
Returns: Output tensors, one for each of the module outputs
Return type: Tuple[tf.Tensor]
call(*args, **kwargs)
Run TF module forward, with pre-loaded inputs in input queues
Parameters:
*args – Inputs into the module
**kwargs – Keyword inputs into the moduls
Returns: Output tensors, one for each of the module outputs
Return type: Tuple[tf.Tensor]
backward(*args)
Run TF module backward, with pre-loaded inputs in input queues
Parameters: *args (List *[*Tuple *[*tf.Tensor , tf.Tensor ] ]) – List of tuples of output tensors and incoming loss tensors
class OnnxModule(name: str, module: ModelProto, onnx_path: str)
A wrapper around a Onnx module.
class PyBudaModule(name: str)
A base class for all PyBuda modules. User should extend this class and implement forward function with workload implementation.
pre_forward(*args, **kwargs)
Called before forward. Override this function to add custom logic.
add_parameter(name: str, parameter: Parameter, prepend_name: bool = False)
Adds a new parameter.
Parameters:
name (str) – Parameter name
parameter (Parameter) – Parameter to add
prepend_name (Bool) – Whether to prepend module name to parameter name
add_constant(name: str, prepend_name: bool = False, shape: Tuple[int] | None = None)
Adds a new constant.
Parameters:
name (str) – Constant name
prepend_name (Bool) – Whether to prepend module name to constant name
get_constant(name)
Gets a constant by name
Parameters: name (str) – constant name
Returns: constant in module
Return type: pybuda.Tensor
set_constant(name: str, data: Tensor | Tensor | ndarray)
Set value for a module constant.
Parameters:
name (str) – constant name
data (SomeTensor) – Tensor value to be set
get_parameter(name)
Gets a parameter by name
Parameters: name (str) – Parameter name
Returns: Module parameter
Return type: Parameter
get_parameters(submodules: bool = True)
Return the list of parameters defined in this module and (optionally) all submodules.
Parameters: submodules (bool , optional) – If set, parameters of submodules will be returned, as well. True by default.
Returns: List of all parameters in this (and submodules, optionally) module
Return type: List[Parameter]
set_parameter(name: str, data: Tensor | Tensor | ndarray)
Set value for a module parameter.
Parameters:
name (str) – Parameter name
data (SomeTensor) – Tensor value to be set
load_parameter_dict(data: Dict[str, Tensor | Tensor | ndarray])
Load all parameter values specified in the dictionary.
Parameters: data (Dict *[*str , SomeTensor ]) – Dictionary of name->tensor pairs to be loaded into parameters
insert_tapout_queue_for_op(op_name: str, output_index: int)
Insert an intermediate queue for op (used for checking/debugging)
Parameters:
op_name (str) – Op name
output_index (int) – Index of the output tensor on the op you want to associate with the queue
Returns: Unique handle for the tapout queue, used to retrieve values later
Return type: IntQueueHandle
Device Types
|
Device class represents a physical device which can be a Tenstorrent device, or a CPU. |
---|---|
|
CPUDevice represents a CPU processor. |
|
TTDevice represents one or more Tenstorrent devices that will receive modules to run. |
class Device(name: str, mp_context=None)
Device class represents a physical device which can be a Tenstorrent device, or a CPU. In a typical operation, each device spawns a process on the host CPU which is either used to run commands on the CPU (if device is a CPU), or feeds commands to the Tenstorrent device.
Each device will allocate input queues for the first module it will execute. On a CPU, these are usually some kind of multiprocessing queues with shared memory storage, and Tenstorrent devices have queues in on-device memory.
One or more Modules can be placed on the device to be executed.
place_module(module: Module | Tuple[Module] | List[Module])
Places a module, or list of modules, on this device for execution. Modules will be run as a sequential pipeline on this single device.
place_loss_module(module: Module)
Places a module used to calculate loss on this device. This must be the last device in the pipeline.
Parameters: module (Module) – A single loss module
remove_loss_module()
Remove module used to calculate loss from this device
push_to_inputs(*tensors: Tuple[Tensor | Tensor, …] | Dict[str, Tensor | Tensor])
Push tensor(s) to module inputs, either in order, or by keyword argumet if a dictionary is used. The data will be queued up on the target device until it is ready to be consumed.
This call can block if there is no space on the target device’s input queues.
Parameters: *tensors (Union *[*torch.Tensor , Tensor ]) – Ordered list of inputs to be pushed into the module’s input queue. Can be pytorch or pybuda tensor.
push_to_target_inputs(*tensors)
Push tensor(s) to module training target inputs, in order. The data will be queued up on the target device until it is ready to be consumed.
This call can block if there is no space on the target device’s input queues.
Parameters: tensors – Ordered list of inputs to be pushed into the module’s target input queue
push_to_command_queue(cmd)
Send command to the running main loop in another process
get_command_queue_response()
Read from command queue response. This is blocking.
Returns: Command-specific dictionary with response data, or None in case of failures
Return type: Optional[Dict]
get_next_command(command_queue: Queue)
Read next command to run, from the given command queue. Blocking.
Parameters: command_queue (queue.Queue) – Queue of commands
Returns: Next command from the queue, or None if shutdown_even was set
Return type: Command
run_next_command(cmd: Command)
In concurrent mode, this is called in a forever loop by the process dedicated to this device. In sequential mode, the main process will call this until there’s no more work to do.
Parameters: command_queue (queue.Queue) – Command queue to read commands from
Returns: True if quit command was seen
Return type: bool
dc_transfer_thread(direction: str, direction_queue: Queue)
Keep transfering data in a thread. One per direction.
dc_transfer(direction: str)
Transfer data between devices
run(output_dir: str)
Main process loop in concurrent mode.
The loop receives commands through its command queue, which indicate how many epochs & iterations to run, whether to run training or inference, and position in the pipeline.
The loop will run until shutdown command is sent in the command queue, or shutdown event is raised due to an exception in another process
Parameters: output_dir (str) – Output directory needed by perf trace on every process
compile_for(training: bool, microbatch_size: int = 0, microbatch_count: int = 1)
Save microbatch size and count
get_first_targets()
Return the tuple of first targets pushed to this device
get_first_inputs(peek=False)
Return the microbatch size, and first input in microbatch pushed into the device. If input_shapes/input_types are provided, then those will be used to create input tensors.
This is used to compile and optimize the model for dimensions provided by the first input.
shutdown_device()
Check for any mp queues that are not empty, and drain them
cpueval_backward(bw_inputs: List[Tensor], parameters: Dict[str, Tensor])
Evaluate backward pass for verification. cpueval_forward should’ve been called first, with save_for_backward set.
Parameters:
bw_inputs (List *[*torch.Tensor ]) – BW inputs, i.e. losses for each fw output
parameters (Dict *[*str , torch.Tensor ]) – Module parameters
Returns:
List[Tensor] – Gradients on ordered inputs
Dict[str, Tensor] – Gradients on parameters
generate(loop_count: int, write_index: int)
Run generate forward pass on each module on this device, in order
Parameters:
loop_count (int) – Number of micro-batches to run
write_index (int) – Write location for past cache buffers
forward(loop_count: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
backward(loop_count: int, zero_grad: bool)
Run backward pass on each module on this device, in reverse order
Parameters:
loop_count (int) – Each mini-batch is broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.
zero_grad (bool) – Set to true to have optimizer zero out gradients before the run
class CPUDevice(name: str, optimizer_f: Callable | None = None, scheduler_f: Callable | None = None, mp_context=None, retain_backward_graph=False, module: PyTorchModule | List[PyTorchModule] | None = None, input_dtypes: List[dtype] | None = None)
CPUDevice represents a CPU processor. It will spawn a process and run local operations on the assigned processor.
forward_pt(loop_count: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
forward_tf(loop_count: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
forward(loop_count: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
backward(loop_count: int, zero_grad: bool)
Run backward pass on each module on this device, in reverse order
Parameters:
loop_count (int) – Each mini-batch is broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.
zero_grad (bool) – Set to true to have optimizer zero out gradients before the run
generate(loop_count: int, write_index: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
compile_for_pt(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, microbatch_count: int = 1, verify_cfg: VerifyConfig | None = None)
For a CPU device, there is currently no compilation. This function propagates input shapes through the model to return output shapes and formats.
Parameters:
inputs (Tuple *[*Tensor , … ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.
compiler_cfg (CompilerConfig) – Compiler configuration
targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module
microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.
microbatch_count (int) – Only relevant for training and TT devices.
verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process
Returns: Output tensors
Return type: Tuple[Tensor, …]
compile_for_tf(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, verify_cfg: VerifyConfig | None = None)
For a CPU device, there is currently no compilation. This function propagates input shapes through the model to return output shapes and formats.
Parameters:
inputs (Tuple *[*Tensor , … ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.
compiler_cfg (CompilerConfig) – Compiler configuration
targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module
microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.
verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process
Returns: Output tensors
Return type: Tuple[Tensor, …]
compile_for(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, microbatch_count: int = 1, verify_cfg: VerifyConfig | None = None)
For a CPU device, there is currently no compilation. This function propagates input shapes through the model to return output shapes and formats.
Parameters:
inputs (Tuple *[*Tensor , … ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.
compiler_cfg (CompilerConfig) – Compiler configuration
targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module
microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.
microbatch_count (int) – Only relevant for training and TT devices.
verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process
Returns: Output tensors
Return type: Tuple[Tensor, …]
cpueval_forward_pt(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])
Evaluate forward pass for verification
Parameters:
inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)
parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters
save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.
targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target
Returns: Forward graph output
Return type: List[Tensor]
cpueval_forward_tf(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])
Evaluate forward pass for verification
Parameters:
inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)
parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters
save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.
targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target
Returns: Forward graph output
Return type: List[Tensor]
cpueval_forward(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])
Evaluate forward pass for verification
Parameters:
inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)
parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters
save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.
targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target
Returns: Forward graph output
Return type: List[Tensor]
cpueval_backward(bw_inputs: List[Tensor], parameters: Dict[str, Tensor])
Evaluate backward pass for verification. cpueval_forward should’ve been called first, with save_for_backward set.
Parameters:
bw_inputs (List *[*torch.Tensor ]) – BW inputs, i.e. losses for each fw output
parameters (Dict *[*str , torch.Tensor ]) – Module parameters
Returns:
List[Tensor] – Gradients on ordered inputs
Dict[str, Tensor] – Gradients on parameters
place_module(module: Module | Tuple[Module] | List[Module])
Places a module, or list of modules, on this device for execution. Modules will be run as a sequential pipeline on this single device.
pop_parameter_checkpoint()
Return a dictionary of current parameter values for the models on this device.
set_debug_gradient_trace_queue(q: Queue)
[debug feature] Provide a queue to which incoming and outgoing gradients will be stored, for debug tracing.
sync()
Block until queued up commands have completed and the device is idle.
class TTDevice(name: str, num_chips: int | None = None, chip_ids: ~typing.List[int] | ~typing.List[~typing.Tuple[int]] | None = None, arch: ~pybuda._C.backend_api.BackendDevice | None = None, devtype: ~pybuda._C.backend_api.BackendType | None = None, device_mode: ~pybuda._C.backend_api.DeviceMode | None = None, optimizer: ~pybuda.optimizers.Optimizer | None = None, scheduler: ~pybuda.schedulers.LearningRateScheduler | None = None, fp32_fallback: ~pybuda._C.DataFormat = <DataFormat.Float16_b: 5>, mp_context=None, module: ~pybuda.module.Module | ~typing.List[~pybuda.module.Module] | None = None)
TTDevice represents one or more Tenstorrent devices that will receive modules to run.
get_device_config(compiler_cfg=None)
Figure out which silicon devices will be used, if in silicon mode
place_module(module: Module | Tuple[Module] | List[Module])
Places a module, or list of modules, on this device for execution. Modules will be run as a sequential pipeline on this single device.
remove_modules()
Remove placed modules, and clear the device
set_active_subgraph(subgraph_index: int)
Set the currently active subgraph by limiting the io queues.
get_active_subgraph()
Gets the currently active subgraph.
generate_graph(*inputs: Tensor, target_tensors: List[Tensor] = [], return_intermediate: bool = False, graph_name: str = ‘default_graph’, compiler_cfg: CompilerConfig | None = None, trace_only: bool = False, verify_cfg: VerifyConfig | None = None)
Generate a buda graph from the modules on the device, and return the graph and output tensors. If input tensors have a value set, the output tensor will also have the calculated output value set.
Parameters:
inputs (Tuple *[*Tensor , … . ]) – Input tensors
target_tensors (List *[*Tensor ]) – Target inputs. Optional, if trace_only is set. Otherwise, value must be provided.
return_intermediate (bool) – Optional. If set, a dictionary of node IDs -> tensors will be return with intermediate values, for data mismatch debug.
trace_only (bool) – If set, the graph is made for a quick trace only and shouldn’t have side-effects
Returns: Buda graph, outputs, optional intermediates, original inputs, target tensor
Return type: Graph, Tuple[Tensor, …], Dict[str, Tensor], Tuple[Tensor, …], Optional[Tensor]
compile_for(inputs: Tuple[Tensor, …], compiler_cfg: CompilerConfig, targets: List[Tensor] = [], microbatch_size: int = 0, microbatch_count: int = 1, verify_cfg: VerifyConfig | None = None)
Compile modules placed on this device, with given input shapes, input formats, and microbatch size.
Parameters:
training (bool) – Specify whether to compile for training or inference. If set to true, autograd will be executed before the compile.
inputs (Tuple *[*Tensor , … ]) – Tuple of input tensors. They must have shape and format set, but do not need to hold data unless auto-verification is set.
compiler_cfg (CompilerConfig) – Compiler configuration
targets (List *[*Tensor ] , optional) – Optional list of target tensors, if this device has a loss module
microbatch_size (int , optional) – The size of microbatch. Must be non-zero for training mode.
microbatch_count (int) – Only relevant for training. This represents the number of microbatches that are pushed through fwd path before bwd path runs. The device will ensure that buffering is large enough to contain microbatch_count number of microbatch intermediate data.
verify_cfg (Optional *[*VerifyConfig ]) – Optional auto-verification of compile process
Returns: Output tensors
Return type: Tuple[Tensor, …]
forward(loop_count: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
generate(loop_count: int, write_index: int, tokens_per_iter: int, token_id: int)
Run forward pass on each module on this device, in order
Parameters: loop_count (int) – Number of micro-batches to run
cpueval_forward(inputs: List[Tensor], parameters: Dict[str, Tensor], save_for_backward: bool, targets: List[Tensor] = [])
Evaluate forward pass for verification
Parameters:
inputs (List *[*torch.Tensor ]) – One input into the model (for each ordered input node)
parameters (Dict *[*str , torch.Tensor ]) – Map of model parameters
save_for_backward (bool) – If set, input and output tensors will be saved so we can run the backward pass later.
targets (List *[*torch.Tensor ] , optional) – If we’re running training, and there’s a loss module on this device, provide target
Returns: Forward graph output
Return type: List[Tensor]
backward(loop_count: int, zero_grad: bool)
Run backward pass on each module on this device, in reverse order
Parameters:
loop_count (int) – Each mini-batch is broken into micro-batches. This is necessary to fill a multi-device pipeline, and should be roughly 4-6x the number of devices in the pipeline for ideal performance.
zero_grad (bool) – Set to true to have optimizer zero out gradients before the run
get_parameter_checkpoint()
Return a dictionary of current parameter values for the models on this device
get_all_parameters()
Return a dictionary of current parameter values for the models on this device
get_parameter_gradients()
Return a dictionary of currently accumulated gradient values for the models on this device
get_parameters(ignore_unused_parameters: bool = True)
Parameters: ignore_used_parameters (bool) – If true, any parameter not being recorded by the graph-trace (i.e. parameter is unused in graph execution) is not included in the returned list to user.
get_optimizer_params(is_buda: bool)
Return a dictionary of dictionaries of optimizer parameters for each model parameter.
get_scheduler_params(is_buda: bool)
Return a dictionary of dictionaries of optimizer parameters used by scheduler.
get_dram_io_queues(queue_type: str)
Returns the appropriate queue description, tile broadcast information, and original shapes, where applicable
shutdown_device()
Shutdown device at the end of the workload
sync()
Block until queued up commands have completed and the device is idle.
Miscellaneous
Members: |
|
---|---|
Members: |
class DataFormat
Members:
Float32
Float16
Bfp8
Bfp4
Bfp2
Float16_b
Bfp8_b
Bfp4_b
Bfp2_b
Lf8
UInt16
Int8
RawUInt8
RawUInt16
RawUInt32
Int32
Invalid
from_json(self: str)
property name
to_json(self: pybuda._C.DataFormat)
class MathFidelity
Members:
LoFi
HiFi2
HiFi3
HiFi4
Invalid