Inspector
Overview
The Inspector is a tool that provides insights into Metal host runtime. It is designed to be on by default and to have minimal impact on performance. It consists of two components: one that logs necessary data to do investigation and one that allows clients to connect and query Metal host runtime data.
Enabling
Configure the Inspector by setting the following environment variables:
export TT_METAL_INSPECTOR=1 # optional: enable/disable the Inspector. Default is `1` (enabled).
export TT_METAL_INSPECTOR_LOG_PATH=logging_path # optional: set logging path. Default is `$TT_METAL_HOME/generated/inspector`
export TT_METAL_INSPECTOR_INITIALIZATION_IS_IMPORTANT=1 # optional: enable/disable stopping execution if the Inspector is not initialized properly. Default is `0` (disabled).
export TT_METAL_INSPECTOR_WARN_ON_WRITE_EXCEPTIONS=0 # optional: enable/disable warnings on logging write exceptions (like disk out of space). Default is `1` (enabled).
export TT_METAL_INSPECTOR_RPC_SERVER_ADDRESS=localhost:50051 # optional: set the address of the Inspector RPC server. Default is `localhost:50051`.
export TT_METAL_INSPECTOR_RPC=1 # optional: enable/disable the Inspector RPC server. Default is `1` (enabled).
Enabling the Inspector will override TT_METAL_RISCV_DEBUG_INFO and debugging info will be generated for riscv elfs. You can also use unix sockets for the RPC server by setting TT_METAL_INSPECTOR_RPC_SERVER_ADDRESS to a unix socket path, e.g. unix:/tmp/inspector_socket.
Extending
To add new RPC functionality, review the Inspector
interface in the tt_metal/impl/debug/inspector/rpc.capnp
file.
Add your new method to this interface.
This will generate a corresponding callback that you can attach by calling Inspector::get_rpc_server().setYourNewMethodCallback(...)
.
Next, implement your callback to handle the request and send a response to the client.
Examples
Example changes in tt_metal/impl/debug/inspector/rpc.capnp
:
interface Inspector {
getHelloWorld @0 () -> (message :Text);
sumTwoNumbers @1 (a :Int32, b :Int32) -> (result :Int32);
... other methods ...
}
Example changes in your Metal host runtime code:
#include "tt_metal/impl/debug/inspector.hpp"
#include "impl/debug/inspector/rpc_server_generated.hpp"
// In your initialization code
Inspector::get_rpc_server().setGetHelloWorldCallback(
[](auto result) {
result.setMessage("Hello, World!");
});
Inspector::get_rpc_server().setSumTwoNumbersCallback(
[](auto params, auto result) {
result.setResult(params.getA() + params.getB());
});
Using in tt-triage
After adding your new RPC method, you can use it in the tt-triage
tool to fetch data from the Metal host runtime.
The tool automatically generates code to call your new RPC method.
In your script, add a dependency to the inspector data script:
from inspector_data import run as get_inspector_data, InspectorData
script_config = ScriptConfig(
depends=["inspector_data"],
)
def run(args, context: Context):
inspector_data = get_inspector_data(args, context)
You can then call your new RPC method:
hello_world = inspector_data.getHelloWorld()
print(hello_world.message)
sum_result = inspector_data.sumTwoNumbers(3, 5)
print(f"Sum of 3 and 5 is {sum_result.result}")
Limitations
Inspector data is designed to be available even after the Metal host runtime exits.
To achieve this, data is serialized to disk in the generated/inspector
directory.
Methods of the Inspector
interface that do not require arguments are automatically serialized during Metal runtime exit.
If you add a method that requires arguments, you must implement serialization and deserialization for that data yourself.
Serialization should be implemented in the Metal host runtime code, and deserialization in the tt-triage
tool.
It is acceptable to add methods that require arguments and query Metal host runtime state during execution, and use this data in
tt-triage
scripts to provide insights into the system’s state during execution.
If you run a tt-triage
script that requires data which is not serialized, you will receive an error indicating that script execution is blocked due to a dependency failure.