TTNN, TT-Lang, and TT-Forge from the ground up.
You want to go below the model API. You want to write kernels, understand the three-processor execution model, and know exactly where every byte is at every step. This track starts with chip anatomy, works through a live kernel dispatch, introduces TT-Lang for custom kernel writing in Python, and finishes with a practical profiling workflow. Five chapters, all runnable on your QB2.