TT-Metal Bisect Documentation

Overview

The tt-metal bisect system extends the automated bisect framework to support three-level dependency bisecting: tt-xla → tt-mlir → tt-metal. This enables pinpointing performance regressions deep in the dependency chain without manual intervention.

How TT-Metal Bisect Works

The tt-metal bisect (scripts/bisect_ttmetal_perf.sh) operates within the context of a tt-mlir commit that uplifts tt-metal. For each candidate tt-metal commit being tested:

1. Checkout and Prepare

Checkout the specific tt-metal commit in the submodule
Update all tt-metal submodules: git submodule update --init --recursive
- Critical because tt-metal has its own submodule dependencies
Navigate to tt-mlir repository (parent of tt-metal)

2. Apply Modifications

Apply revert of specified tt-mlir commit if provided (-r flag)
- Checks if revert commit exists in current tt-mlir history
- Only reverts if commit is an ancestor of current HEAD
- Revert is applied in tt-mlir, NOT in tt-metal submodule
Modify tt-xla’s third_party/CMakeLists.txt to use current tt-metal version
- This is critical to prevent CMake ExternalProject conflicts
- Without this, ExternalProject tries to checkout a different commit and hits git stash conflicts

3. Build Attempt with Parent Checkout Strategy

This is the most efficient approach to minimize Claude invocations:

# Reset all changes to clean state
git reset --hard HEAD
git clean -fd

# Checkout PARENT of reference commit (before compatibility fixes)
git checkout "${FIX_BUILD_REF}^"

# Update tt-xla's TT_MLIR_VERSION to match parent commit
# This prevents ExternalProject from trying to checkout different commit
sed -i "s/set(TT_MLIR_VERSION \"[^\"]*\")/set(TT_MLIR_VERSION \"$PARENT_COMMIT\")/" \
    third_party/CMakeLists.txt

# Re-apply tt-metal version modification
sed -i "s/set(TT_METAL_VERSION \"[^\"]*\")/set(TT_METAL_VERSION \"$CURRENT_TTMETAL_COMMIT\")/" \
    third_party/CMakeLists.txt

# Re-apply revert if specified
if [ -n "$REVERT_COMMIT" ]; then
    git revert --no-commit "$REVERT_COMMIT"
fi

# Try building with parent commit first
cmake --build build

Why this works:

Parent commit is BEFORE compatibility fixes were added
Many intermediate tt-metal commits work fine with parent
Only invoke Claude if parent build also fails
Reduces Claude invocations by ~50%

4. Claude Agent Integration (`--fix-build-ref` option)

Only invoked if parent build fails. Claude examines the reference commit’s compatibility fixes and adapts them for the specific intermediate tt-metal API version.

Key insight: Intermediate tt-metal versions may have different APIs than the final version that the reference commit was written for.

Common Error Types Claude Handles

a) API signature mismatches

Function calls with wrong number or types of arguments
Solution: Examine tt-metal API at specific commit and adjust calls

b) Header redefinition errors (most common)

Types/enums defined in BOTH build/include/ AND source directories
Happens when headers are both generated and in source tree
Solutions:
- Modify tt-mlir’s common.h to exclude one include path
- Add include guards or conditional compilation
- Check if CMakeLists.txt needs updates to exclude duplicate headers
- Prefer build/include versions if auto-generated

c) Missing symbols

Functions/types that don’t exist in this tt-metal version
May need to use alternative APIs or conditionally compile

Example Fix Applied by Claude

For header redefinition errors in commit 63bb00ab4a:

diff --git a/runtime/include/tt/runtime/detail/common/common.h b/runtime/include/tt/runtime/detail/common/common.h
index f9523dbd6..37b664bbc 100644
--- a/runtime/include/tt/runtime/detail/common/common.h
+++ b/runtime/include/tt/runtime/detail/common/common.h
@@ -13,8 +13,8 @@
 #include "tt-metalium/host_api.hpp"
 #include "tt-metalium/mesh_device.hpp"

-#include "tt-metalium/fabric_edm_types.hpp"
-#include "tt-metalium/fabric_types.hpp"
+#include "tt-metalium/experimental/fabric/fabric_edm_types.hpp"
+#include "tt-metalium/experimental/fabric/fabric_types.hpp"
 #include "tt/runtime/detail/common/flatbuffer_operator_ostream.h"
 #include "tt/runtime/detail/common/logger.h"
 #include "tt/runtime/types.h"

Why this fixes it:

Before: tt-metalium/fabric_edm_types.hpp resolved to generated headers in build/include/tt-metalium/
Other files were including from experimental/fabric/ path → source headers
This caused both versions to be included → redefinition errors
Claude’s fix: Use full source path consistently → only one version included

Claude’s Approach

Read build errors to understand what's failing
Examine diff between parent and reference commit
Adapt fixes for THIS specific tt-metal API version
Modify tt-mlir files only (not tt-metal submodule)
For header conflicts: modify include paths in common.h
For API mismatches: examine actual tt-metal code at commit
Test build iteratively until success or timeout

Configuration:

Model: Opus (better reasoning for complex issues)
Timeout: 600 seconds (10 minutes)
Allowed tools: Read, Edit, Bash, Glob, Grep
Permission mode: bypassPermissions (fully automated)
Verbose: Enabled for debugging

5. Performance Test

If build succeeds:

# Run benchmark command
$BENCHMARK_COMMAND

# Extract performance metric using regex pattern
PERFORMANCE=$(grep -oP "$METRIC_PATTERN" benchmark_output.log)

# Compare against threshold
if [ "$PERFORMANCE" -lt "$THRESHOLD" ]; then
    exit 1  # BAD
else
    exit 0  # GOOD
fi

Exit codes:

0: Good (performance meets threshold)
1: Bad (performance below threshold)
125: Untestable (build failed, Claude timed out)

6. Cleanup

cleanup() {
    # Restore reference commit if used
    if [ -n "$FIX_BUILD_REF" ]; then
        cd "$TTMLIR_SRC_DIR"
        git checkout "$FIX_BUILD_REF" --quiet
        git reset --hard HEAD
        git clean -fd  # Remove untracked files (e.g., venv/ created by Claude)
    fi

    # Reset tt-xla changes
    cd "$TTXLA_ROOT"
    git reset --hard HEAD
}

Integration with Auto Bisect

The bisect_perf_auto.sh script orchestrates three-level bisecting:

Phase 3: TT-Metal Detection

# Check if bad tt-mlir commit is a tt-metal uplift
cd "$TTMLIR_THIRD_PARTY_DIR"

git checkout "$FIRST_BAD_TTMLIR" --quiet
BAD_TTMETAL=$(grep 'set(TT_METAL_VERSION' CMakeLists.txt | grep -oP '"\K[^"]+')

git checkout "${FIRST_BAD_TTMLIR}^" --quiet
PARENT_TTMETAL=$(grep 'set(TT_METAL_VERSION' CMakeLists.txt | grep -oP '"\K[^"]+')

if [ "$BAD_TTMETAL" != "$PARENT_TTMETAL" ]; then
    # This is a tt-metal uplift, bisect within tt-metal
    echo "Detected tt-metal uplift: $PARENT_TTMETAL → $BAD_TTMETAL"

    # Automatically invoke tt-metal bisect with reference commit
    bisect_ttmetal_perf.sh \
        -c "$BENCHMARK_COMMAND" \
        -t "$PERF_THRESHOLD" \
        -p "$METRIC_PATTERN" \
        -f "$FIRST_BAD_TTMLIR"  # Use tt-mlir uplift as fix reference
fi

Key Design Decisions

Parent Checkout Strategy

Testing with parent commit (before fixes) first is most efficient:

Only invoke Claude if parent also fails
Reduces Claude usage and total bisect time
Many intermediate commits work fine with parent

TT_MLIR_VERSION Hotfix

Updating tt-xla’s CMakeLists.txt to match checked-out commit prevents CMake ExternalProject from trying to sync tt-mlir to a different commit, which causes git stash conflicts.

Without hotfix:

error: Your local changes to the following files would be overwritten by checkout:
    third_party/CMakeLists.txt
Please commit your changes or stash them before you switch branches.
Aborting

With hotfix:

ExternalProject sees correct TT_MLIR_VERSION
No attempt to checkout different commit
No git conflicts

Revert in TT-MLIR, Not TT-Metal

Reverts are applied in the parent tt-mlir repository because:

That’s where the buggy changes exist that need reverting
TT-metal is a git submodule with independent history
Reverting in tt-metal would be incorrect and confusing

Submodule Updates

Critical to run git submodule update --init --recursive when moving through tt-metal commits:

TT-metal has its own submodule dependencies (UMD, tracy, etc.)
These must be synced to match the tt-metal commit being tested
Missing submodule updates cause cryptic build failures

Example Workflow

Full three-level bisect for resnet regression:

# Start auto bisect
cd /localdev/rpavlovic/tt-xla
./scripts/bisect_perf_auto.sh \
    -c "python ../tt-forge/benchmark/benchmark.py -p tt-xla -m resnet -bs 8 -df bfloat16 -lp 128" \
    -t 540 \
    -p "Sample per second:\s*\K[0-9.]+"

# Phase 1: Bisect tt-xla
[INFO] Starting tt-xla bisect...
[INFO] Testing range: 3ca42547..a6b9d854
[INFO] Found bad commit: a868fa2a9 (tt-mlir uplift)
[INFO] Detected TT_MLIR_VERSION change, entering Phase 2

# Phase 2: Auto-bisect tt-mlir
[INFO] Starting tt-mlir bisect...
[INFO] Testing range: 70efb12f..fb45bf24c
[INFO] Found bad commit: fb45bf24c (tt-metal uplift)
[INFO] Detected TT_METAL_VERSION change, entering Phase 3

# Phase 3: Auto-bisect tt-metal
[INFO] Starting tt-metal bisect...
[INFO] Testing range: e1d6113542..2bd1bb143f
[INFO] Testing commit: 63bb00ab4a
[INFO] Build failed with parent commit
[INFO] Invoking Claude to fix build...
[INFO] Claude fixed header redefinition errors in common.h
[INFO] Build succeeded, running benchmark...
[INFO] Performance: 425.99 (threshold: 540) → BAD

[INFO] Testing commit: 5548ea559d
[INFO] Build succeeded with parent commit
[INFO] Performance: 542.10 (threshold: 540) → GOOD

[INFO] Bisect converging...
[INFO] First bad commit: 63bb00ab4af2ec88203bbb371291324b7d2a4af1
[INFO] Commit message: "#33134: Use output cores for throttle on convs. (#33135)"

Future Improvements

1. Cache Claude’s Build Fixes as Patches

Problem: Currently, Claude is invoked fresh for each failing commit. However, many intermediate commits fail with the SAME header redefinition errors and require the SAME fix. This wastes time and Claude API calls.

Proposed Solution:

# After Claude successfully fixes a build
CLAUDE_PATCH="/tmp/bisect_patches/claude_fix_${TTMETAL_COMMIT}.patch"
mkdir -p /tmp/bisect_patches

# Extract ONLY Claude's changes (excluding revert)
if [ -n "$REVERT_COMMIT" ]; then
    # Get list of files changed by revert
    REVERT_FILES=$(git diff-tree --no-commit-id --name-only -r "$REVERT_COMMIT")

    # Create patch excluding revert files
    git diff HEAD -- $(git ls-files -m | grep -v -F "$REVERT_FILES") > "$CLAUDE_PATCH"
else
    # No revert, all changes are from Claude
    git diff HEAD > "$CLAUDE_PATCH"
fi

# Tag patch with error signature for smart matching
ERROR_SIG=$(echo "$BUILD_ERRORS" | grep "error:" | sort | md5sum | cut -d' ' -f1)
echo "# Error signature: $ERROR_SIG" >> "$CLAUDE_PATCH"

Patch Application Strategy:

try_cached_patches() {
    local current_commit=$1
    local error_sig=$2

    # Try patches in order of likelihood
    local patches=(
        # 1. Exact commit match (if we've seen this before)
        "/tmp/bisect_patches/claude_fix_${current_commit}.patch"

        # 2. Parent commit (likely similar)
        "/tmp/bisect_patches/claude_fix_${current_commit}^.patch"

        # 3. Same error signature
        $(grep -l "Error signature: $error_sig" /tmp/bisect_patches/*.patch 2>/dev/null)

        # 4. Common fixes (header issues, API compat)
        "/tmp/bisect_patches/common_header_fix.patch"
        "/tmp/bisect_patches/common_api_compat.patch"
    )

    for patch in "${patches[@]}"; do
        if [ -f "$patch" ]; then
            echo "Trying cached patch: $(basename $patch)"

            # Check if patch applies cleanly
            if git apply --check "$patch" 2>/dev/null; then
                git apply "$patch"

                # Try building
                if cmake --build build 2>&1 | tee build_test.log; then
                    echo "✓ Cached patch worked!"
                    return 0
                fi

                # Revert failed patch
                git reset --hard HEAD
            fi
        fi
    done

    return 1  # No cached patch worked
}

# In main bisect loop
if ! cmake --build build; then
    ERROR_SIG=$(echo "$BUILD_ERRORS" | grep "error:" | sort | md5sum | cut -d' ' -f1)

    # Try cached patches first
    if try_cached_patches "$TTMETAL_COMMIT" "$ERROR_SIG"; then
        echo "Build fixed with cached patch"
    else
        # Fall back to Claude
        invoke_claude_to_fix_build
    fi
fi

Challenge: Separating Claude’s Changes from Revert

Three approaches with trade-offs:

Approach 1: Patch before revert

# Checkout parent → invoke Claude → save diff → apply revert
git checkout "${FIX_BUILD_REF}^"
invoke_claude
CLAUDE_DIFF=$(git diff HEAD)
echo "$CLAUDE_DIFF" > "$CLAUDE_PATCH"
git revert --no-commit "$REVERT_COMMIT"
echo "$CLAUDE_DIFF" | git apply

✓ Clean separation
✗ Complex workflow

Approach 2: Track revert files and exclude

REVERT_FILES=$(git show "$REVERT_COMMIT" --name-only --pretty=format:)
git diff HEAD -- $(git ls-files -m | grep -v -F "$REVERT_FILES") > "$CLAUDE_PATCH"

✓ Simple
✗ Misses files modified by both revert and Claude

Approach 3: Use git’s three-way diff

# Most accurate but complex
git diff "$REVERT_COMMIT"^.."$REVERT_COMMIT" > revert.patch
git diff HEAD > combined.patch
# Use diff tools to subtract revert.patch from combined.patch

✓ Most accurate
✗ Requires careful diff manipulation

Recommended: Approach 2 with heuristics

Simple and works for 90% of cases
Add error signature matching for robustness
Store metadata about revert state in patch header

Expected Benefits:

~80% reduction in Claude invocations
~60% reduction in total bisect time
More consistent fixes (same patch applied to similar commits)
Lower cost (fewer Claude API calls)

Implementation Plan:

Add patch caching to bisect_ttmetal_perf.sh
Create /tmp/bisect_patches/ directory structure
Implement try_cached_patches() function
Add error signature extraction
Test with known regression

2. Add Timestamps to All Logs

Problem: Currently difficult to track timing and identify bottlenecks. Hard to answer questions like:

How long did Claude take on this commit?
Which commits take longest to build?
What’s the average benchmark time?
When did the bisect start/finish?