Model Auto-Discovery Tests
Overview
- What: A pytest-based runner that auto-discovers Torch models from 
tt-forge-modelsand generates tests for inference and training across parallelism modes. - Why: Standardize model testing, reduce bespoke tests in repos, and scale coverage as models are added or updated.
 - Scope: Discovers 
loader.pyunder<model>/pytorch/inthird_party/tt_forge_models, queries variants, and runs each combination of:- Run mode: 
inference,training - Parallelism: 
single_device,data_parallel,tensor_parallel 
 - Run mode: 
 
Note: Discovery currently targets PyTorch models only. JAX model auto-discovery is planned.
Prerequisites
- A working TT-XLA development environment, built and ready to run tests, with 
pytestinstalled. third_party/tt_forge_modelsgit submodule initialized and up to date:
git submodule update --init --recursive third_party/tt_forge_models
- Device availability matching your chosen parallelism mode (e.g., multiple devices for data/tensor parallel).
 - Optional internet access for per-model pip installs during test execution.
 - env-var 
IRD_LF_CACHEset to point to large file cache / webserver for s3 bucket mirror. Reach out to team for details. 
Quick start / commonly used commands
Warning: Since the number of models and variants supported here is high (1000+), it is a good idea to run with --collect-only first to see what will be discovered/collected before running non-targeted pytest commands locally.
Also, running the full matrix can collect thousands of tests and may install per-model Python packages during execution. Prefer targeted runs locally using -m, -k, or an exact node id.
Tip: Use -q --collect-only to list tests with full path shown, remove --collect-only and use -vv when running.
- List all tests without running:
 
pytest --collect-only -q tests/runner/test_models.py |& tee collect.log
- List only tensor-parallel expected-passing on 
n300-llmbox(remove--collect-onlyto run): 
pytest --collect-only -q tests/runner/test_models.py -m "tensor_parallel and expected_passing and n300_llmbox" --arch n300-llmbox |& tee tests.log
- Run a specific collected test node id exactly:
 
pytest -vv tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_2_1b-single_device-full-inference] |& tee test.log
- Validate test_config files for typos, model name changes, useful when making updates:
 
pytest -svv --validate-test-config tests/runner/test_models.py |& tee validate.log
- List all expected passing llama inference tests for n150 (using substring 
-kand markers with-m): 
pytest -q --collect-only -k "llama" tests/runner/test_models.py -m "n150 and expected_passing and inference" |& tee tests.log
tests/runner/test_models.py::test_all_models[deepcogito/pytorch-v1_preview_llama_3b-single_device-full-inference]
tests/runner/test_models.py::test_all_models[huggyllama/pytorch-llama_7b-single_device-full-inference]
tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_8b_instruct-single_device-full-inference]
tests/runner/test_models.py::test_all_models[llama/sequence_classification/pytorch-llama_3_1_8b-single_device-full-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_8b-single_device-full-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_8b_instruct-single_device-full-inference]
tests/runner/test_models.py::test_all_models[llama/causal_lm/pytorch-llama_3_1_8b-single_device-full-inference]
<snip>
21/3048 tests collected (3027 deselected) in 3.53s
How discovery and parametrization work
- The runner scans 
third_party/tt_forge_models/**/pytorch/loader.py(the git submodule) and importsModelLoaderto callquery_available_variants(). - For every discovered variant, the runner generates tests across run modes and parallelism.
 - Implementation highlights:
- Discovery and IDs: 
tests/runner/test_utils.py(setup_test_discovery,discover_loader_paths,create_test_entries,create_test_id_generator) - Main test: 
tests/runner/test_models.py - Config loading/validation: 
tests/runner/test_config/config_loader.py(merges YAML into Python with validation) 
 - Discovery and IDs: 
 
Test IDs and filtering
- Test ID format: 
<relative_model_path>-<variant_name>-<parallelism>-full-<run_mode> - Examples:
squeezebert/pytorch-squeezebert-mnli-single_device-full-inference...-data_parallel-full-training
 - Filter by substring with 
-kor by markers with-m: 
pytest -q -k "qwen_2_5_vl/pytorch-3b_instruct" tests/runner/test_models.py
pytest -q -m "training and tensor_parallel" tests/runner/test_models.py
Take a look at model-test-passing.json and related .json files inside .github/workflows/test-matrix-presets for seeing how filtering works for CI jobs.
Parallelism modes
- single_device: Standard execution on one device.
 - data_parallel: Inputs are automatically batched to 
xr.global_runtime_device_count(); shard spec inferred on batch dim 0. - tensor_parallel: Mesh derived from 
loader.get_mesh_config(num_devices); execution sharded by model dimension. 
Per-model requirements
- 
If a model provides
requirements.txtnext to itsloader.py, the runner will:- Freeze the current environment
 - Install those requirements (and optional 
requirements.nodeps.txtwith--no-deps) - Run tests
 - Uninstall newly added packages and restore version changes
 
 - 
Environment toggles:
TT_XLA_DISABLE_MODEL_REQS=1to disable install/uninstall managementTT_XLA_REQS_DEBUG=1to print pip operations for debugging
 
Test configuration and statuses
- Central configuration is authored as YAML in 
tests/runner/test_config/*and loaded/validated bytests/runner/test_config/config_loader.py(merged into Python at runtime). - Example: 
tests/runner/test_config/test_config_inference_single_device.yamlfor all single device inference test tagging, andtests/runner/test_config/test_config_inference_data_parallel.yamlfor data parallel inference test tagging. - Each entry is keyed by the collected test ID and can specify:
- Status: 
EXPECTED_PASSING,KNOWN_FAILURE_XFAIL,NOT_SUPPORTED_SKIP,UNSPECIFIED,EXCLUDE_MODEL - Comparators: 
required_pcc,assert_pcc,assert_allclose,allclose_rtol,allclose_atol - Metadata: 
bringup_status,reason, custommarkers(e.g.,push,nightly) - Architecture scoping: 
supported_archsused for filtering by CI job and optionalarch_overridesused if test_config entries need to be modified based on arch. 
 - Status: 
 
YAML to Python loading and validation
- The YAML files in 
tests/runner/test_config/*are the single source of truth. At runtime,tests/runner/test_config/config_loader.py:- Loads and merges all YAML fragments into a single Python dictionary keyed by collected test IDs
 - Normalizes enum-like values (accepts both names like 
EXPECTED_PASSINGand values likeexpected_passing) - Applies 
--arch <archname>-specificarch_overrideswhen provided - Validates field names/types and raises helpful errors on typos or invalid values
 - Uses 
ruamel.yamlfor parsing, which will flag duplicate mapping keys and detect duplicate test entries both within a single YAML file and across multiple YAML files. Duplicates cause validation errors with clear messages. 
 
Model status and bringup_status guidance
Use tests/runner/test_config/* to declare intent for each collected test ID. Typical fields:
- 
status(fromModelTestStatus) controls filtering of tests in CI:EXPECTED_PASSING: Test is green and should run in Nightly CI. Optionally set thresholds.KNOWN_FAILURE_XFAIL: Known failure that should xfail; includereasonandbringup_statusto set them statically otherwise will attempt to be set dynamically at runtime.NOT_SUPPORTED_SKIP: Skip on this architecture or generally unsupported; providereasonand (optionally)bringup_status.UNSPECIFIED: Default for new tests; runs in Experimental Nightly until triaged.EXCLUDE_MODEL: Deselect from auto-run entirely (rare; use for temporary exclusions).
 - 
bringup_status(fromBringupStatus) summarizes current health for Superset dashboard reporting:PASSED(set automatically on pass),INCORRECT_RESULT(e.g., PCC mismatch),FAILED_FE_COMPILATION(frontend compile error),FAILED_TTMLIR_COMPILATION(tt-mlir compile error),FAILED_RUNTIME(runtime crash),NOT_STARTED,UNKNOWN.
 - 
reason: Short human-readable context, ideally with a link to a tracking issue. - 
Comparator controls: prefer
required_pcc; useassert_pcc=Falsesparingly as a temporary measure. 
Examples
- Passing with a tuned PCC threshold if reasonable / understood decrease:
 
"resnet/pytorch-resnet_50_hf-single_device-full-inference": {
  "status": ModelTestStatus.EXPECTED_PASSING,
  "required_pcc": 0.98,
}
- Known compile failure (xfail) with issue link:
 
"clip/pytorch-openai/clip-vit-base-patch32-single_device-full-inference": {
  "status": ModelTestStatus.KNOWN_FAILURE_XFAIL,
  "bringup_status": BringupStatus.FAILED_TTMLIR_COMPILATION,
  "reason": "Error Message - Github issue link",
}
- If minor unexpected PCC mismatch, open ticket, decrease threshold and set bringup_status/reason as:
 
"wide_resnet/pytorch-wide_resnet101_2-single_device-full-inference": {
  "status": ModelTestStatus.EXPECTED_PASSING,
  "required_pcc": 0.96,
  "bringup_status": BringupStatus.INCORRECT_RESULT,
  "reason": "PCC regression after consteval changes - Github Issue Link",
}
- If severe unexpected PCC mismatch, open ticket, disable pcc check and set bringup_status/reason as:
 
"gpt_neo/causal_lm/pytorch-gpt_neo_2_7B-single_device-full-inference": {
  "status": ModelTestStatus.EXPECTED_PASSING,
  "assert_pcc": False,
  "bringup_status": BringupStatus.INCORRECT_RESULT,
  "reason": "AssertionError: PCC comparison failed. Calculated: pcc=-1.0000001192092896. Required: pcc=0.99 - Github Issue Link",
}
- Architecture-specific overrides (e.g., pcc thresholds, status, etc):
 
"qwen_3/embedding/pytorch-embedding_8b-single_device-full-inference": {
    "status": ModelTestStatus.EXPECTED_PASSING,
    "arch_overrides": {
        "n150": {
            "status": ModelTestStatus.NOT_SUPPORTED_SKIP,
            "reason": "Too large for single chip",
            "bringup_status": BringupStatus.FAILED_RUNTIME,
        },
    },
},
Targeting architectures
- Use 
--arch {n150,p150,n300,n300-llmbox}on pytest command line to enablearch_overridesresolution in config in case there are specific overrides (like PCC requirements, checking enablement, tagging) per arch. - Tests are also marked with supported arch markers (or defaults), so you can select subsets using 
-m, example: 
pytest -q -m n300 --arch n300 tests/runner/test_models.py
pytest -q -m n300_llmbox --arch n300-llmbox tests/runner/test_models.py
Placeholder models (report-only)
- Placeholder models are declared in YAML at 
tests/runner/test_config/test_config_placeholders.yamland list important customerModelGroup.REDmodels not yet merged, typically marked withBringupStatus.NOT_STARTED. These entries are loaded using the same config loader as other YAML files. tests/runner/test_models.py::test_placeholder_modelsemits report entries with theplaceholdermarker; used for reporting on Superset dashboard and run in tt-xla Nightly CI (typically viamodel-test-xfail.json).- Be sure to remove the placeholder at the same time the real model is added to avoid duplicate reports.
 
CI setup
- Push/PR: A small, fast subset runs on each pull request (e.g., tests marked 
push). This provides quick signal without large queues. - Nightly: The broad model matrix (inference/training across supported parallelism) runs nightly and reports to the Superset dashboard. Tests are selected via markers and 
tests/runner/test_config/*statuses/arch tags likeModelTestStatus.EXPECTED_PASSING - Experimental nightly: New or experimental models not yet promoted/tagged in 
tests/runner/test_config/*(typicallyunspecified) run separately. These do not report to Superset until promoted with proper status/markers. 
Adding a new model to run in Nightly CI
It is not difficult, but involves potentially 2 projects (tt-xla and tt-forge-models). If model is already added to tt-forge-models and uplifted to tt-xla then skip steps 1-4.
- In 
tt-forge-models/<model>/pytorch/loader.py, implement aModelLoaderif doesn't already exist, exposing:query_available_variants()andget_model_info(variant=...)load_model(...)andload_inputs(...)load_shard_spec(...)(if needed) andget_mesh_config(num_devices)(for tensor parallel)
 - Optionally add 
requirements.txt(andrequirements.nodeps.txt) next toloader.pyfor per-model dependencies. - Contribute the model upstream: open a PR in the 
tt-forge-modelsrepository and land it (seett-forge-modelsrepo: https://github.com/tenstorrent/tt-forge-models). - Uplift 
third_party/tt_forge_modelssubmodule intt-xlato the merged commit so the loader is discoverable:- Update the submodule and commit the pointer:
 
 
git submodule update --remote third_party/tt_forge_models
git add third_party/tt_forge_models
git commit -m "Uplift tt-forge-models submodule to <version> to include <model>"
- Verify the test appears via 
--collect-onlyand run desired flavor locally if needed. - Add or update the corresponding entry in 
tests/runner/test_config/*to set status/thresholds/markers/arch support so that the model test is run in tt-xla Nightly CI. Look at existing tests for reference. - Remove any corresponding placeholder entry from 
PLACEHOLDER_MODELSintest_config_placeholders.yamlif it exists. - Locally run 
pytest -q --validate-test-config tests/runner/test_models.pyto validatetests/runner/test_config/*updates (on-PR jobs run it too). - Open a PR in 
tt-xlafor changes, consider running full set of expected passing models on CI to qualifytt_forge_modelsuplift (if it is risky), and land the PR intt-xlamain when confident in changes. 
Troubleshooting
- Discovery/import errors show as: 
Cannot import path: <loader.py>: <error>; add per-model requirements or setTT_XLA_DISABLE_MODEL_REQS=1to isolate issues. - Runtime/compilation failures are recorded with a bring-up status and reason in test properties; check the test report’s 
tagsanderror_message. - Some models may be temporarily excluded from discovery; see logs printed during collection.
 - Use 
-vvand--collect-onlyfor detailed collection/ID debugging. 
Future enhancements
- Expand auto-discovery beyond PyTorch to include JAX models
 - Automate updates of 
tests/runner/test_config/*potentially based on results of Nightly CI, automatic promotion of tests from Experimental Nightly to stable Nightly. - Broader usability improvements and workflow polish tracked in issue #1307
 
Reference
tests/runner/test_models.py: main parametrized pytest runnertests/runner/test_utils.py: discovery, IDs,DynamicTorchModelTestertests/runner/requirements.py: per-model requirements context managertests/runner/conftest.py: config attachment, markers,--arch, config validationtests/runner/test_config/*.yaml: YAML test config files (source of truth)tests/runner/test_config/config_loader.py: loads/merges/validates YAML into Python at runtimethird_party/tt_forge_models/config.py:Parallelismand model metadata