The Problem With GPU Compute Today

Modern GPU compute is fragmented by hardware vendor. CUDA runs on NVIDIA. ROCm runs on AMD. Metal runs on Apple silicon. If you want code that runs on all three, you're writing it three times or depending on a compatibility layer that introduces its own constraints.

PyTorch partially addresses this through device abstraction, but it introduces a different problem: approximately 2,000 operations across a surface area designed for human-readable research code. An LLM asked to reason about a PyTorch kernel will hit context limits before it can see the whole API. The operations aren't annotated with machine-readable complexity, constraint, or FLOP-count metadata. They weren't designed with automated tooling in mind.

tpt-gpu is an attempt to solve both problems at once: hardware-agnostic GPU compute with a language specifically designed for automated reasoning.

TPT Script: ~200 Operations With Machine-Readable Metadata

The headline feature of tpt-gpu is TPT Script, a statically-typed language for GPU compute. Its design is intentional in one specific way: approximately 200 operations, each annotated with machine-readable metadata.

Every operation carries four annotations:

@doc("Matrix multiply A × B")
@constraint("A.cols == B.rows")
@complexity("O(m*n*k)")
@flops("2 * m * n * k")
fn matmul(A: Tensor<f32>, B: Tensor<f32>) -> Tensor<f32>

The @constraint annotation expresses shape preconditions in a form that both the type checker and an LLM can reason over. The @flops annotation lets automated tools estimate computational cost before compiling. The surface area of 200 operations is small enough to fit in a single context window.

TPT Script supports tensor shape inference (shapes tracked through the type system), dual compilation (host functions compile to Rust, GPU kernels compile to TPTIR), and deployment annotations (@requires_gpu, @distributed, @deploy). There is a full LSP server for IDE integration and a VS Code extension with syntax highlighting, completion, and hover.

The 7-Layer Architecture

tpt-gpu is structured in seven layers, each with a clean boundary:

Layer 1 — ISA. SystemVerilog-based instruction set with 32-bit fixed-length instructions and a 9-stage SIMT pipeline. Tensor operations are first-class instructions.

Layer 2 — Drivers. Kernel modules for Linux (DRM/Rust for Linux), Windows (WDM), and macOS (DriverKit).

Layer 3 — Compiler. TPTIR — an MLIR-compatible intermediate representation with optimisation passes. Implemented in C++ with a parallel Rust port in progress.

Layer 4 — Runtime. Scheduler and three-tier memory allocator (Slab → Buddy → Fallback). Python bindings via PyO3 for framework integration.

Layer 5 — Primitives. Optimised kernels: GEMM, Attention, Conv2D. Written in TPTIR. Benchmarks: GEMM above cuBLAS, Attention matching FlashAttention v2.

Layer 6 — Frameworks. PyTorch custom backend and JAX primitive implementations. HuggingFace support. Distributed training via FSDP and pipeline parallelism.

Layer 7 — Language. TPT Script. The compiler frontend that makes the rest of the stack accessible without writing TPTIR directly.

Why It Matters for the TPT Suite

tpt-gpu's TPTIR (Layer 3) is an MLIR-based IR that is designed to be a shared interchange format across the TPT suite. tpt-crucible generates its own IR from GGUF/ONNX/PyTorch inputs. When the two IRs are unified (on the roadmap), a model compiled once to TPTIR can be routed to GPU via tpt-gpu, or to FPGA, MCU swarm, or analog via tpt-crucible — without recompilation.

The Layer 4 runtime is also designed to be exposed as a Rust crate, allowing tpt-spark to add a TptGpuEngine backend that replaces hand-written WGSL shaders with production-quality primitives.

Getting Started

git clone https://github.com/PhillipC05/tpt-gpu
cd tpt-gpu
cargo build --release

Full setup instructions, including driver installation and framework integration, are in the README.

View on GitHub