What Ollama Gets Wrong

Ollama is genuinely useful. It normalised running language models locally. But it carries architectural decisions that create overhead you may not want.

Ollama is a Go HTTP daemon. It starts a background service and listens for requests. Every inference call goes through an HTTP round-trip, even when the client is on the same machine. The daemon persists in memory whether or not you are running inference. Startup overhead is around 100 MB of RAM before any model is loaded. If you are building a desktop application that calls a local model, you are building on top of an HTTP server that was designed to serve multiple clients — not to be embedded in a single-user app.

tpt-spark was built from the opposite assumption: one user, one device, one binary, no network layer.

Architecture: Rust + Tauri, Single Binary

tpt-spark is a Tauri v2 application. The frontend is TypeScript and Vite. The backend is Rust. They communicate via Tauri IPC channels — typed function calls, not HTTP. There is no daemon. There is no background process when the app is not running. The binary is around 10 MB before models are loaded.

The inference engine is behind a trait:

trait LlmEngine {
    fn load(&mut self, model_path: &Path) -> Result<()>;
    fn infer(&mut self, prompt: &str, callback: impl Fn(u32)) -> Result<()>;
    fn cancel(&mut self);
}

Three implementations ship with the project:

StubEngine — returns echo tokens for testing the UI without loading a real model.

CandleEngine — real CPU inference via HuggingFace Candle with GGUF model support. No GPU required.

WgpuEngine — GPU acceleration via wgpu (Vulkan on Windows/Linux, Metal on macOS, DirectX 12 as fallback). Custom WGSL compute shaders. Automatically falls back to CandleEngine if no supported GPU is detected.

At startup, Spark detects the available hardware and picks the best engine. The UI doesn't know or care which engine is running.

Zero-Copy Model Loading

GGUF models are large. A 4-bit quantised LLaMA 3 8B model is around 4.7 GB. Loading that by reading it into RAM and then copying it to VRAM would double the peak memory usage during the load operation.

Spark uses memory-mapped file I/O. The model file is mapped directly into the process address space; the operating system pages in the relevant sections on demand. Weights stream from disk into VRAM without an intermediate RAM buffer. On machines with fast NVMe storage, this makes model loading significantly faster than daemon-based approaches that buffer the whole file.

Real-Time Token Streaming

Tokens appear in the UI word by word as they are generated, not after the whole response is complete. This is implemented via Tauri's IPC channel streaming support — the Rust backend emits a token event for each generated token, and the TypeScript frontend appends it to the response.

Generation can be cancelled mid-inference. The cancel() method on the engine trait signals the inference loop to stop; partial output is preserved.

Supported Models

LLaMA 3 (1B, 3B, 8B) — tested against Meta's official GGUF releases
Mistral 7B
Phi-3 Mini

Models are GGUF-quantised. 4-bit quantised versions (Q4_K_M) are the default recommendation: good quality, fits in 6–8 GB VRAM.

Privacy By Design

Spark is fully offline after the initial model download. There is no telemetry. No usage data is sent anywhere. Conversations are stored as JSON files in OS-specific directories on your local machine. There are no accounts, no login, no cloud.

The model download feature uses HTTPS to download from any URL — HuggingFace by default — but that is the only network call the application makes, and only when you explicitly trigger a download.

Getting Started

git clone https://github.com/PhillipC05/tpt-spark
cd tpt-spark
npm install
npm run tauri dev

Rust (1.77+) and Node.js are required. Full setup instructions including GPU driver requirements are in the README.

View on GitHub