Back to blog

Building Swift-Llama: Running LLMs Locally on Apple Silicon

How I wrapped llama.cpp in a native Swift API with Metal GPU acceleration for on-device LLM inference.

6 min read
Gagan Deep Singh

Gagan Deep Singh

Founder | GLINR Studios


I wanted to run LLMs directly on my Mac and iPhone without sending data to a remote server. The obvious starting point was llama.cpp -- it's fast, well-maintained, and supports Apple Silicon. The problem is that bridging C++ to Swift is genuinely painful. You're writing header files, managing Objective-C shims, and fighting the type system the whole way. So I built Swift-Llama to solve that once and do it properly.

The Problem with C++ and Swift

llama.cpp exposes a C API, which Swift can technically call through a bridging header. In practice, managing memory lifetimes across the boundary, handling callbacks, and translating C strings into Swift strings turns into a mess quickly. Every project that wants local LLM inference has to solve the same bridging problems from scratch.

I wanted a package you could drop into any Swift project and get a clean, idiomatic API with no C++ visible at the call site. That's what Swift-Llama is.

A Clean Native API

The core of Swift-Llama is a LlamaRunner actor that owns the model context and exposes inference through async Swift methods. Loading a model looks like this:

let runner = try await LlamaRunner(modelPath: "/path/to/model.gguf")

From there you generate text, stream tokens, or call into the chat interface. The C layer is fully encapsulated -- you never touch a raw pointer or worry about manual memory management.

The actor model here is important. LLM inference is inherently stateful and not thread-safe at the C level. Wrapping everything in a Swift actor gives you safe concurrent access for free and integrates naturally with structured concurrency.

Metal GPU Acceleration

Apple Silicon has a unified memory architecture, which means the GPU can access the same memory as the CPU without copying. llama.cpp's Metal backend takes advantage of this, and Swift-Llama surfaces it through a simple configuration option.

let config = LlamaConfig(
    nGpuLayers: 32,  // offload 32 layers to Metal
    contextSize: 4096
)
let runner = try await LlamaRunner(modelPath: modelPath, config: config)

On an M-series chip, offloading most or all layers to the GPU gives you dramatically faster token generation compared to CPU-only inference. On my M3 MacBook Pro with a 7B parameter model, inference runs comfortably above 30 tokens per second -- fast enough to feel interactive.

Streaming with Async/Await

Waiting for a full completion before showing the user anything makes LLMs feel slow. Streaming token by token is the right UX, and Swift's async sequences make it straightforward to implement.

Swift-Llama returns an AsyncThrowingStream<String, Error> from the generation methods, so you can iterate tokens as they arrive:

for try await token in runner.stream(prompt: prompt) {
    print(token, terminator: "")
}

This composes naturally with SwiftUI's task modifier, making it easy to drive a streaming text view without any manual thread management. The stream closes when generation hits an end-of-sequence token or your configured token limit.

Tool Call Parsing for Agent Workflows

A model that can only generate text has limited utility for building agents. You need the model to be able to call functions -- look up data, take actions, chain reasoning steps. Most LLMs that support tool use emit structured JSON within their output to signal a tool call.

Swift-Llama includes a parser that detects and extracts these structured calls from the token stream. Rather than building that parsing logic into every app that uses the package, it lives in the library with a clean delegate interface:

runner.onToolCall = { call in
    switch call.name {
    case "search":
        return try await searchWeb(query: call.arguments["query"])
    default:
        throw ToolError.unknown(call.name)
    }
}

The model output is intercepted, the tool call is dispatched, and the result is injected back into the context so the model can continue generating. This gives you the foundation for agent loops without the boilerplate.

ChatML Template Support

Different models expect prompts formatted in specific ways. Llama 3, Mistral, and Phi each have their own chat templates. Getting the template wrong causes the model to generate garbage or break formatting entirely.

Swift-Llama ships with built-in support for the ChatML format and makes it easy to construct properly formatted conversations:

let messages: [ChatMessage] = [
    .system("You are a helpful assistant."),
    .user("Explain attention mechanisms briefly.")
]
let prompt = ChatMLTemplate.format(messages)

Adding support for other templates is straightforward because the template logic is isolated from the inference engine. You format the prompt, pass it to the runner, and get tokens back.

Swift 6 Concurrency

The whole library is written against Swift 6's strict concurrency model. Every mutable state is either inside an actor or protected by a Sendable boundary. This means the compiler catches data races at build time rather than at runtime under load.

Adopting Swift 6 concurrency was more work upfront, but the result is a library that you can integrate into an async context -- a SwiftUI app, a server-side Swift backend, a command-line tool -- without introducing concurrency bugs.

Distributed as a Swift Package

Swift-Llama is available as a standard SPM package. Add it to your project by pointing at the GitHub URL and you're done -- no build scripts, no manually compiled libraries, no CocoaPods.

The package declares its llama.cpp dependency as a submodule and compiles it as part of the Swift build, so everything works with swift build and in Xcode without any extra steps.

The code is at github.com/profclaw/swift-llama.

Why On-Device AI Matters

Every prompt you send to a cloud API leaves your device. For personal notes, private documents, or sensitive queries, that's a real tradeoff. On-device inference means the model runs on your hardware and your data stays there. No API keys, no usage logs, no dependency on a third party's uptime.

There's also a latency argument. Local inference eliminates the round-trip to a remote server. For interactive applications, that makes a meaningful difference in how the experience feels.

Part of the ProfClaw Ecosystem

Swift-Llama is one piece of a broader set of tools I'm building under the ProfClaw project. The goal is a stack for building capable, private AI applications on Apple platforms -- tools that developers can use to ship real products without depending on cloud inference for every call.

If you're building something with it or running into issues, open an issue or a PR. The library is young and there's a lot of ground left to cover.


Contact