Building a first

local AI stack to save tokens

INDEX

  1. Introduction
  2. Requirements
  3. What we are building
  4. Installing Ollama
  5. Choosing small models
  6. Valid SLM tasks
  7. Local classification service
  8. Measuring quality, cost and time
  9. Extras
  10. Full stack

1. Introduction

Local AI does not have to start with a huge GPU. The first goal is to find narrow, repeated, low-risk tasks where a small model can prepare work before a paid model is called.

2. Requirements

We need a machine for Ollama, real examples, candidate low-risk tasks, a way to measure accuracy and a paid model for comparison.

3. What we are building

We install Ollama, download a small model, test classification, summarization and extraction, create a local function and escalate to paid AI when needed.

Pipeline for using local AI as a first layer
Pipeline for using local AI as a first layer

4. Installing Ollama

Ollama serves a local API at http://localhost:11434/api and provides endpoints such as /api/generate and /api/chat.

ollama run llama3.2:1b
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:1b",
  "messages": [
    {"role": "user", "content": "Classify this ticket: I cannot access VPN"}
  ],
  "stream": false
}'

5. Choosing small models

The Llama 3.2 page on Ollama lists small 1B and 3B variants. Use 1B for simple classification, tags and short rewrites; use 3B for summaries, JSON extraction and slightly more nuanced classification.

6. Valid SLM tasks

Good tasks: classify emails, summarize internal notes, extract dates, detect duplicates, prepare context and create drafts. Bad first tasks: critical decisions, final client writing and anything hard to review.

7. Local classification service

import requests

def classify_ticket(text):
    prompt = f"Classify this ticket as vpn, email, hardware, software or other:\n{text}"
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": "llama3.2:1b",
            "stream": False,
            "messages": [{"role": "user", "content": prompt}],
        },
        timeout=60,
    )
    response.raise_for_status()
    return response.json()["message"]["content"].strip().lower()

8. Measuring quality, cost and time

Track accuracy, avoided tokens, manual review and decision. If the model fails only in specific cases, add escalation rules.

9. Extras

Use Ollama Modelfile to specialize behavior, keep failed examples, and do not measure speed alone.

10. Full stack

Ollama, a small model, real examples, a test script, accuracy tracking and a fallback rule to paid AI.