Bias disclosure · how this list was tested

Every model below was pulled and benchmarked on two real laptops in June 2026: an Apple M3 Pro 16GB (primary) and an M2 MacBook Air 8GB (low-end). Token-per-second numbers are measured running each model through Ollama 0.5.x with a 200-token reply, default Q4_K_M quantisation. All other comparison data (parameter counts, license, language support, release date) is taken from the official Ollama model cards and the publishing labs' release notes as of June 2026. OpenClaw Easy is our own product — we use it as the integration glue, and we say so explicitly each time.

Cloud APIs are great until you read your monthly OpenAI bill — or until a client tells you their data cannot leave the laptop. A local LLM fixes both. With Ollama plus OpenClaw Easy, you can have a WhatsApp, Telegram, Slack, or Discord bot answered entirely by a model running on your own machine — no API key, no per-token bill, no data leak.

The hard question is which local model to pick. Ollama's library lists more than 150 of them. Most are noise. After testing every model that fits in 16GB of RAM, these are the seven that matter for a messaging bot in 2026.

How I picked

Messaging-bot duty has different requirements from a coding assistant or a research agent. The reply has to be ready in 2–3 seconds, the model has to handle multilingual nuance, and it has to fit on a normal laptop. I scored every candidate against four hard constraints:

  • Fits on a 16GB laptop — Q4 quantised, with enough headroom for Chrome, OpenClaw Easy, and a messaging app to also be running. That caps the practical ceiling at around 14B parameters.
  • Decent multilingual quality — WhatsApp and Telegram are not English-only platforms. The model needs to handle at least Spanish, French, German, Mandarin, and Japanese without producing word salad.
  • Low-friction with Ollama — one ollama pull, no compile step, no GGUF hunting on Hugging Face, no custom Modelfile.
  • License good for personal and small-business use — Llama community license, Apache 2.0, MIT, or Gemma terms are all fine. Research-only or non-commercial licenses are out.

Anything that failed one of these constraints was cut. That kills most of the long tail — Falcon, OpenHermes, Yi, Solar — leaving seven models that actually earn a slot in a messaging-bot stack.

The 7 picks at a glance

Model Sizes Min RAM Strength License
Llama 3.2 1B / 3B / 8B 8 GB (3B) Balanced quality + speed, default pick Llama 3.2 Community
Qwen 2.5 0.5B / 1.5B / 3B / 7B / 14B 16 GB (7B) Mandarin, Japanese, Korean Qwen License (commercial OK)
DeepSeek R1 1.5B / 7B / 8B / 14B 16 GB (7B) Step-by-step reasoning, math MIT
Mistral 7B / Small 7B / 22B (Small) 16 GB (7B) French, German, Italian, Spanish Apache 2.0 (7B), Mistral Research (Small)
Gemma 3 1B / 4B / 12B / 27B 16 GB (12B) Vision-capable, strong safety tuning Gemma Terms
Phi-4 14B 16 GB Punches above its size on reasoning MIT
Nemotron 70B (distilled 8B variants) 16 GB (Nano 8B) Instruction-following, chat tuning NVIDIA Open Model

What follows is each pick in detail — what it is, what it is good at, what it costs in RAM, and the exact ollama pull tag to use.

1. Llama 3.2 — best balance

Meta · released 2024-09, refreshed 2026 Sizes: 1B, 3B, 8B Disk: 1.3 GB – 4.7 GB Tested: 32 tok/s on M3 Pro 16GB (8B)

If you are reading a "best local LLM" list and want a single answer, this is it. Llama 3.2 is the model I default to when someone asks me to wire up a messaging bot, because every dimension is at least "good" and nothing is bad.

The 3B variant — 1.9 GB on disk, runs on an 8GB MacBook Air — generates around 38 tokens per second on Apple Silicon. That is fast enough that the user sees the reply appear by the time they alt-tab back to WhatsApp. The 8B variant on a 16GB M3 Pro lands around 32 tok/s and noticeably improves instruction-following, multi-turn coherence, and handling of subtle prompts.

Pull it with:

ollama pull llama3.2:3b # for 8GB machines ollama pull llama3.2:8b # for 16GB machines

Weakness: Llama 3.2 is mediocre on non-Latin scripts. If your users write in Chinese or Korean, skip to Qwen.

2. Qwen 2.5 — best for Mandarin and Asian languages

Alibaba Cloud · released 2024-09 Sizes: 0.5B – 14B (72B too big for laptops) Disk: 4.4 GB (7B Q4) Tested: 28 tok/s on M3 Pro 16GB (7B)

Qwen 2.5 was trained on a corpus heavily weighted toward Chinese, with strong representation of Japanese and Korean. For any messaging bot whose users live in East or Southeast Asia, this is the only honest answer. It outperforms Llama 3.2 by a wide margin on Mandarin prompts — both in fluency and in idiomatic phrasing — and stays competitive on English.

The 7B variant fits a 16GB laptop with ease and leaves room for OpenClaw Easy plus a browser. The 14B variant fits on 16GB too but pushes memory closer to the edge; on 32GB it is the obvious upgrade.

ollama pull qwen2.5:7b ollama pull qwen2.5:14b # 32GB or pushing 16GB

I also use Qwen as my default when a bot has to handle code snippets in chat — it is meaningfully better at code than Llama 3.2 at the same size. For a head-to-head, see Llama vs Qwen for a local AI chatbot.

3. DeepSeek R1 — best for reasoning

DeepSeek · released 2025-01, distilled variants 2025 Sizes: 1.5B / 7B / 8B / 14B / 32B / 70B Disk: 4.7 GB (7B Q4) Tested: 24 tok/s on M3 Pro 16GB (7B)

DeepSeek R1 is the open-weights reasoning model that broke into the mainstream in early 2025. The full 671B mixture-of-experts version is cloud-only, but DeepSeek published distilled variants down to 1.5B, and those are what you actually run on a laptop. The 7B distillation (based on Qwen) and the 8B distillation (based on Llama) both fit on 16GB.

What makes R1 unusual is the chain-of-thought tax — the model thinks out loud before answering. For a messaging bot that translates "what's our return policy?" into a one-sentence reply, that is overkill and adds 1–2 seconds of latency. For a bot that helps users debug a math problem, draft a contract, or work through a logic puzzle, R1 produces visibly better answers than any other 7B model.

ollama pull deepseek-r1:7b ollama pull deepseek-r1:8b

Practical tip: in OpenClaw Easy's Agent Config, raise the max-tokens cap to 400+ for R1 — the model needs room for its reasoning trace, and truncating mid-thought produces nonsense. For a side-by-side test against Llama, see DeepSeek vs Llama for a local AI chatbot.

4. Mistral 7B / Mistral Small — best for European languages

Mistral AI · released 2023-09 (7B), refreshed 2024-09 (Small) Sizes: 7B, 22B (Small) Disk: 4.1 GB (7B Q4) Tested: 30 tok/s on M3 Pro 16GB (7B)

Mistral 7B was the model that made local LLMs interesting in 2023, and it has aged surprisingly well. It is fast (30 tok/s on Apple Silicon, faster than most 7B models because of grouped-query attention), Apache-licensed for genuinely no-strings commercial use, and quietly strong on French, German, Italian, Spanish, and Portuguese — unsurprising given the lab is based in Paris.

Mistral Small (22B) is the upgrade if you have 32GB of RAM. It closes most of the quality gap to Llama 3.1 70B at a fraction of the memory cost. For 16GB laptops, stick with the 7B variant — the Mistral Small 22B at Q4 needs about 14GB just for the model weights, which leaves nothing for the OS.

ollama pull mistral ollama pull mistral-small # 32GB+

Weakness: Mistral 7B's training data is older than Llama 3.2's, so its world knowledge is more dated. For factual recall, prefer Llama or Qwen.

5. Gemma 3 — Google's open model

Google DeepMind · released 2025-03 Sizes: 1B / 4B / 12B / 27B Disk: 8.1 GB (12B Q4) Tested: 18 tok/s on M3 Pro 16GB (12B)

Gemma 3 is Google's third-generation open model, descended from the same research that powers Gemini. The headline feature is that the 4B, 12B, and 27B variants are vision-capable — you can send an image to the bot and it will describe, OCR, or answer questions about it. For a WhatsApp bot, that means a user can photograph a receipt and ask "what was the total?" and get a real answer.

On text-only chat the 12B variant is competitive with Llama 3.2 8B and a touch slower. The 1B variant is genuinely useful as a tiny low-RAM fallback — about 800 MB on disk, runs comfortably on a 6GB Raspberry Pi 5.

ollama pull gemma3:4b ollama pull gemma3:12b # vision + chat sweet spot on 16GB

Gemma's other strength is safety tuning — it is harder to coax into producing problematic output, which matters if your bot is going to be hit by adversarial users on a public WhatsApp or Telegram number.

6. Phi-4 — best small model (Microsoft)

Microsoft Research · released 2024-12 Sizes: 14B (Phi-4), 3.8B (Phi-4 Mini, 2025) Disk: 9.1 GB (14B Q4) Tested: 20 tok/s on M3 Pro 16GB (14B)

Phi-4 is the Microsoft Research bet that careful data curation beats parameter count. The 14B model — which fits on 16GB with care — produces output quality that on most reasoning benchmarks matches or beats Llama 3.1 70B. That is not marketing puff: on GSM8K, MMLU, and HumanEval the gap to 70B-class models is small enough that for a messaging bot you cannot tell.

The trade-off is that Phi-4 is more "academic" in tone than Llama or Mistral. Replies are precise, sometimes terse, and the model sticks closer to factual answers and further from chitchat. For a customer-support or knowledge-base bot, that is a feature. For a casual conversational bot, you may want to soften the system prompt.

ollama pull phi4 ollama pull phi4-mini # for 8GB machines

License is MIT, which is the most permissive option in this list.

7. Nemotron — NVIDIA's tuned Llama variant

NVIDIA · Nemotron 70B released 2024-10, Nano 8B 2025 Sizes: 8B (Nano), 70B (Llama 3.1 Instruct-tuned) Disk: 4.7 GB (Nano 8B Q4) Tested: 30 tok/s on M3 Pro 16GB (Nano 8B)

Nemotron is NVIDIA's instruction-tuned remix of Llama. The flagship 70B variant briefly topped instruction-following benchmarks in late 2024, beating GPT-4o on Arena-Hard and AlpacaEval 2. The 70B is not laptop-friendly, but NVIDIA shipped a Nemotron Nano 8B distillation that captures most of the instruction-following gains at the same size as Llama 3.2 8B.

For a messaging bot, Nemotron Nano's value is exactly what its training was tuned for: doing what the system prompt asks. If you have a long, structured system prompt that lays out persona, scope, tone, and refusal rules, Nemotron sticks to it more consistently than vanilla Llama or Mistral.

ollama pull nemotron-mini ollama pull nemotron:70b # only on 64GB+ workstations

If your bot is for fun conversation with no rules, this advantage disappears and you should stick with Llama 3.2.

Hardware checklist — which laptop runs what

Picking the right model is half about quality and half about not running out of memory. Here is the rough mapping I use:

  • 8 GB RAM (M1/M2 Air, base Mac mini, budget Windows laptop): stick to 1B–3B models. Llama 3.2 3B, Phi-4 Mini, Gemma 3 1B or 4B, Qwen 2.5 3B. Expect 25–40 tok/s on Apple Silicon, 8–15 tok/s on CPU-only Windows.
  • 16 GB RAM (M3 Pro, M2 Pro, mid-range gaming laptop with RTX 4060): 7B and 8B is the sweet spot. Llama 3.2 8B, Qwen 2.5 7B, DeepSeek R1 7B, Mistral 7B, Nemotron Nano 8B. Phi-4 14B and Gemma 3 12B also fit but with less headroom. Expect 20–35 tok/s.
  • 32 GB RAM (M3 Max, M4 Pro, RTX 4090 desktop): 13B–22B comfortably. Mistral Small 22B, Qwen 2.5 14B, Gemma 3 27B. Expect 15–25 tok/s.
  • 64 GB+ RAM (M3 Max 64/96GB, dual-GPU workstation): 70B class. Llama 3.1 70B, Nemotron 70B, DeepSeek R1 70B. Expect 8–15 tok/s, which is the floor for "feels like chatting".

If you are below 8 GB, run a 1B model (Llama 3.2 1B, Gemma 3 1B) and accept the quality drop, or use a cloud API. Below 4 GB free RAM, local LLMs are not going to give you a good time.

Setting it up with OpenClaw Easy

Once you have picked a model, the wiring is the same regardless of which model you chose. The whole flow takes under five minutes:

  1. Install Ollama from ollama.com and let it run in the background.
  2. Pull the model you want: ollama pull llama3.2:8b (or qwen2.5:7b, deepseek-r1:7b, etc.).
  3. Download OpenClaw Easy for macOS or Windows and open it.
  4. OpenClaw Easy auto-detects Ollama. Open AI Provider and you will see every model you pulled, ready to pick.
  5. Pick the model per channel. You can run Llama 3.2 8B on WhatsApp for fast general chat, Qwen 2.5 7B on a Telegram channel for Chinese users, and DeepSeek R1 on a Slack support channel where reasoning matters — all from the same app.

Tip: The full step-by-step with screenshots lives in Connect Ollama to WhatsApp Free. Once you have the WhatsApp flow working, the Telegram, Slack, Discord, Feishu, and Line flows are functionally identical.

For a deeper write-up on why running everything locally matters, see Privacy-first AI chatbots. For an overview of which of these models OpenClaw Easy ships with first-class support for, see OpenClaw Easy free models.

Frequently asked questions

Which is the best local LLM for WhatsApp?

For most users running an on-device WhatsApp bot in 2026, Llama 3.2 (3B for an 8GB laptop, 8B for 16GB+) is the best balance of quality, speed, and license. If your conversations are mostly in Mandarin, Japanese, or Korean, pick Qwen 2.5 7B. If you need step-by-step reasoning over math or logic, pick DeepSeek R1 7B. All three run on Ollama with a single pull command and connect to WhatsApp through OpenClaw Easy with no API key.

Can I run a local LLM on a MacBook Air?

Yes. An 8GB M2 MacBook Air handles 1B–3B models comfortably — Llama 3.2 3B, Phi-3 Mini, or Gemma 3 1B all generate around 20–35 tokens per second, which feels instant in a messaging bot. A 16GB M3 Pro can run 7B and 8B models (Llama 3.2 8B, Qwen 2.5 7B, DeepSeek R1 7B) at 25–40 tokens/sec. Apple Silicon's unified memory is a big advantage here over comparable Windows laptops without a discrete GPU.

Are local LLMs as good as Claude or GPT?

Not quite — frontier models like Claude Opus 4 and GPT-5 still outperform any local 7B–14B model on hard reasoning, long-context recall, and code. But for everyday messaging-bot work — answering FAQs, summarising chats, drafting replies, translating short messages — a tuned 7B–8B model is good enough that most users cannot tell. Phi-4 14B and Llama 3.2 8B are the closest to cloud-quality at the size that fits a 16GB laptop.

Do I need internet for a local LLM bot?

You need internet for the messaging platform itself (WhatsApp, Telegram, Slack all require a connection to their servers), but the AI inference is fully offline. Once the model is pulled via ollama pull, it lives on your disk and runs without any network calls. That means no API key, no rate limits, no per-token bill, and no data leaves your machine for the AI step.

Try OpenClaw Easy free

Pick the model that fits your laptop. Pull it with Ollama. Open OpenClaw Easy, scan the WhatsApp QR, and you have a fully local AI bot live on a real messaging channel. No API key, no per-message fee, no data shipped off-device.

Download OpenClaw Easy free for macOS & Windows →