How to run LLMs locally using Ollama

3 minutes read

PS: This post has definitely not been written using Ollama.

What’s Ollama?

Ollama provides an easy way to manage LLMs. It’s a command-line app that lets you run large language models right on your computer. It’s built on top of llama.cpp, a C++ library that makes it easy to run models on CPUs or GPUs.

Whether you’re using an older GPU or don’t have one at all, Ollama’s got you covered. It’s designed to be simple and focused, making it a breeze to get started with LLMs.

How do I run models locally?

Simple. Just run this in the terminal if you’re on GNU+Linux or macOS.

curl -fsSL https://ollama.com/install.sh | sh

If you’re on Windows, download the executable from the website and run the setup file. This will setup an Ollama service (on all of the above operating systems).

To pull your first model,

ollama run neural-chat 

I’ve used Neural-chat for example purposes, but other models such as LLaMa2 are also available. Some models, such as LLaVA, even have visual understanding capabilities such as identifying details in images.

Ugh, this is a pain to use. Is there a better way?

Of course! Running LLMs this way in a terminal is boring! One can use Open WebUI for this. I’ll be running it via Podman; Docker can also be used to achieve this.

podman run --rm -p 3000:8080  \
   -v open-webui:/app/backend/data  \
   --network slirp4netns:allow_host_loopback=true \
   --add-host=ollama.local:10.0.2.2 \
   --env OLLAMA_BASE_URL=http://ollama.local:11434 \
   --env ANONYMIZED_TELEMETRY=False \
   --name open-webui ghcr.io/open-webui/open-webui:main

If you are using Docker instead of Podman:

docker run --rm -p 3000:8080 \
   -v open-webui:/app/backend/data \
   --network=host \
   --add-host=host.docker.internal:host-gateway \
   -e OLLAMA_BASE_URL=http://127.0.0.1:11434 \
   -e ANONYMIZED_TELEMETRY=False \
   --name open-webui ghcr.io/open-webui/open-webui:main

Now open a browser and start configuring Open-WebUI! Here’s a demo on how to use Open-WebUI.

Bonus

Here’s how you can configure Emacs to use Ollama using GPTel. I’m using straight and hence this can be pulled from the GitHub repo. If you’re not using Straight.el, instructions may vary:

(use-package gptel
  :straight t
  :config
  ;; OPTIONAL configuration
  (setq
   gptel-model "neural-chat:latest"
   gptel-backend (gptel-make-ollama "Ollama"
                   :host "localhost:11434"
                   :stream t
                   :models '("neural-chat:latest"))))
(setq gc-cons-threshold (* 2 1000 1000))

I’ve configured GPTel (and hence Ollama) to use neural-chat because I’ve pulled the model locally. Feel free to choose any other model(s) of your liking.

Fin

Pick and choose the model of your choice. The more parameters a model has been trained on (e.g 3B,7B,13B) etc, the more resource-intensive it will be. On my potato laptop (which lacks have a dGPU) running these models places a strain on my CPU and can consume a ton of memory. But I guess that’s fine. These models don’t send telemetry data anywhere. Thanks for reading!