LocalAI QuickStart: Run OpenAI-Compatible LLMs Locally

Self-host OpenAI-compatible APIs with LocalAI in minutes.

Page content

LocalAI is a self-hosted, local-first inference server designed to behave like a drop-in OpenAI API for running AI workloads on your own hardware (laptop, workstation, or on-prem server).

The project targets practical “replace the cloud API URL” compatibility, while supporting multiple backends and modalities (text, images, audio, embeddings, and more).

localai llm quickstart infographic

What LocalAI is and why engineers use it

LocalAI presents an HTTP REST API that mirrors key OpenAI endpoints, including chat completions, embeddings, image generation, and audio endpoints, so existing OpenAI-compatible tooling can be repointed to your own infrastructure.

Beyond basic text generation, LocalAI’s feature set spans common “production building blocks” such as embeddings for RAG, diffusion-based image generation, speech-to-text, and text-to-speech, with optional GPU acceleration and distributed patterns.

If you’re evaluating self-hosted LLM serving, LocalAI is interesting because it focuses on API compatibility (for easier integration) while also providing a built-in Web UI and a model gallery workflow to reduce the friction of installing and configuring models.

For a broader comparison of self-hosted and cloud LLM hosting options — including Ollama, vLLM, Docker Model Runner, and managed cloud providers — see the LLM hosting guide for 2026.

If you want a side-by-side breakdown of LocalAI against Ollama, vLLM, LM Studio, and others, comparing the main local LLM tools in 2026 covers API support, hardware compatibility, and production readiness. For the broader case for keeping models on your own infrastructure, LLM self-hosting and AI sovereignty covers data residency and compliance motivations.

LocalAI Installation options that work well in practice

LocalAI can be installed in multiple ways, but for most teams the quickest, lowest-risk starting point is containers (Docker or Podman). If you want a command reference while working through the examples below, the Docker cheatsheet covers the most frequent and useful Docker commands.

Fastest start with Docker

This starts the LocalAI server and binds the API and Web UI on port 8080:

docker run -p 8080:8080 --name local-ai -ti localai/localai:latest

LocalAI’s container documentation calls this the fastest path for getting a working server up, with the API reachable on http://localhost:8080.

Choosing the right LocalAI container image

LocalAI publishes multiple container flavours so you can match your hardware:

  • A CPU image for broad compatibility.
  • GPU-specific images for NVIDIA CUDA, AMD ROCm, Intel oneAPI, and Vulkan.
  • All-in-One (AIO) images that come pre-configured with models mapped to OpenAI-like model names.

The upstream GitHub README includes concrete docker run examples for CPU-only and several GPU options (NVIDIA CUDA variants, AMD ROCm, Intel, Vulkan), plus AIO variants.

Persist models between restarts

If you don’t mount storage, your downloaded models may not persist across container lifecycle changes. The container guide recommends mounting a models volume, for example:

docker run -ti --name local-ai -p 8080:8080 \
  -v "$PWD/models:/models" \
  localai/localai:latest-aio-cpu

This makes /models inside the container persistent on your host.

A minimal Docker Compose QuickStart

LocalAI also provides a reference docker-compose.yaml in the repository, demonstrating a common pattern: bind port 8080, mount a /models volume, set MODELS_PATH=/models, and optionally preload a model by specifying it in the command list (the repo example shows phi-2). The Docker Compose cheatsheet is a handy reference while adapting this to your setup.

A “good default” Compose setup (CPU) looks like this:

services:
  localai:
    image: localai/localai:latest
    container_name: local-ai
    ports:
      - "8080:8080"
    volumes:
      - ./models:/models
    environment:
      - MODELS_PATH=/models

The key idea is the same as the upstream example: host models directory ↔ container /models.

If you are also using Docker’s native docker model tooling alongside LocalAI, the Docker Model Runner cheatsheet covers pull, run, package, and configuration commands.

Non-container LocalAI installs

LocalAI also supports installs via platform-specific methods (for example, a macOS DMG and Linux binaries), and broader deployment options like Kubernetes.

If you prefer scripted installs on Linux, the DeepWiki quick start describes an install.sh path that auto-detects hardware and configures the system accordingly.

A predictable usage sequence

A reliable LocalAI workflow is:

Start LocalAI → install or import a model → verify loaded models → call OpenAI-compatible endpoints.

This sequence matches the official “Try it out” and “Setting up models” guidance, which frames the process around starting the server, installing models via gallery or CLI, and then testing endpoints with curl.

Start the server and confirm it is healthy

Once the server is running, a common sanity check is the readiness endpoint:

curl http://localhost:8080/readyz

The troubleshooting guide uses /readyz as a first diagnostic to confirm LocalAI is responsive.

LocalAI provides two mainstream model onboarding flows:

  • Model Gallery install via Web UI, where you open the UI, go to the Models tab, browse models, and click Install.
  • CLI-driven install and run, using local-ai models list, local-ai models install, and local-ai run.

The documentation also supports importing models by URI (Hugging Face repositories, direct model file URIs, and other registries), and the Web UI includes a dedicated Import Model flow with a YAML editor for advanced configuration.

Verify what LocalAI thinks it can serve

To list deployed models through the OpenAI-compatible API:

curl http://localhost:8080/v1/models

This is explicitly recommended both as a “next step” after container install and as a troubleshooting diagnostic.

Main command-line parameters worth learning

LocalAI’s CLI is built around the local-ai run command, with a comprehensive configuration surface. We need to highlight two important operational behaviours:

  • Every CLI flag can be set via an environment variable.
  • Environment variables take precedence over CLI flags.

Below are the parameters most practitioners end up using early, grouped by intent. All defaults and env-var names are taken from the upstream CLI reference. If you are evaluating Ollama alongside LocalAI, the Ollama CLI cheatsheet covers its serve, run, ps, and model management commands for comparison.

Core server and storage flags

What you want Flag Environment variable Notes
Change bind address and port --address LOCALAI_ADDRESS Default is :8080.
Change where models live --models-path LOCALAI_MODELS_PATH Critical for persistent storage and disk planning.
Separate mutable state from config --data-path LOCALAI_DATA_PATH Stores persistent data like agent state and jobs.
Set upload location --upload-path LOCALAI_UPLOAD_PATH For file-related APIs.

LocalAI’s FAQ also documents default model storage locations and explicitly recommends LOCALAI_MODELS_PATH or --models-path if you want models outside the default directory (for example, to avoid filling a home directory).

Performance and capacity flags

What you want Flag Environment variable Notes
Tune CPU usage --threads LOCALAI_THREADS Suggested to match physical cores; used widely for performance tuning.
Control per-model context --context-size LOCALAI_CONTEXT_SIZE Default context size for models.
Enable GPU acceleration mode --f16 LOCALAI_F16 Documented as “Enable GPU acceleration”.
Limit loaded models in memory --max-active-backends LOCALAI_MAX_ACTIVE_BACKENDS Enables LRU eviction when exceeded; can cap memory footprint.
Stop idle or stuck backends --enable-watchdog-idle / --enable-watchdog-busy LOCALAI_WATCHDOG_IDLE / LOCALAI_WATCHDOG_BUSY Useful when running many models or unstable backends.

For broader compatibility and acceleration constraints, the model compatibility table documents which backends support which acceleration modes (CUDA, ROCm, SYCL, Vulkan, Metal, CPU), and also notes that models not explicitly configured may be auto-loaded, while YAML config lets you pin behaviour. For higher-throughput multi-GPU deployments with PagedAttention, the vLLM quickstart guide walks through a comparable OpenAI-compatible server with production-oriented configuration.

API, security, and UI flags

What you want Flag Environment variable Notes
Require API keys --api-keys LOCALAI_API_KEY / API_KEY When set, all requests must authenticate with a configured key.
Allow browsers to call the API --cors / --cors-allow-origins LOCALAI_CORS / LOCALAI_CORS_ALLOW_ORIGINS Keep disabled unless you need it.
Disable Web UI entirely --disable-webui LOCALAI_DISABLE_WEBUI API-only mode for hardened deployments.
Harden error responses --opaque-errors LOCALAI_OPAQUE_ERRORS Useful in high-security environments.

If you expose LocalAI remotely you should protect endpoints and can gate access with an API key, the API key effectively grants full access.

Web UI tour and how it maps to the system

By default, LocalAI serves a built-in Web UI alongside the API (unless you disable it). The docs state the UI is accessible on the same host and port as the server, typically http://localhost:8080.

What you can do in the built-in UI

Web UI is a browser-based interface that covers:

  • Model management and the gallery browsing experience
  • Chat interactions
  • Image generation and text-to-speech interfaces
  • Distributed and P2P configuration

The route structure gives a clear mental model of the UI surface area:

  • / for the dashboard
  • /browse for the model gallery browser
  • /chat/ for chat
  • /text2image/ for image generation
  • /tts/ for text-to-speech
  • /talk/ for voice interaction
  • /p2p for P2P settings and monitoring

For engineers, the most important UI feature is model onboarding. The official “Setting Up Models” guide describes:

  • Installing models via the Models tab with one-click install.
  • Importing models via an Import Model UI that supports a simple mode (URI + preferences) and an advanced mode with a YAML editor and validation tools.

This matters because LocalAI ultimately runs models based on YAML configuration: you can manage individual YAML files in the models directory, use a single file with multiple model definitions via --models-config-file, or reference remote YAML URLs at startup.

Examples you can paste into a terminal

LocalAI’s OpenAI-compatible endpoints are designed to accept familiar request formats and return JSON responses (with audio endpoints returning audio payloads).

Example chat completions with curl

The LocalAI “Try it out” page shows calling the chat completions endpoint directly:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [
      { "role": "user", "content": "Write a one paragraph explanation of what LocalAI is." }
    ],
    "temperature": 0.2
  }'

AIO images ship pre-configured models mapped to OpenAI-like names such as gpt-4, and the container documentation explains these are backed by open-source models.

If you are not using an AIO image, replace "model" with the model name you installed (check with /v1/models).

Example embeddings for RAG pipelines

LocalAI supports embeddings and documents that the embeddings endpoint is compatible with several backends, including llama.cpp, bert.cpp, and sentence-transformers.

A minimal “embed this text” request against the OpenAI-compatible endpoint looks like this:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-embedding-ada-002",
    "input": "LocalAI embeddings are handy for semantic search and RAG."
  }'

LocalAI’s embeddings documentation also shows how embeddings are enabled via YAML configuration by setting embeddings: true.

Example using an OpenAI-compatible client

LocalAI is designed so you can use standard OpenAI client libraries by pointing them at the LocalAI base URL (and optionally setting an API key if you enabled authentication). This “drop-in replacement” goal is described in both the upstream README and the OpenAI-compatibility documentation.

A typical configuration is:

  • Base URL: http://localhost:8080/v1
  • API key: either not required (default) or required if you configured --api-keys

Security and troubleshooting essentials

Secure a LocalAI server before exposing it

LocalAI can run fully open on localhost by default. If you bind to a public interface or expose it through an ingress, add at least one of these controls:

  • Enable API key authentication using --api-keys / API_KEY.
  • Put a reverse proxy and network controls in front of it (firewall, allowlisting, VPN).
  • Disable the Web UI if you only need the API (--disable-webui).
  • Keep CORS disabled unless a browser-based client genuinely needs it.

When API keys are enabled, the OpenAI-compatible endpoints accept credentials in common places such as an Authorization Bearer header or x-api-key header.

Quick diagnostics when something is not working

LocalAI’s troubleshooting guide suggests a small set of checks that resolve most “is it running” incidents:

# readiness
curl http://localhost:8080/readyz

# list models
curl http://localhost:8080/v1/models

# version
local-ai --version

It also documents enabling debug logging via DEBUG=true or --log-level=debug, and for Docker deployments, checking container logs with docker logs local-ai.