Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

RAG embeddings - Python, Ollama, OpenAI APIs.

Page content

If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).

Text embeddings and retrieval

For Go clients and SDK comparisons for Ollama, see Go SDKs for Ollama — comparison with examples.

What is a text embedding?

A text embedding is a vector (a list of floats) produced by an embedding model. The model maps variable-length text into a fixed-dimensional space so that texts with similar meaning tend to sit close under a distance or similarity measure (often cosine similarity on L2-normalized vectors).

Embeddings are not the same as token IDs and not the same as a chat completion. They are a representation layer you use for search, clustering, deduplication, and — in RAG — retrieval.

Common use cases

Use case Role of embeddings
Semantic search / RAG retrieval Embed queries and document chunks; rank by similarity to fetch relevant passages.
Reranking with embedding models Embed the query and each candidate; score pairs by similarity (see Reranking with embedding models).
Clustering and deduplication Group or dedupe items in embedding space without labeling every example by hand.
Classification-style scoring Compare text to prototype descriptions or class names in the same space (patterns vary by model).

For multimodal settings (image–text and related ideas), see Cross-modal embeddings.

Embeddings inside a RAG pipeline

A typical offline path is:

  1. Chunk documents (size, overlap, and structure matter — see Chunking strategies in RAG).
  2. Embed each chunk; optionally store metadata (source id, section, ACL).
  3. Index vectors in memory, a library index, or a vector database (tradeoffs in Vector stores for RAG — comparison).

At query time:

  1. Embed the user query (one short string or a small batch).
  2. Retrieve the top‑k similar chunks by vector search (optionally plus keyword / hybrid search).
  3. Build a prompt from the retrieved plain text chunks and call your chat model.

Important nuance — large language models in chat APIs consume text (and tools), not arbitrary embedding tensors. You use embeddings to choose which text to inject. If you see “query the LLM with precalculated embeddings,” in practice that means retrieve with embeddings, then send the selected text to the LLM.

Get embeddings with Ollama (Python)

Ollama exposes an HTTP API. For embeddings, call POST /api/embed on your Ollama host (default http://127.0.0.1:11434). The JSON body includes a model name and input (a string or a list of strings). The response includes embeddings, a list of vectors aligned with your inputs.

Install httpx (or use requests the same way).

import httpx

OLLAMA = "http://127.0.0.1:11434"
MODEL = "nomic-embed-text"  # replace with an embedding model you have pulled

def embed_ollama(texts: list[str]) -> list[list[float]]:
    r = httpx.post(
        f"{OLLAMA}/api/embed",
        json={"model": MODEL, "input": texts},
        timeout=120.0,
    )
    r.raise_for_status()
    data = r.json()
    return data["embeddings"]

if __name__ == "__main__":
    q = "What is retrieval-augmented generation?"
    chunks = [
        "RAG combines retrieval with generation.",
        "Embeddings map text into vector space for similarity search.",
    ]
    qv = embed_ollama([q])[0]
    doc_vs = embed_ollama(chunks)
    print(len(qv), len(doc_vs), len(doc_vs[0]))

Operational notes

Get embeddings with an OpenAI-compatible server (Python)

Many local servers (including common llama.cpp HTTP setups) expose OpenAI-compatible routes such as POST /v1/embeddings. You can use the official openai Python package and point base_url at your server’s …/v1 root.

from openai import OpenAI

# Example — replace host, port, and model id with your server’s values
client = OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="not-needed",  # many local servers ignore this
)

def embed_openai_compatible(text: str, model: str) -> list[float]:
    r = client.embeddings.create(model=model, input=text)
    return r.data[0].embedding

if __name__ == "__main__":
    v = embed_openai_compatible("hello from llama.cpp", "your-embedding-model-id")
    print(len(v))

Why keep both patterns on one page? The concepts (chunk, embed, index, query, retrieve text) are identical; only the HTTP surface changes. One workshop-style article avoids duplicating the same narrative under two URLs.

Persist vectors and query them

At minimum you must store three things per chunk — vector, text, and metadata (source id, offsets, ACL). For a quick prototype you can keep everything in a Python list and use cosine similarity with NumPy or scikit-learn. For growing data, use a vector database or a library index (FAISS, etc.); see Vector stores for RAG — comparison for product-level tradeoffs.

Conceptual query loop:

  1. query_vec = embed(query)
  2. neighbors = index.search(query_vec, k)
  3. context = "\n\n".join(chunk.text for chunk in neighbors)
  4. Send context and the user question to your chat API.

Reranking after retrieval

A reranker (often a cross-encoder or a second scoring model) can re-order the top candidates after vector retrieval. This site has Python and Go examples, including Reranking with embedding models and Reranking with Ollama and Qwen3 Embedding in Go.

Topic Article
Full RAG architecture RAG tutorial — architecture, implementation, production
Chunking before you embed Chunking strategies in RAG
Vector DB choice Vector stores for RAG — comparison
Qwen3 on Ollama Qwen3 Embedding & Reranker on Ollama
Cross-modal Cross-modal embeddings
Ollama CLI & tips Ollama cheatsheet
Go + Ollama Using Ollama in Go — SDK comparison