Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs

RAG embeddings - Python, Ollama, OpenAI APIs.

Page content

If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).

Text embeddings and retrieval

For Go clients and SDK comparisons for Ollama, see Go SDKs for Ollama — comparison with examples.

What is a text embedding?

A text embedding is a vector (a list of floats) produced by an embedding model. The model maps variable-length text into a fixed-dimensional space so that texts with similar meaning tend to sit close under a distance or similarity measure (often cosine similarity on L2-normalized vectors).

Embeddings are not the same as token IDs and not the same as a chat completion. They are a representation layer you use for search, clustering, deduplication, and — in RAG — retrieval.

Common use cases

Use case	Role of embeddings
Semantic search / RAG retrieval	Embed queries and document chunks; rank by similarity to fetch relevant passages.
Reranking with embedding models	Embed the query and each candidate; score pairs by similarity (see Reranking with embedding models).
Clustering and deduplication	Group or dedupe items in embedding space without labeling every example by hand.
Classification-style scoring	Compare text to prototype descriptions or class names in the same space (patterns vary by model).

For multimodal settings (image–text and related ideas), see Cross-modal embeddings.

Embeddings inside a RAG pipeline

A typical offline path is:

Chunk documents (size, overlap, and structure matter — see Chunking strategies in RAG).
Embed each chunk; optionally store metadata (source id, section, ACL).
Index vectors in memory, a library index, or a vector database (tradeoffs in Vector stores for RAG — comparison).

At query time:

Embed the user query (one short string or a small batch).
Retrieve the top‑k similar chunks by vector search (optionally plus keyword / hybrid search).
Build a prompt from the retrieved plain text chunks and call your chat model.

Important nuance — large language models in chat APIs consume text (and tools), not arbitrary embedding tensors. You use embeddings to choose which text to inject. If you see “query the LLM with precalculated embeddings,” in practice that means retrieve with embeddings, then send the selected text to the LLM.

Get embeddings with Ollama (Python)

Ollama exposes an HTTP API. For embeddings, call POST /api/embed on your Ollama host (default http://127.0.0.1:11434). The JSON body includes a model name and input (a string or a list of strings). The response includes embeddings, a list of vectors aligned with your inputs.

Install httpx (or use requests the same way).

import httpx

OLLAMA = "http://127.0.0.1:11434"
MODEL = "nomic-embed-text"  # replace with an embedding model you have pulled

def embed_ollama(texts: list[str]) -> list[list[float]]:
    r = httpx.post(
        f"{OLLAMA}/api/embed",
        json={"model": MODEL, "input": texts},
        timeout=120.0,
    )
    r.raise_for_status()
    data = r.json()
    return data["embeddings"]

if __name__ == "__main__":
    q = "What is retrieval-augmented generation?"
    chunks = [
        "RAG combines retrieval with generation.",
        "Embeddings map text into vector space for similarity search.",
    ]
    qv = embed_ollama([q])[0]
    doc_vs = embed_ollama(chunks)
    print(len(qv), len(doc_vs), len(doc_vs[0]))

Operational notes

Pull the model first (ollama pull … for your chosen tag). For Qwen3-class models on Ollama, see Qwen3 Embedding & Reranker Models on Ollama.
Under load, embedding throughput interacts with how Ollama schedules work — see How Ollama handles parallel requests.

Get embeddings with an OpenAI-compatible server (Python)

Many local servers (including common llama.cpp HTTP setups) expose OpenAI-compatible routes such as POST /v1/embeddings. You can use the official openai Python package and point base_url at your server’s …/v1 root.

from openai import OpenAI

# Example — replace host, port, and model id with your server’s values
client = OpenAI(
    base_url="http://127.0.0.1:8080/v1",
    api_key="not-needed",  # many local servers ignore this
)

def embed_openai_compatible(text: str, model: str) -> list[float]:
    r = client.embeddings.create(model=model, input=text)
    return r.data[0].embedding

if __name__ == "__main__":
    v = embed_openai_compatible("hello from llama.cpp", "your-embedding-model-id")
    print(len(v))

Why keep both patterns on one page? The concepts (chunk, embed, index, query, retrieve text) are identical; only the HTTP surface changes. One workshop-style article avoids duplicating the same narrative under two URLs.

Persist vectors and query them

At minimum you must store three things per chunk — vector, text, and metadata (source id, offsets, ACL). For a quick prototype you can keep everything in a Python list and use cosine similarity with NumPy or scikit-learn. For growing data, use a vector database or a library index (FAISS, etc.); see Vector stores for RAG — comparison for product-level tradeoffs.

Conceptual query loop:

query_vec = embed(query)
neighbors = index.search(query_vec, k)
context = "\n\n".join(chunk.text for chunk in neighbors)
Send context and the user question to your chat API.

Reranking after retrieval

A reranker (often a cross-encoder or a second scoring model) can re-order the top candidates after vector retrieval. This site has Python and Go examples, including Reranking with embedding models and Reranking with Ollama and Qwen3 Embedding in Go.

Topic	Article
Full RAG architecture	RAG tutorial — architecture, implementation, production
Chunking before you embed	Chunking strategies in RAG
Vector DB choice	Vector stores for RAG — comparison
Qwen3 on Ollama	Qwen3 Embedding & Reranker on Ollama
Cross-modal	Cross-modal embeddings
Ollama CLI & tips	Ollama cheatsheet
Go + Ollama	Using Ollama in Go — SDK comparison

Useful links

Ollama — models and runtime
Ollama API — embed — upstream API reference for /api/embed
OpenAI Python library — works with any OpenAI-compatible base_url