Text embeddings for RAG and search - Python, Ollama, OpenAI-compatible APIs
RAG embeddings - Python, Ollama, OpenAI APIs.
If you are working through retrieval-augmented generation (RAG), this section walks through text embeddings in plain terms — what they are, how they fit search and retrieval, and how to call two common local setups from Python using Ollama or an OpenAI-compatible HTTP API (as many llama.cpp-based servers expose).

For Go clients and SDK comparisons for Ollama, see Go SDKs for Ollama — comparison with examples.
What is a text embedding?
A text embedding is a vector (a list of floats) produced by an embedding model. The model maps variable-length text into a fixed-dimensional space so that texts with similar meaning tend to sit close under a distance or similarity measure (often cosine similarity on L2-normalized vectors).
Embeddings are not the same as token IDs and not the same as a chat completion. They are a representation layer you use for search, clustering, deduplication, and — in RAG — retrieval.
Common use cases
| Use case | Role of embeddings |
|---|---|
| Semantic search / RAG retrieval | Embed queries and document chunks; rank by similarity to fetch relevant passages. |
| Reranking with embedding models | Embed the query and each candidate; score pairs by similarity (see Reranking with embedding models). |
| Clustering and deduplication | Group or dedupe items in embedding space without labeling every example by hand. |
| Classification-style scoring | Compare text to prototype descriptions or class names in the same space (patterns vary by model). |
For multimodal settings (image–text and related ideas), see Cross-modal embeddings.
Embeddings inside a RAG pipeline
A typical offline path is:
- Chunk documents (size, overlap, and structure matter — see Chunking strategies in RAG).
- Embed each chunk; optionally store metadata (source id, section, ACL).
- Index vectors in memory, a library index, or a vector database (tradeoffs in Vector stores for RAG — comparison).
At query time:
- Embed the user query (one short string or a small batch).
- Retrieve the top‑k similar chunks by vector search (optionally plus keyword / hybrid search).
- Build a prompt from the retrieved plain text chunks and call your chat model.
Important nuance — large language models in chat APIs consume text (and tools), not arbitrary embedding tensors. You use embeddings to choose which text to inject. If you see “query the LLM with precalculated embeddings,” in practice that means retrieve with embeddings, then send the selected text to the LLM.
Get embeddings with Ollama (Python)
Ollama exposes an HTTP API. For embeddings, call POST /api/embed on your Ollama host (default http://127.0.0.1:11434). The JSON body includes a model name and input (a string or a list of strings). The response includes embeddings, a list of vectors aligned with your inputs.
Install httpx (or use requests the same way).
import httpx
OLLAMA = "http://127.0.0.1:11434"
MODEL = "nomic-embed-text" # replace with an embedding model you have pulled
def embed_ollama(texts: list[str]) -> list[list[float]]:
r = httpx.post(
f"{OLLAMA}/api/embed",
json={"model": MODEL, "input": texts},
timeout=120.0,
)
r.raise_for_status()
data = r.json()
return data["embeddings"]
if __name__ == "__main__":
q = "What is retrieval-augmented generation?"
chunks = [
"RAG combines retrieval with generation.",
"Embeddings map text into vector space for similarity search.",
]
qv = embed_ollama([q])[0]
doc_vs = embed_ollama(chunks)
print(len(qv), len(doc_vs), len(doc_vs[0]))
Operational notes
- Pull the model first (
ollama pull …for your chosen tag). For Qwen3-class models on Ollama, see Qwen3 Embedding & Reranker Models on Ollama. - Under load, embedding throughput interacts with how Ollama schedules work — see How Ollama handles parallel requests.
Get embeddings with an OpenAI-compatible server (Python)
Many local servers (including common llama.cpp HTTP setups) expose OpenAI-compatible routes such as POST /v1/embeddings. You can use the official openai Python package and point base_url at your server’s …/v1 root.
from openai import OpenAI
# Example — replace host, port, and model id with your server’s values
client = OpenAI(
base_url="http://127.0.0.1:8080/v1",
api_key="not-needed", # many local servers ignore this
)
def embed_openai_compatible(text: str, model: str) -> list[float]:
r = client.embeddings.create(model=model, input=text)
return r.data[0].embedding
if __name__ == "__main__":
v = embed_openai_compatible("hello from llama.cpp", "your-embedding-model-id")
print(len(v))
Why keep both patterns on one page? The concepts (chunk, embed, index, query, retrieve text) are identical; only the HTTP surface changes. One workshop-style article avoids duplicating the same narrative under two URLs.
Persist vectors and query them
At minimum you must store three things per chunk — vector, text, and metadata (source id, offsets, ACL). For a quick prototype you can keep everything in a Python list and use cosine similarity with NumPy or scikit-learn. For growing data, use a vector database or a library index (FAISS, etc.); see Vector stores for RAG — comparison for product-level tradeoffs.
Conceptual query loop:
query_vec = embed(query)neighbors = index.search(query_vec, k)context = "\n\n".join(chunk.text for chunk in neighbors)- Send
contextand the user question to your chat API.
Reranking after retrieval
A reranker (often a cross-encoder or a second scoring model) can re-order the top candidates after vector retrieval. This site has Python and Go examples, including Reranking with embedding models and Reranking with Ollama and Qwen3 Embedding in Go.
On this site — related articles
| Topic | Article |
|---|---|
| Full RAG architecture | RAG tutorial — architecture, implementation, production |
| Chunking before you embed | Chunking strategies in RAG |
| Vector DB choice | Vector stores for RAG — comparison |
| Qwen3 on Ollama | Qwen3 Embedding & Reranker on Ollama |
| Cross-modal | Cross-modal embeddings |
| Ollama CLI & tips | Ollama cheatsheet |
| Go + Ollama | Using Ollama in Go — SDK comparison |
Useful links
- Ollama — models and runtime
- Ollama API — embed — upstream API reference for
/api/embed - OpenAI Python library — works with any OpenAI-compatible
base_url