Memory allocation and model scheduling in Ollama new version - v0.12.1
My own test of ollama model scheduling
Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.
My own test of ollama model scheduling
Here I am comparing how much VRAM new version of Ollama allocating for the model vs previous Ollama version. The new version is worse.
My view on current state of Ollama development
Ollama has quickly become one of the most popular tools for running LLMs locally. Its simple CLI, and streamlined model management have made it a go-to option for developers who want to work with AI models outside the cloud. But as with many promising platforms, there are already signs of Enshittification:
Quick overview of most prominent UIs for Ollama in 2025
Locally hosted Ollama allows to run large language models on your own machine, but using it via command-line isn’t user-friendly. Here are several open-source projects provide ChatGPT-style interfaces that connect to a local Ollama.
Implementing RAG? Here are some Go code bits - 2...
Since standard Ollama doesn’t have a direct rerank API, you’ll need to implement reranking using Qwen3 Reranker in GO by generating embeddings for query-document pairs and scoring them.
qwen3 8b, 14b and 30b, devstral 24b, mistral small 24b
In this test I’m comparing how different LLMs hosted on Ollama translate Hugo page in English to German. Three pages I tested were on different topics, had some nice markdown with some structure: headers, lists, tables, links, etc.
Implementing RAG? Here are some codesnippets in Golang..
This little Reranking Go code example is calling Ollama to generate embeddings for the query and for eache candidate document, then sorting descending by cosine similarity.
New awesome LLMs available in Ollama
The Qwen3 Embedding and Reranker models are the latest releases in the Qwen family, specifically designed for advanced text embedding, retrieval, and reranking tasks.
Thinking of installing second gpu for LLMs?
How PCIe Lanes Affect LLM Performance? Depending on the task. For training and multi-gpu inferrence - perdormance drop is significant.
LLM to extract text from HTML...
In the Ollama models library there are models that able convert HTML content to Markdown, which is useful for content conversion tasks.
Cursor AI vs GitHub Copilot vs Cline AI vs...
Will list here some AI-assisted coding tools and AI Coding Assistants and their nice sides.
Ollama on Intel CPU Efficient vs Performance cores
I’ve got a theory to test - if utilising ALL cores on Intel CPU would raise the speed of LLMs? This is bugging me that new gemma3 27 bit model (gemma3:27b, 17GB on ollama) is not fitting into 16GB VRAM of my GPU, and partially running on CPU.
Configuring ollama for parallel requests executions.
When the Ollama server receives two requests at the same time, its behavior depends on its configuration and available system resources.
Comparing two deepseek-r1 models to two base ones
DeepSeek’s first-generation of reasoning models with comparable performance to OpenAI-o1, including six dense models distilled from DeepSeek-R1 based on Llama and Qwen.
Compiled this Ollama command list sime time ago...
Here is the list and examples of the most useful Ollama commands (Ollama commands cheatsheet) I compiled some time ago. Hopefully it will be useful to you too.
Next round of LLM tests
Not long ago was released. Let’s catch up and test how Mistral Small performs comparing to other LLMs.
A python code of RAG's reranking