Go clients for Ollama: SDK comparison and Qwen3/GPT-OSS examples
Integrate Ollama with Go: SDK guide, examples, and production best practices.
This guide provides a comprehensive overview of available Go SDKs for Ollama and compares their feature sets.
We’ll explore practical Go examples for calling Qwen3 and GPT-OSS models hosted on Ollama—both via raw REST API calls and the official Go client—including detailed handling of thinking and non-thinking modes in Qwen3.
Why Ollama + Go?
Ollama exposes a small, pragmatic HTTP API (typically running at http://localhost:11434
) designed for generate and chat workloads, with built-in streaming support and model management capabilities. The official documentation thoroughly covers /api/generate
and /api/chat
request/response structures and streaming semantics.
Go is an excellent choice for building Ollama clients due to its strong standard library support for HTTP, excellent JSON handling, native concurrency primitives, and statically-typed interfaces that catch errors at compile time.
As of October 2025, here are the Go SDK options you’ll most likely consider.
Go SDKs for Ollama — what’s available?
SDK / Package | Status & “owner” | Scope (Generate/Chat/Streaming) | Model mgmt (pull/list/etc.) | Extras / Notes |
---|---|---|---|---|
github.com/ollama/ollama/api |
Official package inside the Ollama repo; used by the ollama CLI itself |
Full coverage mapped to REST; streaming supported | Yes | Considered the canonical Go client; API mirrors docs closely. |
LangChainGo (github.com/tmc/langchaingo/llms/ollama ) |
Community framework (LangChainGo) with Ollama LLM module | Chat/Completion + streaming via framework abstractions | Limited (model mgmt not primary goal) | Great if you want chains, tools, vector stores in Go; less of a raw SDK. |
github.com/swdunlop/ollama-client |
Community client | Focus on chat; good tool-calling experiments | Partial | Built for experimenting with tool calling; not a 1:1 full surface. |
Other community SDKs (e.g., ollamaclient , third-party “go-ollama-sdk`) |
Community | Varies | Varies | Quality and coverage vary; evaluate per repo. |
Recommendation: For production, prefer github.com/ollama/ollama/api
—it’s maintained with the core project and mirrors the REST API.
Qwen3 & GPT-OSS on Ollama: thinking vs non-thinking (what to know)
- Thinking mode in Ollama separates the model’s “reasoning” from final output when enabled. Ollama documents a first-class enable/disable thinking behavior across supported models.
- (https://www.glukhov.org/post/2025/10/qwen3-30b-vs-gpt-oss-20b/ “Qwen3:30b vs GPT-OSS:20b: Technical details, performance and speed comparison”) supports dynamic toggling: add
/think
or/no_think
in system/user messages to switch modes turn-by-turn; the latest instruction wins. - GPT-OSS: users report that disabling thinking (e.g.,
/set nothink
or--think=false
) can be unreliable ongpt-oss:20b
; plan to filter/hide any reasoning your UI shouldn’t surface.
Part 1 — Calling Ollama via raw REST (Go, net/http)
Shared types
First, let’s define the common types and helper functions we’ll use across our examples:
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
)
// ---- Chat API Types ----
type ChatMessage struct {
Role string `json:"role"`
Content string `json:"content"`
}
type ChatRequest struct {
Model string `json:"model"`
Messages []ChatMessage `json:"messages"`
// Some servers expose thinking control as a boolean flag.
// Even if omitted, you can still control Qwen3 via /think or /no_think tags.
Think *bool `json:"think,omitempty"`
Stream *bool `json:"stream,omitempty"`
Options map[string]any `json:"options,omitempty"`
}
type ChatResponse struct {
Model string `json:"model"`
CreatedAt string `json:"created_at"`
Message struct {
Role string `json:"role"`
Content string `json:"content"`
Thinking string `json:"thinking,omitempty"` // present when thinking is on
} `json:"message"`
Done bool `json:"done"`
}
// ---- Generate API Types ----
type GenerateRequest struct {
Model string `json:"model"`
Prompt string `json:"prompt"`
Think *bool `json:"think,omitempty"`
Stream *bool `json:"stream,omitempty"`
Options map[string]any `json:"options,omitempty"`
}
type GenerateResponse struct {
Model string `json:"model"`
CreatedAt string `json:"created_at"`
Response string `json:"response"` // final text for non-stream
Thinking string `json:"thinking,omitempty"` // present when thinking is on
Done bool `json:"done"`
}
// ---- Helper Functions ----
func httpPostJSON(url string, payload any) ([]byte, error) {
body, err := json.Marshal(payload)
if err != nil {
return nil, err
}
c := &http.Client{Timeout: 60 * time.Second}
resp, err := c.Post(url, "application/json", bytes.NewReader(body))
if err != nil {
return nil, err
}
defer resp.Body.Close()
return io.ReadAll(resp.Body)
}
// bptr returns a pointer to a boolean value
func bptr(b bool) *bool { return &b }
Chat — Qwen3 with thinking ON (and how to turn it OFF)
func chatQwen3Thinking() error {
endpoint := "http://localhost:11434/api/chat"
req := ChatRequest{
Model: "qwen3:8b-thinking", // any :*-thinking tag you have pulled
Think: bptr(true),
Stream: bptr(false),
Messages: []ChatMessage{
{Role: "system", Content: "You are a precise assistant."},
{Role: "user", Content: "Explain recursion with a short Go example."},
},
}
raw, err := httpPostJSON(endpoint, req)
if err != nil {
return err
}
var out ChatResponse
if err := json.Unmarshal(raw, &out); err != nil {
return err
}
fmt.Println("🧠 thinking:\n", out.Message.Thinking)
fmt.Println("\n💬 answer:\n", out.Message.Content)
return nil
}
// Turn thinking OFF for the next turn by:
// (a) setting Think=false, and/or
// (b) adding "/no_think" to the most-recent system/user message (Qwen3 soft switch).
// Qwen3 honors the latest /think or /no_think instruction in multi-turn chats.
func chatQwen3NoThinking() error {
endpoint := "http://localhost:11434/api/chat"
req := ChatRequest{
Model: "qwen3:8b-thinking",
Think: bptr(false),
Stream: bptr(false),
Messages: []ChatMessage{
{Role: "system", Content: "You are brief. /no_think"},
{Role: "user", Content: "Explain recursion in one sentence."},
},
}
raw, err := httpPostJSON(endpoint, req)
if err != nil {
return err
}
var out ChatResponse
if err := json.Unmarshal(raw, &out); err != nil {
return err
}
// Expect thinking to be empty; still handle defensively.
if out.Message.Thinking != "" {
fmt.Println("🧠 thinking (unexpected):\n", out.Message.Thinking)
}
fmt.Println("\n💬 answer:\n", out.Message.Content)
return nil
}
(Qwen3’s /think
and /no_think
soft switch is documented by the Qwen team; the last instruction wins in multi-turn chats.)
Chat — GPT-OSS with thinking (and a caveat)
func chatGptOss() error {
endpoint := "http://localhost:11434/api/chat"
req := ChatRequest{
Model: "gpt-oss:20b",
Think: bptr(true), // request separated reasoning if supported
Stream: bptr(false),
Messages: []ChatMessage{
{Role: "user", Content: "What is dynamic programming? Explain the core idea."},
},
}
raw, err := httpPostJSON(endpoint, req)
if err != nil {
return err
}
var out ChatResponse
if err := json.Unmarshal(raw, &out); err != nil {
return err
}
// Known quirk: disabling thinking may not fully suppress reasoning on gpt-oss:20b.
// Always filter/hide thinking in UI if you don't want to surface it.
fmt.Println("🧠 thinking:\n", out.Message.Thinking)
fmt.Println("\n💬 answer:\n", out.Message.Content)
return nil
}
Users report that disabling thinking on gpt-oss:20b (e.g., /set nothink
or --think=false
) can be ignored—plan for client-side filtering if needed.
Generate — Qwen3 and GPT-OSS
func generateQwen3() error {
endpoint := "http://localhost:11434/api/generate"
req := GenerateRequest{
Model: "qwen3:4b-thinking",
Prompt: "In 2–3 sentences, what are B-Trees used for in databases?",
Think: bptr(true),
}
raw, err := httpPostJSON(endpoint, req)
if err != nil {
return err
}
var out GenerateResponse
if err := json.Unmarshal(raw, &out); err != nil {
return err
}
if out.Thinking != "" {
fmt.Println("🧠 thinking:\n", out.Thinking)
}
fmt.Println("\n💬 answer:\n", out.Response)
return nil
}
func generateGptOss() error {
endpoint := "http://localhost:11434/api/generate"
req := GenerateRequest{
Model: "gpt-oss:20b",
Prompt: "Briefly explain backpropagation in neural networks.",
Think: bptr(true),
}
raw, err := httpPostJSON(endpoint, req)
if err != nil {
return err
}
var out GenerateResponse
if err := json.Unmarshal(raw, &out); err != nil {
return err
}
if out.Thinking != "" {
fmt.Println("🧠 thinking:\n", out.Thinking)
}
fmt.Println("\n💬 answer:\n", out.Response)
return nil
}
REST shapes and streaming behavior come straight from the Ollama API reference.
Part 2 — Calling Ollama via the official Go SDK (github.com/ollama/ollama/api
)
The official package exposes a Client
with methods that correspond to the REST API. The Ollama CLI itself uses this package to talk to the service, which makes it the safest bet for compatibility.
Install
go get github.com/ollama/ollama/api
Chat — Qwen3 (thinking ON / OFF)
package main
import (
"context"
"fmt"
"log"
"github.com/ollama/ollama/api"
)
func chatWithQwen3Thinking(ctx context.Context, thinking bool) error {
client, err := api.ClientFromEnvironment() // honors OLLAMA_HOST if set
if err != nil {
return err
}
req := &api.ChatRequest{
Model: "qwen3:8b-thinking",
// Many server builds expose thinking as a top-level flag;
// additionally, you can control Qwen3 via /think or /no_think in messages.
Think: api.Ptr(thinking),
Messages: []api.Message{
{Role: "system", Content: "You are a precise assistant."},
{Role: "user", Content: "Explain merge sort with a short Go snippet."},
},
}
var resp api.ChatResponse
if err := client.Chat(ctx, req, &resp); err != nil {
return err
}
if resp.Message.Thinking != "" {
fmt.Println("🧠 thinking:\n", resp.Message.Thinking)
}
fmt.Println("\n💬 answer:\n", resp.Message.Content)
return nil
}
func main() {
ctx := context.Background()
if err := chatWithQwen3Thinking(ctx, true); err != nil {
log.Fatal(err)
}
// Example: non-thinking
if err := chatWithQwen3Thinking(ctx, false); err != nil {
log.Fatal(err)
}
}
Chat — GPT-OSS (handle reasoning defensively)
func chatWithGptOss(ctx context.Context) error {
client, err := api.ClientFromEnvironment()
if err != nil {
return err
}
req := &api.ChatRequest{
Model: "gpt-oss:20b",
Think: api.Ptr(true),
Messages: []api.Message{
{Role: "user", Content: "What is memoization and when is it useful?"},
},
}
var resp api.ChatResponse
if err := client.Chat(ctx, req, &resp); err != nil {
return err
}
// If you intend to hide reasoning, do it here regardless of flags.
if resp.Message.Thinking != "" {
fmt.Println("🧠 thinking:\n", resp.Message.Thinking)
}
fmt.Println("\n💬 answer:\n", resp.Message.Content)
return nil
}
Generate — Qwen3 & GPT-OSS
func generateWithQwen3(ctx context.Context) error {
client, err := api.ClientFromEnvironment()
if err != nil {
return err
}
req := &api.GenerateRequest{
Model: "qwen3:4b-thinking",
Prompt: "Summarize the role of a B-Tree in indexing.",
Think: api.Ptr(true),
}
var resp api.GenerateResponse
if err := client.Generate(ctx, req, &resp); err != nil {
return err
}
if resp.Thinking != "" {
fmt.Println("🧠 thinking:\n", resp.Thinking)
}
fmt.Println("\n💬 answer:\n", resp.Response)
return nil
}
func generateWithGptOss(ctx context.Context) error {
client, err := api.ClientFromEnvironment()
if err != nil {
return err
}
req := &api.GenerateRequest{
Model: "gpt-oss:20b",
Prompt: "Explain gradient descent in simple terms.",
Think: api.Ptr(true),
}
var resp api.GenerateResponse
if err := client.Generate(ctx, req, &resp); err != nil {
return err
}
if resp.Thinking != "" {
fmt.Println("🧠 thinking:\n", resp.Thinking)
}
fmt.Println("\n💬 answer:\n", resp.Response)
return nil
}
The official package’s surface mirrors the REST docs and is updated alongside the core project.
Streaming responses
For real-time streaming, set Stream: bptr(true)
in your request. The response will be delivered as newline-delimited JSON chunks:
func streamChatExample() error {
endpoint := "http://localhost:11434/api/chat"
req := ChatRequest{
Model: "qwen3:8b-thinking",
Think: bptr(true),
Stream: bptr(true), // Enable streaming
Messages: []ChatMessage{
{Role: "user", Content: "Explain quicksort algorithm step by step."},
},
}
body, _ := json.Marshal(req)
resp, err := http.Post(endpoint, "application/json", bytes.NewReader(body))
if err != nil {
return err
}
defer resp.Body.Close()
decoder := json.NewDecoder(resp.Body)
for {
var chunk ChatResponse
if err := decoder.Decode(&chunk); err == io.EOF {
break
} else if err != nil {
return err
}
// Process thinking and content as they arrive
if chunk.Message.Thinking != "" {
fmt.Print(chunk.Message.Thinking)
}
fmt.Print(chunk.Message.Content)
if chunk.Done {
break
}
}
return nil
}
With the official SDK, use a callback function to handle streaming chunks:
func streamWithOfficialSDK(ctx context.Context) error {
client, _ := api.ClientFromEnvironment()
req := &api.ChatRequest{
Model: "qwen3:8b-thinking",
Think: api.Ptr(true),
Messages: []api.Message{
{Role: "user", Content: "Explain binary search trees."},
},
}
err := client.Chat(ctx, req, func(resp api.ChatResponse) error {
if resp.Message.Thinking != "" {
fmt.Print(resp.Message.Thinking)
}
fmt.Print(resp.Message.Content)
return nil
})
return err
}
Working with Qwen3 thinking vs non-thinking (practical guidance)
-
Two levers:
- A boolean
thinking
flag supported by Ollama’s thinking feature; and - Qwen3’s soft switch commands
/think
and/no_think
in the latest system/user message. The most recent instruction governs the next turn(s).
- A boolean
-
Default posture: non-thinking for quick replies; escalate to thinking for tasks that need step-by-step reasoning (math, planning, debugging, complex code analysis).
-
Streaming UIs: when thinking is enabled, you may see interleaved reasoning/content in streamed frames—buffer or render them separately and give users a “show reasoning” toggle. (See API docs for streaming format.)
-
Multi-turn conversations: Qwen3 remembers the thinking mode from previous turns. If you want to toggle it mid-conversation, use both the flag AND the soft-switch command for reliability.
Notes for GPT-OSS
- Treat reasoning as present even if you tried to disable it; filter on the client if your UX shouldn’t show it.
- For production applications using GPT-OSS, implement client-side filtering logic that can detect and strip reasoning patterns if needed.
- Test your specific GPT-OSS model variant thoroughly, as behavior may vary between different quantizations and versions.
Best practices and production tips
Error handling and timeouts
Always implement proper timeout handling and error recovery:
func robustChatRequest(ctx context.Context, model string, messages []api.Message) (*api.ChatResponse, error) {
// Set a reasonable timeout
ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel()
client, err := api.ClientFromEnvironment()
if err != nil {
return nil, fmt.Errorf("creating client: %w", err)
}
req := &api.ChatRequest{
Model: model,
Messages: messages,
Options: map[string]interface{}{
"temperature": 0.7,
"num_ctx": 4096, // context window size
},
}
var resp api.ChatResponse
if err := client.Chat(ctx, req, &resp); err != nil {
return nil, fmt.Errorf("chat request failed: %w", err)
}
return &resp, nil
}
Connection pooling and reuse
Reuse the Ollama client across requests instead of creating a new one each time:
type OllamaService struct {
client *api.Client
}
func NewOllamaService() (*OllamaService, error) {
client, err := api.ClientFromEnvironment()
if err != nil {
return nil, err
}
return &OllamaService{client: client}, nil
}
func (s *OllamaService) Chat(ctx context.Context, req *api.ChatRequest) (*api.ChatResponse, error) {
var resp api.ChatResponse
if err := s.client.Chat(ctx, req, &resp); err != nil {
return nil, err
}
return &resp, nil
}
Environment configuration
Use environment variables for flexible deployment:
export OLLAMA_HOST=http://localhost:11434
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_MAX_LOADED_MODELS=2
The official SDK automatically respects OLLAMA_HOST
via api.ClientFromEnvironment()
.
Monitoring and logging
Implement structured logging for production systems:
func loggedChat(ctx context.Context, logger *log.Logger, req *api.ChatRequest) error {
start := time.Now()
client, _ := api.ClientFromEnvironment()
var resp api.ChatResponse
err := client.Chat(ctx, req, &resp)
duration := time.Since(start)
logger.Printf("model=%s duration=%v error=%v tokens=%d",
req.Model, duration, err, len(resp.Message.Content))
return err
}
Conclusion
-
For Go projects,
github.com/ollama/ollama/api
is the most complete, production-ready choice. It’s maintained alongside the Ollama core project, used by the official CLI, and provides comprehensive API coverage with guaranteed compatibility. -
For higher-level abstractions, consider LangChainGo when you need chains, tools, vector stores, and RAG pipelines—though you’ll trade some low-level control for convenience.
-
Qwen3 gives you clean, reliable control over thinking mode with both flags and message-level toggles (
/think
,/no_think
), making it ideal for applications that need both fast responses and deep reasoning. -
For GPT-OSS, always plan to sanitize reasoning output client-side when necessary, as the thinking disable flag may not be consistently honored.
-
In production, implement proper error handling, connection pooling, timeouts, and monitoring to build robust Ollama-powered applications.
The combination of Go’s strong typing, excellent concurrency support, and Ollama’s straightforward API makes it an ideal stack for building AI-powered applications—from simple chatbots to complex RAG systems.
Key takeaways
Here’s a quick reference for choosing your approach:
Use Case | Recommended Approach | Why |
---|---|---|
Production application | github.com/ollama/ollama/api |
Official support, complete API coverage, maintained with core project |
RAG/chains/tools pipeline | LangChainGo | High-level abstractions, integrations with vector stores |
Learning/experimentation | Raw REST with net/http | Full transparency, no dependencies, educational |
Quick prototyping | Official SDK | Balance of simplicity and power |
Streaming chat UI | Official SDK with callbacks | Built-in streaming support, clean API |
Model selection guidance:
- Qwen3: Best for applications requiring controllable thinking mode, reliable multi-turn conversations
- GPT-OSS: Strong performance but requires defensive handling of thinking/reasoning output
- Other models: Test thoroughly; thinking behavior varies by model family
References & further reading
Official documentation
- Ollama API reference — Complete REST API documentation
- Official Ollama Go package — Go SDK documentation
- Ollama GitHub repository — Source code and issues
Go SDK alternatives
- LangChainGo Ollama integration — For chain-based applications
- swdunlop/ollama-client — Community client with tool calling
- xyproto/ollamaclient — Another community option
Model-specific resources
- Qwen documentation — Official Qwen model information
- GPT-OSS information — GPT-OSS model details
Related topics
- Building RAG applications with Go — LangChainGo examples
- Go context package — Essential for timeouts and cancellation
- Go HTTP client best practices — Standard library documentation
Other Useful Links
- Install and configure Ollama
- Ollama cheatsheet
- Go Cheatsheet
- How Ollama Handles Parallel Requests
- Reranking text documents with Ollama and Qwen3 Embedding model - in Go
- Comparing Go ORMs for PostgreSQL: GORM vs Ent vs Bun vs sqlc
- Constraining LLMs with Structured Output: Ollama, Qwen3 & Python or Go
- Using Ollama in Python Code
- LLMs Comparison: Qwen3:30b vs GPT-OSS:20b
- Ollama GPT-OSS Structured Output Issues