Which Go SDK should I use for Ollama in production?

The official github.com/ollama/ollama/api package is the recommended choice for production use. It’s maintained alongside the Ollama core project, used by the Ollama CLI itself, and provides complete API coverage with guaranteed compatibility.

How do I control thinking mode in Qwen3 models?

Qwen3 supports two methods: (1) the think boolean flag in API requests, and (2) soft-switch commands /think or /no_think added to system or user messages. The most recent instruction takes precedence in multi-turn conversations.

Why doesn’t GPT-OSS reliably disable thinking mode?

Some users report that GPT-OSS models (particularly gpt-oss:20b) may ignore thinking disable flags. The recommended approach is to always filter or hide reasoning output on the client side if you don’t want it displayed in your UI.

Can I use raw HTTP requests instead of an SDK?

Yes, Ollama’s REST API at http://localhost:11434 is straightforward to use with Go’s standard net/http package. This gives you full control but requires manually handling request/response types and streaming logic.

What’s the difference between /api/chat and /api/generate endpoints?

The /api/chat endpoint accepts an array of messages with roles (system, user, assistant) for conversational contexts, while /api/generate takes a single prompt string for simple completion tasks. Chat is better for multi-turn conversations.

Go clients for Ollama: SDK comparison and Qwen3/GPT-OSS examples

Integrate Ollama with Go: SDK guide, examples, and production best practices.

Page content

This guide provides a comprehensive overview of available Go SDKs for Ollama and compares their feature sets.

We’ll explore practical Go examples for calling Qwen3 and GPT-OSS models hosted on Ollama—both via raw REST API calls and the official Go client—including detailed handling of thinking and non-thinking modes in Qwen3.

go and ollama

Why Ollama + Go?

Ollama exposes a small, pragmatic HTTP API (typically running at http://localhost:11434) designed for generate and chat workloads, with built-in streaming support and model management capabilities. The official documentation thoroughly covers /api/generate and /api/chat request/response structures and streaming semantics.

Go is an excellent choice for building Ollama clients due to its strong standard library support for HTTP, excellent JSON handling, native concurrency primitives, and statically-typed interfaces that catch errors at compile time.

As of October 2025, here are the Go SDK options you’ll most likely consider.

Go SDKs for Ollama — what’s available?

SDK / Package	Status & “owner”	Scope (Generate/Chat/Streaming)	Model mgmt (pull/list/etc.)	Extras / Notes
`github.com/ollama/ollama/api`	Official package inside the Ollama repo; used by the `ollama` CLI itself	Full coverage mapped to REST; streaming supported	Yes	Considered the canonical Go client; API mirrors docs closely.
LangChainGo (`github.com/tmc/langchaingo/llms/ollama`)	Community framework (LangChainGo) with Ollama LLM module	Chat/Completion + streaming via framework abstractions	Limited (model mgmt not primary goal)	Great if you want chains, tools, vector stores in Go; less of a raw SDK.
`github.com/swdunlop/ollama-client`	Community client	Focus on chat; good tool-calling experiments	Partial	Built for experimenting with tool calling; not a 1:1 full surface.
Other community SDKs (e.g., `ollamaclient`, third-party “go-ollama-sdk`)	Community	Varies	Varies	Quality and coverage vary; evaluate per repo.

Recommendation: For production, prefer github.com/ollama/ollama/api—it’s maintained with the core project and mirrors the REST API.

Qwen3 & GPT-OSS on Ollama: thinking vs non-thinking (what to know)

Thinking mode in Ollama separates the model’s “reasoning” from final output when enabled. Ollama documents a first-class enable/disable thinking behavior across supported models.
(https://www.glukhov.org/post/2025/10/qwen3-30b-vs-gpt-oss-20b/ “Qwen3:30b vs GPT-OSS:20b: Technical details, performance and speed comparison”) supports dynamic toggling: add /think or /no_think in system/user messages to switch modes turn-by-turn; the latest instruction wins.
GPT-OSS: users report that disabling thinking (e.g., /set nothink or --think=false) can be unreliable on gpt-oss:20b; plan to filter/hide any reasoning your UI shouldn’t surface.

Part 1 — Calling Ollama via raw REST (Go, net/http)

Shared types

First, let’s define the common types and helper functions we’ll use across our examples:

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"time"
)

// ---- Chat API Types ----

type ChatMessage struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

type ChatRequest struct {
	Model    string        `json:"model"`
	Messages []ChatMessage `json:"messages"`
	// Some servers expose thinking control as a boolean flag.
	// Even if omitted, you can still control Qwen3 via /think or /no_think tags.
	Think   *bool          `json:"think,omitempty"`
	Stream  *bool          `json:"stream,omitempty"`
	Options map[string]any `json:"options,omitempty"`
}

type ChatResponse struct {
	Model     string `json:"model"`
	CreatedAt string `json:"created_at"`
	Message   struct {
		Role     string `json:"role"`
		Content  string `json:"content"`
		Thinking string `json:"thinking,omitempty"` // present when thinking is on
	} `json:"message"`
	Done bool `json:"done"`
}

// ---- Generate API Types ----

type GenerateRequest struct {
	Model   string         `json:"model"`
	Prompt  string         `json:"prompt"`
	Think   *bool          `json:"think,omitempty"`
	Stream  *bool          `json:"stream,omitempty"`
	Options map[string]any `json:"options,omitempty"`
}

type GenerateResponse struct {
	Model     string `json:"model"`
	CreatedAt string `json:"created_at"`
	Response  string `json:"response"`           // final text for non-stream
	Thinking  string `json:"thinking,omitempty"` // present when thinking is on
	Done      bool   `json:"done"`
}

// ---- Helper Functions ----

func httpPostJSON(url string, payload any) ([]byte, error) {
	body, err := json.Marshal(payload)
	if err != nil {
		return nil, err
	}
	c := &http.Client{Timeout: 60 * time.Second}
	resp, err := c.Post(url, "application/json", bytes.NewReader(body))
	if err != nil {
		return nil, err
	}
	defer resp.Body.Close()
	return io.ReadAll(resp.Body)
}

// bptr returns a pointer to a boolean value
func bptr(b bool) *bool { return &b }

Chat — Qwen3 with thinking ON (and how to turn it OFF)

func chatQwen3Thinking() error {
	endpoint := "http://localhost:11434/api/chat"

	req := ChatRequest{
		Model:   "qwen3:8b-thinking", // any :*-thinking tag you have pulled
		Think:   bptr(true),
		Stream:  bptr(false),
		Messages: []ChatMessage{
			{Role: "system", Content: "You are a precise assistant."},
			{Role: "user",   Content: "Explain recursion with a short Go example."},
		},
	}

	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out ChatResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	fmt.Println("🧠 thinking:\n", out.Message.Thinking)
	fmt.Println("\n💬 answer:\n", out.Message.Content)
	return nil
}

// Turn thinking OFF for the next turn by:
// (a) setting Think=false, and/or
// (b) adding "/no_think" to the most-recent system/user message (Qwen3 soft switch).
// Qwen3 honors the latest /think or /no_think instruction in multi-turn chats.
func chatQwen3NoThinking() error {
	endpoint := "http://localhost:11434/api/chat"

	req := ChatRequest{
		Model:  "qwen3:8b-thinking",
		Think:  bptr(false),
		Stream: bptr(false),
		Messages: []ChatMessage{
			{Role: "system", Content: "You are brief. /no_think"},
			{Role: "user",   Content: "Explain recursion in one sentence."},
		},
	}

	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out ChatResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	// Expect thinking to be empty; still handle defensively.
	if out.Message.Thinking != "" {
		fmt.Println("🧠 thinking (unexpected):\n", out.Message.Thinking)
	}
	fmt.Println("\n💬 answer:\n", out.Message.Content)
	return nil
}

(Qwen3’s /think and /no_think soft switch is documented by the Qwen team; the last instruction wins in multi-turn chats.)

Chat — GPT-OSS with thinking (and a caveat)

func chatGptOss() error {
	endpoint := "http://localhost:11434/api/chat"

	req := ChatRequest{
		Model:  "gpt-oss:20b",
		Think:  bptr(true),   // request separated reasoning if supported
		Stream: bptr(false),
		Messages: []ChatMessage{
			{Role: "user", Content: "What is dynamic programming? Explain the core idea."},
		},
	}

	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out ChatResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	// Known quirk: disabling thinking may not fully suppress reasoning on gpt-oss:20b.
	// Always filter/hide thinking in UI if you don't want to surface it.
	fmt.Println("🧠 thinking:\n", out.Message.Thinking)
	fmt.Println("\n💬 answer:\n", out.Message.Content)
	return nil
}

Users report that disabling thinking on gpt-oss:20b (e.g., /set nothink or --think=false) can be ignored—plan for client-side filtering if needed.

Generate — Qwen3 and GPT-OSS

func generateQwen3() error {
	endpoint := "http://localhost:11434/api/generate"
	req := GenerateRequest{
		Model:  "qwen3:4b-thinking",
		Prompt: "In 2–3 sentences, what are B-Trees used for in databases?",
		Think:  bptr(true),
	}
	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out GenerateResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	if out.Thinking != "" {
		fmt.Println("🧠 thinking:\n", out.Thinking)
	}
	fmt.Println("\n💬 answer:\n", out.Response)
	return nil
}

func generateGptOss() error {
	endpoint := "http://localhost:11434/api/generate"
	req := GenerateRequest{
		Model:  "gpt-oss:20b",
		Prompt: "Briefly explain backpropagation in neural networks.",
		Think:  bptr(true),
	}
	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out GenerateResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	if out.Thinking != "" {
		fmt.Println("🧠 thinking:\n", out.Thinking)
	}
	fmt.Println("\n💬 answer:\n", out.Response)
	return nil
}

REST shapes and streaming behavior come straight from the Ollama API reference.

Part 2 — Calling Ollama via the official Go SDK (`github.com/ollama/ollama/api`)

The official package exposes a Client with methods that correspond to the REST API. The Ollama CLI itself uses this package to talk to the service, which makes it the safest bet for compatibility.

Install

go get github.com/ollama/ollama/api

Chat — Qwen3 (thinking ON / OFF)

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/ollama/ollama/api"
)

func chatWithQwen3Thinking(ctx context.Context, thinking bool) error {
	client, err := api.ClientFromEnvironment() // honors OLLAMA_HOST if set
	if err != nil {
		return err
	}

	req := &api.ChatRequest{
		Model: "qwen3:8b-thinking",
		// Many server builds expose thinking as a top-level flag;
		// additionally, you can control Qwen3 via /think or /no_think in messages.
		Think: api.Ptr(thinking),
		Messages: []api.Message{
			{Role: "system", Content: "You are a precise assistant."},
			{Role: "user",   Content: "Explain merge sort with a short Go snippet."},
		},
	}

	var resp api.ChatResponse
	if err := client.Chat(ctx, req, &resp); err != nil {
		return err
	}

	if resp.Message.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Message.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Message.Content)
	return nil
}

func main() {
	ctx := context.Background()
	if err := chatWithQwen3Thinking(ctx, true); err != nil {
		log.Fatal(err)
	}
	// Example: non-thinking
	if err := chatWithQwen3Thinking(ctx, false); err != nil {
		log.Fatal(err)
	}
}

Chat — GPT-OSS (handle reasoning defensively)

func chatWithGptOss(ctx context.Context) error {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return err
	}
	req := &api.ChatRequest{
		Model: "gpt-oss:20b",
		Think: api.Ptr(true),
		Messages: []api.Message{
			{Role: "user", Content: "What is memoization and when is it useful?"},
		},
	}
	var resp api.ChatResponse
	if err := client.Chat(ctx, req, &resp); err != nil {
		return err
	}
	// If you intend to hide reasoning, do it here regardless of flags.
	if resp.Message.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Message.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Message.Content)
	return nil
}

Generate — Qwen3 & GPT-OSS

func generateWithQwen3(ctx context.Context) error {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return err
	}
	req := &api.GenerateRequest{
		Model:  "qwen3:4b-thinking",
		Prompt: "Summarize the role of a B-Tree in indexing.",
		Think:  api.Ptr(true),
	}
	var resp api.GenerateResponse
	if err := client.Generate(ctx, req, &resp); err != nil {
		return err
	}
	if resp.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Response)
	return nil
}

func generateWithGptOss(ctx context.Context) error {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return err
	}
	req := &api.GenerateRequest{
		Model:  "gpt-oss:20b",
		Prompt: "Explain gradient descent in simple terms.",
		Think:  api.Ptr(true),
	}
	var resp api.GenerateResponse
	if err := client.Generate(ctx, req, &resp); err != nil {
		return err
	}
	if resp.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Response)
	return nil
}

The official package’s surface mirrors the REST docs and is updated alongside the core project.

Streaming responses

For real-time streaming, set Stream: bptr(true) in your request. The response will be delivered as newline-delimited JSON chunks:

func streamChatExample() error {
	endpoint := "http://localhost:11434/api/chat"
	req := ChatRequest{
		Model:  "qwen3:8b-thinking",
		Think:  bptr(true),
		Stream: bptr(true), // Enable streaming
		Messages: []ChatMessage{
			{Role: "user", Content: "Explain quicksort algorithm step by step."},
		},
	}

	body, _ := json.Marshal(req)
	resp, err := http.Post(endpoint, "application/json", bytes.NewReader(body))
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	decoder := json.NewDecoder(resp.Body)
	for {
		var chunk ChatResponse
		if err := decoder.Decode(&chunk); err == io.EOF {
			break
		} else if err != nil {
			return err
		}
		
		// Process thinking and content as they arrive
		if chunk.Message.Thinking != "" {
			fmt.Print(chunk.Message.Thinking)
		}
		fmt.Print(chunk.Message.Content)
		
		if chunk.Done {
			break
		}
	}
	return nil
}

With the official SDK, use a callback function to handle streaming chunks:

func streamWithOfficialSDK(ctx context.Context) error {
	client, _ := api.ClientFromEnvironment()
	
	req := &api.ChatRequest{
		Model: "qwen3:8b-thinking",
		Think: api.Ptr(true),
		Messages: []api.Message{
			{Role: "user", Content: "Explain binary search trees."},
		},
	}
	
	err := client.Chat(ctx, req, func(resp api.ChatResponse) error {
		if resp.Message.Thinking != "" {
			fmt.Print(resp.Message.Thinking)
		}
		fmt.Print(resp.Message.Content)
		return nil
	})
	
	return err
}

Working with Qwen3 thinking vs non-thinking (practical guidance)

Two levers:
1. A boolean thinking flag supported by Ollama’s thinking feature; and
2. Qwen3’s soft switch commands /think and /no_think in the latest system/user message. The most recent instruction governs the next turn(s).
Default posture: non-thinking for quick replies; escalate to thinking for tasks that need step-by-step reasoning (math, planning, debugging, complex code analysis).
Streaming UIs: when thinking is enabled, you may see interleaved reasoning/content in streamed frames—buffer or render them separately and give users a “show reasoning” toggle. (See API docs for streaming format.)
Multi-turn conversations: Qwen3 remembers the thinking mode from previous turns. If you want to toggle it mid-conversation, use both the flag AND the soft-switch command for reliability.

Notes for GPT-OSS

Treat reasoning as present even if you tried to disable it; filter on the client if your UX shouldn’t show it.
For production applications using GPT-OSS, implement client-side filtering logic that can detect and strip reasoning patterns if needed.
Test your specific GPT-OSS model variant thoroughly, as behavior may vary between different quantizations and versions.

Best practices and production tips

Error handling and timeouts

Always implement proper timeout handling and error recovery:

func robustChatRequest(ctx context.Context, model string, messages []api.Message) (*api.ChatResponse, error) {
	// Set a reasonable timeout
	ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
	defer cancel()
	
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return nil, fmt.Errorf("creating client: %w", err)
	}
	
	req := &api.ChatRequest{
		Model:    model,
		Messages: messages,
		Options: map[string]interface{}{
			"temperature": 0.7,
			"num_ctx":     4096, // context window size
		},
	}
	
	var resp api.ChatResponse
	if err := client.Chat(ctx, req, &resp); err != nil {
		return nil, fmt.Errorf("chat request failed: %w", err)
	}
	
	return &resp, nil
}

Connection pooling and reuse

Reuse the Ollama client across requests instead of creating a new one each time:

type OllamaService struct {
	client *api.Client
}

func NewOllamaService() (*OllamaService, error) {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return nil, err
	}
	return &OllamaService{client: client}, nil
}

func (s *OllamaService) Chat(ctx context.Context, req *api.ChatRequest) (*api.ChatResponse, error) {
	var resp api.ChatResponse
	if err := s.client.Chat(ctx, req, &resp); err != nil {
		return nil, err
	}
	return &resp, nil
}

Environment configuration

Use environment variables for flexible deployment:

export OLLAMA_HOST=http://localhost:11434
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_MAX_LOADED_MODELS=2

The official SDK automatically respects OLLAMA_HOST via api.ClientFromEnvironment().

Monitoring and logging

Implement structured logging for production systems:

func loggedChat(ctx context.Context, logger *log.Logger, req *api.ChatRequest) error {
	start := time.Now()
	client, _ := api.ClientFromEnvironment()
	
	var resp api.ChatResponse
	err := client.Chat(ctx, req, &resp)
	
	duration := time.Since(start)
	logger.Printf("model=%s duration=%v error=%v tokens=%d", 
		req.Model, duration, err, len(resp.Message.Content))
	
	return err
}

Conclusion

For Go projects, github.com/ollama/ollama/api is the most complete, production-ready choice. It’s maintained alongside the Ollama core project, used by the official CLI, and provides comprehensive API coverage with guaranteed compatibility.
For higher-level abstractions, consider LangChainGo when you need chains, tools, vector stores, and RAG pipelines—though you’ll trade some low-level control for convenience.
Qwen3 gives you clean, reliable control over thinking mode with both flags and message-level toggles (/think, /no_think), making it ideal for applications that need both fast responses and deep reasoning.
For GPT-OSS, always plan to sanitize reasoning output client-side when necessary, as the thinking disable flag may not be consistently honored.
In production, implement proper error handling, connection pooling, timeouts, and monitoring to build robust Ollama-powered applications.

The combination of Go’s strong typing, excellent concurrency support, and Ollama’s straightforward API makes it an ideal stack for building AI-powered applications—from simple chatbots to complex RAG systems.

Key takeaways

Here’s a quick reference for choosing your approach:

Use Case	Recommended Approach	Why
Production application	`github.com/ollama/ollama/api`	Official support, complete API coverage, maintained with core project
RAG/chains/tools pipeline	LangChainGo	High-level abstractions, integrations with vector stores
Learning/experimentation	Raw REST with net/http	Full transparency, no dependencies, educational
Quick prototyping	Official SDK	Balance of simplicity and power
Streaming chat UI	Official SDK with callbacks	Built-in streaming support, clean API

Model selection guidance:

Qwen3: Best for applications requiring controllable thinking mode, reliable multi-turn conversations
GPT-OSS: Strong performance but requires defensive handling of thinking/reasoning output
Other models: Test thoroughly; thinking behavior varies by model family

References & further reading

Official documentation

Ollama API reference — Complete REST API documentation
Official Ollama Go package — Go SDK documentation
Ollama GitHub repository — Source code and issues

Go SDK alternatives

LangChainGo Ollama integration — For chain-based applications
swdunlop/ollama-client — Community client with tool calling
xyproto/ollamaclient — Another community option

Model-specific resources

Qwen documentation — Official Qwen model information
GPT-OSS information — GPT-OSS model details

Building RAG applications with Go — LangChainGo examples
Go context package — Essential for timeouts and cancellation
Go HTTP client best practices — Standard library documentation