Go clients for Ollama: SDK comparison and Qwen3/GPT-OSS examples

Integrate Ollama with Go: SDK guide, examples, and production best practices.

Page content

This guide provides a comprehensive overview of available Go SDKs for Ollama and compares their feature sets.

We’ll explore practical Go examples for calling Qwen3 and GPT-OSS models hosted on Ollama—both via raw REST API calls and the official Go client—including detailed handling of thinking and non-thinking modes in Qwen3.

go and ollama

Why Ollama + Go?

Ollama exposes a small, pragmatic HTTP API (typically running at http://localhost:11434) designed for generate and chat workloads, with built-in streaming support and model management capabilities. The official documentation thoroughly covers /api/generate and /api/chat request/response structures and streaming semantics.

Go is an excellent choice for building Ollama clients due to its strong standard library support for HTTP, excellent JSON handling, native concurrency primitives, and statically-typed interfaces that catch errors at compile time.

As of October 2025, here are the Go SDK options you’ll most likely consider.


Go SDKs for Ollama — what’s available?

SDK / Package Status & “owner” Scope (Generate/Chat/Streaming) Model mgmt (pull/list/etc.) Extras / Notes
github.com/ollama/ollama/api Official package inside the Ollama repo; used by the ollama CLI itself Full coverage mapped to REST; streaming supported Yes Considered the canonical Go client; API mirrors docs closely.
LangChainGo (github.com/tmc/langchaingo/llms/ollama) Community framework (LangChainGo) with Ollama LLM module Chat/Completion + streaming via framework abstractions Limited (model mgmt not primary goal) Great if you want chains, tools, vector stores in Go; less of a raw SDK.
github.com/swdunlop/ollama-client Community client Focus on chat; good tool-calling experiments Partial Built for experimenting with tool calling; not a 1:1 full surface.
Other community SDKs (e.g., ollamaclient, third-party “go-ollama-sdk`) Community Varies Varies Quality and coverage vary; evaluate per repo.

Recommendation: For production, prefer github.com/ollama/ollama/api—it’s maintained with the core project and mirrors the REST API.


Qwen3 & GPT-OSS on Ollama: thinking vs non-thinking (what to know)

  • Thinking mode in Ollama separates the model’s “reasoning” from final output when enabled. Ollama documents a first-class enable/disable thinking behavior across supported models.
  • (https://www.glukhov.org/post/2025/10/qwen3-30b-vs-gpt-oss-20b/ “Qwen3:30b vs GPT-OSS:20b: Technical details, performance and speed comparison”) supports dynamic toggling: add /think or /no_think in system/user messages to switch modes turn-by-turn; the latest instruction wins.
  • GPT-OSS: users report that disabling thinking (e.g., /set nothink or --think=false) can be unreliable on gpt-oss:20b; plan to filter/hide any reasoning your UI shouldn’t surface.

Part 1 — Calling Ollama via raw REST (Go, net/http)

Shared types

First, let’s define the common types and helper functions we’ll use across our examples:

package main

import (
	"bytes"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"time"
)

// ---- Chat API Types ----

type ChatMessage struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

type ChatRequest struct {
	Model    string        `json:"model"`
	Messages []ChatMessage `json:"messages"`
	// Some servers expose thinking control as a boolean flag.
	// Even if omitted, you can still control Qwen3 via /think or /no_think tags.
	Think   *bool          `json:"think,omitempty"`
	Stream  *bool          `json:"stream,omitempty"`
	Options map[string]any `json:"options,omitempty"`
}

type ChatResponse struct {
	Model     string `json:"model"`
	CreatedAt string `json:"created_at"`
	Message   struct {
		Role     string `json:"role"`
		Content  string `json:"content"`
		Thinking string `json:"thinking,omitempty"` // present when thinking is on
	} `json:"message"`
	Done bool `json:"done"`
}

// ---- Generate API Types ----

type GenerateRequest struct {
	Model   string         `json:"model"`
	Prompt  string         `json:"prompt"`
	Think   *bool          `json:"think,omitempty"`
	Stream  *bool          `json:"stream,omitempty"`
	Options map[string]any `json:"options,omitempty"`
}

type GenerateResponse struct {
	Model     string `json:"model"`
	CreatedAt string `json:"created_at"`
	Response  string `json:"response"`           // final text for non-stream
	Thinking  string `json:"thinking,omitempty"` // present when thinking is on
	Done      bool   `json:"done"`
}

// ---- Helper Functions ----

func httpPostJSON(url string, payload any) ([]byte, error) {
	body, err := json.Marshal(payload)
	if err != nil {
		return nil, err
	}
	c := &http.Client{Timeout: 60 * time.Second}
	resp, err := c.Post(url, "application/json", bytes.NewReader(body))
	if err != nil {
		return nil, err
	}
	defer resp.Body.Close()
	return io.ReadAll(resp.Body)
}

// bptr returns a pointer to a boolean value
func bptr(b bool) *bool { return &b }

Chat — Qwen3 with thinking ON (and how to turn it OFF)

func chatQwen3Thinking() error {
	endpoint := "http://localhost:11434/api/chat"

	req := ChatRequest{
		Model:   "qwen3:8b-thinking", // any :*-thinking tag you have pulled
		Think:   bptr(true),
		Stream:  bptr(false),
		Messages: []ChatMessage{
			{Role: "system", Content: "You are a precise assistant."},
			{Role: "user",   Content: "Explain recursion with a short Go example."},
		},
	}

	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out ChatResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	fmt.Println("🧠 thinking:\n", out.Message.Thinking)
	fmt.Println("\n💬 answer:\n", out.Message.Content)
	return nil
}

// Turn thinking OFF for the next turn by:
// (a) setting Think=false, and/or
// (b) adding "/no_think" to the most-recent system/user message (Qwen3 soft switch).
// Qwen3 honors the latest /think or /no_think instruction in multi-turn chats.
func chatQwen3NoThinking() error {
	endpoint := "http://localhost:11434/api/chat"

	req := ChatRequest{
		Model:  "qwen3:8b-thinking",
		Think:  bptr(false),
		Stream: bptr(false),
		Messages: []ChatMessage{
			{Role: "system", Content: "You are brief. /no_think"},
			{Role: "user",   Content: "Explain recursion in one sentence."},
		},
	}

	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out ChatResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	// Expect thinking to be empty; still handle defensively.
	if out.Message.Thinking != "" {
		fmt.Println("🧠 thinking (unexpected):\n", out.Message.Thinking)
	}
	fmt.Println("\n💬 answer:\n", out.Message.Content)
	return nil
}

(Qwen3’s /think and /no_think soft switch is documented by the Qwen team; the last instruction wins in multi-turn chats.)

Chat — GPT-OSS with thinking (and a caveat)

func chatGptOss() error {
	endpoint := "http://localhost:11434/api/chat"

	req := ChatRequest{
		Model:  "gpt-oss:20b",
		Think:  bptr(true),   // request separated reasoning if supported
		Stream: bptr(false),
		Messages: []ChatMessage{
			{Role: "user", Content: "What is dynamic programming? Explain the core idea."},
		},
	}

	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out ChatResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	// Known quirk: disabling thinking may not fully suppress reasoning on gpt-oss:20b.
	// Always filter/hide thinking in UI if you don't want to surface it.
	fmt.Println("🧠 thinking:\n", out.Message.Thinking)
	fmt.Println("\n💬 answer:\n", out.Message.Content)
	return nil
}

Users report that disabling thinking on gpt-oss:20b (e.g., /set nothink or --think=false) can be ignored—plan for client-side filtering if needed.

Generate — Qwen3 and GPT-OSS

func generateQwen3() error {
	endpoint := "http://localhost:11434/api/generate"
	req := GenerateRequest{
		Model:  "qwen3:4b-thinking",
		Prompt: "In 2–3 sentences, what are B-Trees used for in databases?",
		Think:  bptr(true),
	}
	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out GenerateResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	if out.Thinking != "" {
		fmt.Println("🧠 thinking:\n", out.Thinking)
	}
	fmt.Println("\n💬 answer:\n", out.Response)
	return nil
}

func generateGptOss() error {
	endpoint := "http://localhost:11434/api/generate"
	req := GenerateRequest{
		Model:  "gpt-oss:20b",
		Prompt: "Briefly explain backpropagation in neural networks.",
		Think:  bptr(true),
	}
	raw, err := httpPostJSON(endpoint, req)
	if err != nil {
		return err
	}
	var out GenerateResponse
	if err := json.Unmarshal(raw, &out); err != nil {
		return err
	}
	if out.Thinking != "" {
		fmt.Println("🧠 thinking:\n", out.Thinking)
	}
	fmt.Println("\n💬 answer:\n", out.Response)
	return nil
}

REST shapes and streaming behavior come straight from the Ollama API reference.


Part 2 — Calling Ollama via the official Go SDK (github.com/ollama/ollama/api)

The official package exposes a Client with methods that correspond to the REST API. The Ollama CLI itself uses this package to talk to the service, which makes it the safest bet for compatibility.

Install

go get github.com/ollama/ollama/api

Chat — Qwen3 (thinking ON / OFF)

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/ollama/ollama/api"
)

func chatWithQwen3Thinking(ctx context.Context, thinking bool) error {
	client, err := api.ClientFromEnvironment() // honors OLLAMA_HOST if set
	if err != nil {
		return err
	}

	req := &api.ChatRequest{
		Model: "qwen3:8b-thinking",
		// Many server builds expose thinking as a top-level flag;
		// additionally, you can control Qwen3 via /think or /no_think in messages.
		Think: api.Ptr(thinking),
		Messages: []api.Message{
			{Role: "system", Content: "You are a precise assistant."},
			{Role: "user",   Content: "Explain merge sort with a short Go snippet."},
		},
	}

	var resp api.ChatResponse
	if err := client.Chat(ctx, req, &resp); err != nil {
		return err
	}

	if resp.Message.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Message.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Message.Content)
	return nil
}

func main() {
	ctx := context.Background()
	if err := chatWithQwen3Thinking(ctx, true); err != nil {
		log.Fatal(err)
	}
	// Example: non-thinking
	if err := chatWithQwen3Thinking(ctx, false); err != nil {
		log.Fatal(err)
	}
}

Chat — GPT-OSS (handle reasoning defensively)

func chatWithGptOss(ctx context.Context) error {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return err
	}
	req := &api.ChatRequest{
		Model: "gpt-oss:20b",
		Think: api.Ptr(true),
		Messages: []api.Message{
			{Role: "user", Content: "What is memoization and when is it useful?"},
		},
	}
	var resp api.ChatResponse
	if err := client.Chat(ctx, req, &resp); err != nil {
		return err
	}
	// If you intend to hide reasoning, do it here regardless of flags.
	if resp.Message.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Message.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Message.Content)
	return nil
}

Generate — Qwen3 & GPT-OSS

func generateWithQwen3(ctx context.Context) error {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return err
	}
	req := &api.GenerateRequest{
		Model:  "qwen3:4b-thinking",
		Prompt: "Summarize the role of a B-Tree in indexing.",
		Think:  api.Ptr(true),
	}
	var resp api.GenerateResponse
	if err := client.Generate(ctx, req, &resp); err != nil {
		return err
	}
	if resp.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Response)
	return nil
}

func generateWithGptOss(ctx context.Context) error {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return err
	}
	req := &api.GenerateRequest{
		Model:  "gpt-oss:20b",
		Prompt: "Explain gradient descent in simple terms.",
		Think:  api.Ptr(true),
	}
	var resp api.GenerateResponse
	if err := client.Generate(ctx, req, &resp); err != nil {
		return err
	}
	if resp.Thinking != "" {
		fmt.Println("🧠 thinking:\n", resp.Thinking)
	}
	fmt.Println("\n💬 answer:\n", resp.Response)
	return nil
}

The official package’s surface mirrors the REST docs and is updated alongside the core project.


Streaming responses

For real-time streaming, set Stream: bptr(true) in your request. The response will be delivered as newline-delimited JSON chunks:

func streamChatExample() error {
	endpoint := "http://localhost:11434/api/chat"
	req := ChatRequest{
		Model:  "qwen3:8b-thinking",
		Think:  bptr(true),
		Stream: bptr(true), // Enable streaming
		Messages: []ChatMessage{
			{Role: "user", Content: "Explain quicksort algorithm step by step."},
		},
	}

	body, _ := json.Marshal(req)
	resp, err := http.Post(endpoint, "application/json", bytes.NewReader(body))
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	decoder := json.NewDecoder(resp.Body)
	for {
		var chunk ChatResponse
		if err := decoder.Decode(&chunk); err == io.EOF {
			break
		} else if err != nil {
			return err
		}
		
		// Process thinking and content as they arrive
		if chunk.Message.Thinking != "" {
			fmt.Print(chunk.Message.Thinking)
		}
		fmt.Print(chunk.Message.Content)
		
		if chunk.Done {
			break
		}
	}
	return nil
}

With the official SDK, use a callback function to handle streaming chunks:

func streamWithOfficialSDK(ctx context.Context) error {
	client, _ := api.ClientFromEnvironment()
	
	req := &api.ChatRequest{
		Model: "qwen3:8b-thinking",
		Think: api.Ptr(true),
		Messages: []api.Message{
			{Role: "user", Content: "Explain binary search trees."},
		},
	}
	
	err := client.Chat(ctx, req, func(resp api.ChatResponse) error {
		if resp.Message.Thinking != "" {
			fmt.Print(resp.Message.Thinking)
		}
		fmt.Print(resp.Message.Content)
		return nil
	})
	
	return err
}

Working with Qwen3 thinking vs non-thinking (practical guidance)

  • Two levers:

    1. A boolean thinking flag supported by Ollama’s thinking feature; and
    2. Qwen3’s soft switch commands /think and /no_think in the latest system/user message. The most recent instruction governs the next turn(s).
  • Default posture: non-thinking for quick replies; escalate to thinking for tasks that need step-by-step reasoning (math, planning, debugging, complex code analysis).

  • Streaming UIs: when thinking is enabled, you may see interleaved reasoning/content in streamed frames—buffer or render them separately and give users a “show reasoning” toggle. (See API docs for streaming format.)

  • Multi-turn conversations: Qwen3 remembers the thinking mode from previous turns. If you want to toggle it mid-conversation, use both the flag AND the soft-switch command for reliability.

Notes for GPT-OSS

  • Treat reasoning as present even if you tried to disable it; filter on the client if your UX shouldn’t show it.
  • For production applications using GPT-OSS, implement client-side filtering logic that can detect and strip reasoning patterns if needed.
  • Test your specific GPT-OSS model variant thoroughly, as behavior may vary between different quantizations and versions.

Best practices and production tips

Error handling and timeouts

Always implement proper timeout handling and error recovery:

func robustChatRequest(ctx context.Context, model string, messages []api.Message) (*api.ChatResponse, error) {
	// Set a reasonable timeout
	ctx, cancel := context.WithTimeout(ctx, 2*time.Minute)
	defer cancel()
	
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return nil, fmt.Errorf("creating client: %w", err)
	}
	
	req := &api.ChatRequest{
		Model:    model,
		Messages: messages,
		Options: map[string]interface{}{
			"temperature": 0.7,
			"num_ctx":     4096, // context window size
		},
	}
	
	var resp api.ChatResponse
	if err := client.Chat(ctx, req, &resp); err != nil {
		return nil, fmt.Errorf("chat request failed: %w", err)
	}
	
	return &resp, nil
}

Connection pooling and reuse

Reuse the Ollama client across requests instead of creating a new one each time:

type OllamaService struct {
	client *api.Client
}

func NewOllamaService() (*OllamaService, error) {
	client, err := api.ClientFromEnvironment()
	if err != nil {
		return nil, err
	}
	return &OllamaService{client: client}, nil
}

func (s *OllamaService) Chat(ctx context.Context, req *api.ChatRequest) (*api.ChatResponse, error) {
	var resp api.ChatResponse
	if err := s.client.Chat(ctx, req, &resp); err != nil {
		return nil, err
	}
	return &resp, nil
}

Environment configuration

Use environment variables for flexible deployment:

export OLLAMA_HOST=http://localhost:11434
export OLLAMA_NUM_PARALLEL=2
export OLLAMA_MAX_LOADED_MODELS=2

The official SDK automatically respects OLLAMA_HOST via api.ClientFromEnvironment().

Monitoring and logging

Implement structured logging for production systems:

func loggedChat(ctx context.Context, logger *log.Logger, req *api.ChatRequest) error {
	start := time.Now()
	client, _ := api.ClientFromEnvironment()
	
	var resp api.ChatResponse
	err := client.Chat(ctx, req, &resp)
	
	duration := time.Since(start)
	logger.Printf("model=%s duration=%v error=%v tokens=%d", 
		req.Model, duration, err, len(resp.Message.Content))
	
	return err
}

Conclusion

  • For Go projects, github.com/ollama/ollama/api is the most complete, production-ready choice. It’s maintained alongside the Ollama core project, used by the official CLI, and provides comprehensive API coverage with guaranteed compatibility.

  • For higher-level abstractions, consider LangChainGo when you need chains, tools, vector stores, and RAG pipelines—though you’ll trade some low-level control for convenience.

  • Qwen3 gives you clean, reliable control over thinking mode with both flags and message-level toggles (/think, /no_think), making it ideal for applications that need both fast responses and deep reasoning.

  • For GPT-OSS, always plan to sanitize reasoning output client-side when necessary, as the thinking disable flag may not be consistently honored.

  • In production, implement proper error handling, connection pooling, timeouts, and monitoring to build robust Ollama-powered applications.

The combination of Go’s strong typing, excellent concurrency support, and Ollama’s straightforward API makes it an ideal stack for building AI-powered applications—from simple chatbots to complex RAG systems.

Key takeaways

Here’s a quick reference for choosing your approach:

Use Case Recommended Approach Why
Production application github.com/ollama/ollama/api Official support, complete API coverage, maintained with core project
RAG/chains/tools pipeline LangChainGo High-level abstractions, integrations with vector stores
Learning/experimentation Raw REST with net/http Full transparency, no dependencies, educational
Quick prototyping Official SDK Balance of simplicity and power
Streaming chat UI Official SDK with callbacks Built-in streaming support, clean API

Model selection guidance:

  • Qwen3: Best for applications requiring controllable thinking mode, reliable multi-turn conversations
  • GPT-OSS: Strong performance but requires defensive handling of thinking/reasoning output
  • Other models: Test thoroughly; thinking behavior varies by model family

References & further reading

Official documentation

Go SDK alternatives

Model-specific resources