Blog | liteLLM

Day 0 Support: Claude 4.5 Opus (+Advanced Features)

November 25, 2025

Sameer Kankute

SWE @ LiteLLM (LLM Translation)

Krrish Dholakia

CEO, LiteLLM

Ishaan Jaff

CTO, LiteLLM

This guide covers Anthropic's latest model (Claude Opus 4.5) and its advanced features now available in LiteLLM: Tool Search, Programmatic Tool Calling, Tool Input Examples, and the Effort Parameter.

Feature	Supported Models
Tool Search	Claude Opus 4.5, Sonnet 4.5
Programmatic Tool Calling	Claude Opus 4.5, Sonnet 4.5
Input Examples	Claude Opus 4.5, Sonnet 4.5
Effort Parameter	Claude Opus 4.5 only

Supported Providers: Anthropic, Bedrock, Vertex AI.

Usage

LiteLLM Python SDK
LiteLLM Proxy

import os
from litellm import completion

# set env - [OPTIONAL] replace with your anthropic key
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"

messages = [{"role": "user", "content": "Hey! how's it going?"}]

## OPENAI /chat/completions API format
response = completion(model="claude-opus-4-5-20251101", messages=messages)
print(response)

1. Setup config.yaml

model_list:
  - model_name: claude-4 ### RECEIVED MODEL NAME ###
    litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
      model: claude-opus-4-5-20251101 ### MODEL NAME sent to `litellm.completion()` ###
      api_key: "os.environ/ANTHROPIC_API_KEY" # does os.getenv("ANTHROPIC_API_KEY")

2. Start the proxy

litellm --config /path/to/config.yaml

3. Test it!

OpenAI Chat Completions
Anthropic /v1/messages API

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "max_tokens": 1024,
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

Usage - Bedrock

info

LiteLLM uses the boto3 library to authenticate with Bedrock.

For more ways to authenticate with Bedrock, see the Bedrock documentation.

LiteLLM Python SDK
LiteLLM Proxy

import os
from litellm import completion

os.environ["AWS_ACCESS_KEY_ID"] = ""
os.environ["AWS_SECRET_ACCESS_KEY"] = ""
os.environ["AWS_REGION_NAME"] = ""

## OPENAI /chat/completions API format
response = completion(
  model="bedrock/us.anthropic.claude-opus-4-5-20251101-v1:0",
  messages=[{ "content": "Hello, how are you?","role": "user"}]
)

1. Setup config.yaml

model_list:
  - model_name: claude-4 ### RECEIVED MODEL NAME ###
    litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
      model: bedrock/us.anthropic.claude-opus-4-5-20251101-v1:0 ### MODEL NAME sent to `litellm.completion()` ###
      aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: os.environ/AWS_REGION_NAME

2. Start the proxy

litellm --config /path/to/config.yaml

3. Test it!

OpenAI Chat Completions
Anthropic /v1/messages API
Bedrock /invoke API
Bedrock /converse API

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "max_tokens": 1024,
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

curl --location 'http://0.0.0.0:4000/bedrock/model/claude-4/invoke' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "max_tokens": 1024,
      "messages": [{"role": "user", "content": "Hello, how are you?"}]
    }'

curl --location 'http://0.0.0.0:4000/bedrock/model/claude-4/converse' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "messages": [{"role": "user", "content": "Hello, how are you?"}]
    }'

Usage - Vertex AI

LiteLLM Python SDK
LiteLLM Proxy

from litellm import completion
import json 

## GET CREDENTIALS 
## RUN ## 
# !gcloud auth application-default login - run this to add vertex credentials to your env
## OR ## 
file_path = 'path/to/vertex_ai_service_account.json'

# Load the JSON file
with open(file_path, 'r') as file:
    vertex_credentials = json.load(file)

# Convert to JSON string
vertex_credentials_json = json.dumps(vertex_credentials)

## COMPLETION CALL 
response = completion(
  model="vertex_ai/claude-opus-4-5@20251101",
  messages=[{ "content": "Hello, how are you?","role": "user"}],
  vertex_credentials=vertex_credentials_json,
  vertex_project="your-project-id",
  vertex_location="us-east5"
)

1. Setup config.yaml

model_list:
  - model_name: claude-4 ### RECEIVED MODEL NAME ###
    litellm_params:
        model: vertex_ai/claude-opus-4-5@20251101
        vertex_credentials: "/path/to/service_account.json"
        vertex_project: "your-project-id"
        vertex_location: "us-east5"

2. Start the proxy

litellm --config /path/to/config.yaml

3. Test it!

OpenAI Chat Completions
Anthropic /v1/messages API

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

curl --location 'http://0.0.0.0:4000/v1/messages' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "max_tokens": 1024,
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

Tool Search

This lets Claude work with thousands of tools, by dynamically loading tools on-demand, instead of loading all tools into the context window upfront.

Usage Example

LiteLLM Python SDK
LiteLLM Proxy

import litellm
import os

# Configure your API key
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"

# Define your tools with defer_loading
tools = [
    # Tool search tool (regex variant)
    {
        "type": "tool_search_tool_regex_20251119",
        "name": "tool_search_tool_regex"
    },
    # Deferred tools - loaded on-demand
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location. Returns temperature and conditions.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g. San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        },
        "defer_loading": True  # Load on-demand
    },
    {
        "type": "function",
        "function": {
            "name": "search_files",
            "description": "Search through files in the workspace using keywords",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "file_types": {
                        "type": "array",
                        "items": {"type": "string"}
                    }
                },
                "required": ["query"]
            }
        },
        "defer_loading": True
    },
    {
        "type": "function",
        "function": {
            "name": "query_database",
            "description": "Execute SQL queries against the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "sql": {"type": "string"}
                },
                "required": ["sql"]
            }
        },
        "defer_loading": True
    }
]

# Make a request - Claude will search for and use relevant tools
response = litellm.completion(
    model="anthropic/claude-opus-4-5-20251101",
    messages=[{
        "role": "user",
        "content": "What's the weather like in San Francisco?"
    }],
    tools=tools
)

print("Claude's response:", response.choices[0].message.content)
print("Tool calls:", response.choices[0].message.tool_calls)

# Check tool search usage
if hasattr(response.usage, 'server_tool_use'):
    print(f"Tool searches performed: {response.usage.server_tool_use.tool_search_requests}")

Setup config.yaml

model_list:
  - model_name: claude-4
    litellm_params:
      model: anthropic/claude-opus-4-5-20251101
      api_key: os.environ/ANTHROPIC_API_KEY

Start the proxy

litellm --config /path/to/config.yaml

Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [{
        "role": "user",
        "content": "What's the weather like in San Francisco?"
       }],
       "tools": [
        # Tool search tool (regex variant)
        {
            "type": "tool_search_tool_regex_20251119",
            "name": "tool_search_tool_regex"
        },
        # Deferred tools - loaded on-demand
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather in a given location. Returns temperature and conditions.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "Temperature unit"
                        }
                    },
                    "required": ["location"]
                }
            },
            "defer_loading": True  # Load on-demand
        },
        {
            "type": "function",
            "function": {
                "name": "search_files",
                "description": "Search through files in the workspace using keywords",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "query": {"type": "string"},
                        "file_types": {
                            "type": "array",
                            "items": {"type": "string"}
                        }
                    },
                    "required": ["query"]
                }
            },
            "defer_loading": True
        },
        {
            "type": "function",
            "function": {
                "name": "query_database",
                "description": "Execute SQL queries against the database",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "sql": {"type": "string"}
                    },
                    "required": ["sql"]
                }
            },
            "defer_loading": True
        }
    ]
}
'

BM25 Variant (Natural Language Search)

For natural language queries instead of regex patterns:

tools = [
    {
        "type": "tool_search_tool_bm25_20251119",  # Natural language variant
        "name": "tool_search_tool_bm25"
    },
    # ... your deferred tools
]

Programmatic Tool Calling

Programmatic tool calling allows Claude to write code that calls your tools programmatically. Learn more

LiteLLM Python SDK
LiteLLM Proxy

import litellm
import json

# Define tools that can be called programmatically
tools = [
    # Code execution tool (required for programmatic calling)
    {
        "type": "code_execution_20250825",
        "name": "code_execution"
    },
    # Tool that can be called from code
    {
        "type": "function",
        "function": {
            "name": "query_database",
            "description": "Execute a SQL query against the sales database. Returns a list of rows as JSON objects.",
            "parameters": {
                "type": "object",
                "properties": {
                    "sql": {
                        "type": "string",
                        "description": "SQL query to execute"
                    }
                },
                "required": ["sql"]
            }
        },
        "allowed_callers": ["code_execution_20250825"]  # Enable programmatic calling
    }
]

# First request
response = litellm.completion(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=[{
        "role": "user",
        "content": "Query sales data for West, East, and Central regions, then tell me which had the highest revenue"
    }],
    tools=tools
)

print("Claude's response:", response.choices[0].message)

# Handle tool calls
messages = [
    {"role": "user", "content": "Query sales data for West, East, and Central regions, then tell me which had the highest revenue"},
    {"role": "assistant", "content": response.choices[0].message.content, "tool_calls": response.choices[0].message.tool_calls}
]

# Process each tool call
for tool_call in response.choices[0].message.tool_calls:
    # Check if it's a programmatic call
    if hasattr(tool_call, 'caller') and tool_call.caller:
        print(f"Programmatic call to {tool_call.function.name}")
        print(f"Called from: {tool_call.caller}")
    
    # Simulate tool execution
    if tool_call.function.name == "query_database":
        args = json.loads(tool_call.function.arguments)
        # Simulate database query
        result = json.dumps([
            {"region": "West", "revenue": 150000},
            {"region": "East", "revenue": 180000},
            {"region": "Central", "revenue": 120000}
        ])
        
        messages.append({
            "role": "user",
            "content": [{
                "type": "tool_result",
                "tool_use_id": tool_call.id,
                "content": result
            }]
        })

# Get final response
final_response = litellm.completion(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=messages,
    tools=tools
)

print("\nFinal answer:", final_response.choices[0].message.content)

Setup config.yaml

model_list:
  - model_name: claude-4
    litellm_params:
      model: anthropic/claude-opus-4-5-20251101
      api_key: os.environ/ANTHROPIC_API_KEY

Start the proxy

litellm --config /path/to/config.yaml

Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [{
        "role": "user",
        "content": "Query sales data for West, East, and Central regions, then tell me which had the highest revenue"
      }],
      "tools": [
        # Code execution tool (required for programmatic calling)
        {
            "type": "code_execution_20250825",
            "name": "code_execution"
        },
        # Tool that can be called from code
        {
            "type": "function",
            "function": {
                "name": "query_database",
                "description": "Execute a SQL query against the sales database. Returns a list of rows as JSON objects.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "sql": {
                            "type": "string",
                            "description": "SQL query to execute"
                        }
                    },
                    "required": ["sql"]
                }
            },
            "allowed_callers": ["code_execution_20250825"]  # Enable programmatic calling
        }
    ]
}
'

Tool Input Examples

You can now provide Claude with examples of how to use your tools. Learn more

LiteLLM Python SDK
LiteLLM Proxy

import litellm

tools = [
    {
        "type": "function",
        "function": {
            "name": "create_calendar_event",
            "description": "Create a new calendar event with attendees and reminders",
            "parameters": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "start_time": {
                        "type": "string",
                        "description": "ISO 8601 format: YYYY-MM-DDTHH:MM:SS"
                    },
                    "duration_minutes": {"type": "integer"},
                    "attendees": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "email": {"type": "string"},
                                "optional": {"type": "boolean"}
                            }
                        }
                    },
                    "reminders": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "minutes_before": {"type": "integer"},
                                "method": {"type": "string", "enum": ["email", "popup"]}
                            }
                        }
                    }
                },
                "required": ["title", "start_time", "duration_minutes"]
            }
        },
        # Provide concrete examples
        "input_examples": [
            {
                "title": "Team Standup",
                "start_time": "2025-01-15T09:00:00",
                "duration_minutes": 30,
                "attendees": [
                    {"email": "alice@company.com", "optional": False},
                    {"email": "bob@company.com", "optional": False}
                ],
                "reminders": [
                    {"minutes_before": 15, "method": "popup"}
                ]
            },
            {
                "title": "Lunch Break",
                "start_time": "2025-01-15T12:00:00",
                "duration_minutes": 60
                # Demonstrates optional fields can be omitted
            }
        ]
    }
]

response = litellm.completion(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=[{
        "role": "user",
        "content": "Schedule a team meeting for tomorrow at 2pm for 45 minutes with john@company.com and sarah@company.com"
    }],
    tools=tools
)

print("Tool call:", response.choices[0].message.tool_calls[0].function.arguments)

Setup config.yaml

model_list:
  - model_name: claude-4
    litellm_params:
      model: anthropic/claude-opus-4-5-20251101
      api_key: os.environ/ANTHROPIC_API_KEY

Start the proxy

litellm --config /path/to/config.yaml

Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [{
        "role": "user",
        "content": "Schedule a team meeting for tomorrow at 2pm for 45 minutes with john@company.com and sarah@company.com"
      }],
      "tools": [
    {
        "type": "function",
        "function": {
            "name": "create_calendar_event",
            "description": "Create a new calendar event with attendees and reminders",
            "parameters": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "start_time": {
                        "type": "string",
                        "description": "ISO 8601 format: YYYY-MM-DDTHH:MM:SS"
                    },
                    "duration_minutes": {"type": "integer"},
                    "attendees": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "email": {"type": "string"},
                                "optional": {"type": "boolean"}
                            }
                        }
                    },
                    "reminders": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "minutes_before": {"type": "integer"},
                                "method": {"type": "string", "enum": ["email", "popup"]}
                            }
                        }
                    }
                },
                "required": ["title", "start_time", "duration_minutes"]
            }
        },
        # Provide concrete examples
        "input_examples": [
            {
                "title": "Team Standup",
                "start_time": "2025-01-15T09:00:00",
                "duration_minutes": 30,
                "attendees": [
                    {"email": "alice@company.com", "optional": False},
                    {"email": "bob@company.com", "optional": False}
                ],
                "reminders": [
                    {"minutes_before": 15, "method": "popup"}
                ]
            },
            {
                "title": "Lunch Break",
                "start_time": "2025-01-15T12:00:00",
                "duration_minutes": 60
                # Demonstrates optional fields can be omitted
            }
        ]
    }
]
}
'

Effort Parameter: Control Token Usage

Controls aspects like how much effort the model puts into its response, via output_config={"effort": ..}.

info

Soon, we will map OpenAI's reasoning_effort parameter to this.

Potential Values for effort parameter: "high", "medium", "low".

Usage Example

LiteLLM Python SDK
LiteLLM Proxy

import litellm

message = "Analyze the trade-offs between microservices and monolithic architectures"

# High effort (default) - Maximum capability
response_high = litellm.completion(
    model="anthropic/claude-opus-4-5-20251101",
    messages=[{"role": "user", "content": message}],
    output_config={"effort": "high"}
)

print("High effort response:")
print(response_high.choices[0].message.content)
print(f"Tokens used: {response_high.usage.completion_tokens}\n")

# Medium effort - Balanced approach
response_medium = litellm.completion(
    model="anthropic/claude-opus-4-5-20251101",
    messages=[{"role": "user", "content": message}],
    output_config={"effort": "medium"}
)

print("Medium effort response:")
print(response_medium.choices[0].message.content)
print(f"Tokens used: {response_medium.usage.completion_tokens}\n")

# Low effort - Maximum efficiency
response_low = litellm.completion(
    model="anthropic/claude-opus-4-5-20251101",
    messages=[{"role": "user", "content": message}],
    output_config={"effort": "low"}
)

print("Low effort response:")
print(response_low.choices[0].message.content)
print(f"Tokens used: {response_low.usage.completion_tokens}\n")

# Compare token usage
print("Token Comparison:")
print(f"High:   {response_high.usage.completion_tokens} tokens")
print(f"Medium: {response_medium.usage.completion_tokens} tokens")
print(f"Low:    {response_low.usage.completion_tokens} tokens")

Setup config.yaml

model_list:
  - model_name: claude-4
    litellm_params:
      model: anthropic/claude-opus-4-5-20251101
      api_key: os.environ/ANTHROPIC_API_KEY

Start the proxy

litellm --config /path/to/config.yaml

Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [{
        "role": "user",
        "content": "Analyze the trade-offs between microservices and monolithic architectures"
      }],
      "output_config": {
        "effort": "high"
      }
    }
'

Cost Tracking: Monitor Tool Search Usage

Understanding Tool Search Costs

Tool search operations are tracked separately in the usage object, allowing you to monitor and optimize costs.

It is available in the usage object, under server_tool_use.tool_search_requests.

Anthropic charges $0.0001 per tool search request.

Tracking Example

LiteLLM Python SDK
LiteLLM Proxy

import litellm

tools = [
    {
        "type": "tool_search_tool_regex_20251119",
        "name": "tool_search_tool_regex"
    },
    # ... 100 deferred tools
]

response = litellm.completion(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=[{
        "role": "user",
        "content": "Find and use the weather tool for San Francisco"
    }],
    tools=tools
)

# Standard token usage
print("Token Usage:")
print(f"  Input tokens:  {response.usage.prompt_tokens}")
print(f"  Output tokens: {response.usage.completion_tokens}")
print(f"  Total tokens:  {response.usage.total_tokens}")

# Tool search specific usage
if hasattr(response.usage, 'server_tool_use') and response.usage.server_tool_use:
    print(f"\nTool Search Usage:")
    print(f"  Search requests: {response.usage.server_tool_use.tool_search_requests}")
    
    # Calculate cost (example pricing)
    input_cost = response.usage.prompt_tokens * 0.000003  # $3 per 1M tokens
    output_cost = response.usage.completion_tokens * 0.000015  # $15 per 1M tokens
    search_cost = response.usage.server_tool_use.tool_search_requests * 0.0001  # Example
    
    total_cost = input_cost + output_cost + search_cost
    
    print(f"\nCost Breakdown:")
    print(f"  Input tokens:   ${input_cost:.6f}")
    print(f"  Output tokens:  ${output_cost:.6f}")
    print(f"  Tool searches:  ${search_cost:.6f}")
    print(f"  Total:          ${total_cost:.6f}")

Setup config.yaml

model_list:
  - model_name: claude-4
    litellm_params:
      model: anthropic/claude-opus-4-5-20251101
      api_key: os.environ/ANTHROPIC_API_KEY

Start the proxy

litellm --config /path/to/config.yaml

Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [{
        "role": "user",
        "content": "Find and use the weather tool for San Francisco"
      }],
      "tools": [
        {
          "type": "tool_search_tool_regex_20251119",
          "name": "tool_search_tool_regex"
        },
        # ... 100 deferred tools
      ]
    }
'

Expected Response:

{
    ...,
    "usage": {
        ...,
        "server_tool_use": {
            "tool_search_requests": 1
        }
    }
}

Cost Optimization Tips

Keep frequently used tools non-deferred (3-5 tools)
Use tool search for large catalogs (10+ tools)
Monitor search requests to identify optimization opportunities
Combine with effort parameter for maximum efficiency

Combining Features

The Power of Integration

These features work together seamlessly. Here's a real-world example combining all of them:

LiteLLM Python SDK
LiteLLM Proxy

import litellm
import json

# Large tool catalog with search, programmatic calling, and examples
tools = [
    # Enable tool search
    {
        "type": "tool_search_tool_regex_20251119",
        "name": "tool_search_tool_regex"
    },
    # Enable programmatic calling
    {
        "type": "code_execution_20250825",
        "name": "code_execution"
    },
    # Database tool with all features
    {
        "type": "function",
        "function": {
            "name": "query_database",
            "description": "Execute SQL queries against the analytics database. Returns JSON array of results.",
            "parameters": {
                "type": "object",
                "properties": {
                    "sql": {
                        "type": "string",
                        "description": "SQL SELECT statement"
                    },
                    "limit": {
                        "type": "integer",
                        "description": "Maximum rows to return"
                    }
                },
                "required": ["sql"]
            }
        },
        "defer_loading": True,  # Tool search
        "allowed_callers": ["code_execution_20250825"],  # Programmatic calling
        "input_examples": [  # Input examples
            {
                "sql": "SELECT region, SUM(revenue) as total FROM sales GROUP BY region",
                "limit": 100
            }
        ]
    },
    # ... 50 more tools with defer_loading
]

# Make request with effort control
response = litellm.completion(
    model="anthropic/claude-opus-4-5-20251101",
    messages=[{
        "role": "user",
        "content": "Analyze sales by region for the last quarter and identify top performers"
    }],
    tools=tools,
    output_config={"effort": "medium"}  # Balanced efficiency
)

# Track comprehensive usage
print("Complete Usage Metrics:")
print(f"  Input tokens:     {response.usage.prompt_tokens}")
print(f"  Output tokens:    {response.usage.completion_tokens}")
print(f"  Total tokens:     {response.usage.total_tokens}")

if hasattr(response.usage, 'server_tool_use') and response.usage.server_tool_use:
    print(f"  Tool searches:    {response.usage.server_tool_use.tool_search_requests}")

print(f"\nResponse: {response.choices[0].message.content}")

Setup config.yaml

model_list:
  - model_name: claude-4
    litellm_params:
      model: anthropic/claude-opus-4-5-20251101
      api_key: os.environ/ANTHROPIC_API_KEY

Start the proxy

litellm --config /path/to/config.yaml

Test it!

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer $LITELLM_KEY' \
--data ' {
      "model": "claude-4",
      "messages": [{
        "role": "user",
        "content": "Analyze sales by region for the last quarter and identify top performers"
      }],
      "tools": [
        {
          "type": "tool_search_tool_regex_20251119",
          "name": "tool_search_tool_regex"
        },
        # ... 100 deferred tools
      ],
      "output_config": {
        "effort": "medium"
      }
    }
'

Expected Response:

{
    ...,
    "usage": {
        ...,    
        "server_tool_use": {
            "tool_search_requests": 1
        }
    }
}

Real-World Benefits

This combination enables:

Massive scale - Handle 1000+ tools efficiently
Low latency - Programmatic calling reduces round trips
High accuracy - Input examples ensure correct tool usage
Cost control - Effort parameter optimizes token spend
Full visibility - Track all usage metrics

DAY 0 Support: Gemini 3 on LiteLLM

November 19, 2025

Sameer Kankute

SWE @ LiteLLM (LLM Translation)

Krrish Dholakia

CEO, LiteLLM

Ishaan Jaff

CTO, LiteLLM

info

This guide covers common questions and best practices for using gemini-3-pro-preview with LiteLLM Proxy and SDK.

Quick Start

Python SDK
LiteLLM Proxy

from litellm import completion
import os

os.environ["GEMINI_API_KEY"] = "your-api-key"

response = completion(
    model="gemini/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "Hello!"}],
    reasoning_effort="low"
)

print(response.choices[0].message.content)

1. Add to config.yaml:

model_list:
  - model_name: gemini-3-pro-preview
    litellm_params:
      model: gemini/gemini-3-pro-preview
      api_key: os.environ/GEMINI_API_KEY

2. Start proxy:

litellm --config /path/to/config.yaml

3. Make request:

curl http://0.0.0.0:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-3-pro-preview",
    "messages": [{"role": "user", "content": "Hello!"}],
    "reasoning_effort": "low"
  }'

Supported Endpoints

LiteLLM provides full end-to-end support for Gemini 3 Pro Preview on:

✅ /v1/chat/completions - OpenAI-compatible chat completions endpoint
✅ /v1/responses - OpenAI Responses API endpoint (streaming and non-streaming)
✅ /v1/messages - Anthropic-compatible messages endpoint
✅ /v1/generateContent – Google Gemini API compatible endpoint (for code, see: client.models.generate_content(...))

All endpoints support:

Streaming and non-streaming responses
Function calling with thought signatures
Multi-turn conversations
All Gemini 3-specific features

Thought Signatures

What are Thought Signatures?

Thought signatures are encrypted representations of the model's internal reasoning process. They're essential for maintaining context across multi-turn conversations, especially with function calling.

How Thought Signatures Work

Automatic Extraction: When Gemini 3 returns a function call, LiteLLM automatically extracts the thought_signature from the response
Storage: Thought signatures are stored in provider_specific_fields.thought_signature of tool calls
Automatic Preservation: When you include the assistant's message in conversation history, LiteLLM automatically preserves and returns thought signatures to Gemini

Example: Multi-Turn Function Calling

Streaming with Thought Signatures

When using streaming mode with stream_chunk_builder(), thought signatures are now automatically preserved:

Streaming SDK
Non-Streaming SDK
cURL

import os
import litellm
from litellm import completion

os.environ["GEMINI_API_KEY"] = "your-api-key"

MODEL = "gemini/gemini-3-pro-preview"

messages = [
    {"role": "system", "content": "You are a helpful assistant. Use the calculate tool."},
    {"role": "user", "content": "What is 2+2?"},
]

tools = [{
    "type": "function",
    "function": {
        "name": "calculate",
        "description": "Calculate a mathematical expression",
        "parameters": {
            "type": "object",
            "properties": {"expression": {"type": "string"}},
            "required": ["expression"],
        },
    },
}]

print("Step 1: Sending request with stream=True...")
response = completion(
    model=MODEL,
    messages=messages,
    stream=True,
    tools=tools,
    reasoning_effort="low"
)

# Collect all chunks
chunks = []
for part in response:
    chunks.append(part)

# Reconstruct message using stream_chunk_builder
# Thought signatures are now preserved automatically!
full_response = litellm.stream_chunk_builder(chunks, messages=messages)
print(f"Full response: {full_response}")

assistant_msg = full_response.choices[0].message

# ✅ Thought signature is now preserved in provider_specific_fields
if assistant_msg.tool_calls and assistant_msg.tool_calls[0].provider_specific_fields:
    thought_sig = assistant_msg.tool_calls[0].provider_specific_fields.get("thought_signature")
    print(f"Thought signature preserved: {thought_sig is not None}")

# Append assistant message (includes thought signatures automatically)
messages.append(assistant_msg)

# Mock tool execution
messages.append({
    "role": "tool",
    "content": "4",
    "tool_call_id": assistant_msg.tool_calls[0].id
})

print("\nStep 2: Sending tool result back to model...")
response_2 = completion(
    model=MODEL,
    messages=messages,
    stream=True,
    tools=tools,
    reasoning_effort="low"
)

for part in response_2:
    if part.choices[0].delta.content:
        print(part.choices[0].delta.content, end="")
print()  # New line

Key Points:

✅ stream_chunk_builder() now preserves provider_specific_fields including thought signatures
✅ Thought signatures are automatically included when appending assistant_msg to conversation history
✅ Multi-turn conversations work seamlessly with streaming

from openai import OpenAI
import json

client = OpenAI(api_key="sk-1234", base_url="http://localhost:4000")

# Define tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]

# Step 1: Initial request
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]

response = client.chat.completions.create(
    model="gemini-3-pro-preview",
    messages=messages,
    tools=tools,
    reasoning_effort="low"
)

# Step 2: Append assistant message (thought signatures automatically preserved)
messages.append(response.choices[0].message)

# Step 3: Execute tool and append result
for tool_call in response.choices[0].message.tool_calls:
    if tool_call.function.name == "get_weather":
        result = {"temperature": 30, "unit": "celsius"}
        messages.append({
            "role": "tool",
            "content": json.dumps(result),
            "tool_call_id": tool_call.id
        })

# Step 4: Follow-up request (thought signatures automatically included)
response2 = client.chat.completions.create(
    model="gemini-3-pro-preview",
    messages=messages,
    tools=tools,
    reasoning_effort="low"
)

print(response2.choices[0].message.content)

Key Points:

✅ Thought signatures are automatically extracted from response.choices[0].message.tool_calls[].provider_specific_fields.thought_signature
✅ When you append response.choices[0].message to your conversation history, thought signatures are automatically preserved
✅ You don't need to manually extract or manage thought signatures

# Step 1: Initial request
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-3-pro-preview",
    "messages": [
      {"role": "user", "content": "What'\''s the weather in Tokyo?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {"type": "string"}
            },
            "required": ["location"]
          }
        }
      }
    ],
    "reasoning_effort": "low"
  }'

Response includes thought signature:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"Tokyo\"}"
        },
        "provider_specific_fields": {
          "thought_signature": "CpcHAdHtim9+q4rstcbvQC0ic4x1/vqQlCJWgE+UZ6dTLYGHMMBkF/AxqL5UmP6SY46uYC8t4BTFiXG5zkw6EMJ..."
        }
      }]
    }
  }]
}

# Step 2: Follow-up request (include assistant message with thought signature)
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-3-pro-preview",
    "messages": [
      {"role": "user", "content": "What'\''s the weather in Tokyo?"},
      {
        "role": "assistant",
        "content": null,
        "tool_calls": [{
          "id": "call_abc123",
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\": \"Tokyo\"}"
          },
          "provider_specific_fields": {
            "thought_signature": "CpcHAdHtim9+q4rstcbvQC0ic4x1/vqQlCJWgE+UZ6dTLYGHMMBkF/AxqL5UmP6SY46uYC8t4BTFiXG5zkw6EMJ..."
          }
        }]
      },
      {
        "role": "tool",
        "content": "{\"temperature\": 30, \"unit\": \"celsius\"}",
        "tool_call_id": "call_abc123"
      }
    ],
    "tools": [...],
    "reasoning_effort": "low"
  }'

Important Notes on Thought Signatures

Automatic Handling: LiteLLM automatically extracts and preserves thought signatures. You don't need to manually manage them.
Parallel Function Calls: When the model makes parallel function calls, only the first function call has a thought signature.
Sequential Function Calls: In multi-step function calling, each step's first function call has its own thought signature that must be preserved.
Required for Context: Thought signatures are essential for maintaining reasoning context. Without them, the model may lose context of its previous reasoning.

Conversation History: Switching from Non-Gemini-3 Models

Common Question: Will switching from a non-Gemini-3 model to Gemini-3 break conversation history?

Answer: No! LiteLLM automatically handles this by adding dummy thought signatures when needed.

How It Works

When you switch from a model that doesn't use thought signatures (e.g., gemini-2.5-flash) to Gemini 3, LiteLLM:

Detects missing signatures: Identifies assistant messages with tool calls that lack thought signatures
Adds dummy signature: Automatically injects a dummy thought signature (skip_thought_signature_validator) for compatibility
Maintains conversation flow: Your conversation history continues to work seamlessly

Example: Switching Models Mid-Conversation

Python SDK
cURL

from openai import OpenAI

client = OpenAI(api_key="sk-1234", base_url="http://localhost:4000")

# Step 1: Start with gemini-2.5-flash (no thought signatures)
messages = [{"role": "user", "content": "What's the weather?"}]

response1 = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=messages,
    tools=[...],
    reasoning_effort="low"
)

# Append assistant message (no tool call thought signature from gemini-2.5-flash)
messages.append(response1.choices[0].message)

# Step 2: Switch to gemini-3-pro-preview
# LiteLLM automatically adds dummy thought signature to the previous assistant message
response2 = client.chat.completions.create(
    model="gemini-3-pro-preview",  # 👈 Switched model
    messages=messages,  # 👈 Same conversation history
    tools=[...],
    reasoning_effort="low"
)

# ✅ Works seamlessly! No errors, no breaking changes
print(response2.choices[0].message.content)

# Step 1: Start with gemini-2.5-flash
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-2.5-flash",
    "messages": [{"role": "user", "content": "What'\''s the weather?"}],
    "tools": [...],
    "reasoning_effort": "low"
  }'

# Step 2: Switch to gemini-3-pro-preview with same conversation history
# LiteLLM automatically handles the missing thought signature
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-3-pro-preview",  # 👈 Switched model
    "messages": [
      {"role": "user", "content": "What'\''s the weather?"},
      {
        "role": "assistant",
        "tool_calls": [...]  # 👈 No thought_signature from gemini-2.5-flash
      }
    ],
    "tools": [...],
    "reasoning_effort": "low"
  }'
# ✅ Works! LiteLLM adds dummy signature automatically

Dummy Signature Details

The dummy signature used is: base64("skip_thought_signature_validator")

This is the recommended approach by Google for handling conversation history from models that don't support thought signatures. It allows Gemini 3 to:

Accept the conversation history without validation errors
Continue the conversation seamlessly
Maintain context across model switches

Thinking Level Parameter

How `reasoning_effort` Maps to `thinking_level`

For Gemini 3 Pro Preview, LiteLLM automatically maps reasoning_effort to the new thinking_level parameter:

`reasoning_effort`	`thinking_level`	Notes
`"minimal"`	`"low"`	Maps to low thinking level
`"low"`	`"low"`	Default for most use cases
`"medium"`	`"high"`	Medium not available yet, maps to high
`"high"`	`"high"`	Maximum reasoning depth
`"disable"`	`"low"`	Gemini 3 cannot fully disable thinking
`"none"`	`"low"`	Gemini 3 cannot fully disable thinking

Default Behavior

If you don't specify reasoning_effort, LiteLLM automatically sets thinking_level="low" for Gemini 3 models, to avoid high costs.

Example Usage

Python SDK
LiteLLM Proxy

from litellm import completion

# Low thinking level (faster, lower cost)
response = completion(
    model="gemini/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "What's the weather?"}],
    reasoning_effort="low"  # Maps to thinking_level="low"
)

# High thinking level (deeper reasoning, higher cost)
response = completion(
    model="gemini/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "Solve this complex math problem step by step."}],
    reasoning_effort="high"  # Maps to thinking_level="high"
)

# Low thinking level
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-3-pro-preview",
    "messages": [{"role": "user", "content": "What'\''s the weather?"}],
    "reasoning_effort": "low"
  }'

# High thinking level
curl http://localhost:4000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemini-3-pro-preview",
    "messages": [{"role": "user", "content": "Solve this complex problem."}],
    "reasoning_effort": "high"
  }'

Important Notes

Gemini 3 Cannot Disable Thinking: Unlike Gemini 2.5 models, Gemini 3 cannot fully disable thinking. Even when you set reasoning_effort="none" or "disable", it maps to thinking_level="low".
Temperature Recommendation: For Gemini 3 models, LiteLLM defaults temperature to 1.0 and strongly recommends keeping it at this default. Setting temperature < 1.0 can cause:
- Infinite loops
- Degraded reasoning performance
- Failure on complex tasks
Automatic Defaults: If you don't specify reasoning_effort, LiteLLM automatically sets thinking_level="low" for optimal performance.

Cost Tracking: Prompt Caching & Context Window

LiteLLM provides comprehensive cost tracking for Gemini 3 Pro Preview, including support for prompt caching and tiered pricing based on context window size.

Prompt Caching Cost Tracking

Gemini 3 supports prompt caching, which allows you to cache frequently used prompt prefixes to reduce costs. LiteLLM automatically tracks and calculates costs for:

Cache Hit Tokens: Tokens that are read from cache (charged at a lower rate)
Cache Creation Tokens: Tokens that are written to cache (one-time cost)
Text Tokens: Regular prompt tokens that are processed normally

How It Works

LiteLLM extracts caching information from the prompt_tokens_details field in the usage object:

{
  "usage": {
    "prompt_tokens": 50000,
    "completion_tokens": 1000,
    "total_tokens": 51000,
    "prompt_tokens_details": {
      "cached_tokens": 30000,  # Cache hit tokens
      "cache_creation_tokens": 5000,  # Tokens written to cache
      "text_tokens": 15000  # Regular processed tokens
    }
  }
}

Context Window Tiered Pricing

Gemini 3 Pro Preview supports up to 1M tokens of context, with tiered pricing that automatically applies when your prompt exceeds 200k tokens.

Automatic Tier Detection

LiteLLM automatically detects when your prompt exceeds the 200k token threshold and applies the appropriate tiered pricing:

from litellm import completion_cost

# Example: Small prompt (< 200k tokens)
response_small = completion(
    model="gemini/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "Hello!"}]
)
# Uses base pricing: $0.000002/input token, $0.000012/output token

# Example: Large prompt (> 200k tokens)
response_large = completion(
    model="gemini/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "..." * 250000}]  # 250k tokens
)
# Automatically uses tiered pricing: $0.000004/input token, $0.000018/output token

Cost Breakdown

The cost calculation includes:

Text Processing Cost: Regular tokens processed at base or tiered rate
Cache Read Cost: Cached tokens read at discounted rate
Cache Creation Cost: One-time cost for writing tokens to cache (applies tiered rate if above 200k)
Output Cost: Generated tokens at base or tiered rate

Example: Viewing Cost Breakdown

You can view the detailed cost breakdown using LiteLLM's cost tracking:

from litellm import completion, completion_cost

response = completion(
    model="gemini/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "Explain prompt caching"}],
    caching=True  # Enable prompt caching
)

# Get total cost
total_cost = completion_cost(completion_response=response)
print(f"Total cost: ${total_cost:.6f}")

# Access usage details
usage = response.usage
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")

# Access caching details
if usage.prompt_tokens_details:
    print(f"Cache hit tokens: {usage.prompt_tokens_details.cached_tokens}")
    print(f"Cache creation tokens: {usage.prompt_tokens_details.cache_creation_tokens}")
    print(f"Text tokens: {usage.prompt_tokens_details.text_tokens}")

Cost Optimization Tips

Use Prompt Caching: For repeated prompt prefixes, enable caching to reduce costs by up to 90% for cached portions
Monitor Context Size: Be aware that prompts above 200k tokens use tiered pricing (2x for input, 1.5x for output)
Cache Management: Cache creation tokens are charged once when writing to cache, then subsequent reads are much cheaper
Track Usage: Use LiteLLM's built-in cost tracking to monitor spending across different token types

Integration with LiteLLM Proxy

When using LiteLLM Proxy, all cost tracking is automatically logged and available through:

Usage Logs: Detailed token and cost breakdowns in proxy logs
Budget Management: Set budgets and alerts based on actual usage
Analytics Dashboard: View cost trends and breakdowns by token type

# config.yaml
model_list:
  - model_name: gemini-3-pro-preview
    litellm_params:
      model: gemini/gemini-3-pro-preview
      api_key: os.environ/GEMINI_API_KEY

litellm_settings:
  # Enable detailed cost tracking
  success_callback: ["langfuse"]  # or your preferred logging service

Using with Claude Code CLI

You can use gemini-3-pro-preview with Claude Code CLI - Anthropic's command-line interface. This allows you to use Gemini 3 Pro Preview with Claude Code's native syntax and workflows.

Setup

1. Add Gemini 3 Pro Preview to your config.yaml:

model_list:
  - model_name: gemini-3-pro-preview
    litellm_params:
      model: gemini/gemini-3-pro-preview
      api_key: os.environ/GEMINI_API_KEY

litellm_settings:
  master_key: os.environ/LITELLM_MASTER_KEY

2. Set environment variables:

export GEMINI_API_KEY="your-gemini-api-key"
export LITELLM_MASTER_KEY="sk-1234567890"  # Generate a secure key

3. Start LiteLLM Proxy:

litellm --config /path/to/config.yaml

# RUNNING on http://0.0.0.0:4000

4. Configure Claude Code to use LiteLLM Proxy:

export ANTHROPIC_BASE_URL="http://0.0.0.0:4000"
export ANTHROPIC_AUTH_TOKEN="$LITELLM_MASTER_KEY"

5. Use Gemini 3 Pro Preview with Claude Code:

# Claude Code will use gemini-3-pro-preview from your LiteLLM proxy
claude --model gemini-3-pro-preview

Example Usage

Once configured, you can interact with Gemini 3 Pro Preview using Claude Code's native interface:

$ claude --model gemini-3-pro-preview
> Explain how thought signatures work in multi-turn conversations.

# Gemini 3 Pro Preview responds through Claude Code interface

Benefits

✅ Native Claude Code Experience: Use Gemini 3 Pro Preview with Claude Code's familiar CLI interface
✅ Unified Authentication: Single API key for all models through LiteLLM proxy
✅ Cost Tracking: All usage tracked through LiteLLM's centralized logging
✅ Seamless Model Switching: Easily switch between Claude and Gemini models
✅ Full Feature Support: All Gemini 3 features (thought signatures, function calling, etc.) work through Claude Code

Troubleshooting

Claude Code not finding the model:

Ensure the model name in Claude Code matches exactly: gemini-3-pro-preview
Verify your proxy is running: curl http://0.0.0.0:4000/health
Check that ANTHROPIC_BASE_URL points to your LiteLLM proxy

Authentication errors:

Verify ANTHROPIC_AUTH_TOKEN matches your LiteLLM master key
Ensure GEMINI_API_KEY is set correctly
Check LiteLLM proxy logs for detailed error messages

Responses API Support

LiteLLM fully supports the OpenAI Responses API for Gemini 3 Pro Preview, including both streaming and non-streaming modes. The Responses API provides a structured way to handle multi-turn conversations with function calling, and LiteLLM automatically preserves thought signatures throughout the conversation.

Example: Using Responses API with Gemini 3

Non-Streaming
Streaming

from openai import OpenAI
import json

client = OpenAI()

# 1. Define a list of callable tools for the model
tools = [
    {
        "type": "function",
        "name": "get_horoscope",
        "description": "Get today's horoscope for an astrological sign.",
        "parameters": {
            "type": "object",
            "properties": {
                "sign": {
                    "type": "string",
                    "description": "An astrological sign like Taurus or Aquarius",
                },
            },
            "required": ["sign"],
        },
    },
]

def get_horoscope(sign):
    return f"{sign}: Next Tuesday you will befriend a baby otter."

# Create a running input list we will add to over time
input_list = [
    {"role": "user", "content": "What is my horoscope? I am an Aquarius."}
]

# 2. Prompt the model with tools defined
response = client.responses.create(
    model="gemini-3-pro-preview",
    tools=tools,
    input=input_list,
)

# Save function call outputs for subsequent requests
input_list += response.output

for item in response.output:
    if item.type == "function_call":
        if item.name == "get_horoscope":
            # 3. Execute the function logic for get_horoscope
            horoscope = get_horoscope(json.loads(item.arguments))
            
            # 4. Provide function call results to the model
            input_list.append({
                "type": "function_call_output",
                "call_id": item.call_id,
                "output": json.dumps({
                  "horoscope": horoscope
                })
            })

print("Final input:")
print(input_list)

response = client.responses.create(
    model="gemini-3-pro-preview",
    instructions="Respond only with a horoscope generated by a tool.",
    tools=tools,
    input=input_list,
)

# 5. The model should be able to give a response!
print("Final output:")
print(response.model_dump_json(indent=2))
print("\n" + response.output_text)

Key Points:

✅ Thought signatures are automatically preserved in function calls
✅ Works seamlessly with multi-turn conversations
✅ All Gemini 3-specific features are fully supported

from openai import OpenAI
import json

client = OpenAI()

tools = [
    {
        "type": "function",
        "name": "get_horoscope",
        "description": "Get today's horoscope for an astrological sign.",
        "parameters": {
            "type": "object",
            "properties": {
                "sign": {
                    "type": "string",
                    "description": "An astrological sign like Taurus or Aquarius",
                },
            },
            "required": ["sign"],
        },
    },
]

def get_horoscope(sign):
    return f"{sign}: Next Tuesday you will befriend a baby otter."

input_list = [
    {"role": "user", "content": "What is my horoscope? I am an Aquarius."}
]

# Streaming mode
response = client.responses.create(
    model="gemini-3-pro-preview",
    tools=tools,
    input=input_list,
    stream=True,
)

# Collect all chunks
chunks = []
for chunk in response:
    chunks.append(chunk)
    # Process streaming chunks as they arrive
    print(chunk)

# Thought signatures are automatically preserved in streaming mode

Key Points:

✅ Streaming mode fully supported
✅ Thought signatures preserved across streaming chunks
✅ Real-time processing of function calls and responses

Responses API Benefits

✅ Structured Output: Responses API provides a clear structure for handling function calls and multi-turn conversations
✅ Thought Signature Preservation: LiteLLM automatically preserves thought signatures in both streaming and non-streaming modes
✅ Seamless Integration: Works with existing OpenAI SDK patterns
✅ Full Feature Support: All Gemini 3 features (thought signatures, function calling, reasoning) are fully supported

Best Practices

1. Always Include Thought Signatures in Conversation History

When building multi-turn conversations with function calling:

✅ Do:

# Append the full assistant message (includes thought signatures)
messages.append(response.choices[0].message)

❌ Don't:

# Don't manually construct assistant messages without thought signatures
messages.append({
    "role": "assistant",
    "tool_calls": [...]  # Missing thought signatures!
})

2. Use Appropriate Thinking Levels

reasoning_effort="low": For simple queries, quick responses, cost optimization
reasoning_effort="high": For complex problems requiring deep reasoning

3. Keep Temperature at Default

For Gemini 3 models, always use temperature=1.0 (default). Lower temperatures can cause issues.

4. Handle Model Switches Gracefully

When switching from non-Gemini-3 to Gemini-3:

✅ LiteLLM automatically handles missing thought signatures
✅ No manual intervention needed
✅ Conversation history continues seamlessly

Troubleshooting

Issue: Missing Thought Signatures

Symptom: Error when including assistant messages in conversation history

Solution: Ensure you're appending the full assistant message from the response:

messages.append(response.choices[0].message)  # ✅ Includes thought signatures

Issue: Conversation Breaks When Switching Models

Symptom: Errors when switching from gemini-2.5-flash to gemini-3-pro-preview

Solution: This should work automatically! LiteLLM adds dummy signatures. If you see errors, ensure you're using the latest LiteLLM version.

Issue: Infinite Loops or Poor Performance

Symptom: Model gets stuck or produces poor results

Solution:

Ensure temperature=1.0 (default for Gemini 3)
Check that reasoning_effort is set appropriately
Verify you're using the correct model name: gemini/gemini-3-pro-preview

Usage​

Usage - Bedrock​

Usage - Vertex AI​

Tool Search​

Usage Example​

BM25 Variant (Natural Language Search)​

Programmatic Tool Calling​

Tool Input Examples​

Effort Parameter: Control Token Usage​

Usage Example​

Cost Tracking: Monitor Tool Search Usage​

Understanding Tool Search Costs​

Tracking Example​

Cost Optimization Tips​

Combining Features​

The Power of Integration​

Real-World Benefits​

Quick Start​

Supported Endpoints​

Thought Signatures​

What are Thought Signatures?​

How Thought Signatures Work​

Example: Multi-Turn Function Calling​

Streaming with Thought Signatures​

Important Notes on Thought Signatures​

Conversation History: Switching from Non-Gemini-3 Models​

Common Question: Will switching from a non-Gemini-3 model to Gemini-3 break conversation history?​

How It Works​

Example: Switching Models Mid-Conversation​

Dummy Signature Details​

Thinking Level Parameter​

How reasoning_effort Maps to thinking_level​

Default Behavior​

Example Usage​

Important Notes​

Cost Tracking: Prompt Caching & Context Window​

Prompt Caching Cost Tracking​

How It Works​

Context Window Tiered Pricing​

Automatic Tier Detection​

Cost Breakdown​

Example: Viewing Cost Breakdown​

Cost Optimization Tips​

Integration with LiteLLM Proxy​

Using with Claude Code CLI​

Setup​

Example Usage​

Benefits​

Troubleshooting​

Responses API Support​

Example: Using Responses API with Gemini 3​

Responses API Benefits​

Best Practices​

1. Always Include Thought Signatures in Conversation History​

2. Use Appropriate Thinking Levels​

3. Keep Temperature at Default​

4. Handle Model Switches Gracefully​

Troubleshooting​

Issue: Missing Thought Signatures​

Issue: Conversation Breaks When Switching Models​

Issue: Infinite Loops or Poor Performance​

Additional Resources​

Usage

Usage - Bedrock

Usage - Vertex AI

Tool Search

Usage Example

BM25 Variant (Natural Language Search)

Programmatic Tool Calling

Tool Input Examples

Effort Parameter: Control Token Usage

Usage Example

Cost Tracking: Monitor Tool Search Usage

Understanding Tool Search Costs

Tracking Example

Cost Optimization Tips

Combining Features

The Power of Integration

Real-World Benefits

Quick Start

Supported Endpoints

Thought Signatures

What are Thought Signatures?

How Thought Signatures Work

Example: Multi-Turn Function Calling

Streaming with Thought Signatures

Important Notes on Thought Signatures

Conversation History: Switching from Non-Gemini-3 Models

Common Question: Will switching from a non-Gemini-3 model to Gemini-3 break conversation history?

How It Works

Example: Switching Models Mid-Conversation

Dummy Signature Details

Thinking Level Parameter

How `reasoning_effort` Maps to `thinking_level`

Default Behavior

Example Usage

Important Notes

Cost Tracking: Prompt Caching & Context Window

Prompt Caching Cost Tracking

How It Works

Context Window Tiered Pricing

Automatic Tier Detection

Cost Breakdown

Example: Viewing Cost Breakdown

Cost Optimization Tips

Integration with LiteLLM Proxy

Using with Claude Code CLI

Setup

Example Usage

Benefits

Troubleshooting

Responses API Support

Example: Using Responses API with Gemini 3

Responses API Benefits

Best Practices

1. Always Include Thought Signatures in Conversation History

2. Use Appropriate Thinking Levels

3. Keep Temperature at Default

4. Handle Model Switches Gracefully

Troubleshooting

Issue: Missing Thought Signatures

Issue: Conversation Breaks When Switching Models

Issue: Infinite Loops or Poor Performance

Additional Resources