Building a Visual Product Agent with Gemma 4

I’ve been poking at Gemma 4’s multi-modal capabilities and wanted to see how far I could push a single model into doing real agentic work — vision, tool use, multi-turn reasoning, all at once. Turns out, pretty far.

The idea is simple: show the model a product image, let it figure out what it’s looking at, then have it call tools to gather pricing, reviews, and specs. No hardcoded product names. No separate classification step. Just: here’s a photo, tell me everything.

Here’s the agent in action:

The architecture

The agent runs an explicit loop with a maximum of 5 turns:

User sends image(s) + a question
Model looks at the image, thinks (literally — thinking mode is on), and decides what tools to call
Tools execute, results feed back into the conversation
Model either calls more tools or writes a final report
Repeat until done or max turns hit

Turn 1: [Image] → model identifies product → calls search_product()
Turn 2: model gets search results → calls get_price(), get_reviews(), get_specs()
Turn 3: model has all data → writes structured analysis report

The key thing: the model batches independent tool calls into a single turn. It doesn’t call get_price, wait, then call get_reviews, wait, then call get_specs. It fires all three at once. That’s Gemma 4 figuring out the dependency graph on its own.

Setting up tools

Four tools, each doing one thing:

def search_product(query: str) -> dict:
    """Searches for a product by name or description.

    Args:
        query: Search terms to find the product.
    """
    # Match against a product database
    # Returns: product_name, brand, category, match_confidence
    ...

def get_price(product_name: str) -> dict:
    """Looks up the current price across multiple retailers."""
    ...

def get_reviews(product_name: str) -> dict:
    """Gets review ratings and summary for a product."""
    ...

def get_specs(product_name: str) -> dict:
    """Gets detailed technical specifications."""
    ...

The docstrings matter here. Gemma 4’s processor auto-extracts function schemas from type hints and Google-style docstrings. No separate JSON schema definition needed — just write normal Python functions with proper annotations and the model knows what to call and how.

Registry is a plain dict:

TOOL_REGISTRY = {
    "search_product": search_product,
    "get_price": get_price,
    "get_reviews": get_reviews,
    "get_specs": get_specs,
}

TOOLS = list(TOOL_REGISTRY.values())

Parsing tool calls

Gemma 4 uses a specific token format for tool calls. Parsing them out:

def extract_tool_calls(text: str) -> list[dict]:
    """Parse Gemma 4 tool-call tokens into structured dicts."""
    return [
        {
            "name": name,
            "arguments": {
                k: cast((v1 or v2).strip())
                for k, v1, v2 in re.findall(
                    r'(\w+):(?:<\|"\|>(.*?)<\|"\|>|([^,}]*))', args
                )
            },
        }
        for name, args in re.findall(
            r"<\|tool_call\>call:(\w+)\{(.*?)\}<tool_call\|\>",
            text, re.DOTALL,
        )
    ]

Regex doing heavy lifting, as usual. The cast helper converts string values to ints, floats, or bools where appropriate.

The generation step (the tricky part)

This is where I hit a wall and had to work around it. The naive approach — apply_chat_template(tokenize=True) — crashes on multi-turn conversations that contain assistant tool calls and tool-role messages. The processor tries to iterate through their content looking for images and chokes.

The fix is a two-step approach:

def generate(messages, images=None, max_new_tokens=2048):
    # Step 1: Build text prompt (handles tool schemas, thinking, etc.)
    text_prompt = processor.apply_chat_template(
        messages,
        tools=TOOLS,
        tokenize=False,           # <-- text only, don't tokenize yet
        add_generation_prompt=True,
        enable_thinking=True,
    )

    # Step 2: Encode text + images together
    inputs = processor(
        text=text_prompt,
        images=images if images else None,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=1.0,
            top_p=0.95,
            top_k=64,
            do_sample=True,
        )
    return processor.decode(outputs[0][input_len:], skip_special_tokens=False)

tokenize=False first to get the text prompt, then feed text + images into the processor separately. Not obvious, but it works reliably across all turns.

The agent loop

The main loop orchestrates everything:

def run_agent(image_sources, user_query, max_turns=5, verbose=True):
    pil_images = [load_image(src) for src in image_sources]

    # Build initial message with image(s) + text
    user_content = [{"type": "image", "image": img} for img in pil_images]
    user_content.append({"type": "text", "text": user_query})

    messages = [
        {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
        {"role": "user", "content": user_content},
    ]

    for turn_num in range(1, max_turns + 1):
        raw_output = generate(messages, images=pil_images)
        parsed = processor.parse_response(raw_output)

        raw_tool_calls = extract_tool_calls(raw_output)

        if raw_tool_calls:
            results = execute_tool_calls(raw_tool_calls)

            # Append assistant tool calls + tool results to conversation
            messages.append({
                "role": "assistant",
                "tool_calls": [
                    {"type": "function", "function": c}
                    for c in raw_tool_calls
                ],
            })
            for call, result in zip(raw_tool_calls, results):
                messages.append({
                    "role": "tool",
                    "name": call["name"],
                    "content": json.dumps(result["response"]),
                })
            continue

        # No tool calls = final answer
        return {
            "final_answer": parsed.get("content", ""),
            "turns": turns,
            "total_turns": turn_num,
        }

The pattern is clean: generate → check for tool calls → if yes, execute and loop → if no, return the answer. The conversation history accumulates naturally, so each turn has full context of everything that happened before.

The system prompt

This is where you shape the agent’s behavior:

SYSTEM_PROMPT = """You are a Product Analyst Agent. When given product images, you:

1. FIRST: Look at the image carefully, identify the product, and call search_product.
2. THEN: Call get_price, get_reviews, and get_specs together in a single turn.
3. FINALLY: Produce a structured analysis report in markdown.

Be efficient: batch independent tool calls into one turn.
Do NOT make up prices or specs -- always use the tools."""

That last line is doing real work. Without it, the model occasionally skips tool calls and hallucinates specs from training data. With it, I haven’t seen a single hallucinated spec.

What happens in practice

Show it a photo of Sony WH-1000XM4 headphones:

Turn 1: Thinking... "I can see these are over-ear headphones, they look like Sony..."
        → search_product(query='Sony noise canceling over-ear headphones black')

Turn 2: Got search results (Sony WH-1000XM4, 0.95 confidence)
        → get_price(product_name='Sony WH-1000XM4')
        → get_reviews(product_name='Sony WH-1000XM4')
        → get_specs(product_name='Sony WH-1000XM4')

Turn 3: Final report with price comparison table, review summary, specs breakdown

Three turns. The model identified the product from the image alone, batched the data-gathering calls, and wrote a structured report. The thinking traces are genuinely useful — you can see it reasoning about what it sees in the image before making decisions.

Multi-image comparison

The same agent handles comparisons without any code changes. Show it two headphones and ask “which is the better value?”:

result = run_agent(
    image_sources=["images/sony_xm4.jpg", "images/beats_solo3.jpg"],
    user_query=(
        "Compare these two products. For each one: identify it, "
        "look up pricing, reviews, and specs. Then give me a "
        "side-by-side comparison and tell me which is the better value."
    ),
    max_turns=5,
)

It identifies both products, calls tools for each (batching where possible), and generates a comparison table. The thinking traces show it actually weighing trade-offs — battery life vs. noise cancellation, price per feature.

Hardware reality check

This runs on Gemma 4 31B-IT in bfloat16. That’s ~65GB VRAM. A100 80GB, H100, or the new RTX PRO 6000 Blackwell work. I tried 4-bit quantization early on and the vision encoder fell apart — images looked like solid gray to the model. bf16 or nothing for this one.

What I learned

Thinking mode changes everything for agents. Without it, the model jumps to tool calls without reasoning about what it sees. With it, you get actual deliberation — “I can see the Sony logo,” “these look like noise-canceling headphones based on the ear cup size.” The quality of tool call arguments goes way up.

Docstrings are your schema. Gemma 4’s processor extracts function schemas from Python’s type hints and docstrings directly. Write good docstrings and you never need to maintain a separate tool definition format.

The tokenize=False workaround is essential for multi-turn tool-using agents. The processor’s apply_chat_template with tokenize=True crashes when the conversation history contains assistant tool calls and tool role messages. Separate text generation from tokenization.

Batching tool calls is emergent. I didn’t explicitly program the model to batch independent calls. The system prompt says “batch independent tool calls into one turn” and it does. Three calls in a single generation step, correctly identifying that price/reviews/specs don’t depend on each other.

The full notebook is on GitHub: Gemma4_31B_Multi_Turn_Visual_Agent.ipynb

#ai#gemma#agents#vision#function-calling#python