Building agents with Google Gemini and open source frameworks

The landscape of artificial intelligence is moving at a breakneck pace. We've shifted from models that simply predict text to sophisticated systems that can understand and interact with the world. At the forefront of this evolution is the concept of "AI agents"—autonomous systems that can reason, plan, and execute tasks. Powering these agents requires a state-of-the-art "brain," and this is where Google Gemini enters the picture. As Google's most capable and natively multi-modal model, it offers unprecedented capabilities for reasoning across text, images, code, and more. But a great brain needs a body and tools to interact with its environment. This is where open-source frameworks like LangChain and LlamaIndex shine, providing the essential scaffolding to build robust, production-ready agents. This article provides a comprehensive guide for MLOps engineers, DevOps specialists, and AI developers on how to build powerful agents by combining the intelligence of Google Gemini with the flexibility of these open-source frameworks.

What Are AI Agents and Why Are They a Big Deal?

For years, Large Language Models (LLMs) have been incredible text generators. You give them a prompt, and they provide a coherent, often insightful, response. However, this interaction is largely passive. An AI agent, by contrast, is an active participant. It doesn't just answer; it *acts*.

From Language Models to Action Takers

Think of a standard LLM as a brilliant scholar in a locked room. It has access to vast knowledge but can't affect the outside world. An AI agent is that same scholar given a phone, a computer, and a set of keys.

An agent architecture enhances an LLM by wrapping it in a loop that allows it to:

Reason: Analyze a complex, multi-step user request.
Plan: Break down the request into a series of actionable steps.
Use Tools: Execute actions, such as searching the web, running code, or calling an API.
Observe: Analyze the results of its actions.
Iterate: Repeat the process until the final goal is achieved.

This "reasoning loop" (often implemented with patterns like ReAct, or "Reason + Act") is what elevates a model from a simple chatbot to a genuine assistant that can book flights, summarize a repository's latest commits, or analyze a complex dataset.

The Core Components of an Agent

While architectures vary, most agents you'll build share a common anatomy:

The Brain (LLM): This is the core reasoning engine. It makes all the decisions. This is where Google Gemini fits in, providing world-class reasoning and, critically, native function calling.
Tools: These are the "hands" of the agent. A tool is simply a function or API that the agent can choose to call. This could be a search_web function, a run_python_script tool, or a get_stock_price API.
The Agent Executor (or Loop): This is the orchestration logic that feeds the user's prompt to the LLM, parses the LLM's response to see if it wants to use a tool, calls the tool if needed, and feeds the tool's output back to the LLM to continue the process.
Memory: This component provides the agent with short-term and long-term context, allowing it to remember previous interactions and tool outputs to inform future decisions.

Understanding Google Gemini: The Multi-Modal Powerhouse

Not all LLMs are created equal, especially when it comes to being an agent's brain. Google Gemini was built from the ground up to be different. Its native multi-modality and sophisticated reasoning make it an ideal choice for agent development.

Why Google Gemini for Agent Development?

Native Multi-modality: Gemini wasn't just trained on text. It was pre-trained from the start on a vast dataset of interleaved text, images, audio, and video. This means you can build agents that can *natively* reason about a chart, a code snippet, and a user's text query all in a single prompt.
Advanced Reasoning & Planning: Agents live and die by their ability to plan. Gemini models exhibit state-of-the-art performance on reasoning benchmarks, making them less likely to get "stuck" in a loop or fail at a multi-step task.
Native Function Calling: This is perhaps the most critical feature for agent development. You can describe your tools (functions) to the Gemini API in a structured format. When you ask it a question, Gemini can intelligently decide *which* tool to call and *what arguments* to pass to it, returning a structured JSON object that your code can easily parse and execute. This drastically simplifies the agent orchestration loop.
Long Context Window: With context windows scaling up to 1 million tokens, Gemini agents can maintain context over very long conversations or analyze entire codebases, books, or video transcripts in one go.

Overview of the Gemini API

When working with Google Gemini, you'll primarily interact with a few key models via the Google AI Studio or its SDKs:

gemini-pro: The workhorse model, ideal for text-based reasoning, planning, and chat.
gemini-pro-vision: The multi-modal variant that accepts both text and image/video inputs. This is your go-to for building agents that can "see."
Function Calling: This isn't a separate model but a feature of the API. You pass a list of tools to the generateContent call, and the model will respond with a functionCall object when it deems a tool is necessary.

The Role of Open-Source Frameworks

You *could* build an agent by making raw HTTP requests to the Gemini API, writing your own tool-parsing logic, and managing state manually. But you'd be reinventing the wheel. This is where frameworks come in.

Why Not Build from Scratch?

Building a robust agent involves a surprising amount of "plumbing":

Parsing complex, sometimes finicky, LLM outputs.
Managing the loop of (Reason -> Act -> Observe).
Handling API errors, retries, and rate limits.
Abstracting tool definitions.
Managing conversation memory and context.
Connecting to and indexing data from dozens of different sources (PDFs, databases, APIs, etc.).

Open-source frameworks handle this plumbing, letting you focus on the *business logic* of your agent.

Meet the Titans: LangChain and LlamaIndex

Two frameworks dominate the LLM application ecosystem:

LangChain: A comprehensive, general-purpose framework for "chaining" LLM calls together. It provides a vast library of components for agents, memory, data connection, and more. It's the "Swiss Army knife" for building complex, multi-step applications.
LlamaIndex: A data-centric framework optimized for building Retrieval-Augmented Generation (RAG) applications. While it also supports agents, its primary strength is in ingesting, indexing, and querying your private data, especially with multi-modal capabilities.

Both frameworks have excellent, first-class integration with Google Gemini, making it simple to swap it in as the brain for your agents.

Building Your First Agent with Google Gemini and LangChain

Let's build a practical agent that can search the web and perform calculations. We'll use LangChain and Gemini's native function calling capabilities.

Setting Up Your Environment

First, get your API key from Google AI Studio (formerly MakerSuite) and set it as an environment variable.

export GOOGLE_API_KEY="your_api_key_here"

Then, install the necessary Python libraries. We'll also install tavily-python for a simple search tool.

pip install langchain langchain-google-genai google-generativeai tavily-python

You will also need a Tavily API key for the search tool, so set that as an environment variable too.

export TAVILY_API_KEY="your_tavily_key_here"

Step 1: Define Your Tools

LangChain makes it easy to define tools. We'll use a pre-built TavilySearchResults tool and create a custom calculator tool using a decorator.

from langchain_community.tools.tavily_search import TavilySearchResults
from langchain_core.tools import tool

# 1. A pre-built tool for web search
# By default, it uses the TAVILY_API_KEY env variable
search_tool = TavilySearchResults(max_results=2)

# 2. A custom tool for calculations
@tool
def simple_calculator(expression: str) -> str:
    """A simple calculator that evaluates a mathematical expression."""
    try:
        # Using eval is generally unsafe, but fine for this controlled demo
        result = eval(expression)
        return f"The result is: {result}"
    except Exception as e:
        return f"Error evaluating expression: {e}"

tools = [search_tool, simple_calculator]

Step 2: Initialize the Google Gemini Model

Next, we initialize the ChatGoogleGenerativeAI model from the langchain-google-genai package. We'll bind our tools to it so the model knows they exist and can call them.

from langchain_google_genai import ChatGoogleGenerativeAI

# Initialize the Google Gemini model
# We use gemini-pro as it's great for function calling
llm = ChatGoogleGenerativeAI(model="gemini-pro")

# Bind the tools to the LLM
# This tells the model what functions it can call
llm_with_tools = llm.bind_tools(tools)

Step 3: Creating the Agent with Function Calling

With Google Gemini, we don't need complex ReAct-style prompt engineering to get it to use tools. We can use its native function calling. LangChain provides helpers to build an agent that leverages this directly.

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import AIMessage, HumanMessage, ToolMessage
from langchain.agents.format_scratchpad.google_genai_functions import (
    format_to_google_genai_function_messages,
)
from langchain.agents.output_parsers.google_genai_functions import (
    GoogleGenAIFunctionsAgentOutputParser,
)

# Define the prompt template
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant. You have access to a web search tool and a calculator."),
        ("user", "{input}"),
        MessagesPlaceholder(variable_name="agent_scratchpad"),
    ]
)

# Create the agent by chaining the prompt, LLM, and output parser
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_to_google_genai_function_messages(
            x["intermediate_steps"]
        ),
    }
    | prompt
    | llm_with_tools
    | GoogleGenAIFunctionsAgentOutputParser()
)

Step 4: Running the Agent Executor

Finally, we use LangChain's AgentExecutor to run the agent. The executor is the loop that we discussed earlier: it calls the agent, executes the chosen tools, and feeds the results back to the agent until a final answer is produced.

from langchain.agents import AgentExecutor

# The AgentExecutor runs the loop of:
# 1. Pass user input to agent
# 2. Agent decides to call a tool (or answer)
# 3. If tool call, execute the tool
# 4. Pass tool output back to agent
# 5. Repeat until agent gives a final answer
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Let's run it!
response = agent_executor.invoke(
    {"input": "What is the weather in San Francisco, and what is 12 * 29?"}
)

print(response["output"])

If you run this (with verbose=True), you'll see the agent's thought process. It will first call the tavily_search_results tool for the weather, then call the simple_calculator tool for the math problem, and finally synthesize both results into a single, coherent answer. This is the power of a Google Gemini-powered agent.

Practical Example: A Multi-Modal RAG Agent with LlamaIndex

Now let's explore the other pillar: multi-modality. We'll build an agent that can answer questions about a folder containing both text files and images, using LlamaIndex and gemini-pro-vision.

The Concept: RAG Meets Multi-Modality

Retrieval-Augmented Generation (RAG) is a technique where an LLM's knowledge is augmented with external data at query time. A multi-modal RAG agent can retrieve *both* text chunks and relevant images to answer a query. For example, "Show me the diagram explaining our new architecture and summarize the text doc that describes it."

Setting Up LlamaIndex with Google Gemini

First, install the required LlamaIndex packages.

pip install llama-index llama-index-multi-modal-llms-gemini llama-index-readers-file

Create a folder named multimodal_data/ and add some sample images (e.g., diagram.png, chart.jpg) and text files (e.g., project_spec.md) to it.

Step 1: Loading Multi-Modal Data

LlamaIndex has powerful data loaders. We'll use the SimpleDirectoryReader to load all files, and it's smart enough to distinguish file types.

from llama_index.core import SimpleDirectoryReader

# Load all files from the directory
# This will load images and text files
documents = SimpleDirectoryReader("./multimodal_data/").load_data()

Step 2: Configuring the Multi-Modal LLM (Gemini Pro Vision)

We need to explicitly configure LlamaIndex to use Google Gemini for both text generation (gemini-pro) and multi-modal reasoning (gemini-pro-vision).

from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.llms.gemini import Gemini
from llama_index.core import Settings

# Configure the LLMs
Settings.llm = Gemini(model="models/gemini-pro")
Settings.multi_modal_llm = GeminiMultiModal(model="models/gemini-pro-vision")

Step 3: Building the Vector Index

LlamaIndex will handle the complex task of embedding both text chunks and images. We'll use a MultiModalVectorStoreIndex.

from llama_index.core import StorageContext, VectorStoreIndex, load_index_from_storage
import os

# Define a persistent storage directory
PERSIST_DIR = "./storage_multimodal"

if not os.path.exists(PERSIST_DIR):
    # If index doesn't exist, create it
    # This will use a multi-modal embedding model to embed text and images
    index = VectorStoreIndex.from_documents(
        documents,
        show_progress=True,
    )
    # Persist the index
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # If index exists, load it
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

Step 4: Creating the Query Engine Agent

LlamaIndex abstracts the agent logic into a QueryEngine. We'll create a multi-modal query engine that can handle both text and image inputs/outputs.

# Create a multi-modal query engine from the index
query_engine = index.as_query_engine(
    multi_modal_llm=Settings.multi_modal_llm
)

# Make a text query about the documents
response_text = query_engine.query("Summarize the project_spec.md file for me.")
print(f"Text Response: {response_text}\n")

# Make a multi-modal query
from llama_index.core.schema import ImageDocument
from llama_index.core.query_engine import MultiModalQueryEngine

# This is a dummy image query, in a real app you'd load a user's image
# But we can also just ask about the images already in the index
response_img = query_engine.query("What is shown in the 'diagram.png' image?")
print(f"Image Response: {response_img}\n")

This agent, powered by gemini-pro-vision, can now seamlessly answer questions that require understanding text, images, or a combination of both. This is a capability that was science fiction just a short time ago, now made accessible via Google Gemini and LlamaIndex.

Advanced Concepts and Best Practices

Building simple agents is just the start. As you move to production, consider these factors.

Agentic Planning and Reasoning (ReAct)

While native function calling is powerful and efficient, some complex tasks require a more explicit "chain of thought." Frameworks like LangChain also support ReAct-style agents for Google Gemini. This involves a more complex prompt that instructs the model to think step-by-step, choose an action, and observe the result, all in a text loop. This can be more robust for highly complex, long-running tasks but is often slower and more expensive than native function calling.

Managing Agent State and Memory

A stateless agent is not very useful. Your agent needs to remember the conversation history. Both LangChain and LlamaIndex provide "Memory" modules (e.g., ConversationBufferMemory) that automatically attach the chat history to each call. For Google Gemini's long context, you can use more advanced memory like ConversationSummaryBufferMemory, which summarizes older parts of the conversation to save tokens while preserving context.

Evaluating Your Agents

How do you know your agent is working well? Or that a change didn't break something? Agent evaluation is a critical, complex field. You need to move beyond "it feels right" to rigorous testing. This involves creating a dataset of "golden" questions and answers and evaluating your agent's responses based on criteria like:

Correctness: Did it give the right final answer?
Tool Use: Did it call the correct tools with the correct arguments?
Efficiency: Did it solve the problem in a reasonable number of steps?

Tools like LangChain's LangSmith are built specifically for tracing, debugging, and evaluating complex agentic applications.

Frequently Asked Questions

What's the main difference between using LangChain and LlamaIndex with Gemini?: Think of it this way: Use LangChain when your focus is on complex *interactions* and *process flows* (e.g., an agent that calls 5 different APIs and a code interpreter). Use LlamaIndex when your focus is on *data* (e.g., an agent that needs to deeply understand and query your private collection of text, PDFs, and images).
Can Google Gemini agents interact with real-world APIs?: Absolutely. That's the primary purpose of "tools." A tool can be any Python function. You can write a function that makes a POST request to your company's internal JIRA API, a GET request to a weather API, or runs a SQL query against your database. As long as you can represent it in Python, Google Gemini can learn to call it.
How do I handle errors or hallucinations in my Gemini agent?: This is a critical part of production-grade development. 1) Prompt Engineering: Be very specific in your system prompt about what the agent should and shouldn't do. 2) Tool Design: Make your tools robust. If a tool fails, it should return a clear error message (e.g., "API timeout," "Invalid stock ticker") that you can feed back to the agent. The agent can then retry or inform the user. 3) Guardrails: You can add a validation layer (another LLM call or a simple regex) to check the agent's output before showing it to the user or executing a dangerous action (like rm -rf /).
Is gemini-pro or gemini-1.5-pro better for building agents?: gemini-1.5-pro is generally the superior choice for complex agentic tasks. Its key advantages are a massive context window (up to 1 million tokens) and even stronger reasoning and instruction-following capabilities. This means it can handle much longer conversation histories, analyze entire codebases in one go, and follow complex multi-step plans more reliably. Use gemini-pro for simpler, faster tasks, and gemini-1.5-pro for state-of-the-art, complex agents.

Building agents with Google Gemini and open source frameworks

Conclusion

We are at an inflection point in AI. The combination of hyper-capable multi-modal models like Google Gemini and robust open-source frameworks like LangChain and LlamaIndex has unlocked the ability to build true AI agents. We've moved beyond simple text-in, text-out systems to creating sophisticated entities that can reason, plan, and interact with digital (and soon, physical) environments to accomplish complex goals.

Whether you're a DevOps engineer automating infrastructure tasks, an MLOps engineer building a self-healing model pipeline, or a developer creating a next-generation application, the tools are now at your fingertips. By mastering the integration of Google Gemini with these powerful frameworks, you are not just building applications—you are building the future of autonomous systems. The only question left is: what will you build first? Thank you for reading the huuphan.com

Search This Blog