LLMs Leave the Notebook: How to Manage AI Agents in Production?

LLMs Leave the Notebook: How to Manage AI Agents in Production?
Photo by Catherine Breslin / Unsplash

You Opened ChatGPT, Solved the Problem. But What About the Real World?

Playing with LLMs is easy. You write a prompt to ChatGPT, you get an answer. But when it comes to integrating it into a real application, that "chat" interface suddenly becomes insufficient. If you're building a customer support bot, just answering isn't enough. It needs to open tickets, connect to databases, call APIs. This is where AI agents come in.

What I call an AI agent is a system that uses the LLM as a "brain" and gives it tools. The LLM thinks, decides, chooses which tool to use, and delivers the result to you. In this article, I'll explain how to manage these agents in a production environment.

What is an Agent, and How is it Different from a Simple Bot?

Traditional chatbots are usually rule-based or perform simple intent classification. If you say "Show me my account balance," it runs a predefined flow. An AI agent is more dynamic. You tell the LLM, "I have these tools, you use them to fulfill the user's request." The LLM evaluates the situation, asks the user for additional information if necessary, and runs the tools in sequence.

Let me give an example: The user says, "Search for flight tickets from Istanbul to Izmir for next week, show me the cheapest 3 options, and make sure they match my free slots in my calendar." A traditional bot struggles to understand this sentence. An AI agent thinks like this:

  1. Connect to the user's calendar and fetch free slots for next week (Calendar Tool)
  2. Query the flight API (Flight API Tool)
  3. Filter and sort the results (Filter Tool)
  4. Present them to the user in a formatted way (Response Formatter)

You don't code this flow in advance. You introduce the tools to the LLM, and it creates its own plan based on the context.

3 Major Hurdles You'll Face When Moving to Production

1. The Non-Deterministic Problem

LLMs are probabilistic. If you give the same prompt 10 times, you might get 10 different answers. This is unacceptable in production. A customer querying their balance shouldn't sometimes get correct and sometimes incorrect results.

Solution: Focus your agent on as narrow a domain as possible. Don't try to make a "general assistant." Define specific tools and prevent the LLM from going outside them. Also, keep the temperature parameter low (between 0.1-0.3).

2. Latency and Cost

A simple ChatGPT call takes 2-3 seconds. An agent can make multiple LLM calls (thinking, tool selection, result evaluation). This multiplies the latency. Also, every LLM call costs money. A complex agent flow with GPT-4 could cost $0.50.

Solution: Use smaller models. Like GPT-3.5-turbo, Claude Haiku. Use these for non-critical decisions, only switching to the large model when necessary. Also, set up a caching mechanism. Serve similar queries from the cache.

# A simple cache example
from functools import lru_cache
import hashlib

@lru_cache(maxsize=1000)
def cached_agent_response(user_query: str, context: str) -> str:
    query_hash = hashlib.md5(f"{user_query}|{context}".encode()).hexdigest()
    # Cache check and LLM call
    return response

3. Error Handling and Monitoring

The agent called a tool, the API returned a 500 error. What happens? How will you communicate this error to the LLM? The agent must have its own recovery mechanism.

Solution: Set up a retry mechanism for each tool call. Like a maximum of 3 attempts. Also, log all steps of the agent. Use LangSmith or a similar tool to trace how your agent makes decisions.

Let's Build a Real Agent Architecture: FastAPI + LangGraph

Let's move from theory to practice. Let's build a simple customer support agent. Our tools: Ticket creation (Zendesk API), Knowledge base search (Elasticsearch), Escalation (Redirect to a human).

I'll use LangGraph because it's ideal for modeling complex agent flows as a state machine. We'll also create our API endpoint with FastAPI.

from langgraph.graph import StateGraph, END
from typing import TypedDict
from langchain_openai import ChatOpenAI

class AgentState(TypedDict):
    user_query: str
    conversation_history: list
    tools_called: list
    final_response: str
    needs_human: bool

def route_query(state: AgentState):
    """Have the LLM analyze the query and decide which tool to use"""
    llm = ChatOpenAI(model="gpt-3.5-turbo")
    # Tool selection via prompt engineering
    # ...
    return {"next_step": "search_knowledge_base"}

def search_knowledge_base(state: AgentState):
    """Search in the knowledge base"""
    # Elasticsearch query
    # ...
    return {"next_step": "check_if_sufficient"}

# Create the graph
workflow = StateGraph(AgentState)
workflow.add_node("route", route_query)
workflow.add_node("search_kb", search_knowledge_base)
workflow.set_entry_point("route")
workflow.add_conditional_edges(
    "route",
    lambda x: x["next_step"],
    {"search_knowledge_base": "search_kb", "create_ticket": "create_ticket"}
)
workflow.add_edge("search_kb", END)

app = FastAPI()
agent = workflow.compile()

@app.post("/support-agent")
async def handle_query(query: str):
    initial_state = {
        "user_query": query,
        "conversation_history": [],
        "tools_called": [],
        "final_response": "",
        "needs_human": False
    }
    result = agent.invoke(initial_state)
    return {"response": result["final_response"], "needs_human": result["needs_human"]}

In this architecture, you can control all the agent's steps. Each node is a function, and the edges direct the flow. In production, scaling this graph, adding monitoring, and handling errors is easier.

How Do You Test Your Agent?

Writing unit tests is difficult. Because LLM output isn't deterministic. Instead, use evaluation frameworks. For example:

  1. Scenario-based testing: Write 10-20 realistic user scenarios. Run these scenarios before each deployment and record the results.
  2. LLM-as-a-judge: Give the agent's answer to a larger LLM (GPT-4) and ask, "Is this answer correct?" Perform automatic evaluation.
  3. A/B testing: Direct a small portion of your live traffic to the new agent version. Compare your metrics (resolution rate, customer satisfaction).

Start Now: Deploy Your First Agent to Production

The biggest mistake is starting with a complex agent. Do this instead:

  1. Start simple: Build an agent with only 1-2 tools. For example: "FAQ search" + "Ticket creation".
  2. Human-in-the-loop: Initially, submit all the agent's answers for human approval. Only send answers automatically that are above 95% confidence.
  3. Set up monitoring: LangSmith, OpenTelemetry, or custom logging. Track which tools your agent calls, how often, how long it takes, and how much it costs.
  4. Scale slowly: Increase traffic gradually. Check your metrics with each increase.

AI agents are one of the most powerful ways to apply LLMs to real-world problems. But running them in production is very different from playing in a notebook. There are engineering challenges like determinism, latency, cost, and reliability.

Start today: If you have an existing chatbot, add a simple tool to it. When you say "Get my recent tickets," have it call the Zendesk API. With this small step, you enter the world of agents. Then gradually increase the complexity.

Remember: The best agent is one that works perfectly in a narrow domain. An agent that tries to do everything does nothing well.