Contents

Search with AI four ways: How much AI does your Rails app really need

Documentation search: When your Rails app needs which approach

Imagine spending an afternoon watching a developer tear out a perfectly functional search feature. They replace their solid Postgres full-text search with a vector database and RAG pipeline because, well, that’s what you’re supposed to do now, right? The new system is slower, cost them $200 a month in OpenAI API calls, and returned worse results for their specific use case.

This keeps happening. The AI hype cycle has convinced developers that every search problem needs embeddings, vector databases, and agentic loops. Sometimes that’s true. Often it’s not.

Let’s build the same feature four different ways and see what each approach actually costs you.

The use case: searching Ruby gem documentation

We’re building a search feature for a documentation site that indexes about 5,000 Ruby gems. Each gem has README content, API documentation, and code examples. Users ask questions like “How do I upload files to S3?” or “What’s the best gem for handling webhooks?”

This is a realistic scale for most Rails apps. Not Google-sized, not trivial. Just normal business software that needs to help users find information.

I’ll show you four implementations, each adding a layer of complexity. We’ll look at the code, measure the actual costs, and figure out when the added complexity pays for itself.

Approach 1: Traditional search with AI summarization

Start with what works. Postgres full-text search has been solving search problems since before your junior devs were born.

class Documentation < ApplicationRecord
  include PgSearch::Model
  
  pg_search_scope :search_content,
    against: {
      title: 'A',
      content: 'B',
      code_examples: 'C'
    },
    using: {
      tsearch: { prefix: true },
      trigram: { threshold: 0.3 }
    }
end

class DocumentationSearcher
  def initialize(query)
    @query = query
  end
  
  def search
    results = Documentation.search_content(@query).limit(10)
    
    {
      results: results,
      summary: summarize_results(results)
    }
  end
  
  private
  
  def summarize_results(results)
    return nil if results.empty?
    
    prompt = <<~PROMPT
      User question: #{@query}
      
      Here are the top search results:
      #{format_results(results)}
      
      Provide a concise answer to the user's question based on these results.
      If the results don't contain relevant information, say so.
    PROMPT
    
    client = OpenAI::Client.new
    client = OpenAI::Client.new
    response = client.chat(
      parameters: {
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        temperature: 0.3
      }
    )

    response.dig(:choices, 0, :message, :content)
  end
  
  def format_results(results)
    results.map.with_index do |doc, i|
      "#{i + 1}. #{doc.title}\n#{doc.content.truncate(500)}"
    end.join("\n\n")
  end
end

This approach does one database query and one API call. The search uses proven Postgres features: full-text search with ranking, trigram matching for typos, and weighted fields. Then we send the top results to GPT-4 to generate a summary.

Cost per query:

  • Database: ~5ms
  • OpenAI API: ~$0.002 (about 1,000 input tokens, 200 output tokens)
  • Total latency: ~800ms

When this fails:

  • User queries are conceptually different from how docs are written (“async jobs” versus “background processing”)
  • Important information is buried in the middle of long documents
  • You need to combine information from multiple sources

The failure mode is subtle. Traditional search ranks by keyword matching and field weights. When users phrase questions differently than your documentation uses terminology, they get poor results. You can’t fix this with better prompt engineering because the LLM never sees the relevant documents.

Approach 2: Basic RAG with vector embeddings

RAG (Retrieval Augmented Generation) means embedding your documents as vectors, embedding the user’s query as a vector, and finding documents with similar embeddings. This solves the terminology mismatch problem.

# We need to store embeddings
class AddEmbeddingsToDocumentation < ActiveRecord::Migration[7.1]
  def change
    add_column :documentations, :embedding, :vector, limit: 1536
    add_index :documentations, :embedding, using: :hnsw, opclass: :vector_cosine_ops
  end
end

class Documentation < ApplicationRecord
  has_neighbors :embedding
  
  after_save :generate_embedding, if: :content_changed?
  
  private
  
  def generate_embedding
    text = "#{title}\n\n#{content}"
    
    client = OpenAI::Client.new
    response = client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    self.update_column(:embedding, response.dig(:data, 0, :embedding))
  end
end

class RagSearcher
  def initialize(query, client: OpenAI::Client.new)
    @query = query
    @client = client
  end

  def search
    query_embedding = generate_embedding(@query)
    results = Documentation.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(5)

    {
      results: results,
      answer: generate_answer(results)
    }
  end

  private

  def generate_embedding(text)
    response = @client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    response.dig(:data, 0, :embedding)
  end

  def generate_answer(results)
    context = results.map { |doc| "#{doc.title}\n#{doc.content}" }.join("\n\n---\n\n")

    prompt = <<~PROMPT
      Answer the user's question based only on the following documentation:

      #{context}

      Question: #{@query}

      If you cannot answer based on the provided documentation, say so clearly.
    PROMPT

    response = @client.chat(
      parameters: {
        model: "gpt-4o-mini",
        messages: [{ role: "user", content: prompt }],
        temperature: 0.3
      }
    )

    response.dig(:choices, 0, :message, :content)
  end
end

Now we’re making two API calls per search: one to embed the query, one to generate the answer. We’re also using pgvector with HNSW indexing for fast similarity search.

Cost per query:

  • Database: ~15ms (vector similarity search)
  • OpenAI embeddings API: ~$0.00001 (negligible)
  • OpenAI chat API: ~$0.003
  • Total latency: ~1,200ms

When this works better: The semantic matching is noticeably better. A query about “background jobs” will match documents about “async processing” and “delayed tasks” because the concepts are similar in vector space. This is real improvement over keyword search.

When this still fails:

  • Complex questions requiring information from many documents
  • Multi-step reasoning (“compare these two approaches”)
  • Questions where the first retrieval doesn’t get the right context

Here’s a concrete failure case I hit: A user asks “What’s the difference between Sidekiq and Good Job?” The vector search retrieves five documents, but three are about Sidekiq and two are about Good Job. The LLM tries to compare them but doesn’t have complete information about both systems. It hedges and gives a vague answer.

Approach 3: Agentic RAG with adaptive retrieval

This is where we let the LLM decide if it needs more information. Instead of one retrieve-then-generate pass, we give the LLM tools to search again, rephrase queries, or combine results.

class AgenticRagSearcher
  MAX_ITERATIONS = 3

  def initialize(query, client: OpenAI::Client.new)
    @query = query
    @client = client
    @conversation_history = []
    @retrieved_docs = []
  end
  
  def search
    initial_prompt = <<~PROMPT
      You are a helpful assistant that searches Ruby gem documentation.
      
      User question: #{@query}
      
      You have access to these tools:
      - search_docs(query): Search documentation with a semantic query
      - get_related(doc_id): Get documents related to a specific document
      
      Think step by step. You can search multiple times with different queries
      to gather complete information before answering.
    PROMPT
    
    @conversation_history << { role: "user", content: initial_prompt }
    
    MAX_ITERATIONS.times do
      response = call_llm_with_tools
      
      break if response[:finish_reason] == "stop"
      
      if response[:tool_calls]
        handle_tool_calls(response[:tool_calls])
      end
    end
    
    {
      results: @retrieved_docs.uniq,
      answer: @conversation_history.last[:content],
      iterations: @conversation_history.length
    }
  end
  
  private
  
  def call_llm_with_tools
    response = @client.chat(
      parameters: {
        model: "gpt-4o",
        messages: @conversation_history,
        tools: tool_definitions,
        temperature: 0.3
      }
    )

    message = response.dig(:choices, 0, :message)
    @conversation_history << message

    {
      finish_reason: response.dig(:choices, 0, :finish_reason),
      tool_calls: message[:tool_calls]
    }
  end
  
  def tool_definitions
    [
      {
        type: "function",
        function: {
          name: "search_docs",
          description: "Search documentation using semantic search",
          parameters: {
            type: "object",
            properties: {
              query: {
                type: "string",
                description: "The search query"
              }
            },
            required: ["query"]
          }
        }
      },
      {
        type: "function",
        function: {
          name: "get_related",
          description: "Get documents related to a specific document",
          parameters: {
            type: "object",
            properties: {
              doc_id: {
                type: "integer",
                description: "The ID of the document"
              }
            },
            required: ["doc_id"]
          }
        }
      }
    ]
  end
  
  def handle_tool_calls(tool_calls)
    results = tool_calls.map do |tool_call|
      function_name = tool_call.dig(:function, :name)
      arguments = JSON.parse(tool_call.dig(:function, :arguments) || "{}")

      result = case function_name
      when "search_docs"
        search_docs(arguments["query"])
      when "get_related"
        get_related_docs(arguments["doc_id"])
      end

      @retrieved_docs.concat(result)

      {
        role: "tool",
        tool_call_id: tool_call[:id],
        content: format_docs_for_llm(result)
      }
    end

    @conversation_history.concat(results)
  end
  
  def search_docs(query)
    embedding = generate_embedding(query)
    Documentation.nearest_neighbors(:embedding, embedding, distance: "cosine").limit(3)
  end
  
  def get_related_docs(doc_id)
    doc = Documentation.find(doc_id)
    Documentation
      .nearest_neighbors(:embedding, doc.embedding, distance: "cosine")
      .where.not(id: doc_id)
      .limit(3)
  end
  
  def generate_embedding(text)
    response = @client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    response.dig(:data, 0, :embedding)
  end
  
  def format_docs_for_llm(docs)
    docs.map do |doc|
      {
        id: doc.id,
        title: doc.title,
        content: doc.content.truncate(1000)
      }
    end.to_json
  end
end

This is a real step up in complexity. We’re now orchestrating multiple LLM calls with tool use. The LLM can search multiple times, explore related documents, and build up context before answering.

Cost per query:

  • Database: 30-90ms (multiple vector searches)
  • OpenAI embeddings API: $0.00002-0.00006 (2-6 embedding calls)
  • OpenAI chat API: $0.015-0.045 (3-5 LLM calls with larger context)
  • Total latency: 3-8 seconds

Notice the variance. Some queries get answered in one iteration. Complex ones might do three searches with different phrasings, explore related documents, and make five total LLM calls.

When this works better: That comparison query from before (“What’s the difference between Sidekiq and Good Job?”) now works great. The LLM searches for “Sidekiq background jobs”, gets those docs, then searches for “Good Job background jobs”, gets those docs, then synthesizes a real comparison.

Multi-part questions work too. “How do I set up Stripe payments and handle webhooks?” triggers two separate searches that gather comprehensive information.

When this gets expensive: Every query where the LLM decides it needs more information costs you 3-5x more. If your users ask a lot of complex questions, your API bill climbs fast.

The latency is also noticeable. Eight seconds feels slow in a web UI. You need to stream responses or show progress indicators.

Approach 4: Full conversational agent with external tools

Now we’re building a real agent that can search your documentation, browse external sites, and maintain conversation context across multiple turns.

class DocumentationAgent
  def initialize(session_id, client: OpenAI::Client.new)
    @session_id = session_id
    @client = client
    @conversation_history = load_conversation_history
  end
  
  def chat(message)
    @conversation_history << { role: "user", content: message }
    
    loop do
      response = call_llm_with_tools
      
      break if response[:finish_reason] == "stop"
      
      if response[:tool_calls]
        handle_tool_calls(response[:tool_calls])
      else
        break
      end
    end
    
    save_conversation_history
    
    {
      response: @conversation_history.last[:content],
      sources: extract_sources
    }
  end
  
  private

  def call_llm_with_tools
    response = @client.chat(
      parameters: {
        model: "gpt-4o",
        messages: @conversation_history,
        tools: tool_definitions,
        temperature: 0.3
      }
    )

    message = response.dig(:choices, 0, :message)
    @conversation_history << message

    {
      finish_reason: response.dig(:choices, 0, :finish_reason),
      tool_calls: message[:tool_calls]
    }
  end

  def generate_embedding(text)
    response = @client.embeddings(
      parameters: {
        model: "text-embedding-3-small",
        input: text
      }
    )

    response.dig(:data, 0, :embedding)
  end

  def tool_definitions
    [
      {
        type: "function",
        function: {
          name: "search_internal_docs",
          description: "Search our Ruby gem documentation",
          parameters: {
            type: "object",
            properties: {
              query: { type: "string" }
            },
            required: ["query"]
          }
        }
      },
      {
        type: "function",
        function: {
          name: "fetch_external_url",
          description: "Fetch content from an external URL like GitHub or RubyGems.org",
          parameters: {
            type: "object",
            properties: {
              url: { type: "string" }
            },
            required: ["url"]
          }
        }
      },
      {
        type: "function",
        function: {
          name: "search_github",
          description: "Search GitHub repositories and code",
          parameters: {
            type: "object",
            properties: {
              query: { type: "string" }
            },
            required: ["query"]
          }
        }
      }
    ]
  end
  
  def handle_tool_calls(tool_calls)
    results = tool_calls.map do |tool_call|
      function_name = tool_call.dig(:function, :name)
      arguments = JSON.parse(tool_call.dig(:function, :arguments) || "{}")

      result = case function_name
      when "search_internal_docs"
        search_internal_docs(arguments["query"])
      when "fetch_external_url"
        fetch_external_url(arguments["url"])
      when "search_github"
        search_github(arguments["query"])
      end

      {
        role: "tool",
        tool_call_id: tool_call[:id],
        content: result.to_json
      }
    end

    @conversation_history.concat(results)
  end
  
  def search_internal_docs(query)
    embedding = generate_embedding(query)
    docs = Documentation.nearest_neighbors(:embedding, embedding, distance: "cosine").limit(5)

    docs.map { |d| { title: d.title, content: d.content.truncate(800), source: "internal", id: d.id } }
  end
  
  def fetch_external_url(url)
    # In production, use a proper HTTP client with timeouts and error handling
    response = HTTP.timeout(5).get(url)
    
    {
      url: url,
      content: extract_main_content(response.body.to_s).truncate(2000),
      source: "external"
    }
  rescue HTTP::Error => e
    { error: "Failed to fetch URL: #{e.message}" }
  end
  
  def search_github(query)
    # Use Octokit or similar
    client = Octokit::Client.new(access_token: ENV['GITHUB_TOKEN'])
    results = client.search_code(query, language: "ruby")
    
    results.items.first(3).map do |item|
      {
        name: item.name,
        repo: item.repository.full_name,
        url: item.html_url,
        source: "github"
      }
    end
  rescue Octokit::Error => e
    { error: "GitHub search failed: #{e.message}" }
  end
  
  def load_conversation_history
    cache_key = "agent_conversation:#{@session_id}"
    JSON.parse(Rails.cache.read(cache_key) || "[]")
  end
  
  def save_conversation_history
    cache_key = "agent_conversation:#{@session_id}"
    # Keep last 10 messages to control context size
    trimmed_history = @conversation_history.last(10)
    Rails.cache.write(cache_key, trimmed_history.to_json, expires_in: 1.hour)
  end
  
  def extract_sources
    @conversation_history
      .select { |msg| msg[:role] == "tool" }
      .flat_map { |msg| JSON.parse(msg[:content] || "[]") }
      .select { |item| item.is_a?(Hash) && item["source"] }
      .uniq { |item| item["id"] || item["url"] || item["title"] }
  end
end

This is a different beast. We’re maintaining conversation state, hitting external APIs, and letting the LLM orchestrate complex research tasks.

Cost per conversation turn:

  • Database: 15-50ms
  • External API calls: 200-2000ms (GitHub, external sites)
  • OpenAI embeddings: $0.00001-0.00005
  • OpenAI chat: $0.02-0.15 (larger context windows, multiple turns)
  • Total latency: 5-15 seconds

When this is worth it: You’re building a research assistant or technical support bot where users have complex, multi-turn conversations. They ask follow-up questions, need you to check external sources, and expect the system to remember context.

A user might ask “What’s the best gem for image processing?”, then follow up with “Show me examples from the ImageMagick wrapper”, then “Is there a more modern alternative?” The agent maintains context and can search different sources for each question.

When this is overkill: Most search features. If users are doing one-off queries and moving on, you’re paying for conversational capabilities they don’t need.

Choosing your approach

I’ve built all four of these systems in production. Here’s how I decide which to use.

Start with enhanced traditional search if:

  • You have well-written documentation with consistent terminology
  • Queries are mostly straightforward lookup tasks
  • You need predictable costs and latency
  • Your document corpus is under 10,000 items

The cost difference matters. At 1,000 queries per day, enhanced traditional search costs you $2/day. Basic RAG costs $3/day. Agentic RAG costs $15-45/day. A full agent costs $100-300/day.

Move to basic RAG when:

  • Users phrase questions differently than your docs
  • Keyword search returns poor matches for valid queries
  • You have good quality source documents
  • Your corpus is large enough that keyword search becomes unwieldy (50,000+ documents)

You’ll know you need this when users complain that search doesn’t work, and you look at their queries and think “we have docs about that, but they’re using different words.”

Move to agentic RAG when:

  • Users ask complex questions requiring multiple sources
  • You see patterns of users doing multiple searches in sequence
  • Simple RAG returns incomplete answers
  • You have budget for 3-5x higher API costs

Watch your analytics. If users do three searches in a row and then give up, they’re manually doing what an agentic system would do automatically.

Build a full agent when:

  • You’re building a product feature, not just search
  • Users need multi-turn conversations with context
  • You need to integrate external data sources
  • You have engineering resources for proper tool integration and error handling

The engineering complexity here is significant. You need proper timeout handling, retry logic, conversation state management, and graceful degradation when external APIs fail. This is a feature, not a quick enhancement.

The implementation details that matter

Some practical considerations that aren’t obvious from the code samples.

Chunking strategy for vector search: Don’t just embed entire documents. Break them into logical chunks. For documentation, consider chunking by section with overlap:

class DocumentationChunker
  CHUNK_SIZE = 1000 # characters
  OVERLAP = 200
  
  def chunk(document)
    sections = document.content.split(/^##\s+/)
    
    sections.flat_map do |section|
      break_into_overlapping_chunks(section)
    end
  end
  
  private
  
  def break_into_overlapping_chunks(text)
    chunks = []
    start = 0
    
    while start < text.length
      chunk_end = start + CHUNK_SIZE
      chunks << text[start...chunk_end]
      start += CHUNK_SIZE - OVERLAP
    end
    
    chunks
  end
end

This means each document generates multiple rows in your database with different embeddings. Your vector search returns chunks, not whole documents.

Hybrid search combines the best of both:

def hybrid_search(query)
  # Keyword search results
  keyword_results = Documentation.search_content(query).limit(20)
  
  # Vector search results
  query_embedding = generate_embedding(query)
  vector_results = Documentation.nearest_neighbors(:embedding, query_embedding, distance: "cosine").limit(20)
  
  # Combine with reciprocal rank fusion
  combine_results(keyword_results, vector_results)
end

def combine_results(keyword_results, vector_results)
  scores = Hash.new(0)
  
  keyword_results.each_with_index do |doc, i|
    scores[doc.id] += 1.0 / (i + 60)
  end
  
  vector_results.each_with_index do |doc, i|
    scores[doc.id] += 1.0 / (i + 60)
  end
  
  Documentation.where(id: scores.keys).sort_by { |doc| -scores[doc.id] }
end

This gives you semantic understanding from vectors plus precise keyword matching. The reciprocal rank fusion formula is from research on combining search results. It works better than naive score addition.

Caching saves you money:

class CachedRagSearcher
  def search(query)
    cache_key = "rag_search:#{Digest::MD5.hexdigest(query)}"
    
    Rails.cache.fetch(cache_key, expires_in: 1.hour) do
      perform_search(query)
    end
  end
end

Popular queries get asked repeatedly. Cache the embeddings and the LLM responses. This cuts your API costs dramatically for common questions.

The above caching method uses the raw query as the cache key. If users ask the same question with slightly different wording, they won’t hit the cache. You might want to normalize queries before hashing them for the cache key. For example, you could lowercase the query, remove stop words, etc.

Monitor your failure modes:

class SearchMetrics
  def self.track(query, approach, results)
    SearchLog.create!(
      query: query,
      approach: approach,
      result_count: results.length,
      latency_ms: results[:latency],
      cost_cents: calculate_cost(results),
      user_clicked: false # updated when user clicks a result
    )
  end
end

Track which results users actually click. If they click the first result, your search works. If they reformulate their query three times, it doesn’t. This data tells you whether to upgrade your approach.

What I actually recommend

Build the simplest thing first. Most Rails apps should start with Postgres full-text search plus GPT summarization. It costs almost nothing, has predictable latency, and works fine for straightforward queries.

Add instrumentation immediately. Track user behavior, measure latency, and watch your API costs. You need this data to know if upgrading is worth it.

When you see concrete evidence that simple search fails for your use case, add vector embeddings. This is a real improvement for semantic search. The pgvector extension makes this straightforward in Postgres. You don’t need a separate vector database until you have millions of documents.

Only add agentic features when you can point to specific query patterns that need them. “Users ask comparison questions and we don’t have comparison docs” is a good reason. “Agents are cool and I want to try them” is not.

Save full conversational agents for when you’re building a product feature that needs conversations. This is engineering work, not just adding a library. Budget for it appropriately.

The hype cycle pushes developers toward complexity. Resist it. Your users don’t care about your architecture. They care about getting answers quickly and cheaply. Often the simplest approach gives them exactly that.


But does the code work? See this repo for an implementation of all four approaches.


Disclaimer: The code samples in this post are simplified for clarity. In production, you need proper error handling, timeouts, retries, and security considerations (e.g., sanitizing user input before embedding). The costs and latencies mentioned are estimates based on recent OpenAI pricing and typical response times; your actual costs may vary based on usage patterns and model choices. Always monitor your API usage and costs when deploying AI features. Void where prohibited.