Contents

TestGenAI: Building a Ruby CLI that writes your missing tests

Creating a gem that writes your missing tests

No new project survives contact with the real world unscathed. We built TestGenAI, ran it on itself, and it worked well. Then we ran it on another codebase, and two things broke immediately. The fixes turned out to just as interesting as the original build.

This is a walkthrough of how the tool works and what we learned when we took it outside the greenhouse.

The code here is from TestGenAI, a working Ruby CLI gem you can install and run against your own codebase.

The pipeline

The pipeline has five stages:

  1. Scan your codebase to find classes and methods without test coverage
  2. Build context for each untested method
  3. Generate tests using an LLM with the mechanically curated context
  4. Validate that the generated tests run and pass
  5. Collect the results

Each stage needs to be reliable enough that you can walk away and trust the process to complete. That means handling errors gracefully, providing clear output about what happened, and making it easy to pick up where things left off if something breaks.

Finding untested code

Before you can generate tests, you need to know what needs testing. The right approach depends on whether SimpleCov is available in the project.

If SimpleCov is set up, TestGenAI runs your test suite with COVERAGE=true, reads the resulting coverage/.resultset.json, and uses AST parsing to find methods where every executable line has zero hits. This scanner handles partially-tested files correctly. It reports individual methods that were never exercised, even if other methods in the same file have full coverage.

If SimpleCov isn’t available, the scanner falls back to checking whether a spec or test file exists for each source file. This approach is less accurate. A file tested only through integration tests or through specs for its subclasses will appear fully untested even if its methods are exercised constantly. The SimpleCov scanner is worth setting up.

Both scanners share the same underlying logic for locating methods in source files, which brings up something worth explaining.

Walking the AST

To locate methods, TestGenAI parses each Ruby source file into an abstract syntax tree and walks it recursively. The walker looks for :def and :defs nodes (instance and class methods), tracks the current class/module namespace, and records each method’s file, class, name, and line range.

That line range matters. The SimpleCov scanner uses it to check whether any executable lines in the method had zero hits. A nil in SimpleCov’s coverage array means a line isn’t executable, like a blank line, a comment, or an end. The scanner filters those out before checking for zeros, so it only flags methods where runnable code was never touched.

The parser compatibility problem

To parse Ruby, the gem relies on the parser gem. In older versions, you’d call Parser::CurrentRuby.parse(source) and get back an AST. This worked fine until Ruby 3.4, which switched its internal default parser to prism. Using Parser::CurrentRuby with Ruby 3.4 produces warnings, and in some configurations it fails entirely.

The prism project ships a compatibility shim, Prism::Translation::ParserCurrent, that produces the same AST node types as the old parser gem. The AST-walking code works unchanged. The only question is which one to load.

The solution is a small file that runs at load time and sets a constant:

module Testgenai
  if Gem::Version.new(RUBY_VERSION) >= Gem::Version.new("3.4")
    require "prism"
    CurrentParser = Prism::Translation::ParserCurrent
  else
    require "parser/current"
    CurrentParser = Parser::CurrentRuby
  end
end

The rest of the codebase calls CurrentParser.parse(source) and never thinks about which parser is underneath. The pattern of version check at load time, constant as the abstraction is a clean way to handle the same kind of compatibility gap you’ll run into whenever Ruby ships a significant internal change.

Context, generation, and validation

When you ask an LLM to write tests for a method, you can’t just paste in the method body. It needs the full class, the dependencies that file requires, examples of how the method is called elsewhere in the codebase, and existing test files it can match in style. Context quality is where quick-and-dirty AI test generators fall apart, too little and the tests don’t compile, too much and you hit token limits.

The generator builds a prompt from all of that, sends it to the LLM via the ruby_llm gem (which keeps the generator code provider-agnostic), and strips any markdown fences from the response before passing it to the validator.

The validator writes the code to a temp file, runs bundle exec rspec or the Minitest equivalent, and distinguishes between three outcomes: the file failed to load (syntax errors, undefined constants), the tests ran but failed, or the tests passed. Each outcome needs different handling. A file that doesn’t load gets deleted immediately because it’s useless. A file that runs but fails gets its error output fed back to the LLM for a retry.

The pipeline retries up to three times, passing failure details back each time. LLMs are reasonably good at fixing specific errors when told what went wrong. Undefined constants and wrong require paths almost always resolve in one retry. More complex failures, like incorrect behavior assumptions, may not, and those end up in a failed bucket for manual review.

Then we ran it on a real project

The first external test run revealed two problems, both on the same day.

The first: generated tests were syntactically valid, ran, and passed — but they looked nothing like the rest of the project’s test suite. Wrong authentication setup, wrong factory usage, helpers that weren’t available. Tests that technically pass but violate project conventions create a maintenance burden.

The second: the tool was silently destroying existing tests. When a spec file already existed at the output path, the pipeline would overwrite it with the newly generated content. Any tests already in that file were gone.

Both problems make complete sense in retrospect. The tool had only ever run on its own codebase, where it was always generating new files and where the conventions were deeply familiar to the model from the context it was seeing. A different project broke both assumptions.

Fixing the conventions gap

The core problem is that the LLM knows what your method does, but it doesn’t know how your team writes tests. It doesn’t know that you authenticate in before blocks a certain way, or that you have specific factory traits available, or that you’re not using rails-controller-testing so assigns isn’t an option.

The fix is a conventions system with two parts.

ConventionsExtractor scans your existing test files and pulls out mechanical facts: the most common authentication setup pattern, available factory traits from your factories directory, frequently stubbed objects, whether transactional fixtures are disabled and how cleanup is handled, and whether specific helpers are unavailable based on what’s in your Gemfile. These aren’t judgments, they’re observations extracted directly from the code.

ConventionsSynthesizer takes those raw facts and sends them to the LLM with a prompt asking it to write a concise conventions guide explaining the rule behind each pattern. The result is plain prose. Something like “authentication is handled in before blocks using session[:user_id] = rather than Devise helpers; use this pattern consistently.” That text gets prepended to every generation prompt.

The synthesized guide is cached to spec/conventions.md and invalidated automatically when spec files or Gemfiles change. Regenerating it costs one LLM call.

Enable it with the --conventions flag:

testgenai generate --provider anthropic --model claude-opus-4-7 --conventions

Add spec/conventions.md to your .gitignore. It’s a derived artifact and probably not something to check in.

Fixing the overwrite problem

The overwrite bug was straightforward to diagnose and subtle to fix correctly.

The naive fix would be to skip generation if a spec file already exists. That’s less than ideal. The point is to add tests for untested methods, and partially-tested files are the most common case.

Better behavior is to inject the generated tests into the existing file. The pipeline now reads existing content before doing anything else. It combines existing content with the newly generated code and validates the combined file. If it passes, the combined content is written to the spec file.

If it fails, the pipeline restores the original file exactly as it was and writes the generated-only code to a fallback path. A method in lib/payments/processor.rb that already has a spec/payments/processor_spec.rb would get its fallback at spec/payments/processor_context_spec.rb (using the method name to scope the filename). The failure output tells you where to find it:

  ✗ Payments::Processor#context failed after 3 attempt(s)
    → Generated tests saved to spec/payments/processor_context_spec.rb for manual review

You end up with your original tests intact and the generated attempt sitting somewhere you can look at it and decide what to do.

Running it

Install the gem and point it at your project:

gem install testgenai
cd your_project
testgenai generate --provider anthropic --model claude-opus-4-7

Or add it to your Gemfile in the development group and use bundle exec. Configuration can also come from environment variables:

export TESTGENAI_PROVIDER=anthropic
export TESTGENAI_MODEL=claude-opus-4-7
export ANTHROPIC_API_KEY=your_api_key
testgenai generate --conventions

Three diagnostic commands are available before you commit to a full run:

testgenai scan      # find untested methods without making any API calls
testgenai context   # show what context would be sent to the LLM for each method

scan gives you a picture of your coverage gaps. context is useful for understanding what the LLM will see before spending API credits.

The goal isn’t to replace the developer who understands the code and makes decisions about testing. It’s to handle the mechanical work: setting up describe blocks, wiring test data, writing happy-path coverage. Then you can spend your time on the parts that actually need your judgment. The second project taught us that “mechanical” is more context-dependent than it looks.