Combating false positives in CI

Scope and the problem

If you experience a CI build turning red due to an intermittent failure, such as in a Capybara test, this post is for you. During my last encounter with such issue, the team and I wanted to strengthen the meaning behind the bright red color. Red build should be an actionable failure indicating a real fault in our application.

Source of inaccuracy

In the context of Rails applications, false positive build failures most often occur from Capybara acceptance tests. In it’s implementation, capybara relies on a variety of timeouts and waits, which may or may not delay long enough for the acceptable behaviour to take place. This is even more common when CI runs on slower hardware such as a small EC2 instance.

Step 1: rspec-rerun and the controvercy

The first and most impactful step we took was to use rspec-rerun gem which, in its rake task, automatically reruns failed tests until either they pass or a limit of retries is reached.

Using rspec-rerun is fairly simple: In Gemfile:

group :development, :test do

gem “rspec-rerun” end

a rake task for your CI:

RSPEC_RERUN_RETRY_COUNT=3 rake rspec-rerun:spec
Results:

rspec-rerun immediately lowered the number of false positive build outcomes. The impact of this was huge:

Controvercy

rspec-rerun approach to flaky tests is similar to that of strategies other teams have tried. One other technique is to manually flag tests as flaky and run them in a separate build.

Given enough skill with Capybara, it is possible to argue that most tests can be fixed instead of merely retried. On the other hand, even with skill this costs engineering time and it may be worthwhile to pursue other changes to the system.

A measurable aspect of rspec-rerun and it’s impact on our tests is it’s cost in terms of total build duration. If we were to rerun tests, we wanted at least some view into the impact of false positives on the build duration.

Step 2: Tracking incidence of false positives over time

We wrote a gem to track how often a test shows up as a false positive, via logging rspec data. Here’s how it works:

rspec_log_formatter gem comes as a pair of 2 formatters:

RSpec.configure do |config|
  config.formatters << RspecLogFormatter::Formatter::Factory.new.build
  config.formatters << RspecLogFormatter::AnalyzerFormatter::Factory.new(
    max_reruns: 3, #ENV['RSPEC_RERUN_RETRY_COUNT'] is also a great choice
    builds_to_analyze: 5
  ).build
end

RspecLogFormatter::Formatter appends to a file in your rails directory (rspec.history), storing a history of pass/failure outcomes that can then be analyzed for many purposes.

RspecLogFormatter::AnalyzerFormatter calculates false-positive rates and can report the likelihood of failure along with the estimated duration penalty to the test suite.

Output includes the following table:

Top 6 flakiest examples
  1) Spec 1 -- 50% (cost: 10.3s)
    * RuntimeError
  1) Spec 2 -- 30% (cost: 8.4s)
    * RuntimeError
  1) Spec 3 -- 34% (cost: 6.7s)
    * RuntimeError
  1) Spec 4 -- 21% (cost: 4.6s)
    * RuntimeError
  1) Spec 5 -- 22% (cost: 3.9s)
    * RuntimeError
  1) Spec 6 -- 12% (cost: 1.7s)
    * RuntimeError

Which indicates that the most unreliable test would, on average, slow down the whole suite by 10 seconds due to occasionally being re-ran multiple times.

When to act on this data

What I like about this, is that it collects the data and leaves it completely to your team to act upon it. In our case, we occasionaly looked for most costly examples and prioritized a task to make that example more reliable.

Learn more & respond

The project is on github. Please follow it there for new features, changes.

What do you think of this tool? Do you love it? Do you hate it? Let me know in the comments.