Techblog

Tech Blog

Our latest geek adventures!

Posts Tagged ‘regular expressions’

27 September Performance tweaking of Ruby algorithms

I have been working on request-log-analyzer quite a lot recently. One of the things I focused on was improving the parsing performance: because it parses log files that often are very big, processing times tend to be long. So all savings are very welcome.

Improving the performance of a command line application that does a lot of processing is very different from optimizing the performance of a web application. Request-log-analyzer basically reads a file line by line and processes it, so small performance improvements in the line processing algorithm can really add up when the file has a lot of lines. I used ruby-prof to get information about the performance of our algorithms split out by method to focus my tweaking efforts. I have written down some of my findings below; hopefully they can be helpful.

Parallelization

My first thought for improving performance was parallelization: parse multiple lines at the same time. Unfortunately, this did not yield the results I was hoping for: it instead become slower. Probably, this is because Ruby implements its own in-process threading and thus only uses one core of my processor.

Block overhead

The overhead of using a block should not be underestimated. Consider the following simple change, which improved the performance of request-log-analyzer by 1.5-2% on large log files:

# with block
io.each_line { |line| process_line(line) }
 
# without block
process_line(line) while line = io.gets

Regular expressions

If you’re using complex regular expressions, and you do not expect that every string will match it successfully, it can be beneficial to test the string with a simpler regexp first. For example, request-log-analyzer uses a complex regexp to see if a line in a log file is a “Completed in…” line with the request duration in it:

# Check every line to see if it is a "completed" line and capture the values
if line =~ /Completed in (\d+)ms \((?:View: (\d+), )?DB: (\d+)\) \| (\d\d\d).+\[(http.+)\]/
  # do something with the captured values
end

I improved this by first checking for a superficial regexp that tells me with 99% certainty that the complex regexp will match the line successfully as well:

if line =~ /Completed in/
  if line =~ /Completed in (\d+)ms \((?:View: (\d+), )?DB: (\d+)\) \| (\d\d\d).+\[(http.+)\]/
    # do something with the captured values
  else
    $stderr.puts "#{line.inspect} expected to match 'Completed' regexp!"
  end
end

Depending on the log file, this can increase performance by ~3%. Another benefit of this approach is that it will give feedback on lines that matched the simple regexp, but not the complex one. This information can be used to correct the regular expressions.

Calculate things that do not change only once

Calculate things that do not change only once is easier said than done. For instance, request-log analyzer can aggregate durations in a category. A category can be based on a request field that is parsed from the log, or a Proc that calculates it:

# during parsing:
if categorizer.respond_to?(:call)
  category = categorizer.call(request)
else
  category = request[categorizer]
end
categories[category].aggregate_request(request)

With this implementation, categorizer will be checked to be a Proc for every request, but as the value of categorizer will not change, the result of the check will be constant as well. I solved this my making sure that it is always a Proc beforehand, so the check is no longer necessary during parsing:

# before parsing:
if categorizer.respond_to?(:call)
  categorizer_proc = categorizer
else
  categorizer_proc = lambda { |request| request[categorizer] }
end
 
# during parsing:
category = categorizer_proc.call(request)
categories[category].aggregate_request(request)

Performance gain: ~1%! And because on several instances a similar technique could be applied, the performance got improved by about 4% in total.

Check the most common case first

Consider the following example, which converts a number in any traffic unit (kilobytes, MB, etc) to bytes:

def convert_traffic(value, unit)
  case unit
  when :GB, :G, :gigabyte      then (value.to_f * 1000_000_000).round
  when :MB, :M, :megabyte      then (value.to_f * 1000_000).round
  when :KB, :K, :kilobyte, :kB then (value.to_f * 1000).round
  # ... even more units here
  else value.to_i
  end
end

In most cases, the value will simply given in bytes, which will be returned by the final else after all possibilities have been checked. This can be improved by checking for this possibility first:

# Converts traffic in any unit to bytes.
def convert_traffic(value, unit)
  case unit
  when nil, :B, :byte          then value.to_i
  when :GB, :G, :gigabyte      then (value.to_f * 1000_000_000).round
  when :MB, :M, :megabyte      then (value.to_f * 1000_000).round
  when :KB, :K, :kilobyte, :kB then (value.to_f * 1000).round
  # ... even more units here
  else raise "Unknown unit: #{unit.inspect}!"
  end
end

Again this change adds up to a ~1% performance increase if this method is called very often.

These kinds of changes really improved the performance of request-log-analyzer by quite a bit, so upgrade to the latest version to get some of your valuable time back! :-)

6 Comments - Tags: , , , , , , , , ,