Skip to content

Performance Issue on writing, and trying to fix it #267

@anykeyh

Description

@anykeyh

I open this thread related to what I've posted #245

Because initially the problem was on import/update of a excel file, and in my case on export.

See progress here: https://github.com/anykeyh/rubyXL

The problem

  • Exporting an excel of around ~100k cells take 25 minutes on sidekiq. Gathering the information is good (few seconds), but rendering using RubyXL is not, and take 99% of the time.

After analyzing the situation, the main problem comes from the reusability of the stylesheets, font and border styles.

All comes from this method in ooxml_object.rb:

    def ==(other)
      other.is_a?(self.class) &&
        obtain_class_variable(:@@ooxml_attributes).all? { |k, v| self.send(v[:accessor]) == other.send(v[:accessor]) } &&
        obtain_class_variable(:@@ooxml_child_nodes).all? { |k, v| self.send(v[:accessor]) == other.send(v[:accessor]) }
    end

This check if a style/font/border etc... already exists and return the ID of it for linking in the excel file.
The main problem is this computation is long, as all fields are tested, then all subnodes and all fields of subnodes are tested also...

Theses tests are present in convenience_methods.rb, for example:

    def register_new_font(new_font, old_xf)
      new_xf = old_xf.dup
      # v--- HERE for fonts
      new_xf.font_id = fonts.find_index { |x| x == new_font } # Reuse existing font, if it exists
      new_xf.font_id ||= fonts.size # If this font has never existed before, add it to collection.
      fonts[new_xf.font_id] = new_font
      new_xf.apply_font = true
      new_xf
    end

And performance keep going down as more style exists in your workbook.

Here is an example of excel which take ~8 seconds to generate, and which should not. Ruby-prof give good insight of the problem:

require 'rubyXL'
require 'benchmark'

begin
  require 'ruby-prof'
rescue LoadError
  puts "This example case is using ruby-prof gem..."
  exit
end

def generate_my_excel
  @workbook = RubyXL::Workbook.new
  @workbook.worksheets.clear

  worksheet = @workbook.add_worksheet("TEST")

  x = 0
  y = 0

  bm = Benchmark.realtime do
    200.times do |x|
      worksheet.add_cell(x, y, "#{x}")
      worksheet.sheet_data[x][y].change_border(:top, 'thin')
      worksheet.sheet_data[x][y].change_border(:right, 'thin')
      worksheet.sheet_data[x][y].change_border(:left, 'thin')
      worksheet.sheet_data[x][y].change_border(:bottom, 'medium')
      worksheet.sheet_data[x][y].change_font_bold true
      x+=1
    end
  end
  puts "Time for header: #{bm}ms"

  x = 0
  y = 1

  50.times do |row|
    200.times do |cols|
      x+=1
      value = rand*100
      worksheet.add_cell(x, y, value)

      if value > 0.8
        worksheet.sheet_data[x][y].change_border(:top, 'thin')
        worksheet[x][y].change_fill('00ff00')
      elsif value > 0.5
        worksheet.sheet_data[x][y].change_border(:top, 'thin')
        worksheet[x][y].change_fill('ffff00')
      else
        worksheet.sheet_data[x][y].change_border(:top, 'thin')
        worksheet[x][y].change_fill('ff0000')
      end

    end
    y+=1
    x = 0
  end
end

RubyProf.start
puts "Generation 200x50 excel with style..."
generate_my_excel
result = RubyProf.stop
printer = RubyProf::GraphPrinter.new(result)
printer.print(STDOUT)

output = "write_complex_table.xlsx"
@workbook.write(output)

My idea

I want to implement a test by hash, and an auto hash update on modification of the fields and child nodes.

Eg.

def some_accessor= x
  @hash ^= @some_accessor.hash #Remove the X-OR hash merge of the current version of the object
  @some_accessor = x #Setter
  @hash ^= x.hash # Merge the hash of the object
end

# ...

def hash
  @hash
end

# ...

def ==(other)
      other.is_a?(self.class) && self.hash == other.hash
end

So far I've implemented a naive method which is working more or less, with some bugs on some styles.
The performance are astonishing (30 sec against 20 minutes... lol), but not yet ready for production.

The problem of hash collision also is not taken in consideration. Hash are 64 bit long and chance of collision on a small set like stylesheets and fonts is less than 10^-15. In short, you can win billions of time in the lottery before it happens.
However, I still need to figure out if the hash function is good enough and spread uniformly. For example, an object with only two nil value will have a hash equals to zero (the two nil.hash canceling each others in the xor operator...)

The second problem is when a field/subnode mutate, it should update the hash of the parent, which is not the case now, leading to issues on finding the styles and wrong styles rendering.

If you have any better idea than this hashing method, I would be pleased to hear from you !

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions