I need to parse XLXS which is around 25 MB in size [have about 1 million records] . I read through lot of Node modules including below
https://github.com/trevordixon/excel.js
https://github.com/dkiyatkin/node-office
I also tried using the Ruby with Roo
https://github.com/Empact/roo
But they are hanging. Is there any suggestion to do this Or I need to end up in splitting the files in to multiple small pieces ?
While Using "oxcelix" as per "carlosramireziii" suggestion!
" https://github.com/gbiczo/oxcelix "
2.0.0-p247 :001 > require 'oxcelix'
=> true
2.0.0-p247 :002 > s = Oxcelix::Workbook.new("/var/www/fullcontact/current/public/uploads/fileupload/filename/Book1.xlsx")
Killed
root#createresume:/var/www/fullcontact/current/public/uploads# irb
2.0.0-p247 :001 > require 'oxcelix'
=> true
2.0.0-p247 :002 > s = Oxcelix::Workbook.new("/var/www/fullcontact/current/public/uploads/fileupload/filename/Book1.xlsx")
Errno::EEXIST: File exists - /var/www/fullcontact/shared/uploads/tmp
from /usr/local/rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/fileutils.rb:245:in `mkdir'
from /usr/local/rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/fileutils.rb:245:in `fu_mkdir'
from /usr/local/rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/fileutils.rb:174:in `block in mkdir'
from /usr/local/rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/fileutils.rb:173:in `each'
from /usr/local/rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/fileutils.rb:173:in `mkdir'
from /usr/local/rvm/gems/ruby-2.0.0-p247/gems/oxcelix-0.3.2/lib/oxcelix/workbook.rb:52:in `initialize'
from (irb):2:in `new'
from (irb):2
from /usr/local/rvm/rubies/ruby-2.0.0-p247/bin/irb:13:in `<main>'
2.0.0-p247 :003 > exit
root#createresume:/var/www/fullcontact/current/public/uploads# rm -rf tmp/
root#createresume:/var/www/fullcontact/current/public/uploads# irb
2.0.0-p247 :001 > require 'oxcelix'
=> true
2.0.0-p247 :002 > s = Oxcelix::Workbook.new("/var/www/fullcontact/current/public/uploads/fileupload/filename/Book1.xlsx")
Killed
root#createresume:/var/www/fullcontact/current/public/uploads#
Depending on the parsing library you use, your parsing routine might be attempting to turn the entire XLXS file into objects which then get stored in memory. For very large files, this could result in the hanging behavior that you are seeing.
One option which is frequently used to avoid this issue is to use a SAX parser. Rather than trying to parsing the entire file at once, a SAX parser will sequentially read each piece of the document one bit at a time which won't result in the memory explosion of the former method.
For parsing XLSX documents, you should try the Oxcelix gem for Ruby which uses a SAX parser under the covers.
https://github.com/gbiczo/oxcelix
UPDATE:
Unfortunately, the Oxcelix gem does use SAX parsing under the covers but it then returns the result of the parsing as an array, which, in the case of very large files, will blow up in memory.
If you were able to convert your Excel sheet into XML, then you could make use of any SAX-style parser. In this case, I would recommend this fork of SAXMachine, which allows you to create declarative models and returns them sequentially using the lazy option.
I had a similar problem with a very large xml file. Performance wise it is best to "cut" it down into smaller junks and process each of them separately.
Related
Mongoid 3.1.6
Rails 3.2.21
MongoDB 2.4.9
We're seeing strange performance issues with find() vs where().first:
$ rails c
2.1.5 :001 > Benchmark.ms { User.find('5091e4beccbce30200000006') }
=> 7.95
2.1.5 :002 > Benchmark.ms { User.find('5091e4beccbce30200000006') }
=> 0.27599999999999997
2.1.5 :003 > Benchmark.ms { User.find('5091e4beccbce30200000006') }
=> 0.215
2.1.5 :004 > exit
$ rails c
2.1.5 :001 > Benchmark.ms { User.where(id: '5091e4beccbce30200000006').first }
=> 7.779999999999999
2.1.5 :002 > Benchmark.ms { User.where(id: '5091e4beccbce30200000006').first }
=> 4.84
2.1.5 :003 > Benchmark.ms { User.where(id: '5091e4beccbce30200000006').first }
=> 5.297
2.1.5 :004 > exit
These both appear to be firing off the same queries. Can someone explain why we're seeing such a huge difference in performance?
Configuration:
production:
sessions:
default:
uri: <%= REDACTED %>
options:
consistency: :strong
safe: true
max_retries: 1
retry_interval: 0
options:
identity_map_enabled: true
Here is my assumption why the first one was few orders of magnitude slower (I am writing it from mongo point of view and have zero knowledge about ruby).
The first time you fired the query it was not in the working set and this caused slower performance. The consecutive times it was already there and thus performance is better. If you have small number of documents, I would find this behavior strange (because I would expect that all of them would be in a working set).
The second part with $where surprises me because I would expect all the numbers be bigger than with find() (it is not the case with the first event) because:
The $where provides greater flexibility, but requires that the
database processes the JavaScript expression or function for each
document in the collection.
It appears that find uses the identity map, while where does not. If I set identity_map_enabled to false, then the performance of find vs where is identical.
Moral of the story: use find instead of where when possible.
I've heard that the identity map is removed in Mongoid 4.x. So maybe this issue only affects folks on older versions.
Is there a way to check the size of the Rails cache?
Something in the vein of: Rails.cache.size => 390 MB
I assume there's some slight variation between data stores, but right now I'm not sure how to even start to check the disk space a cache is taking up.
that totally depends on your cache store and the backend you use.
this is an example from my heroku instance running memcachier:
Rails.cache.stats
# => {"xxx.memcachier.com:11211"=>{"curr_items"=>"278", "bytes"=>"3423104", "evictions"=>"0", "total_items"=>"7373", "curr_connections"=>"7", "total_connections"=>"97", "cmd_get"=>"141674", "cmd_set"=>"7373", "cmd_delete"=>"350", "cmd_flush"=>"6", "get_hits"=>"63716", "get_misses"=>"77958", "delete_hits"=>"162", "delete_misses"=>"188", "incr_hits"=>"0", "incr_misses"=>"0", "decr_hits"=>"0", "decr_misses"=>"0"}}
FileStore does not have such a method:
Rails.cache.stats
# => NoMethodError: undefined method `stats' for #<ActiveSupport::Cache::FileStore:0x007ff1cbe905b0>
And when running a memcached locally, i get a different result set:
Rails.cache.stats
# => {"127.0.0.1:11211"=>{"pid"=>"327", "uptime"=>"517931", "time"=>"1392163858", "version"=>"1.4.16", "libevent"=>"2.0.21-stable", "pointer_size"=>"64", "rusage_user"=>"2.257386", "rusage_system"=>"4.345445", "curr_connections"=>"15", "total_connections"=>"16", "connection_structures"=>"16", "reserved_fds"=>"20", "cmd_get"=>"0", "cmd_set"=>"0", "cmd_flush"=>"0", "cmd_touch"=>"0", "get_hits"=>"0", "get_misses"=>"0", "delete_misses"=>"0", "delete_hits"=>"0", "incr_misses"=>"0", "incr_hits"=>"0", "decr_misses"=>"0", "decr_hits"=>"0", "cas_misses"=>"0", "cas_hits"=>"0", "cas_badval"=>"0", "touch_hits"=>"0", "touch_misses"=>"0", "auth_cmds"=>"0", "auth_errors"=>"0", "bytes_read"=>"48", "bytes_written"=>"30", "limit_maxbytes"=>"67108864", "accepting_conns"=>"1", "listen_disabled_num"=>"0", "threads"=>"4", "conn_yields"=>"0", "hash_power_level"=>"16", "hash_bytes"=>"524288", "hash_is_expanding"=>"0", "malloc_fails"=>"0", "bytes"=>"0", "curr_items"=>"0", "total_items"=>"0", "expired_unfetched"=>"0", "evicted_unfetched"=>"0", "evictions"=>"0", "reclaimed"=>"0"}}
In addition to #phoet's answer, for a Redis cache you can use the following to get a human readable format:
Rails.cache.stats["used_memory_human"] #=> 178.32M
Where used_memory_human can actually be any key that's returned when running an INFO command on the redis server.
i'm getting an intermittent error saying that the pseudo random number generator is not seeded when trying to generate the form auth token. i've copied the relevant part of the stack trace below.
here's what i know/see:
- restarting passenger seems to temporarily fix the issue
- running the same code from the console works as expected
- /dev/urandom exists so it should be able to use that to seed
- this is happening on ubuntu 10.04, with openssl 0.9.8k, ree 1.8.7 p253, and passenger 3.0.3.
- i've read about an issue on unicorn that sounds sorta like it that happens when restarting workers but haven't seen anything like that described on passenger.
SessionsController#new (ActionView::TemplateError) "PRNG not seeded"
/usr/local/lib/ruby/1.8/securerandom.rb:53:in `random_bytes'
/usr/local/lib/ruby/1.8/securerandom.rb:53:in `random_bytes'
/usr/local/lib/ruby/1.8/securerandom.rb:105:in `base64'
vendor/bundle/ruby/1.8/gems/actionpack-2.3.14/lib/action_controller/request_forgery_protection.rb:109:in `form_authenticity_token'
(eval):2:in `send'
(eval):2:in `form_authenticity_token'
pretty stumped. any help greatly appreciated.
Assuming both /dev/random and /dev/urandom are read/writeable and you are still getting this error, maybe you need to run/install an entropy generator like prngd?
Try:
$ sudo /etc/init.d/prngd start
And if that fails, install prngd first:
$ sudo apt-get install prngd
A farfetched hypothesis. The relevant bit of code should be something like (possibly this is not the same version):
def self.random_bytes(n=nil)
n ||= 16
if defined? OpenSSL::Random
#pid = 0 if !defined?(#pid)
pid = $$
if #pid != pid
now = Time.now
ary = [now.to_i, now.usec, #pid, pid]
OpenSSL::Random.seed(ary.join('.'))
#pid = pid
end
return OpenSSL::Random.random_bytes(n)
end
and we know that OpenSSL gives the error you reported if fed with less than 128 bits of entropy. Which is 16 bytes in all (or 18..22, if OpenSSL is clever enough to spot a string of ASCII printable characters and ignore the high bit).
The sequence initializing OpenSSL is something like, 1342652367.A.0.B; could it be possible that, sometimes, pid is small enough, and microseconds is near enough to zero, that the resulting entropy falls below the critical threshold?
This should be quite easy to test: replace
ary.join('.')
with
Digest::MD5.hexdigest(ary.join('.'))
in order to have a surely-128 bit, possibly-even-256 bit string length of reasonable unpredictability.
A more definite check would be to add an exception and print out what ary was when the error was triggered.
One thing I miss about ipython is it has a ? operator which diggs up the docs for a particular function.
I know ruby has a similar command line tool but it is extremely inconvenient to call it while I am in irb.
Does ruby/irb have anything similar?
Pry is a Ruby version of IPython, it supports the ? command to look up documentation on methods, but uses a slightly different syntax:
pry(main)> ? File.dirname
From: file.c in Ruby Core (C Method):
Number of lines: 6
visibility: public
signature: dirname()
Returns all components of the filename given in file_name
except the last one. The filename must be formed using forward
slashes (/'') regardless of the separator used on the
local file system.
File.dirname("/home/gumby/work/ruby.rb") #=> "/home/gumby/work"
You can also look up sourcecode with the $ command:
pry(main)> $ File.link
From: file.c in Ruby Core (C Method):
Number of lines: 14
static VALUE
rb_file_s_link(VALUE klass, VALUE from, VALUE to)
{
rb_secure(2);
FilePathValue(from);
FilePathValue(to);
from = rb_str_encode_ospath(from);
to = rb_str_encode_ospath(to);
if (link(StringValueCStr(from), StringValueCStr(to)) < 0) {
sys_fail2(from, to);
}
return INT2FIX(0);
}
See http://pry.github.com for more information :)
You can start with
irb(main):001:0> `ri Object`
Although the output of this is less than readable. You'd need to filter out some metacharacters.
In fact, someone already made a gem for it
gem install ori
Then in irb
irb(main):001:0> require 'ori'
=> true
irb(main):002:0> Object.ri
Looking up topics [Object] o
= Object < BasicObject
------------------------------------------------------------------------------
= Includes:
Java (from gem activesupport-3.0.9)
(from gem activesupport-3.0.9) [...]
No, it doesn't. Python has docstrings:
def my_method(arg1,arg2):
""" What's inside this string will be made available as the __doc__ attribute """
# some code
So, when the ? is called from ipython, it probably calls the __doc__ attribute on the object. Ruby doesn't have this.
When I run a multi-line statement in the Rails 3.0.1 console, pressing enter doesn't actually run the statement. Instead, it goes to a new console line, and the cursor has been tabbed to the right. Then I have to run a basic line (like p "hey"), and then the multi-line statement will run.
ruby-1.9.2-p0 > images = Image.all;images.each do |im|; if im.imagestore_width.blank?;im.save;end;
ruby-1.9.2-p0 > p "hey"
I've been doing it like this for awhile and it's been working okay. But now I've got a problem in the console and it might be related. When I ran the above code, instead of it working like it normally does, it just went to a new console line with a ? added
ruby-1.9.2-p0 > images = Image.all;images.each do |im|; if im.imagestore_width.blank?;im.save;end;
ruby-1.9.2-p0 > p "hey"
ruby-1.9.2-p0 ?>
When it does this, I can't exit the console
ruby-1.9.2-p0 ?> exit
ruby-1.9.2-p0 ?> ^C
Are these problems related? How can I fix them?
In the line:
images = Image.all;images.each do |im|; if im.imagestore_width.blank?;im.save;end;
You have an end to close the if but not an end to close the do block of the each.
This is why the console is redisplaying the prompt asking for more input before executing your statements.
Try:
images = Image.all;images.each do |im|; if im.imagestore_width.blank?;im.save;end;end
Notice, you will see the same behaviour with brackets. irb or console won't execute until the brackets balance e.g.
irb(main):010:0> (3 *
irb(main):011:1* (2 + 1)
irb(main):012:1> )
=> 9
Dunno what's wrong with irb/console but your ruby code could look a lot nicer:
images = Image.all.each { |im| im.save if im.imagestore_width.blank? }
The general consensus is to use {} rather than do/end for single line blocks in ruby.