I am trying to parse a fairly small (< 100MB) xml file with:
(require '[clojure.data.xml :as xml]
'[clojure.java.io :as io])
(xml/parse (io/reader "data/small-sample.xml"))
and I am getting an error:
OutOfMemoryError Java heap space
clojure.lang.Numbers.byte_array (Numbers.java:1216)
clojure.tools.nrepl.bencode/read-bytes (bencode.clj:101)
clojure.tools.nrepl.bencode/read-netstring* (bencode.clj:153)
clojure.tools.nrepl.bencode/read-token (bencode.clj:244)
clojure.tools.nrepl.bencode/read-bencode (bencode.clj:254)
clojure.tools.nrepl.bencode/token-seq/fn--3178 (bencode.clj:295)
clojure.core/repeatedly/fn--4705 (core.clj:4642)
clojure.lang.LazySeq.sval (LazySeq.java:42)
clojure.lang.LazySeq.seq (LazySeq.java:60)
clojure.lang.RT.seq (RT.java:484)
clojure.core/seq (core.clj:133)
clojure.core/take-while/fn--4236 (core.clj:2564)
Here is my project.clj:
(defproject dats "0.1.0-SNAPSHOT"
...
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/data.xml "0.0.7"]
[criterium "0.4.1"]]
:jvm-opts ["-Xmx1g"])
I tried setting a LEIN_JVM_OPTS and JVM_OPTS in my .bash_profile without success.
When I tried the following project.clj:
(defproject barber "0.1.0-SNAPSHOT"
...
:dependencies [[org.clojure/clojure "1.5.1"]
[org.clojure/data.xml "0.0.7"]
[criterium "0.4.1"]]
:jvm-opts ["-Xms128m"])
I get the following error:
Error occurred during initialization of VM
Incompatible minimum and maximum heap sizes specified
Exception in thread "Thread-5" clojure.lang.ExceptionInfo: Subprocess failed {:exit-code 1}
Any idea how I could increase the heap size for my leiningen repl?
Thanks.
Any form evaluated at the top level of the repl is realized in full, as a result of the print step of the Read-Eval-Print-Loop. It is also stored in the heap, so that you can later access it via *1.
if you store the return value as follows:
(def parsed (xml/parse (io/reader "data/small-sample.xml")))
this returns immediately, even for a file hundreds of megabytes in size (I have verified this locally). You can then iterate across the result, which is realized in full as it is parsed from the input stream, by iterating over the clojure.data.xml.Element tree that is returned.
If you do not hold on to the elements (by binding them so they are still accessible), you can iterate over the entire structure without using more ram than it takes to hold a single node of the xml tree.
user> (time (def n (xml/parse (clojure.java.io/reader "/home/justin/clojure/ok/data.xml"))))
"Elapsed time: 0.739795 msecs"
#'user/n
user> (time (keys n))
"Elapsed time: 0.025683 msecs"
(:tag :attrs :content)
user> (time (-> n :tag))
"Elapsed time: 0.031224 msecs"
:catalog
user> (time (-> n :attrs))
"Elapsed time: 0.136522 msecs"
{}
user> (time (-> n :content first))
"Elapsed time: 0.095145 msecs"
#clojure.data.xml.Element{:tag :book, :attrs {:id "bk101"}, :content (#clojure.data.xml.Element{:tag :author, :attrs {}, :content ("Gambardella, Matthew")} #clojure.data.xml.Element{:tag :title, :attrs {}, :content ("XML Developer's Guide")} #clojure.data.xml.Element{:tag :genre, :attrs {}, :content ("Computer")} #clojure.data.xml.Element{:tag :price, :attrs {}, :content ("44.95")} #clojure.data.xml.Element{:tag :publish_date, :attrs {}, :content ("2000-10-01")} #clojure.data.xml.Element{:tag :description, :attrs {}, :content ("An in-depth look at creating applications \n with XML.")})}
user> (time (-> n :content count))
"Elapsed time: 48178.512106 msecs"
459000
user> (time (-> n :content count))
"Elapsed time: 86.931114 msecs"
459000
;; redefining n so that we can test the performance without the pre-parsing done when we counted
user> (time (def n (xml/parse (clojure.java.io/reader "/home/justin/clojure/ok/data.xml"))))
"Elapsed time: 0.702885 msecs"
#'user/n
user> (time (doseq [el (take 100 (drop 100 (-> n :content)))] (println (:tag el))))
:book
:book
.... ;; output truncated
"Elapsed time: 26.019374 msecs"
nil
user>
Notice that it is only when I first ask for the count of the content of n (thus forcing the whole file to be parsed) that the huge time delay occurs. If I doseq across subsections of the structure, this happens very quickly.
I don't know about lein as much but in mvn you can do the following:
mvn -Dclojure.vmargs="-d64 -Xmx2G" clojure:nrepl
(I don't think it matters but I've always seen it with a capitol G is it case sensitive?)
Pulling 100MB of data into memory should be no problem. I routinely route GB worth of data through my projects.
I always use the server of 64bit version for large heaps too, and that seems to be what they are doing here:
JVM options using Leiningen
I think the bigger problem though, is that as you have it written this might be being evaluated at compile time. You need to wrap that call in a function, and defer it's execution. I think the compiler is trying to read that file, and that's likely not what you want. I know with mvn you get different memory settings for compile vs. run and you might be getting that too.
Related
I created a model: Lecture(start_time, end_time, location). I want to write validation functions to check wether the new lecture's time overlapping with saved lectures in database. So that I can find out if the location is occupied in that time. My function is:
class Lecture < ActiveRecord::Base
validates :title, :position, presence: true
validates :start_time, :end_time, format: { with: /([01][0-9]|2[0-3]):([0-5][0-9])/,
message: "Incorrect time format" }
validate: time_overlap
def time_overlap
Lecture.all.each do |user|
if (user.start_time - end_time) * (start_time - user.end_time) >= 0
errors.add(:Base, "time overlaps")
end
end
end
end
The error message: NoMethodError in LecturesController#create
undefined method `-#' for nil:NilClass. How to write this function in right format?
Take a look at Ruby 2.3.0's Time class: http://ruby-doc.org/core-2.3.0/Time.html
You can use it to check if a Time instance is before or after another Time instance, such as:
t1 = Time.now
t2 = Time.now
t1 < t2
=> true
t1 > t2
=> false
So, to check if a given time would exist during an existing Lecture in the database, you could write some Ruby to check if the proposed Lecture's start time or finish time sits after the start time AND before the end time of any existing Lectures.
Lets say you have two slots of time, as:
start_time_a
end_time_a
start_time_b
end_time_b
There are three possible cases where there can be an overlap between the two time slots.
1) start_time_b >= start_time_a && start_time_b =< end_time_a ( i.e, slot b starts somewhere in the middle of slot a )
2) end_time_b >= start_time_a && end_time_b <= end_time_a ( i.e, slot b ends somewhere between slot a )
3) start_time_b <= start_time_a && end_time_b >= end_time_a ( i.e, slot b is larger than slot a, and completely covers it.
If you check for these three conditions, you can determine if there's an overlap between two time slots.
Conditions 1 & 2 can be simplified using start_time_b.between?(start_time_a, end_time_a).
Conditions: Have to use web based solution (HTML/CSS), Have to use Ruby on Rails, no use of database.
Imagine we have a list of jobs, each represented by a character. Because certain jobs must be done before others, a job may have a dependency on another job. For example, a may depend on b, meaning the final sequence of jobs should place b before a. If a has no dependency, the position of a in the final sequence does not matter. These jobs would be input with a simple text box (also how does one store multiple variables)
Given the following job structure:
a =>
b =>
c =>
The result should be a sequence containing all three jobs abc in no significant order.
Given the following job structure:
a =>
b => c
c => f
d => a
e => b
f =>
The result should be a sequence that positions f before c, c before b, b before e and a before d containing all six jobs abcdef.
Given the following job structure:
a =>
b => c
c => f
d => a
e =>
f => b
The result should be an error stating that jobs can’t have circular dependencies.
This should work:
module JobDependenciesResolver
def self.organize(dependencies)
unresolved_dependencies = dependencies.dup
sorted_jobs = []
until unresolved_dependencies.empty? do
doable_jobs = get_doable_jobs unresolved_dependencies, sorted_jobs
raise Exception.new("Not able to resolve dependencies") if doable_jobs.empty?
sorted_jobs += doable_jobs
unresolved_dependencies.delete_if {|key,value| sorted_jobs.include? key}
end
sorted_jobs
end
private
def self.get_doable_jobs(dependencies, job_array)
dependencies.select {|job, dependency| ([*dependency]-job_array).empty? }.keys
end
end
I am building a Rails backend to an iPhone app.
After profiling my application, I have found the following call to be especially expensive in terms of performance:
#messages.as_json
This call returns about 30 message objects, each including many child records. As you can see, a single message json response may make many DB calls to be composed:
def as_json(options={})
super(:only => [...],
:include => {
:user => {...},
:checkin => {...}
}},
:likes => {:only => [...],
:include => { :user => {...] }}},
:comments => {:only => [...],
:include => { :user => {:only => [...] }}}
},
:methods => :top_highlight)
end
On average the #messages.as_jsoncall (all 30 objects) takes almost 1100ms.
Wanting to optimize I've employed memcached. With the solution below, when all my message objects are in cache, average response is now 200-300ms. I'm happy with this, but the issue I have is that this has made cache miss scenarios even slower. In cases where nothing is in cache, it now takes over 2000ms to compute.
# Note: #messages has the 30 message objects in it, but none of the child records have been grabbed
#messages.each_with_index do |m, i|
#messages[i] = Rails.cache.fetch("message/#{m.id}/#{m.updated_at.to_i}") do
m.as_json
end
end
I understand that there will have to be some overhead to check the cache for each object. But I'm guessing there is a more efficient way to do it than the way I am now, which is basically serially, one-by-one. Any pointers on making this more efficient?
I believe Rails.cache uses the ActiveSupport::Cache::Store interface, which has a read_multi method for this exact purpose. [1]
I think swapping out fetch for read_multi will improve your performance because ActiveSupport::Cache::MemCacheStore has an optimized implementation of read_multi. [2]
Code
Here's the updated implementation:
keys = #messages.collect { |m| "message/#{m.id}/#{m.updated_at.to_i}" }
hits = Rails.cache.read_multi(*keys)
keys.each_with_index do |key, i|
if hits.include?(key)
#messages[i] = hits[key]
else
Rails.cache.write(key, #messages[i] = #messages[i].as_json)
end
end
The cache writes are still performed synchronously with one round trip to the cache for each miss. If you want to cut down on that overhead, look into running background code asynchronously with something like workling.
Be careful that the overhead of starting the asynchronous job is actually less than the overhead of Rails.cache.write before you start expanding your architecture.
Memcached Multi-Set
It looks like the Memcached team has at least considered providing Multi-Set (batch write) commands, but there aren't any ActiveSupport interfaces for it yet and it's unclear what level of support is provided by implementations. [3]
As of Rails 4.1, you can now do fetch_multi and pass in a block.
http://api.rubyonrails.org/classes/ActiveSupport/Cache/Store.html#method-i-fetch_multi
keys = #messages.collect { |m| "message/#{m.id}/#{m.updated_at.to_i}" }
hits = Rails.cache.fetch_multi(*keys) do |key|
#messages[i] = #messages[i].as_json
end
Note: if you're setting many items, you may want to consider writing to the cache in some sort of background worker.
I know that serializing an object is (to my knowledge) the only way to effectively deep-copy an object (as long as it isn't stateful like IO and whatnot), but is one way particularly more efficient than another?
For example, since I'm using Rails, I could always use ActiveSupport::JSON, to_xml - and from what I can tell marshalling the object is one of the most accepted ways to do this. I'd expect that marshalling is probably the most efficient of these since it's a Ruby internal, but am I missing anything?
Edit: note that its implementation is something I already have covered - I don't want to replace existing shallow copy methods (like dup and clone), so I'll just end up likely adding Object::deep_copy, the result of which being whichever of the above methods (or any suggestions you have :) that has the least overhead.
I was wondering the same thing, so I benchmarked a few different techniques against each other. I was primarily concerned with Arrays and Hashes - I didn't test any complex objects. Perhaps unsurprisingly, a custom deep-clone implementation proved to be the fastest. If you are looking for quick and easy implementation, Marshal appears to be the way to go.
I also benchmarked an XML solution with Rails 3.0.7, not shown below. It was much, much slower, ~10 seconds for only 1000 iterations (the solutions below all ran 10,000 times for the benchmark).
Two notes regarding my JSON solution. First, I used the C variant, version 1.4.3. Second, it doesn't actually work 100%, as symbols will be converted to Strings.
This was all run with ruby 1.9.2p180.
#!/usr/bin/env ruby
require 'benchmark'
require 'yaml'
require 'json/ext'
require 'msgpack'
def dc1(value)
Marshal.load(Marshal.dump(value))
end
def dc2(value)
YAML.load(YAML.dump(value))
end
def dc3(value)
JSON.load(JSON.dump(value))
end
def dc4(value)
if value.is_a?(Hash)
result = value.clone
value.each{|k, v| result[k] = dc4(v)}
result
elsif value.is_a?(Array)
result = value.clone
result.clear
value.each{|v| result << dc4(v)}
result
else
value
end
end
def dc5(value)
MessagePack.unpack(value.to_msgpack)
end
value = {'a' => {:x => [1, [nil, 'b'], {'a' => 1}]}, 'b' => ['z']}
Benchmark.bm do |x|
iterations = 10000
x.report {iterations.times {dc1(value)}}
x.report {iterations.times {dc2(value)}}
x.report {iterations.times {dc3(value)}}
x.report {iterations.times {dc4(value)}}
x.report {iterations.times {dc5(value)}}
end
results in:
user system total real
0.230000 0.000000 0.230000 ( 0.239257) (Marshal)
3.240000 0.030000 3.270000 ( 3.262255) (YAML)
0.590000 0.010000 0.600000 ( 0.601693) (JSON)
0.060000 0.000000 0.060000 ( 0.067661) (Custom)
0.090000 0.010000 0.100000 ( 0.097705) (MessagePack)
I think you need to add an initialize_copy method to the class you are copying. Then put the logic for the deep copy in there. Then when you call clone it will fire that method. I haven't done it but that's my understanding.
I think plan B would be just overriding the clone method:
class CopyMe
attr_accessor :var
def initialize var=''
#var = var
end
def clone deep= false
deep ? CopyMe.new(#var.clone) : CopyMe.new()
end
end
a = CopyMe.new("test")
puts "A: #{a.var}"
b = a.clone
puts "B: #{b.var}"
c = a.clone(true)
puts "C: #{c.var}"
Output
mike#sleepycat:~/projects$ ruby ~/Desktop/clone.rb
A: test
B:
C: test
I'm sure you could make that cooler with a little tinkering but for better or for worse that is probably how I would do it.
Probably the reason Ruby doesn't contain a deep clone has to do with the complexity of the problem. See the notes at the end.
To make a clone that will "deep copy," Hashes, Arrays, and elemental values, i.e., make a copy of each element in the original such that the copy will have the same values, but new objects, you can use this:
class Object
def deepclone
case
when self.class==Hash
hash = {}
self.each { |k,v| hash[k] = v.deepclone }
hash
when self.class==Array
array = []
self.each { |v| array << v.deepclone }
array
else
if defined?(self.class.new)
self.class.new(self)
else
self
end
end
end
end
If you want to redefine the behavior of Ruby's clone method , you can name it just clone instead of deepclone (in 3 places), but I have no idea how redefining Ruby's clone behavior will affect Ruby libraries, or Ruby on Rails, so Caveat Emptor. Personally, I can't recommend doing that.
For example:
a = {'a'=>'x','b'=>'y'} => {"a"=>"x", "b"=>"y"}
b = a.deepclone => {"a"=>"x", "b"=>"y"}
puts "#{a['a'].object_id} / #{b['a'].object_id}" => 15227640 / 15209520
If you want your classes to deepclone properly, their new method (initialize) must be able to deepclone an object of that class in the standard way, i.e., if the first parameter is given, it's assumed to be an object to be deepcloned.
Suppose we want a class M, for example. The first parameter must be an optional object of class M. Here we have a second optional argument z to pre-set the value of z in the new object.
class M
attr_accessor :z
def initialize(m=nil, z=nil)
if m
# deepclone all the variables in m to the new object
#z = m.z.deepclone
else
# default all the variables in M
#z = z # default is nil if not specified
end
end
end
The z pre-set is ignored during cloning here, but your method may have a different behavior. Objects of this class would be created like this:
# a new 'plain vanilla' object of M
m=M.new => #<M:0x0000000213fd88 #z=nil>
# a new object of M with m.z pre-set to 'g'
m=M.new(nil,'g') => #<M:0x00000002134ca8 #z="g">
# a deepclone of m in which the strings are the same value, but different objects
n=m.deepclone => #<M:0x00000002131d00 #z="g">
puts "#{m.z.object_id} / #{n.z.object_id}" => 17409660 / 17403500
Where objects of class M are part of an array:
a = {'a'=>M.new(nil,'g'),'b'=>'y'} => {"a"=>#<M:0x00000001f8bf78 #z="g">, "b"=>"y"}
b = a.deepclone => {"a"=>#<M:0x00000001766f28 #z="g">, "b"=>"y"}
puts "#{a['a'].object_id} / #{b['a'].object_id}" => 12303600 / 12269460
puts "#{a['b'].object_id} / #{b['b'].object_id}" => 16811400 / 17802280
Notes:
If deepclone tries to clone an object which doesn't clone itself in the standard way, it may fail.
If deepclone tries to clone an object which can clone itself in the standard way, and if it is a complex structure, it may (and probably will) make a shallow clone of itself.
deepclone doesn't deep copy the keys in the Hashes. The reason is that they are not usually treated as data, but if you change hash[k] to hash[k.deepclone] they will also be deep copied also.
Certain elemental values have no new method, such as Fixnum. These objects always have the same object ID, and are copied, not cloned.
Be careful because when you deep copy, two parts of your Hash or Array that contained the same object in the original will contain different objects in the deepclone.
I have a calculation that generates what appears to be the Float 22.23, and a literal 22.23 like so:
some_object.total => 22.23
some_object.total.class => Float
22.23 => 22.23
22.23.class => Float
But for some reason, the following is false:
some_object.total == 22.23 ? true : false
Wacky, right?
Is there some kind of precision mechanism being used that maybe isn't completely transparent through the some_object.total call?
Floating-point numbers cannot precisely represent all decimal numbers within their range. For example, 0.9 is not exactly 0.9, it's a number really close to 0.9 that winds up being printed as it in most cases. As you do floating-point calculations, these errors can accumulate and you wind up with something very close to the right number but not exactly equal to it. For example, 0.3 * 3 == 0.9 will return false. This is the case in every computer language you will ever use — it's just how binary floating-point math works. See, for example, this question about Haskell.
To test for floating point equality, you generally want to test whether the number is within some tiny range of the target. So, for example:
def float_equal(a, b)
if a + 0.00001 > b and a - 0.00001 < b
true
else
false
end
end
You can also use the BigDecimal class in Ruby to represent arbitrary decimal numbers.
If this is a test case, you can use assert_in_delta:
def test_some_object_total_is_calculated_correctly
assert_in_delta 22.23, some_object.total, 0.01
end
Float#to_s and Float#inspect round. Try "%.30f" % some_object.total and you will see that it's not quite 22.23.
there is something else going on here. this is from a 1.8.7 irb
irb(main):001:0> class Test
irb(main):002:1> attr_accessor :thing
irb(main):003:1> end
=> nil
irb(main):004:0> t = Test.new
=> #<Test:0x480ab78>
irb(main):005:0> t.thing = 22.5
=> 22.5
irb(main):006:0> t.thing == 22.5
=> true