Optimizing MongoDB driver performance in a rails application - ruby-on-rails

client = Mongo::Client.new(hosts.split(","),
:database => "#{database}",
:replica_set => "#{replica_set}",
:connection_timeout => 1,
:socket_timeout => 1,
:wait_queue_timeout => 1,
:connect => :replica_set,
:read => { :mode => :primary_preferred },
:heartbeat_frequency => 1,
:min_pool_size => 3,
:max_pool_size => 20
)
I use the code above to connect to a MongoDB database configured as a replica set of three different nodes. Queries are made on an indexed field. I noticed that when the :connect option is set to :direct, the performance is better as there are no auto-discoveries. But when set to :replica_set the performance deteriorates to as much as >12,000 ms. There is actually a node that is not available. Is there any way to improve my configuration as written above to achieve a better performance using the :connect => :replica_set option as opposed to :connect => :direct option. Any other performance improvement tips will be appreciated. Additionally , I must add that each document queried is large averaging at 15707.931229566184 as given by the db.stats() command below
uat-replSet:PRIMARY> db.stats()
{
"db" : "fuse_transactions",
"collections" : 5,
"objects" : 1896978,
**"avgObjSize" : 15707.931229566184**,
"dataSize" : 29797599968,
"storageSize" : 44447108352,
"numExtents" : 114,
"indexes" : 9,
"indexSize" : 242091360,
"fileSize" : 46082293760,
"nsSizeMB" : 16,
"extentFreeList" : {
"num" : 0,
"totalSize" : 0
},
"dataFileVersion" : {
"major" : 4,
"minor" : 22
},
"ok" : 1
}

Related

Ruby Mongo Driver Projection Elemmatch

Following the code in http://www.w3resource.com/mongodb/mongodb-elemmatch-projection-operators.php I have set up a test database using the ruby mongodb driver.
For those following along at home, you first need to install the mongo driver as described at https://docs.mongodb.com/ecosystem/tutorial/ruby-driver-tutorial/#creating-a-client, then run the following commands.
client = Mongo::Client.new([ '127.0.0.1:27017'], :database => 'mydb')
test = client['test']
doc = {
"_id" => 1,
"batch" =>10452,
"tran_details" =>[
{
"qty" =>200,
"prate" =>50,
"mrp" =>70
},
{
"qty" =>250,
"prate" =>50,
"mrp" =>60
},
{
"qty" =>190,
"prate" =>55,
"mrp" =>75
}
]
}
test.insert_one(doc)
Insert all of the different docs created in the w3 tutorial.
If you look at example 2 in the w3 tutorial, the translated ruby find is:
test.find({"batch" => 10452}).projection({"tran_details" => {"$elemMatch" => {"prate" => 50, "mrp" => {"$gte" => 70}}}}).to_a
which returns the same result as in the example.
=> [{"_id"=>1, "tran_details"=>[{"qty"=>200, "prate"=>50, "mrp"=>70}]}, {"_id"=>3}, {"_id"=>4}]
My problem is that I would like to constrain the results with the constraints above (mrp gte 70 etc) while also specifying which fields are returned.
For instance, constraining only the tran_details that have a mrp gte 70, but in the results returned only include the prate field (or any subset of the fields).
I can return only the prate field with the query:
test.find({"batch" => 10452}).projection({"tran_details.prate" => 1}).to_a
I would like to combine the effects of the two different projections, but I haven't seen any documentation about how to do that online. If you string the two projections to each other, only the final projection has an effect.
To anyone out there --
The problem can be solved up to one element by using $elemMatch on projection. However, $elemMatch only returns the first result found. To return only parts of embedded documents multiple layers down that fit certain criteria, you need to use the aggregation framework.
test.find({
'tran_details.prate' => { '$gt' => 56 }
}).projection({
tran_details: {
'$elemMatch' => {
prate: { '$gt' => 56 }
}
},
'tran_details.qty' => 1,
'tran_details.mrp' => 1,
batch: 1,
_id: 0
}).to_a
To return only parts of embedded documents multiple layers down that fit certain criteria, you need to use the aggregation framework.
Here is example code
test.aggregate([{
'$match': {
'$or': [
{'batch': 10452}, {'batch': 73292}]}},
{'$project':
{trans_details:
{'$filter': {
input: '$tran_details',
as: 'item',
cond: {'$and':[
{'$gte' => ['$$item.prate', 51]},
{'gt' => ['$$item.prate', 53]}
]}
}
}
}
}]).to_a
If anyone sees this and knows how to dynamically construct queries in ruby from strings please let me know! Something to do with bson but still trying to find relevant documentation. Thanks -

Count by group and subgroup

I want to generate some stats regarding some data I have in a model
I want to create stats according to an association and a status column.
i.e.
Model.group(:association_id).group(:status).count
to get an outcome like
[{ association_id1 => { status1 => X1, status2 => y1 } },
{ association_id2 => { status1 => x2, status2 => y2 } }...etc
Not really bothered whether it comes out in as an array or hash, just need the numbers to come out consistently.
Is there a 'rails' way to do this or a handy gem?
Ok. Worked out something a little better, though happy to take advice on how to clean this up.
group_counts = Model.group(["association_id","status"]).count
This returns something like:
=> {[nil, "status1"]=>2,
[ass_id1, "status1"]=>58,
[ass_id2, "status7"]=>1,
[ass_id2, "status3"]=>71 ...etc
Which, while it contains the data, is a pig to work with.
stats = group_counts.group_by{|k,v| k[0]}.map{|k,v| {k => v.map{|x| {x[0][1] => x[1] }}.inject(:merge) }}
Gives me something a little friendlier
=> [{
nil => {
"status1" => 10,
"status2" => 23},
"ass_id1" => {
"status1" => 7,
"status2" => 23
}...etc]
Hope that helps somebody.
This is pretty ugly, inefficient and there must be a better way to do it, but...
Model.pluck(:association_id).uniq.map{ |ass| {
name:(ass.blank? ? nil : Association.find(ass).name), data:
Model.where(association_id:ass).group(:status).count
} }
Gives what I need.
Obviously if you didn't need name the first term would be a little simpler.
i.e.
Model.pluck(:association_id).uniq.map{ |ass| {
id:ass, data:Model.where(association_id:ass).group(:status).count
} }

Why does using regex in Mongodb query causes it to scan all documents instead of index?

While I was trying to investigate performance issue with queries, I found that the source was this query. I am using Rails 4 with Mongoid gem.
Order.where("customer.email" => /\Atest#email.com\z/i) where test#email.com is just an example.
Customer is an embedded document in Order document and customer's email is indexed.
When I benchmarked performance using Benchmark.bmbm where Order.where("customer.email" => /\Atest#email.com\z/i).count was repeated 100 times, I got the following result.
user system total real
0.090000 0.010000 0.100000 ( 27.656723)
I thought perhaps \A and \z was causing the slowness, so I tried the following where it looks for emails that start with given argument: Order.where("customer.email" => /^test/i).count
And result wasn't much different.
user system total real
0.090000 0.010000 0.100000 ( 28.712883)
As the last resort, I tried just matching the entire string without regexp. This time, it made a huge difference: Order.where("customer.email" => "test#email.com").count
user system total real
0.080000 0.000000 0.080000 ( 0.122888)
When I looked at output of explain, it shows that using regexp scans all documents.
{
"cursor" => "BtreeCursor customer.email_1",
"isMultiKey" => false,
"n" => 781,
"nscannedObjects" => 781,
"nscanned" => 500000,
"nscannedObjectsAllPlans" => 781,
"nscannedAllPlans" => 500000,
"scanAndOrder" => false,
"indexOnly" => false,
"nYields" => 1397,
"nChunkSkips" => 0,
"millis" => 406,
"indexBounds" => {
"customer.email" => [
[0] [
[0] "",
[1] {}
],
[1] [
[0] /test/i,
[1] /test/i
]
]
}
}
While using entire string only scanned subset, which was what I expected.
{
"cursor" => "BtreeCursor customer.email_1",
"isMultiKey" => false,
"n" => 230,
"nscannedObjects" => 230,
"nscanned" => 230,
"nscannedObjectsAllPlans" => 230,
"nscannedAllPlans" => 230,
"scanAndOrder" => false,
"indexOnly" => false,
"nYields" => 1,
"nChunkSkips" => 0,
"millis" => 0,
"indexBounds" => {
"customer.email" => [
[0] [
[0] "test#email.com",
[1] "test#email.com"
]
]
}
}
Can someone please explain to me why using regexp in mongodb query causes it to scan all documents instead of index?
EDIT: Added indexBounds in the explain output, which was omitted in the original post.
The quick version:
You have a case-insensitive regex here (the /i flag); that means that Mongo can't do prefix matching on the index, and thus has to scan the entire index (since it doesn't know if you want test#example.com or TEST#example.com or teST#exAMple.com or whatnot).
If you want case-insensitive lookups in Mongo, the correct solution is to downcase them prior to storage. If you need to not affect the user-entered input, then store it in a secondary field on the document (ie, email and email_normalized).
The longer version
Mongo's indexes are B-Trees, and when you perform a regex query, it will see a) if it can use an index (by field) and b) how much of that tree it has to scan to be assured of a result. In the case where you have a specific prefix, Mongo knows that it can limit its search to only a portion of the tree. You left out the most important part of your explains - the index bounds. Given a collection with an index and some emails:
kerrigan:PRIMARY> db.test.ensureIndex({email: 1})
kerrigan:PRIMARY> db.test.insert({email: "test#example.com"})
kerrigan:PRIMARY> db.test.insert({email: "teTE#example.com"})
kerrigan:PRIMARY> db.test.insert({email: "teST#example.com"})
kerrigan:PRIMARY> db.test.insert({email: "TEst#example.com"})
If we explain the find on a non-case-insensitive match:
kerrigan:PRIMARY> db.test.find({email: /\Atest#example.com\z/}).explain()
{
"cursor" : "IndexCursor email_1 multi",
"isMultiKey" : false,
"n" : 1,
"nscannedObjects" : 1,
"nscanned" : 1,
"nscannedObjectsAllPlans" : 1,
"nscannedAllPlans" : 1,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"email" : [
[
"test#example",
"test#examplf"
],
[
/\Atest#example.com\z/,
/\Atest#example.com\z/
]
]
},
"server" : "luna:27019"
}
You'll see that it only has to scan one document, and that the scan upper and lower bounds are well-defined ("test#example".."test#examplf"). This is because Mongo looks at the prefix and says "That explicit prefix is guaranteed to be in every matching result", and thus knows that it can limit the portion of the index that it has to scan.
If we add the /i flag though:
kerrigan:PRIMARY> db.test.find({email: /\Atest#example.com\z/i}).explain()
{
"cursor" : "IndexCursor email_1 multi",
"isMultiKey" : false,
"n" : 3,
"nscannedObjects" : 3,
"nscanned" : 4,
"nscannedObjectsAllPlans" : 3,
"nscannedAllPlans" : 4,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"email" : [
[
"",
{
}
],
[
/\Atest#example.com\z/i,
/\Atest#example.com\z/i
]
]
},
"server" : "luna:27019"
}
Suddenly those index bounds are "".."", or a full index scan; because there's no guaranteed static prefix for the field, Mongo has to scan and check each value in the index to see if it matches your provided regex.

Understanding Ruby script for cassandra database

I am a Ruby novice. But due to some problem I have to handle the code as our ruby developer is not available. We are using cassandra database to get values from a Ruby (Sinatra) web service and put it into the Cassandra keyspace. But due to some problem , the data is failing to insert.
In the following code partners_daily , partner_monthly etc are column family (tables) in the stats keyspace(database).
if params and !partner_id.nil? and !activity_type.nil?
{
:partners_daily => "#{partner_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}_#{time.day}",
:partners_monthly => "#{partner_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}",
:partners_alltime => "#{partner_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}",
:channels_daily => "#{channel_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}_#{time.day}",
:channels_monthly => "#{channel_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}",
:channels_alltime => "#{channel_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}",
:countries_daily => "#{country}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}_#{time.day}",
:countries_monthly => "#{country}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}",
:countries_alltime => "#{country}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}"
}.each do |k,v|
stats.add(k, v, 1, 'count')
end
return "Activity stored in stats"
end
else
return "Error: client headers missing"
end
end
def count(table, key)
require 'cassandra-cql' # requiring this at the top was leading to error: unconfigured columnfamily
cqldb = CassandraCQL::Database.new('127.0.0.1:9160', {:keyspace => 'plystats'})
query = "update partners_daily set count = ? where key = ?"#"update #{table} set count = count+1 where key = ?;"
#return cqldb.execute(query, 0, 'sonia').inspect
return query
end
I want to know how the data inserting logic in it is being performed, and where ? Is it in stats.add(k, v, 1, 'count') ?
and is there any error in the inserting part because its failing.
I want to know how the data inserting logic in it is being performed, and where ? Is it in stats.add(k, v, 1, 'count') ?
Yes, that's where it should be happening. Between the {} are dictionary/hash values:
{
:partners_daily => # …
}.each do |k,v|
A loop is started with the each method, and each entry is decomposed and put into k and v, the key in k and the value in v. For example, the first record in the hash is:
:partners_daily => "#{partner_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}_#{time.day}",
This would then decompose within the each loop to:
k = :partners_daily
v = # The result of
"#{partner_id}_#{activity_type}_#{success == 1 ? 'sucess' : "failure:#{failure_code}"}_#{time.year}_#{time.month}_#{time.day}",
I don't now what the values are for partner_id etc, but making some up it'd look something like "123_sales_sucess_2013_6_01"
Notice there's a typo for the word success in there.
It's a bit confusing due to the multiple double quotes and braces, so I'd change this to:
[partner_id, activity_type, (success == 1 ? 'success' : "failure:#{failure_code}"), time.year, time.month, time.day].join("_")
But notice that there's a lot of repetition in there, so I'd change the whole hash to (at least):
success_string = success == 1 ?
'success' :
"failure:#{failure_code}"
data = {
:partners_daily => [partner_id, activity_type,success_string,time.year,time.month,time.day].join("_"),
:partners_monthly => [partner_id,activity_type,success_string,time.year,time.month].join("_"),
:partners_alltime => [partner_id,activity_type,success_string].join("_"),
:channels_daily => [channel_id,activity_type,success_string,time.year,time.month,time.day].join("_"),
:channels_monthly => [channel_id,activity_type,success_string,time.year,time.month].join("_"),
:channels_alltime => [channel_id,activity_type,success_string].join("_"),
:countries_daily => [country,activity_type,success_string,time.year,time.month,time.day].join("_"),
:countries_monthly => [country,activity_type,success_string,time.year,time.month].join("_"),
:countries_alltime => [country,activity_type,success_string].join("_")
}
data.each do |k,v|
# more code…
It starts to be easier to read and see the logic. Also, by putting the hash into the data variable instead of working on it immediately, it allows you to inspect it more easily, e.g.
warn "data = #{data.inspect}"
would output a representation of the data to the console, so at least you could get an idea of what the script is attempting to put in. At the top of this code, you could also add warn "script = #{script.inspect}" to check what the script object looks like.
If the script object is a Cassandra instance i.e. there's something like script = Cassandra.new "blah", "blahblah" that sets it up, then the add method is this one
The signature given is add(column_family, key, value, *columns_and_options) but that doesn't seem to match the call you have:
stats.add(k, v, 1, 'count')
should (probably) be:
stats.add('count', k, v, 1)
In fact, I'm not even sure that the concatenation in the data hash should happen and maybe all of that should just be passed to add, but it's your data so I can't be sure.
Feel free to comment below and I'll update this.
Trying it in IRB to check it for syntax errors:
success = 1
# => 1
partner_id = 123
# => 123
activity_type = "something"
# => "something"
time = Time.now
# => 2013-06-05 11:17:50 0100
channel_id = 456
# => 456
country = "UK"
# => "UK"
success_string = success == 1 ?
'success' :
"failure:#{failure_code}"
# => "success"
data = {
:partners_daily => [partner_id, activity_type,success_string,time.year,time.month,time.day].join("_"),
:partners_monthly => [partner_id,activity_type,success_string,time.year,time.month].join("_"),
:partners_alltime => [partner_id,activity_type,success_string].join("_"),
:channels_daily => [channel_id,activity_type,success_string,time.year,time.month,time.day].join("_"),
:channels_monthly => [channel_id,activity_type,success_string,time.year,time.month].join("_"),
:channels_alltime => [channel_id,activity_type,success_string].join("_"),
:countries_daily => [country,activity_type,success_string,time.year,time.month,time.day].join("_"),
:countries_monthly => [country,activity_type,success_string,time.year,time.month].join("_"),
:countries_alltime => [country,activity_type,success_string].join("_")
}
# => {:partners_daily=>"123_something_success_2013_6_5", :partners_monthly=>"123_something_success_2013_6", :partners_alltime=>"123_something_success", :channels_daily=>"456_something_success_2013_6_5", :channels_monthly=>"456_something_success_2013_6", :channels_alltime=>"456_something_success", :countries_daily=>"UK_something_success_2013_6_5", :countries_monthly=>"UK_something_success_2013_6", :countries_alltime=>"UK_something_success"}

Mongoid or/any_of unexpected behaviour

I'm having an issue with mongoid any_of. I'm trying to find objects that have either one field > 0, or another one > 0. My query is :
Model.any_of(best_friend_method.gt => 0, method.gt => 0).desc(best_friend_method, method)
It is "translated" in :
#<Mongoid::Criteria
selector: {"$or"=>[{:best_friends_lc_sum=>{"$gt"=>0}, :lc_sum=>{"$gt"=>0}}]},
options: {:sort=>[[:best_friends_lc_sum, :desc], [:lc_sum, :desc]]},
class: FbAlbum,
embedded: false>
As I understand it, this is what I want. But it only returns me 6 results. Model.where(:best_friends_lc_sum.gt => 0).count returns me 6 results too, but Model.where(:lc_sum.gt => 0).count returns me ~850 objects.
I expect my query to return the union of those two : is a mongoid/mongodb error, or am I doing something wrong ?
FYI : mongoid 2.4.5, mongodb 2.0.2, rails 3.1.3
Thanks for your time!
It's because you pass only one args and not 2 args. So it's like you have no $or usage.
Try :
Model.any_of({best_friend_method.gt => 0}, {method.gt => 0}).desc(best_friend_method, method)
In this case the Criteria become :
#<Mongoid::Criteria
selector: {"$or"=>[{:best_friends_lc_sum=>{"$gt"=>0}}, {:lc_sum=>{"$gt"=>0}}]},
options: {:sort=>[[:best_friends_lc_sum, :desc], [:lc_sum, :desc]]},
class: FbAlbum,
embedded: false>
Sometime the usage of {} is mandatory to separate different hash.
In case this helps someone...In Mongoid 3, the Origin gem provides the syntax for querying. Here is the list of methods that you can use to write your Mongoid 3 queries. Among those methods is the or method which allows you to perform an $or query:
# Mongoid query:
Model.or(
{ name: "Martin" }, { name: "Dave" }
)
# resulting MongoDB query:
{
"$or" => [
{ "name" => "Martin" }, { "name" => "Dave" }
]
}
Using the OP's original example, it can be rewritten as:
Model.or(
{ best_friend_method.gt => 0 },
{ method.gt => 0 }
).order_by(
best_friend_method,
method
)
At least one of the hashes passed to the or method must match in order for a record to be returned.

Resources