How do you add new documents to an inverted index - search-engine

Consider an inverted index with positional records stored in a MySQL Database as:
Word (VARCHAR) | Documents (LONGTEXT)
-------------------------------------------------------------
Hello | {id: 11, freq: 4, pos: [18, 37, 43, 119]},
| {id: 19, freq: 2, pos: [17, 32]}
-------------------------------------------------------------
Now, a new document comes and most of its words are already indexed. What should be the index operation now? Basic approach seems that if the word is already present in the database, then fetch its documents and add the current document to it and update the record.
Is this sustainable as the number of documents increases reaching, say, millions? How do real world search engines like Solr, Xapain, Google, Bing etc. handle this?

When a new document is added to your collection, the operation would be to:
Assign the document an id, say 20, which uniquely identifies the document. This id is typically incremented by 1 for each new document added to the collection.
Make a list over all the words in the new document, and at what position they occur.
For the document Hi Hello Hello Bye, this would be:
Bye: {id: 20, freq: 1, pos: [15]}
Hello: {id: 20, freq: 2, pos: [3, 9]}
Hi: {id: 20, freq: 1, pos: [0]}
For any new word (Bye, Hi), add an entry to the database for that word. For any existing word in the database (Hello), add the new data to that value.
Below is how your database would look after adding the document.
Word (VARCHAR) | Documents (LONGTEXT)
-------------------------------------------------------------
Bye | {id: 20, freq: 1, pos: [15]}
Hello | {id: 11, freq: 4, pos: [18, 37, 43, 119]},
| {id: 19, freq: 2, pos: [17, 32]}
| {id: 20, freq: 2, pos: [3, 9]}
Hi | {id: 20, freq: 1, pos: [0]}
-------------------------------------------------------------
The quick answer to your other question is: Yes, this is sustainable for large indexes. Inverted indexes is typically optimized for lookup, using hash tables or binary trees, making retrieval practically independant on the size of the document collection.
For how large search engines handle this: I don't know about the details (even though I'd like to). They obviously use data cluster to spread the load over multiple servers (yes, I said spread the load. It wasn't intentional). I bet they have preprocessed a bunch of stuff, and cached common queries like "Stack Overflow" so there's already a solution page for that.

Related

Determine differences between incoming CSV data and existing Mongo collection for large data sets

I have an incoming CSV that I am trying to compare with an existing collection of mongo documents (Note objects) to determine additions, deletions, and updates. The incoming CSV and mongo collection are quite large at around 500K records each.
ex. csv_data
[{
id: 1, text: "zzz"
},
{
id: 2, text: "bbb"
},
{
id: 4, text: "ddd"
},
{
id: 5, text: "eee"
}]
Mongo collection of Note objects:
[{
id: 1, text: "aaa"
},
{
id: 2, text: "bbb"
},
{
id: 3, text: "ccc"
},
{
id: 4, text: "ddd"
}]
As a result I would want to get
an array of additions
[{
id: 5, text: "eee"
}]
an array of removals
[{
id: 3, text: "ccc"
}]
an array of updates
[{
id: 1, text: "zzz"
}]
I tried using select statements to filter for each particular difference but it is failing / taking hours when using the real data set with all 500k records.
additions = csv_data.select{|record| !Note.where(id: record[:id]).exists?}
deletions = Note.all.select{|note| !csv_data.any?{|row| row[:id] == note.id}}
updates = csv_data.select do |record|
note = Note.where(id: record[:id])
note.exists? && note.first.text != record[:text]
end
How would I better optimize this?
Assumption: the CSV file is a snapshot of the data in the database taken at some other time, and you want a diff.
In order to get the answers you want, you need to read every record in the DB. Right now you are effectively doing this three times, once to obtain each statistic. Which is c.1.5m DB calls, and possibly more if there are significantly more notes on the DB than there are in the file. I'd follow these steps:
Read the CSV data into a hash keyed by ID
Read each record in the database, and for each record:
If the DB ID is found in the CSV hash, move it from the hash to the updates
If the DB ID isn't found in the CSV hash, add it to the deletes
When you reach the end of the DB, anything still left in the CSV hash must therefore be an addition
While it's still not super-slick, at least you only get to do the database I/O once instead of three times...

Efficient way to randomize data model in Rails

Creating a programming schedule based on videos in object model.
I want to run a task every day to shuffle this model so the programming each day would be different.
I am aware of
product.shuffle.all for ex. but I want the order to be saved one time each day to do so vs on each server call.
I am thinking to add an attribute to each product, named order which would be an integer to order by. How would I shuffle just product.order for all products in this case?
Would this be the most efficient way? Thanks for the help!
You could use the random parameter of shuffle. It allows for stable randomization:
# When the Random object's seed is different, multiple runs result in different outcome:
pry(main)> [1,2,3,4,5,6,7].shuffle(random: Random.new)
=> [5, 6, 3, 4, 1, 7, 2]
pry(main)> [1,2,3,4,5,6,7].shuffle(random: Random.new)
=> [1, 7, 6, 4, 5, 3, 2]
# When the Random object is created with the same seed, multiple runs return the same result:
pry(main)> [1,2,3,4,5,6,7].shuffle(random: Random.new(1))
=> [7, 3, 2, 1, 5, 4, 6]
pry(main)> [1,2,3,4,5,6,7].shuffle(random: Random.new(1))
=> [7, 3, 2, 1, 5, 4, 6]
By basing the seed e.g. on the number of day in the year you can determine when the results randomization changes. You can (obviously) restore the randomization for any given day if you need to do so.
What I think you want to do would be best solved with a combination of the gem paper_trail along with your product.shuffle.all and an update_attributes call to the DB. That way you can view past versions as they are updated in your DB.

How To Sum Column In Relation with active record rails

I have question about summing 2 columns in a relation, with records such as:
#<RequestProduct id: 26, request_id: 27, product_service_id: 9, quantity: 12, created_at: "2015-09-12 04:58:19", updated_at: "2015-09-12 04:58:19">,
#<RequestProduct id: 27, request_id: 27, product_service_id: 10, quantity: 11, created_at: "2015-09-12 04:58:19", updated_at: "2015-09-12 04:58:19">,
#<RequestProduct id: 28, request_id: 27, product_service_id: 11, quantity: 10, created_at: "2015-09-12 04:58:20", updated_at: "2015-09-12 04:58:20">
I want to sum quantity on model RequestProduct with price in model ProductService. I've tried to use some ways but still failed, the way that I use like this:
#request.request_products.sum("quantity * request_products.product_service.price")
#request.request_products.sum("quantity * product_service.price")
#request.request_products.sum("quantity * ?", product_service.price)
Is there any other way?
You need to try something like this:
res = #request.request_products.joins(:product_services)
.select("sum(request_products.quantity) as quantities, sum(product_services.price * request_products.quantity) as total")
Now you can access sum values via:
res.quantities # total quantity
res.total # total amount
One method, which makes this rather invisible to Rails, is to define a view in the database to join the tables, perform the arithmetic, and maybe even aggregate the numbers.
By exposing the appropriate key id columns of the underlying table and creating a model for the view (mark it as readonly), you can treat it as a regular Rails model with all of the complexity pushed down into the database layer.
Some developers will hate this, as it could be argued that it moves business logic into the the database layer, which they're not comfortable with. Also, just using this to simply join tables and multiply values may not be a strong enough use case, and the arguments in favour may get stronger if you did need to aggregate.
However it's not complex logic, and simplicity in the Rails level and performance of the system, would probably be unmatched by any other method.

Counting associated objects from an array of parent objects

I am working a report that counts neighbors in a household - I am looking at using either a helper method, simple query or both
A household is an object that has several neighbors and I want to count the total number of neighbors in a select group of households. I have an array of households IDs:
#household_ids = [31, 15, 30, 38, 1, 5, 32, 25, 10, 26, 14,29]
I tried this:
def household_neighbor_count(houses)
houses.each do |id|
#neigh = Household.find(id).neighbor_count
#neigh
end
end
Which doesn't work - it returns a list of the IDs
Since this is Rails I could also do an activerecord query and this is my shot in pseudo sql:
Neighbors where household_id == household_id in #household-ids
I am using squeel if that helps
How do I do this either -approach is fine or a recommendation of the best approach is great
It is unclear what you are trying to do, but I guess this is what you want:
houses.map{|id| Household.find(id).neighbor_count}.inject(:+)
If there are duplicates within the neighbors among the houses, then you need a method to get the set of neighbors for a given house, not just the count. And since you have not shown such method, I guess either that is not an issue, or your question is inappropriate.
Asuming there are household_id in Neighbor model
#household_ids = [31, 15, 30, 38, 1, 5, 32, 25, 10, 26, 14,29]
total_number_of_neighbors = Neighbor.where('household_id IN (?)', #household_ids).count
try this out
Household.includes(:neighbors).select('households.id, neighbors.id').where(:id => [31, 15, 30, 38, 1, 5, 32, 25, 10, 26, 14,29]).map{|r| {r.id => r.neighbors.count}}

add fields having similar values

I have a trading model which has got 2 fields, number_of_share and price_per_share.
I want to showcase it in a pie chart and to do that I have to find all the trading objects associated with a user and then add all the trading objects which have same price_per_share and add their number_of_shared as well.
Example :-
trading id: 1, price_per_share: 10, number_of_shares: 20
trading id: 2, price_per_share: 10, number_of_shares: 12
trading id: 3, price_per_share: 12, number_of_shares: 10
now i want to add all the price_per_share which have got similar values (10 in this case). How can I do that ?
This should work:
Trading.group(:price_per_share).sum(:number_of_shares)
# => {10=>32, 12=>10}
The SQL will be something like:
SELECT SUM(`tradings`.`number_of_shares`) AS sum_number_of_shares, price_per_share AS price_per_share FROM `tradings` GROUP BY `tradings`.`price_per_share`
Hope I understand that right, but:
Trading.where(:price_per_share => 10).sum(:number_of_shares)
should give you the result you are looking for.

Resources