I have a Rails app with PostgreSQL.
I'm trying to implement a method to suggest alternative names for a certain resource, if the user input has been already chosen.
My reference is slack:
Is there any solution that could do this efficiently?
For efficiently I mean: using only one or also a small set of queries. A pure SQL solution would be great, though.
My initial implementation looked like this:
def generate_alternative_names(model, column_name, count)
words = model[column_name].split(/[,\s\-_]+/).reject(&:blank?)
candidates = 100.times.map! { |i| generate_candidates_using_a_certain_strategy(i, words) }
already_used = model.class.where(column_name => candidates).pluck(column_name)
(candidates - already_used).first(count)
end
# Usage example:
model = Domain.new
model.name = 'hello-world'
generate_alternative_names(model, :name, 5)
# => ["hello_world", "hello-world2", "world_hello", ...]
It generates 100 candidates, then checks the database for matches and removes them from the candidates list. Finally it returns the first count values extracted.
This method is a best effort implementation, as it works for small sets of suggestions, that have few conflicts (in my case, 100 conflicts).
Even if I increase this magic number (100), it does not scales indefinitely.
Do you know a method to improve this, so it can scale for large number of conflicts and without using magic numbers?
I would go with reversed approach: query the database for existing records using LIKE and then generate suggestions skipping already taken:
def alternatives(model, column, word, count)
taken = model.class.where("#{column} LIKE '%#{word}%'").pluck(column)
count.times.map! do |i|
generate_candidates_using_a_certain_strategy(i, taken)
end
end
Make a generate_candidates_using_a_certain_strategy to receive an array of already taken words to be skipped. There could be one possible glitch with race condition on two requests taking the same name, but I don’t think it might cause any problems, since you are always free to apologize when an actual creation will fail.
Related
This is a follow-up to this last question I asked: Sort Users by Number of Followers. That code is:
#ordered_users = User.all.sort{|a,b| b.followers.count <=> a.followers.count}
What I hope to accomplish is take the ordered users and get the top 100 of those and then randomly choose 5 out of that 100. Is there a way to accomplish this?
Thanks.
users_in_descending_order_of_followers = User.all.sort_by { |u| -u.followers.count }
sample_of_top = users_in_descending_order_of_followers.take(100).sample(5)
You can use sort_by which can be easier to use than sort, and combine take and sample to get the top 100 users and sample 5 of those users.
User.all.sort can "potentially" pose some problems in the long-run, depending on the number of total users, and the availability of resources particularly computer memory, not to mention it would be a lot slower because you're calling 2x .followers.count inside the sort block, which essentially calls 2xN times more DB query; N being the number of users. This is because User.all.sort will immediately execute the User.all query, thereby fetching all User records into memory, as opposed to your usual User.all, which is lazy loaded, until you (for example use .each, or better yet .find_each somewhere down the line)
I suggest something like below (I extended Deekshith's answer referring to your link to the other question):
User.joins(:followers).order('count(followers.user_id) desc').limit(100).sample(5)
.joins, .order, and .limit above will all extend the SQL string query into one string, then executes that SQL string, and finally run .sample(5) (not a SQL anymore!, but is already just a plain ruby method at this point), finally yielding the result that you needed.
I would strongly consider using a counter cache on the User model, to hold the count of followers.
This would give a very small performance impact on adding or removing followers, and greatly increase performance when performing sorts:
User.order(followers_count: :desc)
This would be particularly noticeable if you wanted the top-n users by follower count, or finding users with no followers.
User.order(followers_count: :desc).limit(100).sample(5)
This method will out-perform others using count(*). Add an index on followers_count for best effect.
I am developing a Rails app and one of my actions compares two of the same kind of objects and returns a decimal value between 0 and 1. There are roughly 800 objects that need to be compared, thus there are roughly 800*800 possible decimal values that can be returned. Each action call requires about 300 or so comparisons, which are made via an API.
Because of the number of API calls that are needed, I have decided that the best approach is to make a lookup table with all 800*800 API comparison values stored locally, to avoid having to rely on the API, which has call limits and a significant overhead per call.
Basically I have decided that a lookup table best suits this task (although I am open for suggestions on this too).
My question is this: what is the best way to implement a 2 dimensional lookup table with ~800 "rows" and ~800 "columns" in rails? For example, if I wanted to compare objects 754 and 348, Would it be best to create models for the rows and columns and access the decimal comparison like:
object1.754.object2.348 # => 0.8738
Or should I store all of the values in a CSV or something like this? If this is the better approach, how should I even approach setting this up? I am relatively new to the rails world so apologies if an obvious answer is dangling in front of me!
Bear in mind that the entire point of this approach was to avoid the overheads from API calls and thus avoid large waiting times for the end user, so I am looking for the most time-efficient way to approach this task!
I would consider a hash of hashes, so you retrieve the values with:
my_hash[754][348]
=> 0.8738
If you might not have already loaded the value for a particular combination then you'd want to be careful to use:
my_hash[754].try(:[],348)
There could be some subtleties in the implimentation to do with loading the hash that make it beneficial to use the hashie gem.
https://rubygems.org/gems/hashie/versions/3.4.2
If you wanted to persist the values then it can be written into a database using serialize, and you can also extend the method to provide expiry dates on the values as well if you wished.
I had a similar problem comparing the contents of a large collection of ebooks, I stored the allready compared results in a matrix that I serialise with Marshal, the lookup key is a 2 dimensional array of the MD5 value of the filepath.
Here I add the Matric class I created for this task.
require 'digest/md5'
class Matrix
attr_accessor :path, :store
def initialize path
#path = path
#store = File.open(#path,'rb') { |f| Marshal.load(f.read) } rescue Hash.new(nil)
end
def save
File.open(#path,'wb') { |f| f.write(Marshal.dump(#store)) }
self
end
def add file1, file2, value
#store[[Digest::MD5.hexdigest(file1), Digest::MD5.hexdigest(file2)]] = value
end
def has? file1, file2
!#store[[Digest::MD5.hexdigest(file1), Digest::MD5.hexdigest(file2)]].nil?
end
def value file1, file2
#store[[Digest::MD5.hexdigest(file1), Digest::MD5.hexdigest(file2)]]
end
def each &blk
#store.each &blk
end
end
Intro
I'm doing a system where I have a very simple layout only consisting of transactions (with basic CRUD). Each transaction has a date, a type, a debit amount (minus) and a credit amount (plus). Think of an online banking statement and that's pretty much it.
The issue I'm having is keeping my controller skinny and worrying about possibly over-querying the database.
A Simple Report Example
The total debit over the chosen period e.g. SUM(debit) as total_debit
The total credit over the chosen period e.g. SUM(credit) as total_credit
The overall total e.g. total_credit - total_debit
The report must allow a dynamic date range e.g. where(date BETWEEN 'x' and 'y')
The date range would never be more than a year and will only be a max of say 1000 transactions/rows at a time
So in the controller I create:
def report
#d = Transaction.select("SUM(debit) as total_debit").where("date BETWEEN 'x' AND 'y'")
#c = Transaction.select("SUM(credit) as total_credit").where("date BETWEEN 'x' AND 'y'")
#t = #c.credit_total - #d.debit_total
end
Additional Question Info
My actual report has closer to 6 or 7 database queries (e.g. pulling out the total credit/debit as per type == 1 or type == 2 etc) and has many more calculations e.g totalling up certain credit/debit types and then adding and removing these totals off other totals.
I'm trying my best to adhere to 'skinny model, fat controller' but am having issues with the amount of variables my controller needs to pass to the view. Rails has seemed very straightforward up until the point where you create variables to pass to the view. I don't see how else you do it apart from putting the variable creating line into the controller and making it 'skinnier' by putting some query bits and pieces into the model.
Is there something I'm missing where you create variables in the model and then have the controller pass those to the view?
A more idiomatic way of writing your query in Activerecord would probably be something like:
class Transaction < ActiveRecord::Base
def self.within(start_date, end_date)
where(:date => start_date..end_date)
end
def self.total_credit
sum(:credit)
end
def self.total_debit
sum(:debit)
end
end
This would mean issuing 3 queries in your controller, which should not be a big deal if you create database indices, and limit the number of transactions as well as the time range to a sensible amount:
#transactions = Transaction.within(start_date, end_date)
#total = #transaction.total_credit - #transaction.total_debit
Finally, you could also use Ruby's Enumerable#reduce method to compute your total by directly traversing the list of transactions retrieved from the database.
#total = #transactions.reduce(0) { |memo, t| memo + (t.credit - t.debit) }
For very small datasets this might result in faster performance, as you would hit the database only once. However, I reckon the first approach is preferable, and it will certainly deliver better performance when the number of records in your db starts to increase
I'm putting in params[:year_start]/params[:year_end] for x and y, is that safe to do?
You should never embed params[:anything] directly in a query string. Instead use this form:
where("date BETWEEN ? AND ?", params[:year_start], params[:year_end])
My actual report probably has closer to 5 database calls and then 6 or 7 calculations on those variables, should I just be querying the date range once and then doing all the work on the array/hash etc?
This is a little subjective but I'll give you my opinion. Typically it's easier to scale the application layer than the database layer. Are you currently having performance issues with the database? If so, consider moving the logic to Ruby and adding more resources to your application server. If not, maybe it's too soon to worry about this.
I'm really not seeing how I would get the majority of the work/calculations into the model, I understand scopes but how would you put the date range into a scope and still utilise GET params?
Have you seen has_scope? This is a great gem that lets you define scopes in your models and have them automatically get applied to controller actions. I generally use this for filtering/searching, but it seems like you might have a good use case for it.
If you could give an example on creating an array via a broad database call and then doing various calculations on that array and then passing those variables to the template that would be awesome.
This is not a great fit for Stack Overflow and it's really not far from what you would be doing in a standard Rails application. I would read the Rails guide and a Ruby book and it won't be too hard to figure out.
I'm trying to figure out the best way to model a simple lookup in my rails app.
I have a model, Coupon, which can be of two "types", either Percent or Fixed Amount. I don't want to build a database table around the "type" lookup, so what I'd like to do is just have a coupon_type(integer) field on the Coupon table which can be either 1 (for Percent) or 2 (for Fixed).
What is the best way to handle this?
I found this: Ruby on Rails Static List Options for Drop Down Box but not sure how that would work when I want each value to have two fields, both ID and Description.
I'd like this to populate a select list as well.
Thank you for the feedback!
If this is really unlikely to change, or if it does change it will be an event significant enough to require redeployment, the easiest approach is to have a constant that defines the conditions:
COUPON_TYPES = {
:percent => 1,
:fixed => 2
}
COUPON_TYPES_FOR_SELECT = COUPON_TYPES.to_a
The first constant defines a forward mapping, the second in a format suitable for options_for_select.
It's important to note that this sort of thing would take, at most, a millisecond to read from a database. Unless you're rendering hundreds of forms per second it would hardly impact performance.
Don't worry about optimizing things that aren't perceptible problems.
I have the following code for
h2.each {|k, v|
#count += 1
puts #count
sq.each do |word|
if Wordsdoc.find_by_docid(k).tf.include?(word)
sum += Wordsdoc.find_by_docid(k).tf[word] * #s[word]
end
end
rec_hash[k] = sum
sum = 0
}
h2 -> is a hash that contain ids of documents, the hash contains more than a 1000 of these
Wordsdoc -> is a model/table in my database...
sq -> is a hash that contain around 10 words
What i'm doing is i'm going through each of the document ids and then for each word in sq i look up in the Wordsdoc table if the word exists (Wordsdoc.find_by_docid(k).tf.include?(word) , here tf is a hash of {word => value}
and if it does I get the value of that word in Wordsdoc and multiple it with the value of the word in #s which is also a hash of {word = > value}
This seems to be running very slow. Tt processe one document per second. Is there a way to process this faster?
thanks really appreciate your help on this!
You do a lot of duplicate querying. While ActiveRecord can do some caching in the background to speed things up, there is a limit to what it can do, and there is no reason to make things harder for it.
The most obvious cause for slowdown is the Wordsdoc.find_by_docid(k). For each value of k, you call it 10 times, and each time you call it there is a possibility to call it again. That means you call that method with the same argument 10-20 times for each entry in h2. Queries to the database are expensive, since the database is on the hard disk, and accessing the hard disk is expensive in any system. You can just as easily call Wordsdoc.find_by_Docid(k) once, before you enter the sq.each loop, and store it in a variable - that would save a lot of querying and make your loop go much faster.
Another optimization - though not nearly as important as the first one - is to get all the Wordsdoc records in a single query. Almost all mid to high level(and some of the low level, too!) programming languages and libraries work better and faster when they work in bulks, and ActiveRecord is no exception. If you can query for all entries of Wordsdoc, and filter them by the docid's in h2's keys, you can turn 1000 queries(after the first optimization. Before the first optimization it was 10000-20000 queries) to a single, huge query. That will enable ActiveRerocd and the underlying database to retrieve your data in bigger chunks, and save you a lot of disc access.
There are some more minor optimization you can do, but the two I've specified should be more than enough.
You're calling Wordsdoc.find_by_docid(k) twice.
You could refactor the code to:
wordsdoc = Wordsdoc.find_by_docid(k)
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
...but still it will be ugly and inefficient.
You should prefetch all records in batches, see: https://makandracards.com/makandra/1181-use-find_in_batches-to-process-many-records-without-tearing-down-the-server
For example something like that should be much more efficient:
Wordsdoc.find_in_batches(:conditions => {:docid => array_of_doc_ids}).each do |wordsdoc|
if wordsdoc.tf.include?(word)
sum += wordsdoc.tf[word] * #s[word]
end
end
Also you can retrieve only certain columns from Wordsdoc table using for example :select => :tf in find_in_batches method.
As you have a lot going on I'm just going to offer you up to things to check out.
A book called Eloquent Ruby deals with Documents and iterating through documents to count the number of times a word was used. All his examples are about a Document system he was maintaining and so it could even tackle other problems for you.
inject is a method that could speed up what you're looking to do for the sum part, maybe.
Delayed Job the whole thing if you are doing this async-ly. meaning if this is a web app, you must be timing out if you're waiting a 1000 seconds for this job to complete before it shows it's answers on the screen.
Go get em.