I have a list of words and want to find which ones already exist in the database.
Instead of making tens of SQL queries, I decided to use "SELECT word FROM table WHERE word IN(array_of_words)" and then loop through the result.
The problem is database collation.
http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html
There are many different characters, which MySQL treats as the same. However, in Ruby code string1 would not be equal to string2.
For example: if the word is "šuo", database might also return "suo", if it's found (and it's ok), but, when I want to check, if something by "šuo" is found, Ruby, of course, returns false (šuo != suo).
So, is there any way to compare two strings in Ruby in terms of the same collation?
I've used iconv like this for something similar:
require 'iconv'
class String
def to_ascii_iconv
Iconv.new('ASCII//IGNORE//TRANSLIT', 'UTF-8').iconv(self).unpack('U*').select { |cp| cp < 127 }.pack('U*')
end
end
puts 'suo'.to_ascii_iconv
# => suo
puts 'šuo'.to_ascii_iconv
# => suo
puts 'suo'.to_ascii_iconv == 'šuo'.to_ascii_iconv
# => true
Hope that helps!
Zubin
Related
I am working on a legacy Rails project that relies on Ruby version 1.8
I have a string looks like this:
my_str = "a,b,c"
I would like to convert it to
value_list = "('a','b','c')"
so that I can directly use it in my SQL statement like:
"SELECT * from my_table WHERE value IN #{value_list}"
I tried:
my_str.split(",")
but it returns "abc" :(
How to convert it to what I need?
To split the string you can just do
my_str.split(",")
=> ["a", "b", "c"]
The easiest way to use that in a query, is using where as follows:
Post.where(value: my_str.split(","))
This will just work as expected. But, I understand you want to be able to build the SQL-string yourself, so then you need to do something like
quoted_values_str = my_str.split(",").map{|x| "'#{x}'"}.join(",")
=> "'a','b','c'"
sql = ""SELECT * from my_table WHERE value IN (#{quoted_values_str})"
Note that this is a naive approach: normally you should also escape quotes if they should be contained inside your strings, and makes you vulnerable for sql injection. Using where will handle all those edge cases correctly for you.
Under no circumstances should you reinvent the wheel for this. Rails has built-in methods for constructing SQL strings, and you should use them. In this case, you want sanitize_sql_for_assignment (aliased to sanitize_sql):
my_str = "a,b,c"
conditions = sanitize_sql(["value IN (?)", my_str.split(",")])
# => value IN ('a','b','c')
query = "SELECT * from my_table WHERE #{conditions}"
This will give you the result you want while also protecting you from SQL injection attacks (and other errors related to badly formed SQL).
The correct usage may depend what version of Rails you're using, but this method exists as far back as Rails 2.0 so it will definitely work even with a legacy app; just consult the docs for the version of Rails you're using.
value_list = "('#{my_str.split(",").join("','")}')"
But this is a very bad way to query. You better use:
Model.where(value: my_str.split(","))
The string can be manipulated directly; there is no need to convert it to an array, modify the array then join the elements.
str = "a,b,c"
"(%s)" % str.gsub(/([^,]+)/, "'\\1'")
#=> "('a','b','c')"
The regular expression reads, "match one or more characters other than commas and save to capture group 1. \\1 retrieves the contents of capture group 1 in the formation of gsub's replacement string.
couple of use cases:
def full_name
[last_name, first_name].join(' ')
end
or
def address_line
[address[:country], address[:city], address[:street], address[:zip]].join(', ')
end
I'm attempting to write a Ruby method which accepts an array of strings (for example, ["EG", "K", "C"], and returns all records from a database table where the icao_code field starts with any of those strings (for example, KORD, EGLL, and CYVR would all match). The length of the array will vary, and it will be input by a user, so it needs to be sanitized.
If I were only searching for a single string, I could do something like Airport.where("icao_code LIKE ?", "#{icao_start}%"). However, since I need to search against an arbitrary number of strings, I can't use that syntax.
Right now, I've got it working as follows:
def in_region(icao_starts)
where_clause = icao_starts.map{|i| "icao_code LIKE '#{i}%'"}.join(" OR ")
return Airport.where(where_clause)
end
However, I'm a bit worried using a setup like this with untrusted user input, since I suspect it would be vulnerable to SQL injection.
Is there a better way to get the same result in a more secure way?
You could consider something like this:
def in_region(icao_starts)
where_clause = "icao_code LIKE '#?%' OR " * icao_starts.length
return Airport.where(where_clause.sub(/\ OR\ $/, ''), *icao_starts)
end
This will build up a (potentially very long?) string with ? placeholders. The *icao_starts will expand that array into arguments to the where clause, so each ? will end up getting safely replaced. The sub(/\ OR\ $/, '') simply trims off the final OR (you could append 1=0 instead if you wanted).
If I were you I would also perform a .uniq on icao_starts before you do anything, truncate the array at some sensible upper length limit, and also have a whitelist of permitted values (oh, forget that, I thought users were searching by airport code). That should be pretty much infallible.
You are right about not interpolating user input into your SQL query. This is dangerous and makes your code vulnerable for SQLI attacks.
def in_region(icao_starts)
conditions = icao_starts.map { "icao_code LIKE ?"}
Airport.where(conditions.join(' OR '), *icao_starts.map { |name| "#{name}%"})
end
It is pretty similar than the solution of bogardpd but does not use a Regexp to get rid of the last " OR"
I'm trying to find a database agnostic way of comparing dates with active record queries. I've the following query:
UserRole.where("(effective_end_date - effective_start_date) > ?", 900.seconds)
This works fine on MySQL but produces an error on PG as the sql it generates doesn't contain the 'interval' syntax. From the console:
←[1m←[36mUserRole Load (2.0ms)←[0m ←[1mSELECT "user_roles".* FROM "user_roles" WHERE "user_roles"."effective_end_date" IS NULL AND ((effective_end_d
ate - effective_start_date) > '--- 900
...
')←[0m
ActiveRecord::StatementInvalid: PG::Error: ERROR: invalid input syntax for type interval: "--- 900
When I run this with the to_sql I option I get:
irb(main):001:0> UserRole.where("effective_end_date - effective_start_date) > ?", 900.seconds).to_sql
=> "SELECT \"user_roles\".* FROM \"user_roles\" WHERE \"user_roles\".\"effective_end_date\" IS NULL AND (effective_end_date - effective_start_date) >
'--- 900\n...\n')"
All help appreciated.
If your effective_end_date and effective_start_date columns really are dates then your query is pointless because dates have a minimum resolution of one day and 900s is quite a bit smaller than 86400s (AKA 25*60*60 or 1 day). So I'll assume that your "date" columns are actually datetime (AKA timestamp) columns; if this is true then you might want to rename the columns to avoid confusion during maintenance, effectively_starts_at and effectively_ends_at would probably be good matches for the usual Rails conventions. If this assumption is invalid then you should change your column types or stop using 900s.
Back to the real problem. ActiveRecord converts Ruby values to SQL values using the ActiveRecord::ConnectionAdapters::Quoting#quote method:
def quote(value, column = nil)
# records are quoted as their primary key
return value.quoted_id if value.respond_to?(:quoted_id)
case value
#...
else
"'#{quote_string(YAML.dump(value))}'"
end
end
So if you try to use something as a value for a placeholder and there isn't any specific handling built in for that type, then you get YAML (a bizarre choice of defaults IMO). Also, 900.seconds is an ActiveSupport::Duration object (despite what 900.seconds.class says) and the case value has no branch for ActiveSupport::Duration so 900.seconds will get YAMLified.
The PostgreSQL adapter provides its own quote in ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#quote but that doesn't know about ActiveSupport::Duration either. The MySQL adapter's quote is also ignorant of ActiveSupport::Duration. You could monkey patch some sense into these quote methods. Something like this in an initializer:
class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
# Grab an alias for the standard quote method
alias :std_quote :quote
# Bludgeon some sense into things
def quote(value, column = nil)
return "interval '#{value.to_i} seconds'" if(value.is_a?(ActiveSupport::Duration))
std_quote(value, column)
end
end
With that patch in place, you get intervals that PostgreSQL understands when you use an ActiveSupport::Duration:
> Model.where('a - b > ?', 900.seconds).to_sql
=> "SELECT \"models\".* FROM \"models\" WHERE (a - b > interval '900 seconds')"
> Model.where('a - b > ?', 11.days).to_sql
=> "SELECT \"models\".* FROM \"models\" WHERE (a - b > interval '950400 seconds')"
If you add a similar patch to the MySQL adapter's quote (which is left as an exercise for the reader), then things like:
UserRole.where("(effective_end_date - effective_start_date) > ?", 900.seconds)
will do The Right Thing in both PostgreSQL and MySQL and your code won't have to worry about it.
That said, developing and deploying on different databases is a really bad idea that will make Santa Claus cry and go looking for some coal (possibly laced with arsenic, possibly radioactive) for your stocking. So don't do that.
If on the other hand you're trying to build database-agnostic software, then you're in for some happy fun times! Database portability is largely a myth and database-agnostic software always means writing your own portability layer on top of the ORM and database interfaces that your platform provides. You will have to exhaustively test everything on each database you plan to support, everyone pays lip service to the SQL Standard but no one seems to fully support it and everyone has their own extensions and quirks to worry about. You will end up writing your own portability layer that will consist of a mixture of utility methods and monkey patches.
I need to store a regular expression related to other fields in a database table with ActiveRecord.
I found the to_s method in Regexp class which states
Returns a string containing the regular expression and its options
(using the (?opts:source) notation. This string can be fed back in to
Regexp::new to a regular expression with the same semantics as the
original. (However, Regexp#== may not return true when comparing the
two, as the source of the regular expression itself may differ, as the
example shows). Regexp#inspect produces a generally more readable
version of rxp.
So it seems a working solution, but it will store the exp with an unusual syntax and in order to get the string to store I need to build it manually with /my-exp/.to_s. Also I may not be able to edit to regexp directly. For instance a simple regexp produces:
/foo/i.to_s # => "(?i-mx:foo)"
The other option is to eval the field content so I might store the plain expression in the db column and then doing an eval(record.pattern) to get the actual regexp. This is working and since I'm the only one who will be responsible to manage the regexp records there should be no issues in doing that, except application bugs ;-)
Do I have other options? I'd prefer to not doing eval on db fields but on the other side I don't want to work with a syntax which I don't know.
use serialize to store your regex 'as-is'
class Foo < ActiveRecord::Base
serialize :my_regex, Regexp
end
see the API doc to learn more about this.
Not sure I understand your constraints exactly.
If you store a string in db, you could make a Regexp from it:
a = 'foo'
=> "foo"
/#{a}/
=> /foo/
Regexp.new('dog', Regexp::EXTENDED + Regexp::MULTILINE + Regexp::IGNORECASE)
=> /dog/mix
There are other constructors, see doc.
The very best solution to not use eval'd code is to store the regexp part in a string column and flags in a separate integer column. In this way the regexp can be built with:
record = Record.new pattern: 'foo', flags: Regexp::IGNORECASE
Regexp.new record.pattern, record.flags # => /foo/i
You can use #{} within regular expressions to insert variables, so you could insert a carefully cleaned regexp by storing "foo" in the db under record.pattern as a string, and then evaluating it with:
/#{record.pattern}/
So, in the db, you would store:
"pattern"
in your code, you could do:
if record.other_field =~ /#{record.pattern}/
# do something
end
This compiles the regexp from a dynamic string in the db that you can change, and allows you to use it in code. I wouldn't recommend it for security reasons though, see below:
Obviously this could be dangerous, as the regex can contain ruby code, so this is simpler, but in terms of danger, it is similar to eval:
a = "foo"
puts a
=> foo
b = "#{a = 'bar'}"
a =~ /#{b}/
puts a
=> bar
You might be better to consider whether for security it is worth decomposing your regex tests into something you can map to methods which you write in the code, so you could store keys in the db for constraints, something like:
'alpha,numeric' etc.
And then have hard-coded tests which you run depending on the keys stored. Perhaps look at rails validations for hints here, although those are stored in code, it's probably the best approach (generalise your requirements, and keep the code out of the db). Even if you don't think you need security now, you might want it later, or forget about this and grant access to someone malicious.
I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop