Accent Insensitive ordering in Sphinx

Accent Insensitive ordering in Sphinx - ruby-on-rails

I am using Sphinx with the Thinking Sphinx plugin to search my data. I am using MySQL.
My data contains accented chars ("á", "é", "ã") and I want them to be equivalent to their non-accented counterparts ("a", "e", "a", for example) when searching and ordering.
I got the search working using a charset table (pastie.org/204316), and a search for "AGUA" returns "ÁGUA", but the ordering of the results is not working properly. In a search for "AGUA", "ÁGUA" cames after "MUITA ÁGUA", for example, but I wanted it to be sorted as if it were written with an "A", not an "Á".
The only solution I can think is index a new column containing the non-accented chars and using it for sortering, using the REPLACE (http://dev.mysql.com/doc/refman/5.4/en/string-functions.html#function_replace) mysql function to strip the accented chars, but I would need one call to REPLACE for each possible accented char (and there are many) and it seems to me a not very maintanable workaround.
Anybody know some better way to handle this issue?
Thanks!

Sphinx handles sorting on string fields by storing all the values in a list, sorting the list and then storing the index of each string as an int attribute. According to the docs the sorting of this list is done at a byte level and currently isn't configurable.
Ideally the strings should be sorted differently, depending on the encoding and locale. For instance, if the strings are known to be Russian text in KOI8R encoding, sorting the bytes 0xE0, 0xE1, and 0xE2 should produce 0xE1, 0xE2 and 0xE0, because in KOI8R value 0xE0 encodes a character that is (noticeably) after characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx does not support that at the moment and will simply sort the strings bytewise.
-- from http://www.sphinxsearch.com/docs/current.html
So, no easy way to achieve this within Sphinx. A modification to your REPLACE() based idea would be to have a separate column and populate it using a callback in your model. This would let you handle the replace in Ruby instead of MySQL, an arguably more maintainable solution.
# save an unaccented copy of your title. Normalise method borrowed from
# http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri
class MyModel < ActiveRecord::Base
before_validation :update_sort_col
private
def update_sort_col
sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s
end
end

you can also use a special index for that you dont even need a new column on your db
indexes "LOWER(title)", :as => :title, :sortable => true
its raw sql so you can call your replace method.

Just build index on lower case version with following syntax. Its very simple and elegant solution for case insensitive search using Sphinx.
indexes title, as: :title, sortable: :insensitive

Related

convert my string to comma based elements

I am working on a legacy Rails project that relies on Ruby version 1.8
I have a string looks like this:
my_str = "a,b,c"
I would like to convert it to
value_list = "('a','b','c')"
so that I can directly use it in my SQL statement like:
"SELECT * from my_table WHERE value IN #{value_list}"
I tried:
my_str.split(",")
but it returns "abc" :(
How to convert it to what I need?

To split the string you can just do
my_str.split(",")
=> ["a", "b", "c"]
The easiest way to use that in a query, is using where as follows:
Post.where(value: my_str.split(","))
This will just work as expected. But, I understand you want to be able to build the SQL-string yourself, so then you need to do something like
quoted_values_str = my_str.split(",").map{|x| "'#{x}'"}.join(",")
=> "'a','b','c'"
sql = ""SELECT * from my_table WHERE value IN (#{quoted_values_str})"
Note that this is a naive approach: normally you should also escape quotes if they should be contained inside your strings, and makes you vulnerable for sql injection. Using where will handle all those edge cases correctly for you.

Under no circumstances should you reinvent the wheel for this. Rails has built-in methods for constructing SQL strings, and you should use them. In this case, you want sanitize_sql_for_assignment (aliased to sanitize_sql):
my_str = "a,b,c"
conditions = sanitize_sql(["value IN (?)", my_str.split(",")])
# => value IN ('a','b','c')
query = "SELECT * from my_table WHERE #{conditions}"
This will give you the result you want while also protecting you from SQL injection attacks (and other errors related to badly formed SQL).
The correct usage may depend what version of Rails you're using, but this method exists as far back as Rails 2.0 so it will definitely work even with a legacy app; just consult the docs for the version of Rails you're using.

value_list = "('#{my_str.split(",").join("','")}')"
But this is a very bad way to query. You better use:
Model.where(value: my_str.split(","))

The string can be manipulated directly; there is no need to convert it to an array, modify the array then join the elements.
str = "a,b,c"
"(%s)" % str.gsub(/([^,]+)/, "'\\1'")
#=> "('a','b','c')"
The regular expression reads, "match one or more characters other than commas and save to capture group 1. \\1 retrieves the contents of capture group 1 in the formation of gsub's replacement string.

couple of use cases:
def full_name
[last_name, first_name].join(' ')
end
or
def address_line
[address[:country], address[:city], address[:street], address[:zip]].join(', ')
end

How to sort by salary with Ransack and ignoring the dollar symbol?

I am currently attempting to create multiple sort functions for my rails project using ransack gem. The issue that I am having with ransacker, is that I cannot read past the format of the string, because it has a ($) in some of the post and commas as well. What I would like to do is still sort the data attribute and ignore both the $ conditional dollar symbol and thousand position commas (may not be included in certain cases) & append current input from search box
For example:
string = "$30,000" -> parse to remove $ and leave only 30000 for the search engine to find the records that include the number & what was written in the search_form input (job.job_title). The code that I wrote is below, it may not be correct as I was trying multiple approaches. Final result: Ransack should search for "30000 marketing position"
rails view
<li>$30,000+ <%= sort_link(#q, :salary_between_30_and_40k, default_order: :desc) %></li>
job.rb
ransacker :salary_between_30_and_40k do
Arel.sql('SELECT * FROM JOBS WHERE job.hourly_wage_salary BETWEEN 30000 AND 40000')
end

The correct approach here is to migrate your database so that salary details are stored as a numeric value rather than a string with formatting.

Rails Amounts in Thousands Are Truncated

In my Rails 5 app, I read in a feed for products. In the JSON, when the price is over $1,000, it the JSON has a comma, like 1,000.
My code seems to be truncating it, so it's storing as 1 instead of 1,000.
All other fields are storing correctly. Can someone please tell me what I'm doing wrong?
In this example, the reg_price saves as 2, instead of 2590.
json sample (for reg_price field):
[
{
"reg_price": "2,590"
}
]
schema
create_table "products", force: :cascade do |t|
t.decimal "reg_price", precision: 10, scale: 2
end
model
response = open_url(url_string).to_s
products = JSON.parse(response)
products.each do |product|
product = Product.new(
reg_price: item['reg_price']
)
product.save
end

You are not doing anything wrong. Decimals don't work with comma separator. I'm not sure there is a nice way to fix the thing. But as an option you could define a virtual attribute:
def reg_price=(reg_price)
self[:reg_price] = reg_price.gsub(',', '')
end

The reason this is happening has nothing to do with Rails.
JSON is a pretty simple document structure and doesn't have any support for number separators. The values in your JSON document are strings.
When you receive a String as input and you want to store it as an Integer, you need to cast it to the appropriate type.
Ruby has built in support for this, and Rails is using it: "1".to_s #=> 1
The particular heuristic Ruby uses to convert a string to an integer is to take any number up to a non-numerical character and cast it as an integer. Commas are non-numeric, at least by default, in Ruby.
The solution is to convert the string value in your JSON to an integer using another method. You can do this any of these ways:
Cast the string to an integer before sending it to your ActiveRecord model.
Alter the string in such a way that the default Ruby casting will cast the string into the expected value.
Use a custom caster to handle the casting for this particular attribute (inside of ActiveRecord and ActiveModel).
The solution proposed by #Danil follows #2 above, and it has some shortcomings (as #tadman pointed out).
A more robust way of handling this without getting down in the mud is to use a library like Delocalize, which will automatically handle numeric string parsing and casting with consideration for separators used by the active locale. See this excellent answer by Benoit Garret for more information.

Multi parameter search via user input - ruby on rails & mongodb

I have a web page where a user can search through documents in a mongoDB collection.
I get the user's input through #q = params[:search].to_s
I then run a mongoid query:
#story = Story.any_of( { :Tags => /#{#q}/i}, {:Name => /#{#q}/i}, {:Genre => {/#{#q}/i}} )
This works fine if the user looks for something like 'humor' 'romantic comedy' or 'mystery'. But if looking for 'romance fiction', nothing comes up. Basically I'd like to add 'and' 'or' functionality to my search so that it will find documents in the database that are related to all strings that a user types into the input field.
How can this be done while still maintaining the substring search capabilties I currently have?Thanks in advance for help!
UPDATE:
Per Eugene's comment below...
I tried converting to case insensitive with #q.map! { |x| x="/#{x}/i"}. It does save it properly as ["/romantic/i","/comedy/i"]. But the query Story.any_of({:Tags.in => #q}, {:Story.in => #q})finds nothing.
When I change the array to be ["Romantic","Comedy"]. Then it does.
How can I properly make it case insensitive?
Final:
Removing the quotes worked.
However there is now no way to use an .and() search to find a book that has both words in all these fields.

to create an OR statement, you can convert the string into an array of strings, and then convert the array of strings into an array of regex and then use the '$in' option. So first, pick a delimeter - perhaps commas or space or you can set up a custom like ||. Let's say you do comma seperated. When user enters:
romantic, comedy
you split that into ['romantic', 'comedy'], then convert that to [/romantic/i, /comedy/i] then do
#story = Story.any_of( { :Tags.in => [/romantic/i, /comedy/i]}....
To create an AND query, it can get a little more complicated. There is an elemMatch function you could use.
I don't think you could do {:Tags => /romantic/i, :Tags => /comedy/i }
So my best thought would be to do sequential queries, even though there would be a performance hit, but if your DB isn't that big, it shouldn't be a big issue. So if you want Romantic AND Comedy you can do
query 1: find all collections that match /romantic/i
query 2: take results of query 1, find all collections that match /comedy/i
And so on by iterating through your array of selectors.

Thinking sphinx attribute from polymorphic association's datetime field

I have a model A associated to model B via INNER JOIN:
class A
has_many :bees, as: :bable
scope :bees, -> () {
joins("INNER JOIN bees AS b ON id = b.bable_id .......")
}
end
class B
table_name = "bees"
belongs_to :bable, polymorphic: true
end
I need to filter using B's datetime field (created_at), so I declared a new attribute thus:
has bees.created_at, as: :b_created_at
The sphinx query statement generated now includes:
GROUP_CONCAT(DISTINCT UNIX_TIMESTAMP(bees.`created_at`) SEPARATOR ',') AS `b_created_at`
After indexing, my sphinx index file size exploded.
How much is the "GROUP_CONCAT" part of the query causing the problem, and is there a better way to filter by this attribute?
How can I debug the indexer and find other causes of the large index file being generated?
Thanks

It appears that the indexer is creating, within the index file, a comma separated list of all created timestamps of all bees - as created timestamps are generally unique (!), this indexing is going to create one item for every bee. If you have a lot of bees then this is going to be big.
I would be looking at some way to bypass Sphinx for this part of the query if that is possible and get it to add a direct SQL BETWEEN LowDateTs AND HighDateTs against the built in created_at instead. I hope this is possible - it will definitely be better than using a text index to find it.
Hope this is of some help.
Edit:
Speed reading Sphinx' docs:
[...] WHERE clause. This clause will map both to fulltext query and filters. Comparison operators (=, !=, <, >, <=, >=), IN, AND, NOT, and BETWEEN are all supported and map directly to filters [...]
So the key is to stop it treating the timestamp as a text search and use a BETWEEN, which will be vastly more efficient and hopefully stop it trying to use text indexing on this field.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Accent Insensitive ordering in Sphinx - ruby-on-rails

you can also use a special index for that you dont even need a new column on your db indexes "LOWER(title)", :as => :title, :sortable => true its raw sql so you can call your replace method.

Just build index on lower case version with following syntax. Its very simple and elegant solution for case insensitive search using Sphinx. indexes title, as: :title, sortable: :insensitive

Related

convert my string to comma based elements

How to sort by salary with Ransack and ignoring the dollar symbol?

Rails Amounts in Thousands Are Truncated

Multi parameter search via user input - ruby on rails & mongodb

Thinking sphinx attribute from polymorphic association's datetime field

Categories

Resources