thinking_sphinx results order - ruby-on-rails

I have no experience with thinking_sphinx (I take pride in the fact I even got it working).
I would like to sort my results based on relevance to the search and how recent they are. Maybe, 5X for relevance, 1X for time. (I'd have to play with that to get it right). Obviously if there's no search criteria, I'd like it to sort just by time.
I know I need to add the created_at column to the search model, but not as indexes (what term do I use?)
Report controller:
def index
#reports = Report.search params[:search]
# unknown sorting code here
end
Report model:
define_index do
indexes apparatus
indexes body
indexes comments.body, as => :comment_body
????? created_at
end

You would just do:
define_index do
indexes apparatus
indexes body
indexes comments.body, as => :comment_body
has created_at
end
By using has you just denote whatever fields it needs but isn't indexing on
For search sorting, you need to read the Sphinx docs for how you think you'd want them weighted and sorted:
http://freelancing-god.github.com/ts/en/searching.html#sorting
http://freelancing-god.github.com/ts/en/searching.html#fieldweights
By default, Sphinx sorts based on how relevant it thinks the results are to the given inputs.

Related

Searching multiple tables with postgreSQL 13 and Rails 6+

I provide a lot of context to set the stage for the question. What I'm trying to solve is fast and accurate fuzzysearch against multiple database tables using structured data, not full-text document search.
I'm using postgreSQL 13.4+ and Rails 6+ if it matters.
I have fairly structured data for several tables:
class Contact
attribute :id
attribute :first_name
attribute :last_name
attribute :email
attribute :phone
end
class Organization
attribute :name
attribute :license_number
end
...several other tables...
I'm trying to implement a fast and accurate fuzzysearch so that I can search across all these tables (Rails models) at once.
Currently I have a separate search query using ILIKE that concats the columns I want to search against on-the-fly for each model:
# contact.rb
scope :search -> (q) { where("concat_ws(' ', first_name, last_name, email, phone) ILIKE :q", q: "%#{q}%")
# organization.rb
scope :search -> (q) { where("concat_ws(' ', name, license_number) ILIKE :q", q: "%#{q}%") }
In my search controller I query each of these tables separately and display the top 3 results for each model.
#contacts = Contact.search(params[:q]).limit(3)
#organizations = Organization.search(params[:q]).limit(3)
This works but is fairly slow and not as accurate as I would like.
Problems with my current approach:
Slow (relatively speaking) with only thousands of records.
Not accurate because ILIKE must have an exact match somewhere in the string and I want to implement fuzzysearch (ie, with ILIKE, "smth" would not match "smith").
Not weighted; I would like to weight the contacts.last_name column over say the organizations.name because the contacts table is generally speaking the higher priority search item.
My solution
My theoretical solution is to create a search_entries polymorphic table that has a separate record for each contact, organization, etc, that I want to search against, and then this search_entries table could be indexed for fast retrieval.
class SearchEntry
attribute :data
belongs_to :searchable, polymorphic: true
# Store data as all lowercase to optimize search (avoid lower method in PG)
def data=(text)
self[:data] = text.lowercase
end
end
However, what I'm getting stuck on is how to structure this table so that it can be indexed and searched quickly.
contact = Contact.first
SearchEntry.create(searchable: contact, data: "#{contact.first_name} #{contact.last_name} #{contact.email} #{contact.phone}")
organization = Organization.first
SearchEntry.create(searchable: organization, data: "#{organization.name} #{organization.license_number}")
This gives me the ability to do something like:
SearchEntry.where("data LIKE :q", q: "%#{q}%")
or even something like fuzzysearch using PG's similarity() function:
SearchEntry.connection.execute("SELECT * FROM search_entries ORDER BY SIMILARITY(data, '#{q}') LIMIT 10")
I believe I can use a GIN index with pg_trgm on this data field as well to optimize searching (not 100% on that...).
This simplifies my search into a single query on a single table, but it still doesn't allow me to do weighted column searching (ie, contacts.last_name is more important than organizations.name).
Questions
Would this approach enable me to index the data so that I could have very fast fuzzysearch? (I know "very fast" is subjective, so what I mean is an efficient usage of PG to get results as quickly as possible).
Would I be able to use a GIN index combined with pg_trgm tri-grams to index this data for fast fuzzysearch?
How would I implement weighting certain values higher than others in an approach like this?
One potential solution is to create a materialized view consisting of a union of data from the two (or more tables). Take this simplefied example:
CREATE MATERIALIZED VIEW searchables AS
SELECT
resource_id,
resource_type,
name,
weight
FROM
SELECT
id as resource_id,
'Contact' as resource_type
concat_ws(' ', first_name, last_name) AS name,
1 AS weight
FROM contacts
UNION
SELECT
id as resource_id,
'Organization' as resource_type
name
2 AS weight
FROM organizations
class Searchable < ApplicationRecord
belongs_to :resource, polymorphic: true
def readonly?
true
end
# Search contacts and organziations with a higher weight on contacts
def self.search(name)
where(arel_table[:name].matches(name)).order(weight: :desc)
end
end
Since materialized views are stored in a table like structure you can apply indices just like you could with a normal table:
CREATE INDEX searchables_name_trgm ON name USING gist (searchables gist_trgm_ops);
To ActiveRecord it also behaves just like a normal table.
Of course the complexity here will grow with number of columns you want to search and the end result might end up both underwhelming in functionality and overwhelming in complexity compared to an off the shelf solution with thousands of hours behind it.
The scenic gem can be used to make the migrations for creating materialized views simpler.

Sunspot Solr index time boost

I try to use document boost on index time, but it seems, that it hasn't any effect. I've set up my model for Sunspot like
Spree::Product.class_eval do
searchable :auto_index => true, :auto_remove => true do
text :name, :boost => 2.0, stored: true
text :description, :boost => 1.2, stored: false
boost { boost_value }
end
end
The boost_value field is a field in the database, where a user can change the boost in the frontend. It gets stored at index time (either the first time I build the index, or when a product is updated). I have about 3600 products in my database, with a default boost_valueof 1.0. Two of the products got different boost_values, one with 5.0 and the other with 2.0.
However, If I just want to retrieve all products from Solr, the document boost seems to have no effect on the order or the score:
solr = ::Sunspot.new_search(Spree::Product) do |query|
query.order_by("score", "desc")
query.paginate(page: 1, per_page: Spree::Product.count)
end
solr.execute
solr.results.first
The Solr query itself looks like this:
http://localhost:8982/solr/collection1/select?sort=score+desc&start=0&q=*:*&wt=xml&fq=type:Spree\:\:Product&rows=3600&debugQuery=true
I've appended a debugQuery=true at the end, to see what the scores are. But there are no scores shown.
The same things happens, when I search for a term. For examle, I have 2 products that have a unique string testtest inside the name field. When I search for this term, the document boost has no effect on the order.
So my questions are:
Can per document index time boosting be used based on a database field?
Does the document boost has any effect for q=*:*?
How can I debug this?
Or do I have to specify, that solr should involve the document boost?
In solr, the boosts only apply to text searches, so it applies only if you do a fulltext search.
Something like this:
solr = ::Sunspot.new_search(Spree::Product) do |query|
fulltext 'somesearch'
query.order_by("score", "desc") # I think this isn't necesary
query.paginate(page: 1, per_page: Spree::Product.count)
end
If you want to boost certain products more than others:
solr = ::Sunspot.new_search(Spree::Product) do |query|
fulltext 'somesearch' do
boost(2.0) { with(:featured, true) }
end
query.paginate(page: 1, per_page: Spree::Product.count)
end
As you see, this is much powerfull than boosting at index time, and you could put different boostings for different conditions, all at query time with no need of reindexing if you want to change the boost or the conditions.

How do I get the search to use the attr_accessor?

Ok so i have a date field that i need to search on, but i need to search on it by day like in a mysql query
search_conditions << ["DAY(open_date) != ?", event.thursday.day] if options[:thur].blank?
and i need to do this condition with Thinking Sphinx so i tried this
attr_accessor :event_day
def event_day
self.start_date.day
end
#thinking sphinx configurations for the event search
define_index do
indexes event_day
...
...
and in the search i tried this
search_string = "#event_day -#{event.thursday.day}" unless options[:thur].blank?
but i keep getting this error
index event_core: query error: no field 'event_day' found in schema
Any way to make this work
You can't use a ruby attribute in an SQL query. Rails isn't that clever.
You need to write SQL that replicates that function, or filter the results of a query through it, e.g.
#my_query.where(:a => "b").select { |rec| rec.some_method == "some value" }
As Michael's pointed out, Ruby attributes aren't accessible by Sphinx - it talks directly to your database.
So, either you can create a column that holds the event day value, and reference that via Sphinx, or you can create a field that uses a SQL function (which could vary, depending on MySQL or PostgreSQL) that extracts the day from the start_date column - not particularly complex. It'd probably end up looking like this:
indexes "GET_DAY_FROM_DATE(start_date)", :as => :event_day

Solr/Lucene is it possible to order first by relevance, and then by a second attribute?

In Solr/Lucene is it possible to order first by relevance, and then by a second attribute?
As far as I can tell if I set an ordering parameter, it totally overrides relevance, and sorts by the ordering parameter(s).
How can I have results sorted first by relevance, and then in the case of two entries with exactly the same relevance, giving the nod to the item that, say, comes first alphabetically.
If it makes any difference I'm using Solr through Sunspot in Ruby on Rails.
Solved my own problem!
The keyword score can be passed to order the result by relevancy.
So in Rails Sunspot terms:
Article.search do
keywords params[:query]
order_by :score, :desc
order_by :name, :asc
end

Filtering Sphinx search results by date range

I have Widget.title, Widget.publish_ at, and Widget.unpublish_ at. It's a rails app with thinking_sphinx running, indexing once a night. I want to find all Widgets that have 'foo' in the title, and are published (publish _at < Time.now, unpublish _at > Time.now).
To get pagination to work properly, I really want to do this in a sphinx query. I have
'has :publish_at, :unpublish_at' to get the attributes, but what's the syntax for 'Widget.search("foo #publish_ at > #{Time.now}",:match _mode=>:extended'? Is this even possible?
Yep, easily possible, just make sure you're covering the times in your indexes:
class Widget < ActiveRecord::Base
define_index do
indexes title
has publish_at
has unpublish_at
...
end
To pull it based purely off the dates, a small amount of trickery is required due to sphinx requiring a bounded range (x..y as opposed to x>=y). The use of min/max value is very inelegant, but I'm not aware of a good way around it at the moment.
min_time = Time.now.advance(:years => -10)
max_time = Time.now.advance(:years => 10)
title = "foo"
Widget.search title, :with => {:publish_at => min_time..Time.now, :unpublish_at => Time.now..max_time}
I haven't used sphinx with rails yet.
But this is possible by the Sphinx API.
What you need to do is to set a datetime attribute at your sphinx.conf.
And don't forget to use UNIX_TIMESTAMP(publish_at), UNIX_TIMESTAMP(unpublish_at) at your index select.

Resources