ThinkingSphinx: OR-condition on the SQL-backed indices? - ruby-on-rails

I am trying to use ThinkingSphinx in my Rails 5 project. I read an instruction at http://freelancing-gods.com/thinking-sphinx/
I need to implement the OR logic on SQL-backed indices.
Here is my class:
class Message < ApplicationRecord
belongs_to :sender, class_name: 'User', :inverse_of => :messages
belongs_to :recipient, class_name: 'User', :inverse_of => :messages
end
and its indexer:
ThinkingSphinx::Index.define :message, :with => :active_record, :delta => true do
indexes text
indexes sender.email, :as => :sender_email, :sortable => true
indexes recipient.email, :as => :recipient_email, :sortable => true
has sender_id, created_at, updated_at
has recipient_id
end
schema.rb:
create_table "messages", force: :cascade do |t|
t.integer "sender_id"
t.integer "recipient_id"
t.text "text"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.boolean "read", default: false
t.boolean "spam", default: false
t.boolean "delta", default: true, null: false
t.index ["recipient_id"], name: "index_messages_on_recipient_id", using: :btree
t.index ["sender_id"], name: "index_messages_on_sender_id", using: :btree
end
So I need to search only within 2 indices at once - :sender_email and :recipient_email - but ignoring indexes text.
In pseudocode I need something like this:
Message.search 'manager1#example.com' :conditions => {:sender_email => 'client1#example.com' OR :receiver_email => 'client1#example.com'}
Which means: find all the messages between 'manager1#example.com' and 'client1#example.com' (each of them could be either a sender or a receiver) - ignoring the messages containing the text with words 'manager1#example.com' or 'client1#example.com'.
Unfortunately, the docs say:
The :conditions option must be a hash, with each key a field and each value a string.
In other words, I need a conditional index set (at run-time) - but simultaneously over 2 indices (not 1 as documented).
I mean that it is a bad idea to allow only hashes as a condition - and no strings (like ActiveRecord queries do allow http://guides.rubyonrails.org/active_record_querying.html#pure-string-conditions ).
PS I would say that the ThinkingSphinx documentation http://freelancing-gods.com/thinking-sphinx/ is pretty bad and needs to be totally rewritten from scratch. I read it all and did not understand anything. It has no examples (complete examples - only partial - thus totally unclear). I even don't understand what are fields and attributes and how do they differ. Associations, conditions, etc - all is unclear. Very bad. The gem itself looks pretty good - but its documentation is awful.

I'm sorry to hear you've not been able to find a solution that works for you in the documentation. The challenging thing with Sphinx is that it uses the SphinxQL syntax, which is very similar to SQL, but also quite different at times - and so people often expect SQL-like behaviour.
It's also part of the challenge of maintaining this gem - I'm not sure it's wise to mimic the ActiveRecord syntax too closely, otherwise that could make things more confusing.
The key thing to note here is that you can make use of Sphinx's extended query syntax for matches to get the behaviour you're after:
Message.search :conditions => {
:sender_email => "(client1#example.com | manager1#example.com)",
:receiver_email => "(client1#example.com | manager1#example.com)"
}
This will return anything where the sender is either of the two values, and the receiver is either of the two values. Of course, this will include any messages sent from client1 to client1, or manager1 to manager1, but I'd expect that's rare and maybe not that big a problem.
One caveat to note is that # and . aren't usually treated as searchable word characters, so you may need to add them to your charset_table.
Also, given you're actually performing exact matches on the entire values of database columns, this does feel like a query that's actually better served by some database indices on the columns and using SQL instead. Sphinx (and I'd say most/all other full-text search libraries) are best suited to matching words and phrases within larger text fields.
As for the documentation… I've put a lot of effort into trying to make them useful, though I realise there's still a lot of improvement that could take place. I do have a page that outlines how fields and attributes differ - if that's not clear, feedback is definitely welcome.
Keeping documentation up-to-date requires a lot of effort in small and new projects - and Thinking Sphinx is neither of these, being 10 years old in a few months. I'm proud at how it still works well, it still supports the latest versions of Rails, and it's still actively maintained and supported. But it's open source - it's done in my (and others') spare time. It's not perfect. If you find ways to improve things, then please do contribute! The code and docs are on GitHub, and pull requests are very much welcome.

Related

ThinkingSphinx: dynamic indices on the SQL-backed indices?

I am trying to use ThinkingSphinx (with SQL-backed indices) in my Rails 5 project.
I need some dynamic run-time indices to search over.
I have a Message model:
class Message < ApplicationRecord
belongs_to :sender, class_name: 'User', :inverse_of => :messages
belongs_to :recipient, class_name: 'User', :inverse_of => :messages
end
and its indexer:
ThinkingSphinx::Index.define :message, :with => :active_record, :delta => true do
indexes text
indexes sender.email, :as => :sender_email, :sortable => true
indexes recipient.email, :as => :recipient_email, :sortable => true
indexes [sender.email, recipient.email], :as => :messager_email, :sortable => true
has sender_id, created_at, updated_at
has recipient_id
end
schema.rb:
create_table "messages", force: :cascade do |t|
t.integer "sender_id"
t.integer "recipient_id"
t.text "text"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.boolean "read", default: false
t.boolean "spam", default: false
t.boolean "delta", default: true, null: false
t.index ["recipient_id"], name: "index_messages_on_recipient_id", using: :btree
t.index ["sender_id"], name: "index_messages_on_sender_id", using: :btree
end
The problem is about so-called "dialogs". They don't exist in the database - they are determined at run-time. A dialog - that's a set of messages between 2 users, where each user may be either a sender or a receiver.
The task is to search through my dialogs and to find the dialog (dialog's messages) by the piece of the correspondent email. So complicated!
Here's my effort:
conditions = {messager_email: search_email}
with_current_user_dialogs =
"*, IF(sender_id = #{current_user.id} OR recipient_id = #{current_user.id}, 1, 0) AS current_user_dialogs"
messages = Message.search search_email, conditions: conditions,
select: with_current_user_dialogs,
with: {'current_user_dialogs' => 1}
This is almost fine - but still not. This query correctly searches only within my dialog (within the messages I sent or received) and only within :sender and :recipient fields simultaneously (which is not best).
Say my email is "client1#example.com". Other emails are like "client2#example.com", "client3#example.com", "manager1#example.com".
The trouble is that when I search for "client1" - I get all the messages where I was either a sender or a receiver. But I should get nothing in response - I need to search only across my correspondents emails - not mine.
Even worse stuff happens also while querying "client" - I get back the correct correspondents with "client2#example.com", "client3#example.com" - but the result is spoiled with wrong "client1#example.com".
I need a way to choose at run-time - which index subset to search within.
I mean this condition is not enough for me:
indexes [sender.email, recipient.email], :as => :messager_email, :sortable => true
It searches (for "client") within all the sender.email and all the recipient.email at once.
But I need to dynamically choose like: "search only within sender.email values conforming to if sender.id != current_user.id" OR "search only within recipient.email conforming to if recipient.id != current_user.id" (because I can be as a sender as a receiver).
That's what I call a "dynamic index".
How to do that? Such "dynamic index" surely would depend on the current current_user value - so it will be different for the different users - even on the same total messages set.
It is clear that I can't apply whatever post-search cut-offs (what to cut off?) - I need to somehow limitate the search itself.
I tried to search over some scope - but got the error that "searching is impossible over scopes" - something like that.
Maybe I should use the real-time indexing instead of the SQL-backed indexing?
Sorry for the complexity of my question.
Would the following work?
other = User.find_by :email => search_email
with_current_user_dialogs = "*, IF((sender_id = #{current_user.id} AND recipient_id = #{other.id}) OR (recipient_id = #{current_user.id} AND sender_id = #{other.id}), 1, 0) AS current_user_dialogs"
Or do you need partial matches on the searched email address?
[EDIT]
Okay, from the discussion in the comments below, it's clear that the field data is critical. While you can construct a search query that uses both fields and attributes, you can't have logic in the query that combines the two. That is, you can't say: "Search field 'x' when attribute 'i' is 1, otherwise search field 'y'."
The only way I can possibly see this working is if you're using fields for both parts of the logic. Perhaps something like the following:
current_user_email = "\"" + current_user.email + "\""
Message.search(
"(#sender_email #{current_user_email} #recipient_email #{search_email}) | (#sender_email #{search_email} #recipient_email #{current_user_email})"
)

Rails relationship confusion

I am trying to get some relationships in Rails set up and am having some confusion with how to use the ones I have configured.
My scenario is this:
I have a model called Coaster. I wish each Coaster to be able to have 0 or more versions. I wish to be able to find all versions of a Coaster from it's instance and also in reverse.
My models and relationships as they stand:
coaster.rb:
has_many :incarnations
has_many :coaster_versions,
through: :incarnations
incarnation.rb:
belongs_to :coaster
belongs_to :coaster_version,
class_name: "Coaster",
foreign_key: "coaster_id"
Database schema for Incarnations:
create_table "incarnations", force: :cascade do |t|
t.integer "coaster_id"
t.integer "is_version_of_id"
t.boolean "is_latest"
t.integer "version_order"
end
and my code that happens when importing Coasters from my CSV data file:
# Versions
# Now determine if this is a new version of existing coaster or not
if Coaster.where(order_ridden: row[:order_ridden]).count == 1
# Create Coaster Version that equals itself.
coaster.incarnations.create!({is_version_of_id: coaster.id, is_latest: true})
else
# Set original and/or previous incarnations of this coaster to not be latest
Coaster.where(order_ridden: row[:order_ridden]).each do |c|
c.incarnations.each do |i|
i.update({is_latest: false})
end
end
# Add new incarnation by finding original version
original_coaster = Coaster.unscoped.where(order_ridden: row[:order_ridden]).order(version_number: :asc).first
coaster.incarnations.create!({is_version_of_id: original_coaster.id, is_latest: true})
Now all my DB tables get filled in but I am unsure how to ensure everything is working how I want it to.
For example I have two coasters (A and B), B is a version of A. When I get A and ask for a count of it's coaster_versions, I only get 1 returned as a result? Surely I should get 2 or is that correct?
In the same line, if I get B and call coaster_versions I get 1 returned as well.
I just need to ensure I am getting back the correct results really.
Any comments would be highly appreciated as I have been working on this for ages now and not getting very far.
Just incase anyone is going to reply telling me to look at versioning gems. I went this route initially and it was great but the problem there is that in MY case a Coaster and a VERSION of a coaster are both as important as each other and I can't do Coaster.all to get ALL coasters whether they were versions or not. Other issues along the same line also cropped up.
Thanks
First of all, welcome to the wonderful world of history tracking! As you've found, it's not actually that easy to keep track of how your data changes in a relational database. And while there are definitely gems out there that can just track history for audit purposes (e.g. auditable), sounds like you want your history records to still be first-class citizens. So, let me first analyze the problems with your current approach, and then I'll propose a simpler solution that might make your life easier.
In no particular order, here are some pain points with your current system:
The is_latest column has to be maintained, and is at risk for going out of sync. You likely wouldn't see this in testing, but in production, at scale, it's a very valid risk.
Your incarnations table creates a one-master-version-with-many-child-versions structure, which is fine except that (similar to is_latest) the ordering of the versions is controlled by the version_order column which again needs to be maintained and is at risk of being incorrect. Your import script doesn't seem to set it, at the moment.
The incarnations relationship makes it difficult to tell that B is a version of A; you could fix this with some more relations, but that will also make your code more complex.
Complexity. It's hard to follow how history is tracked, and as you've found, it's hard to manage the details of inserting a new version (A and B should both agree that they have 2 versions, right? Since they're the same coaster?)
Now, I think your data model is still technically valid -- the issues you're seeing are, I think, problems with your script. However, with a simpler data model, your script could become much simpler and thus less prone to error. Here's how I'd do it, using just one table:
create_table "coasters", force: :cascade do |t|
t.string "name"
t.integer "original_version_id"
t.datetime "superseded_at"
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
end
The original_version_id serves the same purpose as your incarnations table does, to link a version back to the original record. The superseded_at column is both usable as an is_latest check and a way to order the versions (though below, I just order by id for simplicity). With that structure, this is my Coaster class:
class Coaster < ActiveRecord::Base
belongs_to :original_version, class_name: "Coaster"
scope :latest, -> { where(superseded_at: nil) }
scope :original, -> { where('original_version_id = id') }
# Create a new record, linked properly in history.
def self.insert(attrs)
# Find the current latest version.
if previous_coaster = Coaster.latest.find_by(name: attrs[:name])
# At the same time, create the new version (linked back to the original version)
# and deprecate the current latest version. A transaction ensures either both
# happen, or neither do.
transaction do
create!(attrs.merge(original_version_id: previous_coaster.original_version_id))
previous_coaster.update_column(:superseded_at, Time.now)
end
else
# Create the first version. Set its original version id to itself, to simplify
# our logic.
transaction do
new_record = create!(attrs)
new_record.update_column(:original_version_id, new_record.id)
end
end
end
# Retrieve all records linked to the same original version. This will return the
# same result for any of the versions.
def versions
self.class.where(original_version_id: original_version_id)
end
# Return our version as an ordinal, e.g. 1 for the very first version.
def version
versions.where(['id <= ?', id]).count
end
end
This makes adding new records simple:
irb> 5.times { Coaster.insert(name: "Coaster A") }
irb> 4.times { Coaster.insert(name: "Coaster B") }
irb> Coaster.latest.find_by(name: "Coaster A").version
(2.2ms) SELECT COUNT(*) FROM "coasters" WHERE "coasters"."original_version_id" = $1 AND (id <= 11) [["original_version_id", 7]]
=> 5
irb> Coaster.original.find_by(name: "Coaster A").version
(2.3ms) SELECT COUNT(*) FROM "coasters" WHERE "coasters"."original_version_id" = $1 AND (id <= 7) [["original_version_id", 7]]
=> 1
Granted, it's still complex code that would be nice to have made simpler. My approach is certainly not the only one, nor necessarily the best. Hopefully you learned something, though!

Rails :uniqueness validation not finding previous records

I'm having a bizzarre glitch, where Rails does not validate the uniqueness of an attribute on a model, despite the attribute being saved perfectly, and despite the validation being written correctly.
I added a validation to ensure the uniqueness of a value on one of my Rails models, Spark, with this code:
validates :content_hash, :presence => true, :uniqueness => true
The content_hash is an attribute created from the model's other attributes in a method called using a before_validation callback. Using the Rails console, I've confirmed that this hash is actually being created before the validation, so that is not the issue.
When I call in the Rails console spark.valid? on a spark for which I know a collision exists on its content_hash, the console tells me that it has run this query:
Spark Exists (0.2ms) SELECT 1 AS one FROM "sparks" WHERE "sparks"."content_hash" = '443524b1c8e14d627a3fadfbdca50118c6dd7a7f' LIMIT 1
And the method returns that the object is valid. It seems that the validator is working perfectly fine, and is running the correct query to check the uniqueness of the content_hash, the problem is instead on the database end (I'm using sqlite3). I know this because I decided to check on the database myself to see if a collision really exists using this query:
SELECT "sparks".* FROM "sparks" WHERE "sparks"."content_hash" = '443524b1c8e14d627a3fadfbdca50118c6dd7a7f'
Bizarrely, this query returns nothing from the database, despite the fact that I can see with my own eyes that other records with this content_hash exist on the table.
For some reason, this is an issue that exists exclusively with the content_hash attribute of the sparks table, because when I run similar queries for the other attributes of the table, the output is correct.
The content_hash column is no different from the others which work as expected, as seen in this relevant part of my schema.rb file:
create_table "sparks", :force => true do |t|
t.string "spark_type"
t.string "content_type"
t.text "content"
t.text "content_hash"
t.datetime "created_at", :null => false
t.datetime "updated_at", :null => false
end
Any help on this problem would be much appreciated; I'm about ready to tear my hair out over this thing.
Okay, I managed to fix the problem. I think it was an sqlite3 issue, because everything worked perfectly once I changed the type of content_hash from a text column to a string column. Weird.

ActiveRecord::Base.connection.execute(..) looking for column with name of a value I am trying to insert into a table

Newbie question.
I am trying to use ActiveRecord::Base.connection.execute(..) in a ruby 3.1 app and docs I have read seem to be straight forward but for the life in me I can seem to understand why I cant get the code below to work. The error message I am getting suggests that the execute function is looking for a column with the name of one of the values I am trying to save but I dont understand why.
Firstly, my db table structure is as follows:
create_table "countries", :force => true do |t|
t.string "iso3"
t.string "iso2"
t.string "name"
t.datetime "created_at"
t.datetime "updated_at"
end
And the code Im playing with is as follows:
code = 'ZA'
name = 'South Africa'
ActiveRecord::Base.connection.execute("INSERT INTO countries ('iso3', 'iso2', 'name')
VALUES ('Null', #{code}, #{name})")
The error message I am getting is as follows:
SQLite3::SQLException: no such column: ZA: INSERT INTO countries ('iso3', 'iso2', 'name')
VALUES ('Null', ZA, SouthAfrica)
Where did you get the basis for this? Code of this variety is a sterling example of what not to do.
If you have ActiveRecord, then you have ActiveRecord::Model, and with that you're on the right track and pretty much done. You don't need to write raw SQL for routine things of this variety. It's not necessary, and more, it's extremely dangerous for the reasons you've just discovered. You can't just shove random things in to your query or you will end up with nothing but trouble.
What you should be doing is declaring a model and then using it:
# Typically app/models/country.rb
class Country < ActiveRecord::Base
end
To insert once you have a model is made seriously easy:
Country.create(
:code => 'ZA',
:name => 'South Africa'
)
A good ActiveRecord reference is invaluable as this facility will make your life significantly easier if you make use of it.
Within Rails you usually go about generating these automatically so that you have something rough to start with:
rails generate model country
This will take care of creating the migration file, the model file, and some unit test stubs you can fill in later.
The error is just because if the missing quotes. it should be like:
INSERT INTO countries ('iso3', 'iso2', 'name') VALUES ('Null', 'ZA', 'SouthAfrica')

How do I get only unique results from two dissimilar arrays?

This might seem like a duplicate question, but I can't find any information on this. I want to show the results from a remotely acquired json array excluding certain results by comparing them to a local table. I have a gallery model with:
t.integer :smugmug_id
t.string :smugmug_key
t.integer :category_id
t.string :category_name
t.string :description
t.integer :highlight_id
t.string :highlight_key
t.string :highlight_type
t.string :keywords
t.string :nicename
t.integer :subcategory_id
t.string :subcategory_name
t.string :title
t.string :url
The data for this model gets populated by a rake task that connects to the smugmug api (json) and stores the data locally. I'm trying to create a view that shows all the smugmug galleries that are not stored locally.
Here's what I've tried so far, but it's not excluding the locally stored galleries like I thought it would.
def self.not_stored
smugmug_list = Smug::Client.new.albums(heavy = true)
gallery_list = Gallery.select(:smugmug_id)
smugmug_list.each do |smugmug|
smugmug unless gallery_list.include? smugmug.id
end
end
Hopefully this makes sense. I'm getting a json array of galleries, and I want to display that array excluding results where the album id matches the smugmug_id of any of my locally stored records.
Quick edit: I'm using an adaptation of this gem to connect to the smugmug api.
Just use the difference operator.
General Example:
ruby-1.9.2-p136 :001 > [3,2,1] - [2,1]
=> [3]
So you would have:
smugmug_list.collect{|e| e.id} - gallery_list
Enumerable#collect will turn the smugmug_list into a list of id's. From there, you can do the difference operator, which will return all the id's of all the smugmug galleries that are not stored locally.
Another option to maintain the list of galleries:
smugmug_list.select{|e|!gallery_list.include?(e.id)}

Resources