Solr join across multiple collections and fetch data from both collections - join

I have 2 solr collections:
Ads {id, title, body, description, etc etc)
AdPlacement (ad_id, placement_id, price)
Each Ad can have 500-1000 placements, with different prices.
The search usecase is where I have a placement and some search keyword and I want to find the Ads that map the keyword provided in the title/body/description fields and it should be sorted by the price in the AdPlacement collection for the given placement. We would like to get the Ad details and the price in the output returned.
Is there any way to achieve this in solr using join across multiple collections? What I have read so far says you can only get data from one collection and use the other one just for filtering.

Solr is a Document database and supports nested documents so ideally you would want to model such that your add placement records are a part of the Ad document. This would be the better way to handle your scenario. Please go through this blog Solr Nested Objects and the relevant Solr documentation
In case modifying the document structure is not an option then consider this documentation which mentions about allowing some level of join between collections.

Related

Can I Order Results Using Two Columns from Different Tables?

I have a Rails application featuring a city in the US. I'm working on a database process that will feature businesses that pay to be on the website. The goal is to feature businesses within an hour's drive of the city's location in order to make visitors aware of what is available. My goal is to group the businesses by city where the businesses in the city are listed first then the businesses from the next closest city are displayed. I want the cities to be listed by distance and the businesses within the city group to be listed by the business name.
I have two tables that I want to join in order to accomplish this.
city (has_many :businesses) - name, distance
business (belongs_to :city) - name, city_id, other columns
I know I can do something like the statement below that should only show data where business rows exist for a city row.
#businesses = City.order(“distance ASC").joins('JOIN businesses ON businesses.city_id = cities.id')
I would like to add order by businesses.name. I've seen an example ORDER BY a.Date, p.title which referencing columns from two databases.
ORDER BY a.Date, p.title
Can I add code to my existing statement to order businesses by name or will I have to embed SQL code to do this? I have seen examples with other databases doing this but either the answer is not Rails specific or not using PostgreSQL.
After lots more research I was finally able to get this working the way I wanted to.
Using .joins(:businesses) did not yield anything because it only included the columns for City aka BusinessCity and no columns for Business. I found that you have to use .pluck or .select to get access to the columns from the table you are joining. This is something I did not want to do because I foresee more columns being added in the future.
I ended up making Business the main table instead of BusinessCity as my starting point since I was listing data from Business on my view as stated in my initial question. When I did this I could not use the .joins(:business_cities) clause because it said the relation did not exist. I decided to go back to what I had originally started with using Business as the main table.
I came up with the following statement that provides all the columns from both tables ordered by distance on the BusinessCity table and name on the Business table. I was successful in added .where clauses as needed to accommodate the search functionality on my view.
#businesses = Business.joins("JOIN business_cities ON business_cities.id = businesses.business_city_id").order("business_cities.distance, businesses.name")

Indexing criteria for elasticsearch

I am working with twitter streaming api. and am a little confused about deciding the criteria for indexing the data. Right now I have a single index that contains all the tweets in one doc_type and users in another doc type.
Is it the best way to go about storing them or should i create a new doc type for every category (category can be decided on basis of hashtag and tweet content)
What should be the best approach to storing such data?
Thanks in advance.
At first, the answer to your question is that this very much depends on your use case. What is your application doing? What do you do with the tweets? How many categories do you plan to have?
I'd in general, however, go for a solution where you use the same index and the same doc_type for all tweets. This allows you to build queries and aggregations over all your tweets without thinking about the different types of categories. It also allows you to add new categories easily without having to change your queries.
If you want to do some classification of the tweets you could add a category field to the tweet document stored in elasticsearch. You can then use this category field to implement your specific application logic.
If your category names have spaces or punctuation marks don't forget to define the category field as not_analyzed. Otherwise it will be broken up in parts.

How to do a join in Elasticsearch -- or at the Lucene level

What's the best way to do the equivalent of an SQL join in Elasticsearch?
I have an SQL setup with two large tables: Persons and Items.
A Person can own many items.
Both Person and Item rows can change (i.e. be updated).
I have to run searches which filter by aspects of both the person and the item.
In Elasticsearch, it looks like you could make Person a nested document of Item, then use has_child.
But: if you then update a Person, I think you'd need to update every Item they own (which could be a lot).
Is that correct?
Is there a nice way to solve this query in Elasticsearch?
As already mentioned the way to go is parent/child. The point is that nested documents are extremely performant but in order for them to be updated you need to re-submit the whole structure (parent + nested documents). Although the internal implementation of nested documents consists of separate lucene documents, those nested doc are not visible nor directly accessible. In fact when using nested documents you then need to use proper queries to access them (nested query, nested filter, nested facet etc.).
On the other hand parent/child allows you to have separate documents that refer to each other, which can be updated independently. It has a cost in terms of performance and memory used but it is way more flexible than nested documents.
As mentioned in this article though, the fact that elasticsearch helps you managing relations doesn't mean that you must use those features. In a lot of complex usecases it is just better to have some custom logic on the application layer that handles with relations. In facet there are limitations with parent/child too: for instance you can never get back both parent and children at the same time, as opposed to nested documents that doesn't allow to get back only matching children (for now).
Take a look at my answer for: In Elasticsearch, can multiple top-level documents share a single nested document?
This discusses the use of _parent mapping as a way to avoid the issue with needing to update every Item when a Person is updated.

Returning Search Results in Rails

I am having a problem implementing a special kind of search for my Rails application. I am working on an achievement system where you can search for a set of users in a search form (e.g., the query being "Ross, Adam, Jake") and it returns all of the common achievements that the users have unlocked (e.g., if users Ross, Adam, and Jake all had an achievement named "You are winner!"). I have three tables, one for achievements, one for users, and a join table. We have tested the associations and such, so we know that works.
My first idea was to put the search terms in an array and get the search results for each item in the array and place them into respective "search result arrays". Then, I was thinking to go through each item in search result array 1 to see if it appears in both of the other result arrays. The objects that appear in all three of the search result arrays would be returned and displayed on a page.
Is there an easy way to implement this without writing a bunch of my own code? Are there some functions I should know about? Any help will be appreciated!
Well, both Ransack and it's predecessor (MetaSearch) are useful gems for creating complex search forms.
In general I think you want to do something like select distinct achievement ids for user ids in an array. Off the top of my head I'm not quite sure how you should write it... others may know.
Look at the documentation on MetaSearch (more established) and see if you see a pattern that fits, if not check Ransack (more advanced).
You can use some autocomplete plugin for user names and convert the names to ids on the fly, that way you won't have to deal with converting user names to ids in backend later.
For common achievements, if a user can have a achievement only once, aggregating the results in join table and counting the results with achievement ids would be the way to go.
You can provide more details for a more detailed answer. :)
You can use Sunspot which is allows easy solr integration with Ruby and Rails

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

Resources