Given I have a model
class Firm < ActiveRecord::Base
searchable do
text :name
end
end
And solr's schema.xml contains
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And I have a Firm with name == 'Ойл-М (Oil-M)'
When I try to search
Sunspot.search(Firm) do
fulltext 'Ойл-М'
end
Then I get nothing
When I try to search
Sunspot.search(Firm) do
fulltext 'Ойл'
end
Then I get needed Firm
How should I set up Solr and/or search to be able to find this Firm by both queries?
Your NGramFilter is cutting off the final 'M', because you have minGramSize=2. Setting minGramSize=1 will work, but this greatly increases the size of data Solr will have to store, and also drives up noise.
When you index and query a field in Solr, two things happen:
The field is split up into smaller pieces (tokenized),
Each token is then filtered.
This happens separately for indexing and querying.
In this case, you are indexing the field with StandardTokenizerFactory, StandardFilter, LowercaseFilter, and an NGramFilter, and querying the field with everything except for the NGramFilter.
Here's what's happening when you index "Ойл-М (Oil-M)" into Solr.
StandardTokenizerFactory: ['Ойл', 'М', 'Oil', 'M']
StandardFilter: ['Ойл', 'М', 'Oil', 'M']
LowerCaseFilter: ['ойл', 'м', 'oil', 'm']
NGramFilter: ['ой', 'йл', 'ойл', 'oi', 'il', 'oil']
The 'm' drops away completely. Searching for "Ойл-М" returns nothing, because there is no M to search.
Cut out the NGramFilter unless you have a very good reason to use it, and stick with the standard Russian fieldType.
<fieldType name="text_ru" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ru.txt" format="snowball" enablePositionIncrements="~
<filter class="solr.SnowballPorterFilterFactory" language="Russian"/>
</analyzer>
</fieldType>
NOTE: Notice that there is no distinction here between the index analyzer and query analyzer. Each query is transformed in the exact same manner as when indexed.
Related
I'm running Sunspot Solr in my rails app. I am using it to enable a user to search for different "articles" by using fulltext search on the :name attribute. At this point in time, I have Sunspot Solr configured and it's working nicely.
However, when I search for dog mouse cat (as an example), it only returns articles that contain all of the keywords. How can I configure Solr to show articles like 'The dog and the cat' - which contains only 2 of the 3 search keywords in the query example above?
My searchable block in the model:
searchable do
text :name
end
My current schema.xml for fulltext search looks like this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Model.search do
fulltext text do
fields :name
minimum_match 2
end
end
I am trying configure my Solr (with sunspot, rails), to works like this:
Given a name: Willie Price
I want be able to search for: llie (for exemple)
On my schema.xml I added:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My Rails model:
searchable do
text :name, :as => :ngram
end
I also try just:
searchable do
text :name
end
On my controller:
#search = Client.search do
keywords params[:search]
end
And also try this:
#search = Client.search do
full_text params[:search]
end
The problem: I just be able to search using complete word, so given "Willie Price" only works with Willie or Price.
Thanks a lot
I wanted to understand whether Sunspot, in standard mode, searches for words or sequences of characters in full-text search and how to make it search for sequences.
For example, I have the following setup:
class User < ActiveRecord::Base
searchable do
text :email
end
end
with one User with e-mail "panayotis#matsinopoulos.gr"
the following query :
search = User.search do
fulltext 'matsinopoulos'
end
does not bring any result, whereas:
search = User.search do
fulltext 'panayotis#matsinopoulos.gr'
end
brings.
Is there any configuration setting for sunspot to match sequences of characters instead of words?
Or, am I doing something wrong?
One needs to configure file:
solr/conf/schema.xml
The standard entry:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
has to be turned to:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory"
minGramSize="3"
maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>`
</fieldType>
A very nice reference on Solr configuration can be found here:
http://techbot.me/2011/01/full-text-search-in-in-rails-with-sunspot-and-solr/
but, watch out that when it comes to partial words matching this reference talks about the EdgeNGramFilterFactory which indexes the beginnings of the words only. For making Solr match any part of the word, the NGramFilterFactory needs to be used.
Note also that we have set minGramSize to 3 and maxGramSize to 30. So, patterns with length less than 3 or greater than 30 will not be returned in queries.
I just have done setting up sunspot_rails and it seems working well except one thing.
After I made 3 records like below
name=John
name=John2
name=John3
when I search with the keyword "John", only 1st record shows up. it looks like complete matching.
I'd like to have all of them to be appeared as search result.
Is this supposed to be happened as default?
or did I setup something wrong??
If you want return substrings in fulltext search, you can take a look in
https://github.com/sunspot/sunspot/wiki/Matching-substrings-in-fulltext-search
Also you can add a file sunspot_solr.rb for pagination of results in myapp/config/initializers/ with:
Sunspot.config.pagination.default_per_page = 100
return 100 results for this case.
Added:
Your schema.xml file is founded in yourappfolder/solr/conf
Also you can add <filter class="solr.NGramFilterFactory"/> to match arbitrary substrings.
This is my particular config for schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldtype class="solr.TextField" name="text_pre" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="10"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldtype>
For me it does works fine with full strings and substrings for all keywords. Please do not forget to restart the server and reindex your models for the changes to take effect.
Regards!
Thanks!!!
block from girls controller(girls_controller.rb)
def index
#search = Girl.search do
fulltext params[:search]
end
#girls = #search.results
# #girls = Girl.all
#
# respond_to do |format|
# format.html # index.html.erb
# format.json { render json: #girls }
# end
end
block from Girl model(girl.rb)
searchable do
text :name_en, :name_es, name_ja
end
Say, I have this code in my model:
class Facility < ActiveRecord::Base
...
searchable do
text :name
text :facility_type do
end
...
And this in search controller:
#search = Facility.search do
keywords(query) do
boost_fields :name => 1.9,
:facility_type => 1.98
end
...
And I have two Facility objects - first one having a type "cafe", but not having a word "cafe" in the name, a second one - called "cafe sun", for example, but being of a "bar" type in fact.
I run the search with query="cafe" and get both facilities in the response, but the score is 5.003391 for a "cafe sun" and 1.250491 for a real "cafe"
For the second try I set
boost_fields :name => 1.9, :facility_type => 3
Score for "cafe sun" doesn't change, but "cafe" somewhat grew up - 1.8946824
So, as long as results get sorted by the score, I am interested how is it calculated ?
Or am I choosing wrong tokenizers or something, here is what I have in schema.xml
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory"
minGramSize="3"
maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Scoring results is the domain of the Lucene library, and the crux of its algorithm is described in detail here:
http://lucene.apache.org/core/3_6_1/api/core/org/apache/lucene/search/Similarity.html
http://lucene.apache.org/core/3_6_1/scoring.html
To inspect the raw scoring data, run a query against your Solr instance directly and append the debugQuery=on parameter to see scoring data.
http://localhost:8983/solr/select?q=test&defType=dismax&qf=name_text+facility_type_text&debugQuery=on
For general relevancy optimizations in Solr, you can consult the SolrRelevancyFAQ. It also has one question specifically demonstrating the output of debugQuery
All in all: you ask a very good question with a very deep answer. I may edit my response down the road to expand on the subject.