EdgeNGramFilterFactory still giving no results? - ruby-on-rails

I followed the Railscast to get Sunspot running and then this tutorial on enabling wildcard searching on my search field but for some reason it still isn't working.
Inside of my solr/conf/schema.xml I replaced the default lines with these instead for the EdgeNGramFilterFactory:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
When I search for "ste" or "steve jobs" I get Steve Jobs, but when I try "stv jbs" or stv jobs" I get no results.
I reindexed and restarted the sunspot server a couple of times ( also the rails server).
Am I missing something here? What could be the issue?

EdgeNGramFilterFactory basically creates n-grams for the terms.
So for steve jobs with min gram size as 1 the following tokens would be generated -
s, st, ste, stev, steve, steve j, steve jo, steve job, steve jobs
As in your case searching for stv jbs or stv jobs are more of an misspellings rather than partial matches, and would not match the documents.

Related

give importance to documents which contains the word proximity + solr + sunspot

I am working on rails application and which is based on Apache Solr search engine and we are using Sunspot gem. But I am facing one problem, If I search query house rent then its giving me thousands of results by using and query. But the results what I am getting are not relevant.
I am expecting the documents which contains the house and rent words near to each other, those documents should come on top. But for now the documents which contains more number of house and rent documents are coming on top. But there is no any word proximity.
My schema.xml contains following definition:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,\.;\(\)]+"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
To achieve this what changes are need to do? or any filter are necessary to add for this?
You can try this
<fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>
Use phrase fields and boost them or you can try terms boosting like "house rent"~5

Sunspot - How to use a regex with sunspot SOLR? [duplicate]

any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560
Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex
Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this

how to implement wildcard search with sunspot

any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560
Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex
Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this

EdgeNGramFilterFactory not working (not indexing?)

I am having trouble getting ngrams to work. Here's my schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>
</fieldType>
My database has a bunch of entries with
"Elizabeth"
and
"Elizabeths"
When I try to query on "Elizabeth" I get only "Elizabeth" and not "Elizabeths".
The odd thing is, when I check out the solr admin, the Analysis page shows that the EdgenGramFilterFactory is indeed available, and results in "Elizabeths" being expanded into
e el eli eliz eliza elizab elizabe elizabet elizabeth
It seems like the indexer isn't picking up on this. I have the same problem when I move the synonyms filter from the query block to the index block. That is to say, when I have the synonyms filter in the query block, it works, but when I put it in the index block, it has no effect.
I have restarted Sunspot and reindexed multiple times. No dice. Any ideas? How can I directly check the indexed words list?
I think I found the problem and it looks like a noob error.
In my model, is was using the following construct as per one of the tutorials:
class Institution < ActiveRecord::Base
.
.
.
end
Sunspot.setup(Institution) do
text :name
end
This did not seem to throw any errors when I started, stopped, or reindexed. It struck me as strange that I was able to reindex immediately after stopping Solr.
I switched to
class Institution < ActiveRecord::Base
.
.
.
searchable do
text :name
end
endH
When I did this, I found that I could not reindex after stopping Solr. However, when I started Solr and reindexed, the index appeared to be truly
refreshed and my queries finally behaved as expected.

Adding stemming to my schema.xml file doesn't work

I'm trying to setup Websolr on my Heroku app. I'm following the instructions in the Heroku docs. I've got the initial setup working fine.
In development:
ruby-1.9.2-p0 > Note.search { keywords 'grit' }.results.length
=> 3
I am trying to add stemming. I updated the relevant part of my schema.xml file to this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
I then reindexed:
$ rake sunspot:reindex
But it doesn't seem to work at all:
ruby-1.9.2-p0 > Note.search { keywords 'gri' }.results.length
=> 0
What am I doing wrong?
I have two ideas for you here:
Firstly, you didn't mention whether you were restarting Solr after changing your schema.xml. So: are you restarting Solr so your changes can take effect? :)
Next, I am wondering if the term grit would even qualify to have its t removed under the Porter stemming algorithm. You would need to have a close read of the PorterStemmer algorithm to be sure. But you may also try some more obvious examples (say, writing to write).

Resources