how to implement wildcard search with sunspot - ruby-on-rails

any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560

Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex

Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this

Related

give importance to documents which contains the word proximity + solr + sunspot

I am working on rails application and which is based on Apache Solr search engine and we are using Sunspot gem. But I am facing one problem, If I search query house rent then its giving me thousands of results by using and query. But the results what I am getting are not relevant.
I am expecting the documents which contains the house and rent words near to each other, those documents should come on top. But for now the documents which contains more number of house and rent documents are coming on top. But there is no any word proximity.
My schema.xml contains following definition:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,\.;\(\)]+"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
To achieve this what changes are need to do? or any filter are necessary to add for this?
You can try this
<fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>
Use phrase fields and boost them or you can try terms boosting like "house rent"~5

Sunspot - How to use a regex with sunspot SOLR? [duplicate]

any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560
Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex
Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this

Improper results from searching on a URL in solr

I must be doing something wrong trying to run the following search
http://localhost:8983/solr/collection1/select?q=url:www.abc.com&wt=xml&indent=true
It is not giving this sites results back, it's giving everything back. The schema.xml is pretty vanilla in how url is set up.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
If I use host:www.abc.com, it works.
Why the seemingly incorrect results when using the url field?
Thanks for any and all help.
Assuming that you are on Solr 3.1 or greater.
StandardTokenizerFactory - It creates token based on Word Boundary rules. This means URLs will be broken into multiple tokens and match on any one of them would be considered a hit.
Try using KeywordTokenizerFactory, for your url fieldtype. This should preserve the complete URL and match against it only.
In addition to using KeywordTokenizerFactory, you will have to remove the WordDelimiterFilterFactory. WDF splits tokens on punctuation and other delimiters ... which are very plentiful in URLs. You'll have to rebuild your index after making the change and restarting Solr or reloading the core.
An alternate idea, if you don't need to force URLs to lowercase: Switch from TextField to StrField and get rid of the analyzer config entirely.

EdgeNGramFilterFactory still giving no results?

I followed the Railscast to get Sunspot running and then this tutorial on enabling wildcard searching on my search field but for some reason it still isn't working.
Inside of my solr/conf/schema.xml I replaced the default lines with these instead for the EdgeNGramFilterFactory:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
When I search for "ste" or "steve jobs" I get Steve Jobs, but when I try "stv jbs" or stv jobs" I get no results.
I reindexed and restarted the sunspot server a couple of times ( also the rails server).
Am I missing something here? What could be the issue?
EdgeNGramFilterFactory basically creates n-grams for the terms.
So for steve jobs with min gram size as 1 the following tokens would be generated -
s, st, ste, stev, steve, steve j, steve jo, steve job, steve jobs
As in your case searching for stv jbs or stv jobs are more of an misspellings rather than partial matches, and would not match the documents.

Adding stemming to my schema.xml file doesn't work

I'm trying to setup Websolr on my Heroku app. I'm following the instructions in the Heroku docs. I've got the initial setup working fine.
In development:
ruby-1.9.2-p0 > Note.search { keywords 'grit' }.results.length
=> 3
I am trying to add stemming. I updated the relevant part of my schema.xml file to this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
I then reindexed:
$ rake sunspot:reindex
But it doesn't seem to work at all:
ruby-1.9.2-p0 > Note.search { keywords 'gri' }.results.length
=> 0
What am I doing wrong?
I have two ideas for you here:
Firstly, you didn't mention whether you were restarting Solr after changing your schema.xml. So: are you restarting Solr so your changes can take effect? :)
Next, I am wondering if the term grit would even qualify to have its t removed under the Porter stemming algorithm. You would need to have a close read of the PorterStemmer algorithm to be sure. But you may also try some more obvious examples (say, writing to write).

Resources