any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560
Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex
Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this
Related
I must be doing something wrong trying to run the following search
http://localhost:8983/solr/collection1/select?q=url:www.abc.com&wt=xml&indent=true
It is not giving this sites results back, it's giving everything back. The schema.xml is pretty vanilla in how url is set up.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
If I use host:www.abc.com, it works.
Why the seemingly incorrect results when using the url field?
Thanks for any and all help.
Assuming that you are on Solr 3.1 or greater.
StandardTokenizerFactory - It creates token based on Word Boundary rules. This means URLs will be broken into multiple tokens and match on any one of them would be considered a hit.
Try using KeywordTokenizerFactory, for your url fieldtype. This should preserve the complete URL and match against it only.
In addition to using KeywordTokenizerFactory, you will have to remove the WordDelimiterFilterFactory. WDF splits tokens on punctuation and other delimiters ... which are very plentiful in URLs. You'll have to rebuild your index after making the change and restarting Solr or reloading the core.
An alternate idea, if you don't need to force URLs to lowercase: Switch from TextField to StrField and get rid of the analyzer config entirely.
any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560
Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex
Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this
I followed the Railscast to get Sunspot running and then this tutorial on enabling wildcard searching on my search field but for some reason it still isn't working.
Inside of my solr/conf/schema.xml I replaced the default lines with these instead for the EdgeNGramFilterFactory:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
When I search for "ste" or "steve jobs" I get Steve Jobs, but when I try "stv jbs" or stv jobs" I get no results.
I reindexed and restarted the sunspot server a couple of times ( also the rails server).
Am I missing something here? What could be the issue?
EdgeNGramFilterFactory basically creates n-grams for the terms.
So for steve jobs with min gram size as 1 the following tokens would be generated -
s, st, ste, stev, steve, steve j, steve jo, steve job, steve jobs
As in your case searching for stv jbs or stv jobs are more of an misspellings rather than partial matches, and would not match the documents.
I am having trouble getting ngrams to work. Here's my schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>
</fieldType>
My database has a bunch of entries with
"Elizabeth"
and
"Elizabeths"
When I try to query on "Elizabeth" I get only "Elizabeth" and not "Elizabeths".
The odd thing is, when I check out the solr admin, the Analysis page shows that the EdgenGramFilterFactory is indeed available, and results in "Elizabeths" being expanded into
e el eli eliz eliza elizab elizabe elizabet elizabeth
It seems like the indexer isn't picking up on this. I have the same problem when I move the synonyms filter from the query block to the index block. That is to say, when I have the synonyms filter in the query block, it works, but when I put it in the index block, it has no effect.
I have restarted Sunspot and reindexed multiple times. No dice. Any ideas? How can I directly check the indexed words list?
I think I found the problem and it looks like a noob error.
In my model, is was using the following construct as per one of the tutorials:
class Institution < ActiveRecord::Base
.
.
.
end
Sunspot.setup(Institution) do
text :name
end
This did not seem to throw any errors when I started, stopped, or reindexed. It struck me as strange that I was able to reindex immediately after stopping Solr.
I switched to
class Institution < ActiveRecord::Base
.
.
.
searchable do
text :name
end
endH
When I did this, I found that I could not reindex after stopping Solr. However, when I started Solr and reindexed, the index appeared to be truly
refreshed and my queries finally behaved as expected.
I'm trying to setup Websolr on my Heroku app. I'm following the instructions in the Heroku docs. I've got the initial setup working fine.
In development:
ruby-1.9.2-p0 > Note.search { keywords 'grit' }.results.length
=> 3
I am trying to add stemming. I updated the relevant part of my schema.xml file to this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
I then reindexed:
$ rake sunspot:reindex
But it doesn't seem to work at all:
ruby-1.9.2-p0 > Note.search { keywords 'gri' }.results.length
=> 0
What am I doing wrong?
I have two ideas for you here:
Firstly, you didn't mention whether you were restarting Solr after changing your schema.xml. So: are you restarting Solr so your changes can take effect? :)
Next, I am wondering if the term grit would even qualify to have its t removed under the Porter stemming algorithm. You would need to have a close read of the PorterStemmer algorithm to be sure. But you may also try some more obvious examples (say, writing to write).