Finding singular versions of a word in Sunspot/Solr

Finding singular versions of a word in Sunspot/Solr - ruby-on-rails

I have a Rails+Sunspot application and I'm working on configuring it so that searching returns the singluar version of the query. For instance:
I want a search for "cookies" to return something named "cookie". Currently my Sunspot search returns "cookies" but not "cookie" (singluar).
I've made some customizations to Solr's schema.xml, adding solr.EdgeNGramFilterFactory to provide more flexibility but EdgeNGramFilterFactory doesn't suite this case as it only allows matches when the query is a substring of the result's name. My understanding is EdgeNGramFilterFactory will return "cookie" when the user searches for "co", "coo", "cook" or "cooki", but not a superstring of "cookie" (ie: cookies). Simply put, this is because "cookies" is not a substring within "cookie".
I've tried adding all three of Solr's build-in stemming factories but to no avail. You can see one commented out in my schema.
In schema.xml, the relevant field looks as follows:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
<!-- <filter class="solr.EnglishMinimalStemFilterFactory"/> -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I supposed I could singluarize the user's query but I would rather not touch their query before it hits Solr.
You can play with this here: http://staging.zisboombah.com/parent/food_guide/?search=cookie. Try changing the query between "cookie" and "cookies".
Any tips on how to do this in Solr would be greatly appreciated!

The solr xml options are ordered. You want the stemmer to come before the ngram filter, so that you ngram-ize cooki, rather than stemming c, co, etc.
Combining filters in this way may lead to some odd results, mostly depending on how aggressive your stemmer is. You should definitely add the stemmer to the query analyzer, but that will mess with your autocomplete.
A better solution: use a copyField to make independent text_stemmed and text_autocomplete fields. Then search using an OR query over both fields.

Like Kyle mentions, you probably want to use more text field types for each of these different use cases.
Here's an example of mine:
schema.xml
<schema>
<types>
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_stopwords" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
<!-- ... -->
</types>
<fields>
<!-- ... -->
</fields>
<copyField source="*_text" dest="text"/>
<copyField source="*_texts" dest="text"/>
<copyField source="*_textsv" dest="text"/>
<copyField source="*_textv" dest="text"/>
</schema>
Sunspot modeling
Using the copyField directive can save some setup work in the model. However, Sunspot uses those text declarations to decide which fields to keywords-search by default, so I like to include distinct text invocations that use :as to specify the full Solr document field name.
searchable do
text :name, stored: true, default_boost: 10
text :name, as: 'name_text_en'
text :description, stored: true
end

Related

solr sunspot unicode? characters

How do I use this file provided in sunspot (mapping-ISOLatin1Accent.txt)(or is this the one I need as well)? I need to be able to search for "las pinas" and include results like "las piñas" in my database. Meaning n => ñ? I have my config schema.xml like this for now:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" tokenizerFactory="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
and I have tried moving the <charFilter> setting around as well.
I have also searched and found various solutions mostly pointing to this or this articles but those don't seem to work either.

give importance to documents which contains the word proximity + solr + sunspot

I am working on rails application and which is based on Apache Solr search engine and we are using Sunspot gem. But I am facing one problem, If I search query house rent then its giving me thousands of results by using and query. But the results what I am getting are not relevant.
I am expecting the documents which contains the house and rent words near to each other, those documents should come on top. But for now the documents which contains more number of house and rent documents are coming on top. But there is no any word proximity.
My schema.xml contains following definition:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,\.;\(\)]+"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
To achieve this what changes are need to do? or any filter are necessary to add for this?

You can try this
<fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>

Use phrase fields and boost them or you can try terms boosting like "house rent"~5

Solr ShingleFilterFactory in query analysis not worikng

I have field with definition below, I works perfect in analysis but when I try to query it in that way, query analysis behaves different. What I am missing?
data: thd_keyphrase: Privately held companies based in California,Social media,Privately held companies
query: q=thd_keyphrase:find any social media
in analysis query is processed this way: |find any|any social|social media
and it matches Social media
output from debug query is sifferent:
"rawquerystring": "thd_keyphrase:find any social media",
"querystring": "thd_keyphrase:find any social media",
"parsedquery": "thd_keyphrase:find text:ani text:social text:media",
"parsedquery_toString": **"thd_keyphrase:find text:ani text:social text:media",**
or when I remove default field text : "msg": "no field name specified in query and no default specified via 'df' param",
<fieldType name="keyphrase" class="solr.TextField" omitNorms="false" termVectors="false" multiValued="false">
<analyzer type="index">
<tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"
outputUnigrams="false" outputUnigramsIfNoShingles="true" tokenSeparator=" "/>
<!-- <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true" enablePositionIncrements="false"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
</types>

Since you have spaces in the text string make sure to surround it with double quotes like so:
q=thd_keyphrase:"find any social media"
Also, do you mean to tokenize the field on comma?

Improper results from searching on a URL in solr

I must be doing something wrong trying to run the following search
http://localhost:8983/solr/collection1/select?q=url:www.abc.com&wt=xml&indent=true
It is not giving this sites results back, it's giving everything back. The schema.xml is pretty vanilla in how url is set up.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
If I use host:www.abc.com, it works.
Why the seemingly incorrect results when using the url field?
Thanks for any and all help.

Assuming that you are on Solr 3.1 or greater.
StandardTokenizerFactory - It creates token based on Word Boundary rules. This means URLs will be broken into multiple tokens and match on any one of them would be considered a hit.
Try using KeywordTokenizerFactory, for your url fieldtype. This should preserve the complete URL and match against it only.

In addition to using KeywordTokenizerFactory, you will have to remove the WordDelimiterFilterFactory. WDF splits tokens on punctuation and other delimiters ... which are very plentiful in URLs. You'll have to rebuild your index after making the change and restarting Solr or reloading the core.
An alternate idea, if you don't need to force URLs to lowercase: Switch from TextField to StrField and get rid of the analyzer config entirely.

Rails sunspot-solr - words with hyphen

I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen.
Example:
The string "tron" returns a lot of results(the word mentioned in all articles is e-tron)
The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles.
My current schema.xml config:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
What I want: The behaviour for the search string tron is okay of course, but I also want to have the correct matches for the search string e-tron.

The problem is that solr.StandardTokenizerFactory is splitting words by hyphens so "e-tron" generates the tokens "e", "tron". Presumably "e" is lost as solr.TextField filters with a minimum token size of 2.
This is one example that would show your specific problem.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
solr.WhitespaceTokenizerFactory will generate tokens on whitespace. ["e-tron"]
solr.WordDelimiterFilterFactory will split on hyphens but also preserve the original word. ["e", "tron", "e-tron"]

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Finding singular versions of a word in Sunspot/Solr - ruby-on-rails

Related

solr sunspot unicode? characters

give importance to documents which contains the word proximity + solr + sunspot

Solr ShingleFilterFactory in query analysis not worikng

Improper results from searching on a URL in solr

Rails sunspot-solr - words with hyphen

Categories

Resources