Getting fuzzy searching to work for Sunspot? - ruby-on-rails

I have in my database or Solr index the following 2 Products: Total War: Shogun 2 [Download] and Eggs.
What I want the search to be able to do is match these 2 Products with mistakes e.g:
"Egggs", "Eggz", "Eg", "Egs" and "Shogn Download", "Totle War","Tutal War: Shogunn 2 Download" etc.
EDIT ( Working somewhat):
This will get you started, still having issues with using different characters inside of a search though i.e. Only things like "Eggs" and "Great Value Vitamin D Whole Milk" can be misspelled not "Total War: Shogun 2".
New code:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The Ideal is to be able have my search like Googles where it does a pretty good job of correcting your spelling whether lowercase, uppercase and with a couple of errors. How would I make my search similar to what Google does?

Fuzzy searches do not undergo query time analysis.
So there are chances that you query does not match the index terms.
The terms in the above config, undergo lower case filtering during indexing, which would store all the terms in lower case.
And searching for Egggs would never produce any results, as Egggs would not match eggs.
The searched terms need to be lowercased explicitly.
Also, in the above config, the index time analysis is very different from query time analysis.
Its usually recommended to have similiar filters during query and index, so that the indexed terms match the searched terms.
solr.PorterStemFilterFactory may result into a completely different root for the searched term and may never match the indexed terms.
Revisit your configuration. Maybe check the example solr schema xml for reference.

Related

Elasticsearch/ Searchkick gem - boosting fields do not return results with special characters (e.g. apostrophes)

We're using the searchkick gem in our app and have many documents with fields that contains special characters such as apostrophes, e.g. an offer with the title Valentine's Day Special.
Without boosters, a search for Valentines or Valentine's or Valentine would return the correct search results:
Activity.search "Valentines"
However when boosters for those the title field is incorporated, a search of any of the above queries will not return the Valentine's Day Special result.
Activity.search "Valentines", fields: ["title^10"]
I've been trying to troubleshoot through the Elasticsearch/ Searckick documentation but haven't found a solution yet. Anyone else encountered this problem?
Resolved this with a partial work match workaround:
Offer.search("Valentine's", fields: ["name"], match: :word_middle)

Solr: Perform stemming on a field and get the sorted list of stemmed words which were most frequent

Is there a way that I can use stemming on a field at index time and then retrieve a sorted list of stemmed words by frequency of their original occurrence at query time.
For example assume my 'text' field has contents of a document and contains only these words:
walk walking walked moved run running.
I want to use stemming on this field to get the base forms sorted by the occurrence of their original words i.e.
walk
run
move
My understanding is that solr use stemming to reduce walk, walking and walked to one base form walk and then store it in index. I am not interested in retrieving count but just the list of words. Does solr keep track of such word count at index time? Here is my configuration:
My schema.xml has the text field:
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true" />
and
The field type 'text_general' is defined as:
<fieldType class="solr.TextField" name="text_general" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Thanks for help.

How to use AngularDart ng-repeat over a list of simple values possibly with duplicates

The AngularDart ng-repeat directive seems to require unique values; e.g., the following
<li ng-repeat="x in ['Alice', 'Bob', 'Alice']">...</li>
results in
[NgErr50] ngRepeat error! Duplicates in a repeater are not allowed.
Use 'track by' expression to specify unique keys.
Assuming that the list of strings is obtained from some external source and that uniqueness of values is not guaranteed, how can the [NgErr50] be avoided?
This works:
<li ng-repeat="x in ['Alice', 'Bob', 'Alice'] track by $index">...</li>
For more options, see the ng-repeat API docs.

How To extract the values from csv file in ant?

Washington
New York
New Delhi
India
United States Of America
In ant I want to extract all the values as separate values like washington, new, delhi, india, united, states, of, america. Altough I am able to extract them line wise as
<loadfile property="message" srcFile="../Ant_Scripts/Name.csv"/>
<target name="init">
<for list="${message}" delimiter="${line.separator}" param = "val">
<echo message=${val}/>
but I am not able to extract them as individual units that is once I got New Delhi or New York I should be able to get New and Delhi seprately also.
can you please post your ant script – Satya
<loadfile property="message" srcFile="../Ant_Scripts/Name.csv"/>
<target name="init">
<for list="${message}" delimiter="${line.separator}" param = "val">
<sequential>
<echo>$val</echo>
</sequential>
</for>
</target>
</project>
This code will print all the names line by line, but after this I want to break those lines on the basis of space.
There is one fundamental error:
http://dailyraaga.wordpress.com/2010/12/21/ant-for-loop/
you have to access the CSV element with #{param}, no ${param}. This loop uses attributes, not properties ;)
There is also one task you might find handy for your line seperation needs:
http://ant.apache.org/manual/Tasks/fixcrlf.html

ant - compare two lists

How to compare two lists in ant? Basically I am getting all the message_id's from the database for all my id's and want to compare them to the same id's after some messages in database are deleted.
A very simple one assuming the lists are exactly identical in num of lines, charcters, whitespace etc., this would yield a match, else the property will not be set.
<condition property="comp" value="the files match">
<filesmatch file1="a.txt" file2="b.txt"/>
</condition>
<echo> !!!! </echo>
<echo> ${comp}</echo>
<echo> !!!! </echo>

Resources