Solr search with non-standard ASCII characters

Solr search with non-standard ASCII characters - ruby-on-rails

A search which indexes the following string: "Ordoñez" as:
text :lastname
Is then searched as:
User.solr_search do
keywords 'Ordonez'
end
Will return 0 results.
How can I index the string: Ordoñez using solr and get a match when the search is performed for
keywords 'Ordonez' or keywords 'Ordoñez'
I have tried the ASCIIFoldingFilter at index time but this did not do the job.
Here's what I did to try to make this work.

You probably need to add the handling on the Container side as well.
You can check Why don't International Characters Work

My problem was having these 3 fields, which happen to be unused.
<field name="firstname_text" type="textgen" stored="false" multiValued="true" indexed="true"/>
<field name="lastname_text" type="textgen" stored="false" multiValued="true" indexed="true"/>
<field name="specialty_text" type="textgen" stored="false" multiValued="true" indexed="true"/>
Not too sure why but as soon as removed them, the ASCII filter started working.
The ASCIIFoldingFilterFactory does do the job.
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.SynonymFilterFactory"/>
</analyzer>

Related

Configuration of Solr 5 (Umlaute, special characters, and string length)

Been running Solr 5 on a German site for the past few months and dealing with Umlaute appears to be a nightmare. I am not a specialist in Solr and on all other projects I am running elastic. It is a bit of an uphill battle to find your way through Solr documentation.
I am wondering if the following two things can be easily configured via schema.xml:
1.) UMLAUTE and Special characters
Special characters are stored in the Database in HTML code. For example:
"an einer Außenwand. Eine Brandschutztür sorgt für maximale Sicherheit."
Now Solr does NOT in anyway know how to deal with it. So if a user searches for "für" nothing comes up. I also tried to search for "für" and for "fr" - nothing returns the expected result.
The same if I type in "Regelungs-App", nothing comes up - if I enter "Regelungs App" I get hits. Why does a simple dash throw Solr of its track? And what setting, or what can I do to ignore this?
2.) Length of Search string
If I search for a string within indexed content, it may be limited to a certain number of characters - example:
"Erreicht als einziger Staubemissionen" - no results
"als einziger Staubemissionen" - no results
"einziger Staubemissionen" - correct results
"Staubemissionen" - correct result
How can I set this?
My current schema.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<!--
This is the Solr schema file. This file should be named "schema.xml" and
should be in the conf directory under the solr home
(i.e. ./solr/conf/schema.xml by default)
or located where the classloader for the Solr webapp can find it.
This example schema is the recommended starting point for users.
It should be kept correct and concise, usable out-of-the-box.
For more information, on how to customize this file, please see
http://wiki.apache.org/solr/SchemaXml
PERFORMANCE NOTE: this schema includes many optional features and should not
be used for benchmarking. To improve performance one could
- set stored="false" for all fields possible (esp large fields) when you
only need to search on the field but don't need to return the original
value.
- set indexed="false" if you don't need to search on the field, but only
return the field as a result of searching on other indexed fields.
- remove all unneeded copyField statements
- for best index size and searching performance, set "index" to false
for all general text fields, use copyField to copy them to the
catchall "text" field, and use that for searching.
- For maximum indexing performance, use the StreamingUpdateSolrServer
java client.
- Remember to run the JVM in server mode, and use a higher logging level
that avoids logging every request
-->
<schema name="sunspot" version="1.0">
<types>
<!-- field type definitions. The "name" attribute is
just a label to be used by field definitions. The "class"
attribute and any other attributes determine the real
behavior of the fieldType.
Class names starting with "solr" refer to java classes in the
org.apache.solr.analysis package.
-->
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="string" class="solr.StrField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="tdouble" class="solr.TrieDoubleField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="rand" class="solr.RandomSortField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="boolean" class="solr.BoolField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="tint" class="solr.TrieIntField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="tlong" class="solr.TrieLongField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="tfloat" class="solr.TrieFloatField" omitNorms="true"/>
<!-- *** This fieldType is used by Sunspot! *** -->
<fieldType name="tdate" class="solr.TrieDateField"
omitNorms="true"/>
<fieldType name="daterange" class="solr.DateRangeField" omitNorms="true" />
<!-- Special field type for spell correction. Be careful about
adding filters here, as they apply *before* your values go in
the spellcheck. For example, the lowercase filter here means
all spelling suggestions will be lower case (without it,
though, you'd have duplicate suggestions for lower and proper
cased words). -->
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
<fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
</types>
<fields>
<!-- Valid attributes for fields:
name: mandatory - the name for the field
type: mandatory - the name of a previously defined type from the
<types> section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
compressed: [false] if this field should be stored using gzip compression
(this will only apply if the field type is compressable; among
the standard field types, only TextField and StrField are)
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with
this field (this disables length normalization and index-time
boosting for the field, and saves some memory). Only full-text
fields or fields that need an index-time boost need norms.
termVectors: [false] set to true to store the term vector for a
given field.
When using MoreLikeThis, fields used for similarity should be
stored for best performance.
termPositions: Store position information with the term vector.
This will increase storage costs.
termOffsets: Store offset information with the term vector. This
will increase storage costs.
default: a value that should be used if no value is specified
when adding a document.
-->
<!-- *** This field is used by Sunspot! *** -->
<field name="id" stored="true" type="string" multiValued="false" indexed="true"/>
<!-- *** This field is used by Sunspot! *** -->
<field name="type" stored="false" type="string" multiValued="true" indexed="true"/>
<!-- *** This field is used by Sunspot! *** -->
<field name="class_name" stored="false" type="string" multiValued="false" indexed="true"/>
<!-- *** This field is used by Sunspot! *** -->
<field name="text" stored="false" type="string" multiValued="true" indexed="true"/>
<!-- *** This field is used by Sunspot! *** -->
<field name="lat" stored="true" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This field is used by Sunspot! *** -->
<field name="lng" stored="true" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="random_*" stored="false" type="rand" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="_local*" stored="false" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_text" stored="false" type="text" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_texts" stored="true" type="text" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_b" stored="false" type="boolean" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_bm" stored="false" type="boolean" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_bs" stored="true" type="boolean" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_bms" stored="true" type="boolean" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_d" stored="false" type="tdate" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dm" stored="false" type="tdate" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ds" stored="true" type="tdate" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dms" stored="true" type="tdate" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_e" stored="false" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_em" stored="false" type="tdouble" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_es" stored="true" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ems" stored="true" type="tdouble" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_f" stored="false" type="tfloat" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_fm" stored="false" type="tfloat" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_fs" stored="true" type="tfloat" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_fms" stored="true" type="tfloat" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_i" stored="false" type="tint" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_im" stored="false" type="tint" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_is" stored="true" type="tint" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ims" stored="true" type="tint" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_l" stored="false" type="tlong" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_lm" stored="false" type="tlong" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ls" stored="true" type="tlong" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_lms" stored="true" type="tlong" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_s" stored="false" type="string" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_sm" stored="false" type="string" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ss" stored="true" type="string" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_sms" stored="true" type="string" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_it" stored="false" type="tint" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_itm" stored="false" type="tint" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_its" stored="true" type="tint" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_itms" stored="true" type="tint" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ft" stored="false" type="tfloat" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ftm" stored="false" type="tfloat" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_fts" stored="true" type="tfloat" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ftms" stored="true" type="tfloat" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dt" stored="false" type="tdate" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dtm" stored="false" type="tdate" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dts" stored="true" type="tdate" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dtms" stored="true" type="tdate" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_textv" stored="false" termVectors="true" type="text" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_textsv" stored="true" termVectors="true" type="text" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_et" stored="false" termVectors="true" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_etm" stored="false" termVectors="true" type="tdouble" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_ets" stored="true" termVectors="true" type="tdouble" multiValued="false" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_etms" stored="true" termVectors="true" type="tdouble" multiValued="true" indexed="true"/>
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_dr" stored="false" type="daterange" multiValued="false" indexed="true" />
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_drm" stored="false" type="daterange" multiValued="true" indexed="true" />
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_drs" stored="true" type="daterange" multiValued="false" indexed="true" />
<!-- *** This dynamicField is used by Sunspot! *** -->
<dynamicField name="*_drms" stored="true" type="daterange" multiValued="true" indexed="true" />
<!-- Type used to index the lat and lon components for the "location" FieldType -->
<dynamicField name="*_coordinate" type="tdouble" indexed="true" stored="false" multiValued="false"/>
<dynamicField name="*_p" type="location" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="*_ll" stored="false" type="location" multiValued="false" indexed="true"/>
<dynamicField name="*_llm" stored="false" type="location" multiValued="true" indexed="true"/>
<dynamicField name="*_lls" stored="true" type="location" multiValued="false" indexed="true"/>
<dynamicField name="*_llms" stored="true" type="location" multiValued="true" indexed="true"/>
<field name="textSpell" stored="false" type="textSpell" multiValued="true" indexed="true"/>
<!-- required by Solr 4 -->
<field name="_version_" type="string" indexed="true" stored="true" multiValued="false" />
</fields>
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>id</uniqueKey>
<!-- field for the QueryParser to use when an explicit fieldname is absent -->
<defaultSearchField>text</defaultSearchField>
<!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
<solrQueryParser defaultOperator="AND"/>
<!-- copyField commands copy one field to another at the time a document
is added to the index. It's used either to index the same field differently,
or to add multiple fields to the same field for easier/faster
searching. -->
<!-- Use copyField to copy the fields you want to run spell checking
on into one field. For example: -->
<copyField source="*_text" dest="textSpell" />
<copyField source="*_s" dest="textSpell" />
</schema>

You don't say anything about the type of fields you're searching, but if it's of the type "text", the analysis chain looks unsuitable for what you're trying to do. The input (NGramTokenizer) and just lowercasing, will not give the results you're expecting together with the StandardTokenizer on the query side.
Create a new field with a more simplified definition (and probably the same for both index and query for now), that just consist of a whitespace tokenizer or another, more standard tokenizer - see the reference manual for examples of the differences. You'll probably want a lowercasefilter as well.
You might run into issues with umlauts and other specific german terms, but the ICU*-range of filters and tokenizers are more international than the other ones. There's also a filter for splitting words into their components (as you have the same issue as us Norwegians, where words are written together instead of the English way of splitting them up).
The "Analysis" page under the Solr Admin is a great place to start debugging this - it'll show you exactly which transformations are made both on the index and query side, allowing you to see why terms don't match and what the terms look like at each step.

For latin accent characters use
<filter class="solr.ASCIIFoldingFilterFactory"/>
this stores all words in their ASCII format with accents removed.
Your text type looks pretty unbalanced on the query side. Try something like this to start with then reindex your data (I'm currently working on a French site). I don't think the PorterStemFilterFactor is good for German, there are other stemmers that work better:-
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="40" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

give importance to documents which contains the word proximity + solr + sunspot

I am working on rails application and which is based on Apache Solr search engine and we are using Sunspot gem. But I am facing one problem, If I search query house rent then its giving me thousands of results by using and query. But the results what I am getting are not relevant.
I am expecting the documents which contains the house and rent words near to each other, those documents should come on top. But for now the documents which contains more number of house and rent documents are coming on top. But there is no any word proximity.
My schema.xml contains following definition:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,\.;\(\)]+"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
To achieve this what changes are need to do? or any filter are necessary to add for this?

You can try this
<fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>

Use phrase fields and boost them or you can try terms boosting like "house rent"~5

Improper results from searching on a URL in solr

I must be doing something wrong trying to run the following search
http://localhost:8983/solr/collection1/select?q=url:www.abc.com&wt=xml&indent=true
It is not giving this sites results back, it's giving everything back. The schema.xml is pretty vanilla in how url is set up.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
If I use host:www.abc.com, it works.
Why the seemingly incorrect results when using the url field?
Thanks for any and all help.

Assuming that you are on Solr 3.1 or greater.
StandardTokenizerFactory - It creates token based on Word Boundary rules. This means URLs will be broken into multiple tokens and match on any one of them would be considered a hit.
Try using KeywordTokenizerFactory, for your url fieldtype. This should preserve the complete URL and match against it only.

In addition to using KeywordTokenizerFactory, you will have to remove the WordDelimiterFilterFactory. WDF splits tokens on punctuation and other delimiters ... which are very plentiful in URLs. You'll have to rebuild your index after making the change and restarting Solr or reloading the core.
An alternate idea, if you don't need to force URLs to lowercase: Switch from TextField to StrField and get rid of the analyzer config entirely.

While searching with sunspot_rails on Solr, how can I boost whole word matching over partial word matching?

I am using sunspot_rails to submit queries to a Solr instance. Everything works ok, but I want to order my results with the following criteria: I want to take first the documents where the matching term appears as word rather than as part of a word.
Hence, if I have the two documents:
1) Solr searching with Solr is fantastic
and
2) Solr is very good to support search with free text
and the term I am looking for is : search, then
I want to take both documents in the results, but I want document (2) to appear first.
I have tried order_by :score, :desc but it does not seem to be working. Unless I find a way to tell how the "score" is calculated.
Thanks in advance
Panayotis

You would need to maintain two fields with Solr.
One with the Original value and other with the analyzed value.e.g. text_org and text (which is analyzed)
Then you can adjust the boost accordingly, boosting the original field value over the analyzed one e.g. text_org^2 text^1
Remember if it matches the original, it will also match the analyzed text or the effect for the exact whole word match is more then the normal match.

Expanding on Jayendra's answer a bit, you should index into two separate fields.
Here's an example schema.xml excerpt for Sunspot, from my answer to an earlier question: How to boost longer ngrams in solr?
<schema>
<types>
<!--
A text type with minimal text processing, for the greatest semantic
value in a term match. Boost this field heavily.
-->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StandardFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<!--
Looser matches with NGram processing for substrings of terms and synonyms
-->
<fieldType name="text_ngram" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StandardFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="6" side="front" />
</analyzer>
</fieldType>
<!-- other stuff -->
</types>
<fields>
<!-- other fields; refer to *_text -->
<dynamicField name="*_ngram" type="text_ngram" ... />
</fields>
</schema>
In your searchable block, you can use the :as option to specify the fieldname:
searchable do
text :title
text :title, :as => :title_ngram
# ...
end

Finding singular versions of a word in Sunspot/Solr

I have a Rails+Sunspot application and I'm working on configuring it so that searching returns the singluar version of the query. For instance:
I want a search for "cookies" to return something named "cookie". Currently my Sunspot search returns "cookies" but not "cookie" (singluar).
I've made some customizations to Solr's schema.xml, adding solr.EdgeNGramFilterFactory to provide more flexibility but EdgeNGramFilterFactory doesn't suite this case as it only allows matches when the query is a substring of the result's name. My understanding is EdgeNGramFilterFactory will return "cookie" when the user searches for "co", "coo", "cook" or "cooki", but not a superstring of "cookie" (ie: cookies). Simply put, this is because "cookies" is not a substring within "cookie".
I've tried adding all three of Solr's build-in stemming factories but to no avail. You can see one commented out in my schema.
In schema.xml, the relevant field looks as follows:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
<!-- <filter class="solr.EnglishMinimalStemFilterFactory"/> -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I supposed I could singluarize the user's query but I would rather not touch their query before it hits Solr.
You can play with this here: http://staging.zisboombah.com/parent/food_guide/?search=cookie. Try changing the query between "cookie" and "cookies".
Any tips on how to do this in Solr would be greatly appreciated!

The solr xml options are ordered. You want the stemmer to come before the ngram filter, so that you ngram-ize cooki, rather than stemming c, co, etc.
Combining filters in this way may lead to some odd results, mostly depending on how aggressive your stemmer is. You should definitely add the stemmer to the query analyzer, but that will mess with your autocomplete.
A better solution: use a copyField to make independent text_stemmed and text_autocomplete fields. Then search using an OR query over both fields.

Like Kyle mentions, you probably want to use more text field types for each of these different use cases.
Here's an example of mine:
schema.xml
<schema>
<types>
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_stopwords" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
<!-- ... -->
</types>
<fields>
<!-- ... -->
</fields>
<copyField source="*_text" dest="text"/>
<copyField source="*_texts" dest="text"/>
<copyField source="*_textsv" dest="text"/>
<copyField source="*_textv" dest="text"/>
</schema>
Sunspot modeling
Using the copyField directive can save some setup work in the model. However, Sunspot uses those text declarations to decide which fields to keywords-search by default, so I like to include distinct text invocations that use :as to specify the full Solr document field name.
searchable do
text :name, stored: true, default_boost: 10
text :name, as: 'name_text_en'
text :description, stored: true
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Solr search with non-standard ASCII characters - ruby-on-rails

You probably need to add the handling on the Container side as well. You can check Why don't International Characters Work

Related

Configuration of Solr 5 (Umlaute, special characters, and string length)

give importance to documents which contains the word proximity + solr + sunspot

Improper results from searching on a URL in solr

While searching with sunspot_rails on Solr, how can I boost whole word matching over partial word matching?

Finding singular versions of a word in Sunspot/Solr

Categories

Resources