Improper results from searching on a URL in solr

Improper results from searching on a URL in solr - url

I must be doing something wrong trying to run the following search
http://localhost:8983/solr/collection1/select?q=url:www.abc.com&wt=xml&indent=true
It is not giving this sites results back, it's giving everything back. The schema.xml is pretty vanilla in how url is set up.
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="url" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"/>
</analyzer>
</fieldType>
If I use host:www.abc.com, it works.
Why the seemingly incorrect results when using the url field?
Thanks for any and all help.

Assuming that you are on Solr 3.1 or greater.
StandardTokenizerFactory - It creates token based on Word Boundary rules. This means URLs will be broken into multiple tokens and match on any one of them would be considered a hit.
Try using KeywordTokenizerFactory, for your url fieldtype. This should preserve the complete URL and match against it only.

In addition to using KeywordTokenizerFactory, you will have to remove the WordDelimiterFilterFactory. WDF splits tokens on punctuation and other delimiters ... which are very plentiful in URLs. You'll have to rebuild your index after making the change and restarting Solr or reloading the core.
An alternate idea, if you don't need to force URLs to lowercase: Switch from TextField to StrField and get rid of the analyzer config entirely.

Related

solr sunspot unicode? characters

How do I use this file provided in sunspot (mapping-ISOLatin1Accent.txt)(or is this the one I need as well)? I need to be able to search for "las pinas" and include results like "las piñas" in my database. Meaning n => ñ? I have my config schema.xml like this for now:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false" tokenizerFactory="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
and I have tried moving the <charFilter> setting around as well.
I have also searched and found various solutions mostly pointing to this or this articles but those don't seem to work either.

give importance to documents which contains the word proximity + solr + sunspot

I am working on rails application and which is based on Apache Solr search engine and we are using Sunspot gem. But I am facing one problem, If I search query house rent then its giving me thousands of results by using and query. But the results what I am getting are not relevant.
I am expecting the documents which contains the house and rent words near to each other, those documents should come on top. But for now the documents which contains more number of house and rent documents are coming on top. But there is no any word proximity.
My schema.xml contains following definition:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="[\s,\.;\(\)]+"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
To achieve this what changes are need to do? or any filter are necessary to add for this?

You can try this
<fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" outputUnigrams="true" outputUnigramIfNoNgram="true" maxShingleSize="99"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>

Use phrase fields and boost them or you can try terms boosting like "house rent"~5

Sunspot - How to use a regex with sunspot SOLR? [duplicate]

any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560

Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex

Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this

Finding singular versions of a word in Sunspot/Solr

I have a Rails+Sunspot application and I'm working on configuring it so that searching returns the singluar version of the query. For instance:
I want a search for "cookies" to return something named "cookie". Currently my Sunspot search returns "cookies" but not "cookie" (singluar).
I've made some customizations to Solr's schema.xml, adding solr.EdgeNGramFilterFactory to provide more flexibility but EdgeNGramFilterFactory doesn't suite this case as it only allows matches when the query is a substring of the result's name. My understanding is EdgeNGramFilterFactory will return "cookie" when the user searches for "co", "coo", "cook" or "cooki", but not a superstring of "cookie" (ie: cookies). Simply put, this is because "cookies" is not a substring within "cookie".
I've tried adding all three of Solr's build-in stemming factories but to no avail. You can see one commented out in my schema.
In schema.xml, the relevant field looks as follows:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
<!-- <filter class="solr.EnglishMinimalStemFilterFactory"/> -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I supposed I could singluarize the user's query but I would rather not touch their query before it hits Solr.
You can play with this here: http://staging.zisboombah.com/parent/food_guide/?search=cookie. Try changing the query between "cookie" and "cookies".
Any tips on how to do this in Solr would be greatly appreciated!

The solr xml options are ordered. You want the stemmer to come before the ngram filter, so that you ngram-ize cooki, rather than stemming c, co, etc.
Combining filters in this way may lead to some odd results, mostly depending on how aggressive your stemmer is. You should definitely add the stemmer to the query analyzer, but that will mess with your autocomplete.
A better solution: use a copyField to make independent text_stemmed and text_autocomplete fields. Then search using an OR query over both fields.

Like Kyle mentions, you probably want to use more text field types for each of these different use cases.
Here's an example of mine:
schema.xml
<schema>
<types>
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_en" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_stopwords" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
<!-- ... -->
</types>
<fields>
<!-- ... -->
</fields>
<copyField source="*_text" dest="text"/>
<copyField source="*_texts" dest="text"/>
<copyField source="*_textsv" dest="text"/>
<copyField source="*_textv" dest="text"/>
</schema>
Sunspot modeling
Using the copyField directive can save some setup work in the model. However, Sunspot uses those text declarations to decide which fields to keywords-search by default, so I like to include distinct text invocations that use :as to specify the full Solr document field name.
searchable do
text :name, stored: true, default_boost: 10
text :name, as: 'name_text_en'
text :description, stored: true
end

how to implement wildcard search with sunspot

any help is always welcome
I am using sunspot with solr but not able to find any good solution that how to perform wildcard search with sunspot
if i search for 8088***
it should return all numbers starts with 8088 but not 228088560

Look for the following lines of code in /solr/conf/schema.xml:
<fieldType name="text" class="solr.TextField" omitNorms="false">
...
</fieldType>
and replace them with this:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Remember to restart the solr server, and reindex after these changes
rake sunspot:solr:stop
rake sunspot:solr:start
rake sunspot:reindex

Sunspot gives you wildcard for free* with NGramToeknizer(there are sometimes NGramTokenizer issues for subsets that are too small and other quirks), which means that exclusion is actually the tricky part. If you know the number of digits in the number (say 6), a crude, but effective, way to handle this would be to use without (:field).greater_than(808900) without (:field).less_than(808700) <-- I don't remember whether .greater_than and .less_than are actually => and =< , so if they are just > and < you may want to do 808899 and 808800 instead, but you get the idea.
**Correction There is a solution for this: you can change the NGramFilterFactory in your solr/config/schema.xml to an EdgeNGramFilterFactory (assuming you had an NGramFilterFactory in the first place to get the partial-word seaching). This makes the index only break up words starting at the beginning of strings. After this, restart your server and reindex.
***All credit to Zach Moazeni at Collective Idea for this

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Improper results from searching on a URL in solr - url

Related

solr sunspot unicode? characters

give importance to documents which contains the word proximity + solr + sunspot

Sunspot - How to use a regex with sunspot SOLR? [duplicate]

Finding singular versions of a word in Sunspot/Solr

how to implement wildcard search with sunspot

Categories

Resources