Indexing and Querying URLS in Solr

Indexing and Querying URLS in Solr - url

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls.
I've tried a few things, and I think I'm close but not sure why it doesn't work:
Here is my custom field type:
<fieldType name="customUrlType" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
For example:
http://www.twitter.com/AndersonCooper when indexed, will have the following words in different positions: http,www,twitter,com,andersoncooper
If I search for simply twitter.com/andersoncooper, I would like this query to match the record that was indexed, which is why I also use the WDF to split the search query,
however the search query ends up being like so:
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
Is there a different query filter or tokenizer I should be using?

If I understand this statement from your question
myfield:("twitter com andersoncooper") when really want it to match all records that have all of the following separate words: twitter com andersoncooper
You are trying to write a query that would match both:
http://www.twitter.com/AndersonCooper
and
http://www.andersoncooper.com/socialmedia/twitter
(both links contain all of the tokens), but not match either
http://www.facebook.com/AndersonCooper
or
http://www.twitter.com/AliceCooper
If that is correct, your existing configuration should work just fine. Assuming that you are using the standard query parser and you are querying via curl or some other url based mechanism, you need the query parameter to look like this:
&q=myField:andersoncooper AND myField:twitter AND myField:com
One of the gotchas that may have been tripping you up is that the default query operator (between terms in a query) is "OR", which is why the AND's must be explicitly specified above. Alternately to save some space, you can change the default query operator to "AND" like this:
&q.op=AND&q=myField:(andersoncooper twitter com)

This should be the most simplest solution:
<field name="iconUrl" type="string" indexed="true" stored="true" />
But for you requirement you will need to make it multivalued and index it 1. no changes 2. without http 3. without www
or make the URL searchable via wildcards at the front (which is slower I guess)

You can try the keyword tokenizer
From the book Solr 1.4 Enterprise Search Server published by Packt
KeywordTokenizerFactory: This doesn't
actually do any tokenization or
anything at all for that matter! It
returns the original text as one term.
There are cases where you have a
field that always gets one word, but
you need to do some basic analysis
like lowercasing. However, it is more
likely that due to sorting or
faceting requirements you will require
an indexed field with no more than
one term. Certainly a document's
identifier field, if supplied and not
a number, would use this.

Related

WCM REST API Query

I need to use REST API to query IBM WCM 8.0 on the contents stored in it.
When i use the following query format, it works fine:
wcmrest/query?keyword=ABC&keyword=DEF
This returns all the contents which has both ABC and DEF as values in keywords.
My requirement is to search contents that matches either ABC or DEF keywords.
Kindly let me know what query I need use for the same?
Also, is it possible to search WCM based on user defined metadata?

The dynamic/adhoc queries do not have a query parameter that can perform an OR for multiple keywords.
This can be achieved using the a pre-defined query: http://infolib.lotus.com/resources/portal/8.0.0/doc/en_us/PT800ACD004/wcm/wcm_rest_defined.html
For example you could use the following user defined query:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<definedQuery pageSize="10" page="1" restrictParameters="false" depth="DESCENDANTS">
<select>
<keywordIn>
<keyword>ABC</keyword>
<keyword>XYZ</keyword>
</keywordIn>
</select>
<allowParameters/>
</definedQuery>

Sitecore issue on replacing danish characters in url

I used this solution to optimize urls, all works fine but there are a problem with danish characters (æ and ø) which sould be replaced to "a" and "o". I used this in Web.config:
<replace mode="on" find="æ" replaceWith="a" />
<replace mode="on" find="ø" replaceWith="o" />
Urls looks good, but when I try to go by this link I got 404 error and if I manually change "a" to "æ" in url page opens.
Help me please!:)

Remember that replacement is two way. Generated URLs will subsitute a for æ.
Incoming URLs will replace a with æ when looking up items.
As Danish uses both letters simply replacing æ with a when you generate URLs will cause you all sorts of headaches - e.g. the item at-spise-æbler ("to eat apples") will generate the URL at-spise-abler, which will be reverse replaced during item lookup to try and find the item æt-spise-æbler.
To be more consistent you should replace æ with ae, å with aa and ø with oe if you wish to replace Danish characters.
If you are also using replace mode to ensure all URLs are lower-cased (e.g. <replace mode="on" find="A" replaceWith="a" /> ) then your incoming URL containing an "a" will be interpreted as containing an "A" (assuming replacement is in order of the entries in the web.config and your lowercasing matches are first - if it's the other way round then you still have other problems!). The item at-spise-æbler will still generate a URL at-spise-abler, but your item lookup may match a to A first, trying to find At-spise-Abler, which doesn't exist.
The double letter substitution won't help you here either, as Sitecore will simply match each letter to its uppercase version
A beter solution for you would be to actually rename items (or their display names) when they are created or edited.
This link shouldpoint you in theright direction: http://briancaos.wordpress.com/2007/05/30/sc-53-ensure-item-names/

Sharepoint Lists.asmx: remove "ows_MetaInfo" field from GetListItems method response xml

The following question was posted in other forum, but there was no response, I am facing the same problem and I think it will get some answers here :
Question :
I am making use of the SharePoint 2007 GetListItems web service to programatically retrieve all documents within a document library. However, my program is throwing an Exception due to an invalid character contained within the XML response. The bad data is within the Word document itself. There are control characters within the Comments section of the document properties. The bad characters are then ending up as  in the ows_MetaInfo field in the XML output which is invalid.
I have no need for the ows_MetaInfo field and so I have been trying to use the viewFields parameter to specify which fields to return along with setting the query option IncludeMandatoryColumns to false but the ows_MetaInfo field is always returned.
Does anyone know if it is possible to remove the ows_MetaInfo field from the output or somehow handle these invalid characters that are appearing in the XML output

In my case (SharePoint 2010) this solved the problem:
<soap:viewFields>
<ViewFields Properties="True">
<FieldRef Name="MetaInfo" Property="ModifiedBy" />
<FieldRef Name="ID" />
<FieldRef Name="LinkFilename" />
</ViewFields>
</soap:viewFields>

This works for me to exclude the ows_MetaInfo field:
<soap:GetListItems>
<soap:listName>{....}</soap:listName>
<soap:viewFields>
<ViewFields Properties="True">
<FieldRef Name="*"/>
<FieldRef Name="MetaInfo"/>
</ViewFields>
</soap:viewFields>
</soap:GetListItems>
See also http://msdn.microsoft.com/en-us/library/dd964860(v=office.12).aspx

There are no way to remove this field from the output, or at least none that I've found.
The MSDN documentation says that even if you set the IncludeMandatoryColumns to false, it will returns some mandatory fields.
I think your best option here is to fill a bug report to Microsoft, saying that invalid characters are put inside the ows_MetaInfo field.
Another thing you can try, but I don't know if it will resolve the problem it setting the Properties attribute of the ViewFields element to TRUE.
<ViewFields Properties="TRUE">your fieldrefs</ViewFields>

Tokenizing Twitter Posts in Lucene

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index a number of tweets in Lucene and keep the terms like #user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like #user and #hashtag?
My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,
String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("#", "addresstag");
Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?
Thanks in advance!
Amaç

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.
What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.
For more details one the inner workings of the analyzers you can have a look over here

It is cleaner to use a custom tokenizer that handles Twitter usernames natively. I have made one here: https://github.com/wetneb/lucene-twitter
This tokenizer will recognize Twitter usernames and hashtags, and a companion filter can be used to lowercase them (given that they are case-insensitive):
<fieldType name="text_twitter" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
<filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="org.opentapioca.analysis.twitter.TwitterTokenizerFactory" />
<filter class="org.opentapioca.analysis.twitter.TwitterLowercaseFilterFactory" />
</analyzer>
</fieldType>

There's a Twitter-specific tokenizer here: https://github.com/brendano/ark-tweet-nlp/blob/master/src/cmu/arktweetnlp/Twokenize.java

A tutorial on twitter specific tokenizer which is a modified version of ark-tweet-nlp API can be found at http://preciselyconcise.com/apis_and_installations/tweet_pos_tagger.php
This API is capable of identifying emoticons, hashtags,interjections etc present in a tweet

The Twitter API can be told to return all Tweets, Bios etc with the "entities" (hashtags, userIds, urls etc) already parsed out of the content into collections.
https://dev.twitter.com/docs/entities
So aren't you just looking for a way to re-do something that the folks at Twitter have already done for you?

Twitter open source there text process lib, implements token handler for hashtag etc.
such as: HashtagExtractor
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/text/extractor/HashtagExtractor.java
It is base on lucene's TokenStream.

BizTalk 2006 R2 mapping problem

I have this data (all the elements are optional):
<data>
<optionalElement1>...</optionalElement1>
<optionalElement2>...</optionalElement2>
<optionalElement3>...</optionalElement3>
</data>
I need to map this to another schema (all the elements are required):
<request>
<Element1>...</Element1>
<Element2>...</Element2>
<Element3>...</Element3>
</request>
Since the elements in the original request are optional, the mapping will
only generate the corresponding elements for the originally included
elements. But the validation of the request will fail.
Example:
<data>
<optionalElement3>
<value1>1</value1>
<value2>2</value2>
</optionalElement3>
</data>
will be mapped to
<request>
<Element3>
<subelement1>1</subelement1>
<subelement2>2</subelement2>
</Element3>
</request>
And the validation will fail because i'm missing Element1 and Element2. The
response should be (I think):
<request>
<Element1 xsi:nil="true" />
<Element2 xsi:nil="true" />
<Element3>
<subelement1>1</subelement1>
<subelement2>2<subelement2>
</Element3>
</request>
How can I do this in the mapping? How can I ensure that the element is
created in the output message?
And, by the way, if a subelement is not present (let's say
"data/optionalElement1/value1" how can I make sure that the destination
subelement "request/Element1/subelement1" is created?

Make it very simple. Use the xlst file for mapping.
Using simple if condition you can check for value exist for opetion element or not, if value exist then map that else map the null (Empty) value. So the complex element will get generated even if there is no value for optional element.
Hope it will solve your problem.

You can do all this in the mapper. I haven't been into Biztalk for a while and I don't have it near me, but I know there are functiods in the mapper that lets you check for the existence of the fields you need. Depending on the existence of these field, you can specify what the appropriate action for the mapper is.
You force the creation of fields by giving them default values in the target schema. This can also be done using the mapper, via the properties window.

Jose,
You'll want to look at the table looping functoid. Here's a post about it.
http://geekswithblogs.net/Chilberto/archive/2008/04/16/121274.aspx
Using this functoid with the table extraction should give you your solution. Also here's a good series on understadning the mapper.
http://www.bizbert.com/bizbert/2008/02/07/Understanding+The+BizTalk+Mapper+Part+1+Introduction.aspx
-Bryan

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Indexing and Querying URLS in Solr - url

Related

WCM REST API Query

Sitecore issue on replacing danish characters in url

Sharepoint Lists.asmx: remove "ows_MetaInfo" field from GetListItems method response xml

Tokenizing Twitter Posts in Lucene

BizTalk 2006 R2 mapping problem

Categories

Resources