Solr: exclude parts of url - url

I have a field url in schema.xml. I need to separate my search results based on this field.
For example
in one search I want results of www.example.com/part1/ actually all results that have this prefix.
for another search I want results from www.example.com, but without all documents containing /part1/ in their url.
How can I achieve this? fq doesn't accept special characters and I don't want to split content with NGramFilterFactory, so that this behaviour should be only at search-time.

The PathHierarchyTokenizerFactory should do what you need, I believe. It splits a path-type string into multiple tokens, building up from the root forwards. See https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-PathHierarchyTokenizer for more details.
You can then do a query such as q=path:www.example.com -path:*/part1, assuming that you are using the Path Hierarchy Tokenizer for both index and query analysis on that field.

Related

Querying lucene index with arbitrary long article text to check for all matches within article (through neo4j)

I'm trying to query the lucene index I've added to a neo4j field (it's a "name" field, that isn't very long, one to ten words at most).
What I do right now is take all the text in a given webpage, sanitize it with a javascript function to keep only words, spaces and alphanumeric characters, and use that to query my index.
.replace(/[^\w\s]|or|and|not|return+/gi, "") // <- escaping the input
I'm not sure if the length of the search text is limited somehow, but results do seem to disappear after about 1050 words (~6500 characters).
Ideally, I'd like to be able to use a couple thousand words in one query, with the end goal of highlighting the matches found within the webpage itself.
Why is my query not returning any results past a certain number of characters ? Am I missing some keyword in my escaping regex ?
Is what I'm trying to achieve feasible ? Is there a better approach I could use ?
Thanks for reading :)
(for anyone finding this, I found a somewhat related question here: Handling large search queries on relatively small index documents in Lucene)

Supported search queries with OneDrive

What values the 'search-text' can take in the following query?
GET /me/drive/root/search(q='{search-text}')
From experiments, it looks like the {search-text} is a single string that would be searched in the contents of the file. Meaning if the search text is a multiple word sentence then entire sentence is searched rather than individual works in the sentence? Is this right assumption?
Eg: Say If I would like to search 'word1' 'word2' ... 'wordn' then it looks like search query should be issued for all the n words individually. Is there a format/way in which we can search all the n words in single query?
Thanks,
/Girish BK
Searching is phrase based and does not support wildcards or similar search augmentations.
For example, the query /me/drive/search(q='pizza shop') would search for files that contain the phrase "pizza shop" in a filename, a file's metadata, and a file's content.

How to give multiple hyperlink to single string

I am of the opinion that giving multiple hyperlink to a single string is not possible in MS word document. I don't have any knowledge of C# but I think after reading How insert multiple hyperlinks to one comment in MS WORD using C#? I just want to clarify weather multiple hyperlink to a single comment is possible or not.
For example I have a "string"
I want to give different hyperlinks like this
example.com/s
example.com/t
example.com/r
example.com/i
example.com/n
example.com/g
So that I get change to select where I want to go from that string.
In Word, a hyperlink is not a string, it's a field code. Field codes are special objects. It is, therefore, not possible to pass multiple hyperlink objects as part of a string. You can't even pass one hyperlink object as part of a string...
You can pass multiple hyperlink strings as a delimited string, then "split" the string into an array, loop the array to create multiple hyperlink objects.
If you want to open a hyperlink or hyperlinks from code, there is a FollowHyperlink method, as I recall (I'm on a mobile device at the moment, so can't double-check). You can pass a string to that.

How can I smartly extract information from an HTML page?

I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.
In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.

Google Search Appliance wildcard character support

Does Google Search Appliance support wildcard characters in the query string. If no, is there any way I can fetch all the result set through my query?
The GSA does not support wildcarding. An option can be toN-Gram the fields or content that you want wildcarded. This would be achieved in your feeder or pipeline.
If waiting and upgrading the gsa software to v 7.2,.coming mid December is an option you will have wild card search built in.
Otherwise you have to dig deeper. A possible option is a document filter. If you are interested in that option I might be able to help.
I have developed such a document filter.
GSA software 7.4 has wildcard search built in. From documentation:
Enabling Wildcard Search
Wildcard search is a feature that enables your users to search by entering a word pattern rather than the exact spelling of a term. The search appliance supports two wildcard operators:
*--Matches zero or more characters
?--Matches exactly 1 character
Using wildcards can simplify queries for long names, technical data, pharmaceutical information, or strings where the exact spelling varies or is unknown. A user can search for all words starting with a particular pattern, ending with a particular pattern, or having a particular substring pattern.
By default, wildcard indexing is disabled for your search appliance. You can enable or disable wildcard indexing by using the Index > Index Settings page. You can disable or enable wildcard search for one or more front ends by using the Filters tab of the Search > Search Features > Front Ends page.
One way to get all indexed items from a collection is to use a query that will match every indexed record, e.g., supposing you're indexing some set of URLs from subdomain.companyname.com, just query for "companyname", with the "&num=1000&filter=0" query string parameters.

Resources