Querying lucene index with arbitrary long article text to check for all matches within article (through neo4j) - neo4j

I'm trying to query the lucene index I've added to a neo4j field (it's a "name" field, that isn't very long, one to ten words at most).
What I do right now is take all the text in a given webpage, sanitize it with a javascript function to keep only words, spaces and alphanumeric characters, and use that to query my index.
.replace(/[^\w\s]|or|and|not|return+/gi, "") // <- escaping the input
I'm not sure if the length of the search text is limited somehow, but results do seem to disappear after about 1050 words (~6500 characters).
Ideally, I'd like to be able to use a couple thousand words in one query, with the end goal of highlighting the matches found within the webpage itself.
Why is my query not returning any results past a certain number of characters ? Am I missing some keyword in my escaping regex ?
Is what I'm trying to achieve feasible ? Is there a better approach I could use ?
Thanks for reading :)
(for anyone finding this, I found a somewhat related question here: Handling large search queries on relatively small index documents in Lucene)

Related

OR Operator in Google Sheets Not Working?

This is the formula I am working on right now:
=FILTER(Data!A:K, SEARCH("Gretsch", Data!F:F)+SEARCH("Krutz", Data!F:F))
I'm trying to bring in both rows that include "Gretsch" and rows that include "Krutz". Just using one of those instead of both works just fine, and using keywords that the other has included in it's results (for example, searching just Gretsch brings up 10 or so (out of the 100+) products that include "Streamliner" in the F Column, as well as Gretsch, so using this formula:
=FILTER(Data!A:K, SEARCH("Gretsch", Data!F:F)+SEARCH("Streamliner", Data!F:F))
brings up the 10 or so products with both, but I'm looking for one OR the other, so why is that '+' acting like an AND operator instead? Am I just completely off base?
SEARCH returns a number: the starting position where a string is found within another string. It's not a simple 1 like other tests for TRUE, and there is no 0 case (i.e., FALSE) as you have it written. In addition, if either SEARCH does not find the target string, it returns an error; and a number plus an error... returns an error (which is not TRUE and therefore will not be included in the FILTER).
A better approach to achieving OR with FILTER:
=FILTER(Data!A:K, REGEXMATCH(LOWER(Data!F:F),"gretsch|krutz"))
The pipe symbol ("|") means OR in this context, and you may list as many pipe-separated strings as you like. Notice that the search range is wrapped in LOWER and that the terms to search are also lowercase; this assures the same kind of caps-agnostic search you were looking for with SEARCH.
By the way, based on your other recent post, you can also use REGEXMATCH with NOT, e.g.:
=FILTER(Data!A:K, REGEXMATCH(LOWER(Data!F:F),"gretsch|krutz"), NOT(REGEXMATCH(LOWER(Data!F:F),"case")))
One additional note: Your post is tagged "Excel" and "Google Sheets." The formulas I've proposed will only work with Google Sheets. In most cases by far, it is best to tag a post here with either "Excel" or "Google Sheets" but not both, since the differences are substantial between the two.
try:
=QUERY(Data!A:K, "where lower(F) contains 'gretsch'
or lower(F) contains 'krutz'")

Storing words in a text

I am building an application for learning languages, with Rails and Postgresql.
Texts get uploaded. The texts will be of varying length, but let’s assume they’ll be 100-3000 words long.
On upload, each text position gets transformed into a “token”, representing information about the word at that position (base word, noun/verb/adjective/etc., grammar tags, definition_id).
On click of a word in the text, I need to find (and show) all other texts in the database that have words with the same attributes (base_word, part of speech, tags) as the clicked word.
The easiest and most relational way to do this is a join table TextWord, between the table Text and Word. Each text_word would represent a position in the text, and would contain the text_id, word_id, grammar_tags, start_index, and end_index.
However, if a text has between 100-3000 words, this would mean 100-3000 entries for each text object.
Is that crazy? Expensive? What problems could this lead to?
Is there a better way?
I can’t use Postgres full text search because, for example, if I click “left” in “I left Nashville”, I don’t want “take a left at the light” to show up. I want only “left” as a verb, as well as other forms of “leave” as a verb. Furthermore, I might want only “left” with a specific definition_id (ex. “Left” used as “The political party”, not “the opposite of right”).
The other option I can think of is to store a JSON on the text object, with the tokens as a big hash of hashes, or array of hashes (either way). Does Postgresql have a way to search through that kind of nested data structure?
A third option is to have the same JSON as option 2 (to store all the positions in a text), and a 2nd json on each word object / definition object / grammar object (to store all the positions across all texts where that object appears). However, this seems like it might take up more storage than a join table, and I’m not sure if it would bring any tangible benefit.
Any advice would be much appreciated.
Thanks,
Michael.
An easy solution would be to have a database with several indexes: one for the base word, one for the part-of-speech, and one for every other feature you're interested in.
When you click on left, you identify it's a form of "leave", and a "verb" in the "past tense". Now you go to your indexes, and get all token position for "leave", "verb", and "past tense". You take the intersection of all the index positions, and you are left with the token positions of the forms you're after.
If you want to save space, have a look at Managing Gigabytes, which is an excellent book on the topic. I have in the past used that to fully index text corpora with millions of words (which was quite a lot 20 years ago...)

JQL actual "contains"

I want to perform a simple search on a text field with part of its content but I don't know the beginning. I basically want what someone would expect of a "contains search". If I search issue for 345, I would want this result:
123456
234567
345678
...
Which, in JQL, would be the result of the query issue ~ "*345*", but the * is not allowed as first character in wildcard query. Is there an easy way to get this result, preferably with a JQL query?
Right now it's impossible to search JIRA for contains operation. As described in Search syntax for text fields, JIRA support Word stemming:
Since JIRA cannot search for issues containing parts of words, word
'stemming' allows you to retrieve issues from a search based on the
'root' (or 'stem') forms of words instead of requiring an exact match
with specific forms of these words. The number of issues retrieved
from a search based on a stemmed word is typically larger, since any
other issues containing words that are stemmed back to the same root
will also be retrieved in the search results.
That means, that you can search for common root of some word, but can't search for arbitrary part of it.
There is an issue in official JIRA bug tracker: Allow searching for part of a word (prefix / substring searches), which describes why this can't be implemented:
Lucene doesn't support prefix search.
As a workaround the suggestion is to use Script Runner plugin for JIRA:
issueFunction in issueFieldMatch("project = JRA", "description", "ABC\\d{4}")
See more on IssueFieldMatch here.
Another plugin, which can do regex jql is JQL Search Toolkit.
Filter issues by "345" substring in the Summary field:
summary ~ "345"
Filter issues by "345" substring in the Description field:
description ~ "345"

Supported search queries with OneDrive

What values the 'search-text' can take in the following query?
GET /me/drive/root/search(q='{search-text}')
From experiments, it looks like the {search-text} is a single string that would be searched in the contents of the file. Meaning if the search text is a multiple word sentence then entire sentence is searched rather than individual works in the sentence? Is this right assumption?
Eg: Say If I would like to search 'word1' 'word2' ... 'wordn' then it looks like search query should be issued for all the n words individually. Is there a format/way in which we can search all the n words in single query?
Thanks,
/Girish BK
Searching is phrase based and does not support wildcards or similar search augmentations.
For example, the query /me/drive/search(q='pizza shop') would search for files that contain the phrase "pizza shop" in a filename, a file's metadata, and a file's content.

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords).
And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents.
With this constrain(working on small set of texts), how can I generate tags ?
Regards
Two Stage Approach for Multiword Tags
You could pool all the tweets into a single larger document and then extract the n most interesting collocations from the whole collection of tweets. You could then go back and tag each tweet with the collocations that occur in it. Using this approach, n would be the total number of multiword tags that would be generated for the whole dataset.
For the first stage, you could use the NLTK code posted here. The second stage could be accomplished with just a simple for loop over all the tweets. However, if speed is a concern, you could use pylucene to quickly find the tweets that contain each collocation.
Tweet Level PMI for Single Word Tags
As also suggested here, For single word tags, you could calculate the point-wise mutual information of each individual word and the tweet itself, i.e.
PMI(term, tweet) = log [ P(term, tweet) / (P(term)*P(tweet))
Again, this will roughly tell you how much less (or more) surprised you are to come across the term in the specific document as appose to coming across it in the larger collection. You could then tag the tweet with a few terms that have the highest PMI with the tweet.
General Changes for Tweets
Some changes you might want to make when tagging with tweets include:
Only use a word or collocation as a tag for a tweet, if it occurs within a certain number or percentage of other tweets. Otherwise, PMI will tend to tag tweets with odd terms that occur in just one tweet but that are not seen anywhere else, e.g. misspellings and keyboard noise like ##$##$%!.
Scale the number of tags used with the length of each tweet. You might be able to extract 2 or 3 interesting tags for longer tweets. But, for a shorter 2 word tweet, you probably don't want to use every single word and collocation to tag it. It's probably worth experimenting with different cut-offs for how many tags you want to extract given the tweet length.
I have used a method earlier, for small text content such as SMSes, where I would just repeat the same line two times. Surprisingly, that works well for such content where a noun could well be the topic. I mean, you don't need it to repeat for it to be the topic.

Resources