Keyword Text Recognition and Extraction - machine-learning

I'm trying to build an architecture that recognizes words that are related to a subject word from a paragraph of text. These "related" words can be words that describes the subject word or provides information about the subject word.
Here's a basic example:
John is 36, male and lives in New York. He's a skinny, about 5'9 with fair skin.
In this example, the subject word would be, "John".
The related words are "36", "male", "new york", "skinny", 5'9", "fair skin".
I already have a ruled based approach to identify the subject word which is working perfectly fine. Identifying the "related" words is not yielding the accuracy I'm hoping for. To identify the related words, I've taken supervised learning and LSTM structure approach. While I used a combination of PoS tags and dependency tags in the beginning, I have now switched to pure embeddings (transformer-based models).
Any architectural or method recommendations would be greatly appreciated.

Related

NLP tfidf with LSTM

I have a basic question in NLP.
When we consider traditional models like Decision trees, The feature column order is important, Like first column is fixed with some particular attribute. So If, I have Tf-Idf Each word will have some fixed index and the model can learn.
But in the case of LSTM, Sentences can be jumbled. For eg: "There is heavy rain", "Heavy rain is there"
In the above 2 sentences, The word heavy occurs in different places. So in order for the model to understand that we have passed the word "There", We would require some unique representations for the word "there". Either a One-Hot or Word2vec. Is my understanding so far right?
My final question is, If I use tfidf for the above, How will it work? How will the model understand that "heavy" word is passed? This question has been bugging me for long.

How to find similar Sentences using FastText ( Sentences with Out of Vocabulary words)

I am trying to create an NLP model which can find similar sentences. For example, It should be able to say that "Software Engineer", "Software Developer", "Software Dev", "Soft Engineer" are similar sentences.
I have a dataset with a list of roles such as Cheif Executives, Software Engineer and the variation of these terms will be unknown ( out of vocabulary).
I am trying to use fastText with Gensim but struggling.
Does anyone have suggested readings/ tutorials that might help me?
A mere list-of-roles may not be enough data for FastText (and similar word2vec-like algorithms), which need to see words (or tokens) in natural ussage contexts, alongside other related words, to gradually nudge them into interesing relative-similarity alignments.
Do you just have the titles, or other descriptions of the roles?
To the extent that the titles are composed of individual words, which in their title-context mostly mean the same as in normal contexts, and they are very short (2-3 words each), one potential approach is to try the "word mover's distance" (WMD) metric.
You'd want good word-vectors trained from elsewhere with good contexts and compatible word senses, so that the vectors for 'software', 'engineer', etc individually are all reasonably good. Then you could use the .wmdistance() method in Gensim's word-vector classes to calculate a measure of how much, across all of a texts words, one run-of-words differs from another run-of-words.
Update: Note that for the values from WMD (and those from cosine-similarity), you generally shouldn't obsess over their absolute values, only how they affect relative rankings. That is, no matter what raw value wmd(['software', 'engineer'], ['electric', 'engineer']) returns, be it 0.01 or 100, the important measure is how that number compares to other pairwise comparisons, like say wmd(['software', 'engineer'], ['software', 'developer']).

Retrieving the top 5 sentences- Algorithm if any present

I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well.
I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library - I could fetch the most occurring 5 sentences, but I am interested to know if any algorithm (ML/DL/NLP) is present for such a requirement. All the sentences are given by the user. I need to know his top 5 (most occurring/frequent) sentences (not phrases please)!!
Examples of sentences -
"Welcome to the world of Geeks"
"This portal has been created to provide well written subject"
"If you like Geeks for Geeks and would like to contribute"
"to contribute at geeksforgeeks org See your article appearing on "
"to contribute at geeksforgeeks org See your article appearing on " (occurring for the second time)
"the Geeks for Geeks main page and help thousands of other Geeks."
Note: All my sentences in my database are distinct (contextual wise and no duplicates too). This is just an example for my requirement.
Thanks in Advance.
I'd suggest you to start with sentence embeddings. Briefly, it returns a vector for a given sentence and it roughly represents the meaning of the sentence.
Let's say you have n sentences in your database and you found the sentence embeddings for each sentence so now you have n vectors.
Once you have the vectors, you can use dimensionality reduction techniques such as t-sne to visualize your sentences in 2 or 3 dimensions. In this visualization, sentences that have similar meanings should ideally be close to each other. That may help you pinpoint the most-frequent sentences that are also close in meaning.
I think one problem is that it's still hard to draw boundaries to the meanings of sentences since meaning is intrinsically subjective. You may have to add some heuristics to the process I described above.
Adding to MGoksu's answer, Once you get sentence embeddings, you can apply LSH(Locality Sensitive Hashing) to group the embeddings into clusters.
Once you get the clusters of embeddings. It would be a trivial to get the clusters with highest number of vectors.

Classification of single sentence

I have 4 different categories and I also have around 3000 words which belong to each of these categories. Now if a new sentence comes, I am able to break the sentence into words and get more words related to it. So say for each new sentence I can get 20-30 words generated from the sentence.
Now what is the best way to classify this sentence in above mentioned category? I know bag of words works well.
I also looked at LDA, but it works with documents, where as I have a list of words as a training corpus. In LDA it looks at the position of word in document. So I could not get meaningful results from LDA.
I'm not sure if I fully understand what your question is exactly.
Bag of words works well for some purposes, but in a lot of cases it throws away a lot of potentially useful information (which could be taken from word order, for example).
And assuming that you get a grammatical sentence as input, why not use your sentence as document and still use LDA? The position of a word in your sentence can still be verymeaningful.
There are plenty of classification methods available. Which one is best depens largely on your purpose. If you're neew to this area, this may be interesting to have a look at: https://www.coursera.org/course/ml
Like, Igor, I am also a bit confused regarding your problem. Be it a document or a sentence, the terms will be part of the feature set for categorization, in some form. You can find out the most relevant terms of each category and using this knowledge, do a better classification of the new sentences. For example, if your sentence is as follows-" There is a stray dog near our layout which bites everyone who goes near to it". If you take the useful keywords from this sentence, removing stopwords, they are a few in number ( stray, dog, layout, bites, near ). You can categorize it into a bucket, "animals_issue". If you train your system with a larger set of example, this bag of words model can help. Otherwise, you can go for LDA/ other topic modelling approaches.

Text Corpus for finding important words in a news headline

I am working on a small project which requires me to find the keywords for a news headline. I simply used tfidf using nltk.webtext (http://www.nltk.org/book/ch02.html#web-and-chat-text) as the corpus and each sentence assumed to be as the document. The idea being idf would give me an idea of how important a word is.
The results would clearly depend largely on the underlying corpus. Webtext clearly is biased towards things on internet, thus they are tagged not-so-important by the algorithm.
What according to you will be a relevant corpus, and
What will be the corresponding document?
Headlines would be centred around politics, events, sports, etc. So, say a book by Charles Dickens would be pretty neutral, but is there more organized way to approach this?

Resources