How to remove irrelevant text data from a large dataset [closed]

How to remove irrelevant text data from a large dataset [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I am working on a ML project where data were coming from a social media, and the topic about the data should be depression under Covid-19. However, when I read some of the data retrieved, I noticed that even though the text (around 1-5 %) mentioned some covid-related keywords, the context of those texts are not actually about the pandemic, they are telling a life story (from 5-year-old to 27-year-old) instead of how covid affects their lives.
The data I want to use and am looking for is some texts that tell people how covid makes depression worse and what not.
Is there a general way to clean those irrelevant data whose contexts are not covid-related (or outliers )?
Or is it ok to keep them in the dataset since they only count for 1-5% ?

I think what you want is Topic Modeling, or perhaps a Text Rank algo, or certainly something along those lines. Check out the link below for some ideas of where to go with this.
https://monkeylearn.com/keyword-extraction/
There are numerous weaknesses with the bag of words model, especially when applied to natural language processing tasks, that graph ranking algorithms such as TextRank are able to address. TextRank is able to incorporate word sequence information. Bag of words simply refers to a matrix in which the rows are documents and the columns are words. The values matching a document with a word in the matrix, could be a count of word occurrences within the document or use tf-idf. The bag of words matrix is then provided to a machine learning algorithm. Using word counts or tf-idf, we are only able to identify key single word terms in a document.
Also, see the link below.
https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd
You can find the accompanying sample data used in the example in that link, directly below.
https://raw.githubusercontent.com/susanli2016/NLP-with-Python/master/data/quora_sample.csv

Related

Is there a machine learning or NLP model to separate questions and answers in raw text?

I have some raw text that has questions and answers in it. I would like to identify which parts of the text are questions and which parts are the answers. This seems like it would be easy, but the questions aren't necessarily terminated with question marks. The only thing I know for sure is that after a question is over the answer begins, and after the answer is over another question begins, but there is no consistent format on how many \n are included in the answers. A question is definitely its own paragraph though.
I'm hoping for some sort of pre-trained model for this?
One possibility would be to take some existing data, manually tag each paragraph as q vs a and then use google's universal sentence encoder for each paragraph to get the 512 dimension output and then use that as the input to train a neural net or some other classification model on the labeled data. I'm hoping to avoid this path because I don't want to manually tag a few thousand paragraphs, and after all that work, who knows if the model will have a decent classification error.
Another possibility is to use something like gpt3: feed it the entire text and just ask it what are the questions/requests. The problem with this is that the gpt3 api is still a bit sandboxed. I tried a sample on the gpt3 playground and it only identified 80% of the questions.
Any other suggestions?
To give you an idea, the text may look like this:
What is the name of the company?
We are Acme Inc.
How many employees are there.
There are 50 employees.
Describe a day in the life of an employee.
An employee arrives at 9am.
Then they go to the factory and make widgets for 4 hours. After making widgets they eat lunch and then go to the QA engineer to make sure their widgets are good enough.
After QA, they write a report about how many widgets they made.
Most employees leave around 5pm.
List the pay range of your employees.
The starting salary is $22/hours.
After 1 year pay increases to $25 an hour and then increases 3% per year.
Contact information:
Acme Inc
123 Main Street
Anyplace, USA

According to the description and the text sample that you provided, I would split this problem into 2 parts:
How to split the whole text
How to "classify" which sentence (or paragraph) is a question or an answer
I tried solving this problem using a heuristics based approach with spacy (you can use other libraries).
You can just use this technique directly or use it to build a weakly supervised dataset that you can train a classification model with (try skweak).
Sentence Detection
This is the easy part, all you have to do is follow the details in this link https://spacy.io/usage/linguistic-features#sbd
nlp = spacy.load('en_core_web_sm')
doc = nlp("Hi, I'm sentence number 1. Hi, I'm sentence number 2.")
for sent in doc.sents:
print(sent.text)
# Hi, I'm sentence number 1.
# Hi, I'm sentence number 2.
Question or Answer
From the sample that you shared, I can see that you want to detect questions and also imperative phrases:
Questions: How many employees are there. To detect this type of question, you can use spacy's tag property and look for WH tokens (who, where etc...). You could use the same logic to find Subject verb inversions for example etc...
Imperative phrases: List the pay range of your employees. To detect these cases, you can search for verbs that are the first tokens of a sentence.
Here's a small example that you can follow:
def is_question(sent):
d = nlp(sent)
token = d[0] # gets the first token in a sentence
if token.pos_ == "VERB" and token.dep_ == "ROOT": # checks if the first token is a verb and root or not
return True
for token in d: # loops through the sentence and checks for WH tokens
if token.tag_ == "WDT" or token.tag_ == "WP" or token.tag_ == "WP$" or token.tag_ == "WRB":
return True
return False
doc = nlp(text)
for sent in doc.sents:
print(sent.text.strip())
if is_question(sent.text.strip()):
print("is question")
else:
print("not a question")
print("***")
# what is the name of the company?
# is question
# ***
# We are Acme Inc.
# not a question
# ***
# How many employees are there.
# is question
# ***
# There are 50 employees.
# not a question
You can apply this function on a large corpus and get a weakly annotated dataset that you can use to train a classifier or you can just use the function. But...
Beware !!
This is a heuristic based approach, not all the results are correct for example: What a beautiful day !
The sentence start with a WH token but it's not a question, you fix this by checking if it ends with a question mark or not but in you corpus questions don't always end with a question mark.
A possible solution would be, to apply this on a corpus and manually filter out these outliers.

Find the most similar terms from a list of given terms in a huge text corpora [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have a 2-million long list of names of Podcasts. Also, I have a huge text corpus scraped from a sub-Reddit (Posts, comments, threads etc.) where the podcasts from our list are being mentioned a lot by the users. The task I'm trying to solve is, I've to count the number of mentions by each name in our corpora. In other words, generate a dictionary of (name: count) pairs.
The challenge here is that most of these Podcast names are several words long, For eg: "Utah's Noon News"; "Congress Hears Tech Policy Debates" etc. However, the mentions which Reddit users make are often a crude substring of the original name, for eg: "Utah Noon/ Utah New" or "Congress Tech Debates/ Congress Hears Tech". This makes identifying names from the list quite difficult.
What I've Tried:
First, I processed and concatenated all the words in the original podcast names into a single word. For instance,
"Congress Hears Tech Policy Debates" -> "Congresshearstechpolicydebates"
As I traversed the subreddit corpus, whenever I found a named-entity or a potential podcast name, I processed its words like this,
"Congress Hears Tech" (assuming this is what I found in the corpora) -> "congresshearstech"
I compared this "congresshearstech" string to all the processed names in the podcast list. I make this comparison using scored calculated on word-spelling similarity. I did this using difflib Python library. Also, there are similarity scores like Leveshtein and Hamming Distance. Eventually, I rewarded the podcast name with similarity score maximum to our corpus-found string.
My problem:
The thing is, the above strategy is infact working accurately. However, it's way too slow to do for the entire corpus. Also, my list of names is way too long. Can anyone please suggest a faster algorithm/data structure to compare so many names on such a huge corpus? Is there any deep learning based approach possible here? Something like where I can train a LSTM on the 2 million Podcast names. So, that whenever a possible name is encountered, this trained model can output the closest spelling of any Podcast from our list?

You may be able to use something like tf-idf and cosine similarity to solve this problem. I'm not familiar with any approach to use machine learning that would be helpful here.
This article gives a more detailed description of the process and links to some useful libraries. You should also read this article which describes a somewhat similar project to yours and includes information on improving performance. I'll describe the method as I understand it here.
tf-idf is an acronym meaning "term frequency inverse document frequency". Essentially, you look at a subset of text and find the frequency of the terms in your subset relative to the frequency of those terms in the entire corpus of text. Terms that are common in your subset and in the corpus as a whole will have a low value, whereas terms that are common in your subset but rare in the corpus would have a high value.
If you can compute the tf-idf for a "document" (or subset of text) you can turn a subset of text into a vector of tf-idf values. Once you have this vector you can use it to compute the cosine-similarity of your text subset with other subsets. Say, find the similarity of an excerpt from reddit with all of your titles. (There is a way to manage this so you aren't continuously checking each reddit excerpt against literally every title - see this post).
Once you can do this then I think the solution is to pick some value n, and scan through the reddit posts n words at a time doing the tf-idf / cosine similarity scan on your titles and marking matches when the cosine-similarity is higher than a certain value (you'll need to experiment with this to find what gives you a good result). Then, you decrement n and repeat until n is 0.

If exact text matching (with or without your whitespace removal preprocessing) is sufficient, consider the Aho-Corasick string matching algorithm for detecting substring matches (i.e. the podcast names) in a body of text (i.e. the subreddit content). There are many implementations of this algorithm for python, but ahocorapy has a good readme that summarizes how to use it on a dataset.
If fuzzy matching is a requirement (also matching when the mention text of the podcast name is not an exact match), then consider a fuzzy string matching library like thefuzz (aka fuzzywuzzy) if per query-document operations offer sufficient performance. Another approach is to precompute n-grams from the substrings and accumulate the support counts across all n-grams for each document as the fuzzyset package does.
If additional information about the podcasts is available in a knowledge base (i.e. more than just the name is known), then the problem is more like the general NLP task of entity linking but to a custom knowledge base (i.e. the podcast list). This is an area of active research and state of the art methods are discussed on NLP Progress here.

Can Word2Vec be used for information extraction?

I am using Gensim to train Word2Vec. I know word similarities are deteremined by if the words can replace each other and make sense in a sentence. But can word similarities be used to extract relationships between entities?
Example:
I have a bunch of interview documents and in each interview, the interviewee always says the name of their manager. If I wanted to extract the name of the manager from these interview transcripts could I just get a list of all human name's in the document (using nlp), and the name that is the most similar to the word "manager" using Word2Vec, is most likely the manager.
Does this thought process make any sense with Word2Vec? If it doesn't, would the ML solution to this problem then be to input my word embeddings into a sequence to sequence model?

Yes, word-vector similarities & relative-arrangements can indicate relationships.
In the original Word2Vec paper, this was demonstrated by using word-vectors to solve word-analogies. The most famous example involves the analogy "'man' is to 'king' as 'woman' is to ?".
By starting with the word-vector for 'king', then subtracting the vector for 'man', and adding the vector for 'woman', you arrive at a new point in the coordinate system. And then, if you look for other words close to that new point, often the closest word will be queen. Essentially, the directions & distances have helped find a word that's related in a particular way – a gender-reversed equivalent.
And, in large news-based corpuses, famous names like 'Obama' or 'Bush' do wind up with vectors closer to their well-known job titles like 'president'. (There will be many contexts in such corpuses where the words appear immediately together – "President Obama today signed…" – or simply in similar roles – "The President appointed…" or "Obama appointed…", etc.)
However, I suspect that's less-likely to work with your 'manager' interview-transcripts example. Achieving meaningful word-to-word arrangements depends on lots of varied examples of the words in shared usage contexts. Strong vectors require large corpuses of millions to billions of words. So the transcripts with a single manager wouldn't likely be enough to get a good model – you'd need transcripts across many managers.
And in such a corpus each manager's name might not be strongly associated with just manager-like contexts. The same name(s) will be repeated when also mentioning other roles, and transcripts may not especially refer to managerial-action in helpful third-person ways that make specific name-vectors well-positioned. (That is, there won't be clean expository statements like, "John_Smith called a staff meeting", or "John_Smith cancelled the project, alongside others like "…manager John_Smith…" or "The manager cancelled the project".)

Summarization of simple Q&A

Is there a way to generate a one-sentence summarization of Q&A pairs?
For example, provided:
Q: What is the color of the car?
A: Red
I want to generate a summary as
The color of the car is red
Or, given
Q: Are you a man?
A: Yes
to
Yes, I am a man.
which accounts for both question and answer.
What would be some of the most reasonable ways to do this?

I had to once work on solving the opposite problem, i.e. generating questions out of sentences from Wikipedia articles.
I used the Stanford Parser to generate parse trees out of all possible sentences in my training dataset.
e.g.
Go to http://nlp.stanford.edu:8080/parser/index.jsp
Enter "The color of the car is red." and click "Parse".
Then look at the Parse section of the response. The first layer of that sentence is NP VP (noun phrase followed by a verb phrase).
The second layer is NP PP VBZ ADJP.
I basically collected these patterns across 1000s of sentences, sorted them how common each patter was, and then used figured out how to best modify this parse tree to convert into each sentence in a different Wh-question (What, Who, When, Where, Why, etc)
You could you easily do something very similar. Study the parse trees of all of your training data, and figure out what patterns you could extract to get your work done. In many cases, just replacing the Wh word from the question with the answer would give you a valid albeit somewhat awkwardly phrases sentence.
e.g. "Red is the color of the car."
In the case of questions like "Are you a man?" (i.e. primary verb is something like 'are', 'can', 'should', etc), swapping the first 2 words usually does the trick - "You are a man?"

I don't know any NLP task that explicitly handles your requirement.
Broadly, there are two kinds of questions. Questions that expect a passage as the answer such as definition or explain sort: What is Ebola Fever. The second type are fill in the blank which are referred to as Factoid Questions in the literature such as What is the height of Mt. Everest?. It is not clear what kind of question you would like to summarize. I am assuming you are interested in factoid questions as your examples refer to only them.
A very similar problem arises in the task of Question Answering. One of the first stages of this task is to generate query. In the paper: An Exploration of the Principles Underlying
Redundancy-Based Factoid Question
Answering; Jimmy Lin 2007, the author claims that better performance can be achieved by reformulating the query (see section 4.1) to the form more likely to appear in free text. Let me copy some of the examples discussed in the paper.
1. What year did Alaska became a state?
2. Alaska became a state ?x
1. Who was the first person to run the miles in less than four minutes?
2. The first person to run the miles in less than four minutes was ?x
In the above examples, the query in 1 is reformulated to 2. As you might have already observed, ?x is the blank that should be filled by the answer. This reformulation is carried out through a dozen hand-written rules and are built into the software tool discussed in the paper: ARANEA. All you have to do is to find the tool and use it, the paper is a good ten years old, I cannot promise you anything though :)
Hope this helps.

What are some good methods to find the "relatedness" of two bodies of text?

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.
What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?
I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.
I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.

These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.
You could also look into Soundex for words that "sound alike" phonetically.

I've never used it, but you might want to look into Levenshtein distance

Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)
One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.
And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases

This is quite doable for reasonable large texts, however harder for smaller texts.
I did it once like this, and it worked pretty well:
Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.

See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.

This book may be relevant.
Edit: here is a related SO question

Phonetic algorithms
The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.
I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name
A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart