I am working on an NLP project, wherein I have a list of emails all related to appreciation. I am trying to determine from the email content, who is being appreciated. This in turn will help the organization in our performance evaluation program.
Apart from identifying who is being appreciated, I am also trying to identify the type of work a person has done and score it. I am using open NLP (max entropy/logistic regression) for classification of the email and use some form of heuristics to identify the person being appreciated.
The approach for person identification is as follows:
Determine if an email is related to appreciation
Get the list of people in the "To:" list
Check if that person is being referred to in the email
Tag that person as the receiver of appreciation
However, this approach is very simple and does not work for complex emails we generally see. An email can consist of many email ids or people being referred to and they are not the receivers of the appreciation. The context of the person is not available and hence the accuracy is not very good.
I am thinking of using HMM and word2vec to solve the person issue. I would appreciate if anyone has come across this problem or has any suggestion.
Use tm package for R. And use tf-idf (term frequency - inverse document frequency) to determine whos been appreciate.
I'm suggesting this because , for what I can read , this is an unsupervised learning (you dont knot prior whos been appreciate). So you have to describe the documents (emails) content , and that formula (tf-idf) will help know what words are been use most in a particular document that are rarely uso on all anothers.
One way to solve this problem is through the use of Named Entity Recognition. You can possibly run something like Stanford NER over the text which will help you recognize all people names mentioned in the email and then use a rules based chunker such as Stanford TokensRegex to extract sentences where names of people and appreciation words are mentioned.
The best way to solve this will be by treating this as a supervised learning problem. You will then need to annotate a bunch of training data with entities and expression phrases and the relations between them. Then you can use Stanford Relation Extractor to extract appropriate relations.
Related
I am trying to solve a problem where I'm identifying entities in articles (ex: names of cars), and trying to predict sentiment about each car within the article. For that, I need to extract the text relevant to each entity from within the article.
Currently, the approach I am using is as follows:
If a sentence contains only 1 entity, tag the sentence as text for that entity
If sentence has more than 1 entity, ignore it
If sentence contains no entity, tag as a sentence for previously identified entity
However, this approach is not yielding accurate results, even if we assume that our sentiment classification is working.
Is there any method that the community may have come across that can solve this problem?
The approach fails for many cases and gives wrong results. For example if I am saying - 'Lets talk about the Honda Civic. The car was great, but failed in comparison to the Ford focus. The car also has good economy.'
Here, the program would pick up Ford Focus as the entity in last 2 sentences and tag those sentences for it.
I am using nltk for descriptive words tagging, and scikit-learn for classification (linear svm model).
If anyone could point me in the right direction, it would be greatly appreciated. Is there some classifier I could build with custom features that can detect this type of text if I were to manually tag say - 50 articles and the text in them?
Thanks in advance!
The problem that I'm facing is:
I want to read a document, get the raw string of this document, and classify the information.
For example, I want to identify when the string is a "Name", or a "date" ou some other useful information.
Is it possible to use machine learning to do that?
How may I approach the problem?
The most hard problem here is that I'm not trying to classify the document itself, but the String information inside the document.
So it's all about how you think about your problem. I think your problem can be formulated as an entity extraction/recognition problem, where you have a document and want to identify specific entities within the text (where an entity might be a person, date, etc). Take a look at Conditional Random Fields and their applications to named entity recognition (NER for short), as there are some libraries & tools already implemented.
For example, check out StanfordNER.
I have a Machine Learning project that given the reactions of a group of users on a collection of online articles (displayed by means of like/dislike) I need to make a decision for a newly arrived article.
The task dictates that given each individual's reaction to be able to predict whether the newly arrived article should be considered as to be recommended to the community as a whole.
I have been wondering how am I supposed to incorporate each user's feedback to dictate whether this would be an interesting article to recommend.
Bearing in mind that within users' reactions there would be users that like and dislike the same article is there a way to incorporate all this information and reach a conclusion about the article?
Thank you in advance.
There are a lot of different ways to determine what's "interesting." I think reddit has a pretty good model to look at in considering different options. They have different categories, like "hot", or "controversial", etc.
So a couple options depending on what you/your professor want:
Take the net number of likes (like = +1, dislike = -1)
Take just the number of likes
Take the total number of ratings (who's read it at all)
Take the ones with the highest percentage of likes vs. dislikes
Some combination of these things
Etc.
So there are a lot of different things you could try. Maybe try a few and see which produce results most like what you want?
In terms of how to predict whether a new article compares to the articles you already have information about, that's a much broader question, but I don't think that's what you're asking, and it seems like that's what the Machine Learning project is about.
I am not sure if the recommending an article in this way is good, but if this is what your requirement then let me suggest you an approach.
Approach:
First, for every article give a lable(like/dislike) based on the number of likes & dislikes. Now you have set of articles with like/dislike lables. Based on this data you need to identify whether a new article's lable is like/dislike. This comes under simple linear classification problem, which can be solved by using any of the open source ml frameworks.
let us say, we have
- n number of users in the Group
- m number of articles
sample data
user1 article1 like
user1 article2 dislike
user2 article3 dislike
....
usern articlem like
Implementation:
for each article
count the number of likes
count the nubmer of dislikes
if no. of likes > no. of dislikes,
lable = like
else
lable = dislike
Give this input(articles with lables) to naive bayes(or any) classifier to build a model.
Use this model to classify, the new article.
Output: like/dislike, if you get like recommend the article.
Known Issues:
1. What is half of the users likes & other half dislikes the article, Will you consider it as a like or dislike?
2. What is 11 users dislike & 10 users like, is it Okay to consider this as dislike?
Such Questions should be answered by yourself or your client as a part of requirement clarification.
Recently I'm working on my course project, it's an android app that can automatically help fill consuming form based on the user's voice. So here is one sample sentence:
So what I want to do is let the app fill forms automatically, my forms have several fields: time(yesterday), location(MacDonald), cost(10 dollars), type(food). Here the "type" field will include food, shopping, transport, etc.
I have used the word-splitting library to split the sentence into several parts and parse it, so I can already extract the time, location and cost fields from the user's voice.
What I want to do is deduce the "type" field with some kind of machine learning model. So there should be some records in advance, input by user manually to train the model. After training, when new record comes in, I first extract the time, location and cost fields, and then calculate the type field based on the model.
But I don't know how to represent the location field, should I use a dictionary to include many famous locations and use index to represent the location? If so, which kind of machine learning method should I use to model this requirement?
I would start with the Naive Bayes classifier. The links below should be useful in understanding it:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
http://cs229.stanford.edu/notes/cs229-notes2.pdf
http://scikit-learn.org/stable/modules/naive_bayes.html
I wonder if time and cost are that discriminative/informative in comparison to location for your task.
In general, look at the following link on working with text data (it should be useful even if you dont know python):
http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html
It should include three stages:
Feature Representation:
One way to represent the features is the Bag-of-Word representation, which you fix an order of the dictionary and use a word frequency vector to represent the documents. See https://en.wikipedia.org/wiki/Bag-of-words_model for details.
Data and Label Collection:
Basically, in this stage, you should prepare some [feature]-[type] pairs to training your model, which can be tedious or expensive. If you had already published your app, and collected a lot of [sentence]-[type] pair (probably chosen by app user), you can extract the features and build a training set.
Model Learning:
Cdeepakroy has suggested a good choice of the model: Naive Bayes, which is very efficient for classification task like this. At this stage, you can just find a suitable package, insert your training data, and enjoy the classifier it returns.
Am working on this problem where I need to cluster search phrase based on what they are looking for (for now, let's assume they are looking for only places, such as bookstore, supermarket, ..)
"Where can I find a cheesecake ?"
could get clustered probabilistically to 'desserts', 'restaurants', ...
"Where can I buy groceries ?"
could get clustered probabilistically to 'supermarkets', 'vegetables', ...
Assume for beginning with, a set of what the search phrases could get classified to, already exists.
I looked into topic modeling but I feel like I might be heading the wrong direction. Any suggestions on how to get started off / what to look into would be highly helpful.
Thanks a lot.
Topic modelling certainly provides one possible solution. Induce a topic model from a large corpus, as representative as possible of the texts you're indexing and searching with. Then represent each query as the posterior over the topics given the query. If you want to obtain a clustering of queries, you could then do so on this reduced set, or if you're doing IR you could use the resulting vectors instead of the original bag of words.
If this isn't what you want, can you elaborate on the problem? What do you hope to do with the clustered queries?