I have a dataset having customer message and final category one of example is following-
key message final category
1 i want customer care no i want to talk with ur team other
2 hi I 9986443603cjhh had qkuiv1uhqllljqvocally q illgi vq noclass
3 hai points not coming checking
like. The dataset is huge file with at least 20 final category type. Please suggest appropriate method to classify the data with a message which will be its final category. I am thinking of making feature_vector with message word and feed it into Bayesian would it be great? Or I have to use other technique.
Thanks a lot.
You can consider word-embedding.
You can download from here the embbedings (in this link- Glove, you can alternatively use word2vec).
The idea is that similar words will have similar vectors.
After you convert each word in your message to a vector you can average all the vectors (or, average using TF-IDF for better results) to get the vector-representation of your message.
Of course, words like qkuiv1uhqllljqvocally will not appear in the vocabulary.
To check your results, you can cluster(using 20-means clustering, if you have 20 classes) all your vectors to see that similar messages cluster to the same group.
Related
I am building a tool from scratch that takes a sample of text and turns it into a list of categories. I am not using any libraries for this at the moment but am interested if anyone has experience in this territory as the hardest part that I'm struggling with is building in sentiment to the search. It's easy to word match but sentiment is much more challenging.
The goal would be to take something like this paragraph;
"Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"
and turn it into
categories = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
If possible I'd like to end up adding a filter for negative sentiment so that if the text said;
"I hate cooking"
'cooking' wouldn't be included in the categories.
Any help is greatly appreciated. TIA 👍
You seem to have at least two tasks: 1. Sequence classification by topics; 2. Sentiment analysis. [Edit, I only noticed now that you are using Ruby/Rails, but the code below is in Python. But maybe this answer is still useful for some people and the steps can be applied in any language.]
1. For sequence classification by topics, you can either define categories simply with a list of words as you said. Depending on the use-case, this might be the easiest option. If that list of words were too time-intensive to create, you can use a pre-trained zero-shot classifier. I would recommend the zero-shot classifier from HuggingFace, see details with code here.
Applied to your use-case, this would look like this:
# pip install transformers # pip install in terminal
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
sequence = ["Whenever I am out walking with my son, I like to take portrait photographs of him to see how he changes over time. My favourite is a pic of him when we were on holiday in Spain and when his face was covered in chocolate from a cake we had baked"]
candidate_labels = ['father', 'photography', 'travel', 'spain', 'cooking', 'chocolate']
classifier(sequence, candidate_labels, multi_class=True)
# output:
{'labels': ['photography', 'spain', 'chocolate', 'travel', 'father', 'cooking'],
'scores': [0.9802802205085754, 0.7929317951202393, 0.7469273805618286, 0.6030028462409973, 0.08006269484758377, 0.005216470453888178]}
The classifier returns scores depending on how certain it is that a each candidate_label is represented in your sequence. It doesn't catch everything, but it works quite well and is fast to put into practice.
2. For sentiment analysis you can use HuggingFace's sentiment classification pipeline. In your use-case, this would look like this:
classifier = pipeline("sentiment-analysis")
sequence = ["I hate cooking"]
classifier(sequence)
# Output
[{'label': 'NEGATIVE', 'score': 0.9984041452407837}]
Putting 1. and 2. together:
I would probably probably (a) first take your entire text and split it into sentences (see here how to do that); then (b) run the sentiment classifier on each sentence and discard those that have a high negative sentiment score (see step 2. above) and then (c) run your labeling/topic classification on the remaining sentences (see 1. above).
I have a set of 20 small document which talks about a particular kind of issue (training data). Now i want to identify those docs out of 10K documents, which are talking about the same issue.
For the purpose i am using the doc2vec implementation:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
# Tokenize_and_stem is creating the tokens and stemming and returning the list
# documents_prb store the list of 20 docs
tagged_data = [TaggedDocument(words=tokenize_and_stem(_d.lower()), tags=[str(i)]) for i, _d in enumerate(documents_prb)]
max_epochs = 20
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
alpha=alpha,
min_alpha=0.00025,
min_count=1,
dm =1)
model.build_vocab(tagged_data)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
model.train(tagged_data,
total_examples=model.corpus_count,
epochs=model.iter)
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
model.save("d2v.model")
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
def doc2vec_score(s):
s_list = tokenize_and_stem(s)
v1 = model.infer_vector(s_list)
similar_doc = model.docvecs.most_similar([v1])
original_match = (X[int(similar_doc[0][0])])
score = similar_doc[0][1]
match = similar_doc[0][0]
return score,match
final_data = []
# df_ws is the list of 10K docs for which i want to find the similarity with above 20 docs
for index, row in df_ws.iterrows():
print(row['processed_description'])
data = (doc2vec_score(row['processed_description']))
L1=list(data)
L1.append(row['Number'])
final_data.append(L1)
with open('file_cosine_d2v.csv','w',newline='') as out:
csv_out=csv.writer(out)
csv_out.writerow(['score','match','INC_NUMBER'])
for row in final_data:
csv_out.writerow(row)
But, I am facing the strange issue, the results are highly un-reliable (Score is 0.9 even if there is not a slightest match) and score is changing with great margin every time. I am running the doc2vec_score function. Can someone please help me what is wrong here ?
First & foremost, try not using the anti-pattern of calling train multiple times in your own loop.
See this answer for more details: My Doc2Vec code, after many loops of training, isn't giving good results. What might be wrong?
If there's still a problem after that fix, edit your question to show the corrected code, and a more clear example of the output you consider unreliable.
For example, show the actual doc-IDs & scores, and explain why you think the probe document you're testing should be "not a slightest match" for any documents returned.
And note that if a document is truly nothing like the training documents, for example by using words that weren't in the training documents, it's not really possible for a Doc2Vec model to detect that. When it infers vectors for new documents, all unknown words are ignored. So you'll be left with a document using only known words, and it will return the best matches for that subset of your document's words.
More fundamentally, a Doc2Vec model is really only learning ways to contrast the documents that are in the universe demonstrated by the training set, by their words' cooccurrences. If presented with a document with either totally different words, or words whose frequencies/cooccurrences are totally unlike anything seen before, its output will be essentially random, without much meaningful relationship to other more-typical documents. (That'll be maybe-close, maybe-far, because in a way the training on the 'known universe' tends to fill the whole available space.)
So, you wouldn't want to use a Doc2Vec model trained only only positive examples of what you want to recognize, if you also want to recognize negative examples. Rather, include all kinds, then remember the subset that's relevant for certain in/out decisions – and use that subset for downstream comparisons, or multiple subsets to feed a more-formal classification or clustering algorithm.
I currently have a trained topic model using MALLET (http://mallet.cs.umass.edu/topics.php) that is based on about 80 000 collected news articles (these articles all belong to one category).
I wish to give a relevancy score each time a new article comes in (might or might not be related to the category). Is there any way to achieve this? I've read up on td-idf, but it seems that is giving a score based on existing articles, not any incoming new ones. The end goal is to filter out articles that might be irrelevant.
Any ideas or help is greatly appreciated. Thank you!
After you have the model(topics) you can test on new unseen documents as per documentation (parameter --evaluator-filename [FILENAME] is where you pass the new unseen documents) Topic Held-out probability:
Topic Held-out probability
--evaluator-filename [FILENAME] The previous section describes how to get topic proportions for new documents. We often want to estimate the
log probability of new documents, marginalized over all topic
configurations. Use the MALLET command bin/mallet evaluate-topics
--help to get information on using held-out probability estimation. As with topic inference, you must make sure that the new data is
compatible with your training data. Use the option --use-pipe-from
[MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or
import-dir to specify a training file.
Note: I did used more the gensim LDA and LSI and you can pass the new documents as follow:
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(lda_model[new_vec])
#output: [(0, 0.020229542), (1, 0.49642297)
Interpretation: you can see (1, 0.49642297) meaning that from the 2
topics(categories) we have the new document is close represented by topic #1. So in your case you can take the maximum from the outputting list and you have the relevancy "coefficient" so high coefficient to be in the category and low not (added 2 topics as per better visualization and in your case if you have only #1 topic than just add a simple threshold of the minim you want to consider and if did fail above, for example 0.40, than is in the category otherwise not).
Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.
The example training excersize labels single-term names after tokenizing with something like a simple split(' ').
I need to train for and recognize names that include spaces. How do I train the recognizer?
Example: "I saw a Big Red Apple Tree." -- How would I tokenize for training and then recognize "Big Red Apple Tree" instead of recognizing four separate words?
Will this work for the training data?
I\tO
saw\tO
a\tO
Big Red Apple Tree\tMyName
.\tO
Would the output from the recognizer look the same as that?
The training section in the FAQ says "The training file parser isn't very forgiving: You should make sure each line consists of solely content fields and tab characters. Spaces don't work."
The problem you are trying to solve belongs to phrase identification. There are different ways with which you can tag the words. For example, You can tag the words with IOB tags. Train the stanford ner model onto this newly created data. Write a post processing step to concatenate the predicted data.
For Example :
your training data should look like this:
I\tO
saw\tO
a\tO
Big\tB-MyName
Red\tI-MyName
Apple\tI-MyName
Tree\tO-MyName
.\tO<br/>
So Basically, you are using [ 0, B-MyName , I-MyName , O-MyName ] as tags.
I have solved similar problem and it works great. But make sure you have enough data to train it on.