I have huge amount of yelp data and I have to classify the reviews into 8 different categories.
Categories
Cleanliness
Customer Service
Parking
Billing
Food Pricing
Food Quality
Waiting time
Unspecified
Reviews contains multiple categories so I have used multilable classification. But I am confuse how I can handle the positive/negative . Example review may be for positive for food quality but negative for customer service. Ex- food taste was very good but staff behaviour was very bad. so review contains positive food quality but negative Customer service How can I handle this case? Should I do sentiment analysis before classification? Please help me
I think your data is very similar to Restaurants reviews. It contains around 100 reviews, with varied number of aspect terms in each (More information). So you can use Aspect-Based Sentiment Analysis like this:
1-Aspect term Extraction
Extracting the aspect terms from the reviews.
2-Aspect Polarity Detection
For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative.
3-Identify the aspect categories
Given a predefined set of aspect categories (e.g., food quality, Customer service), identify the aspect categories discussed in a given sentence.
4-Determine the polarity
Given a set of pre-identified aspect categories (e.g., food quality, Customer service), determine the polarity (positive, negative) of each aspect category.
Please see this for more information about similar project.
I hope this can help you.
Yes you would need a sentiment analysis. Why don't you create tokens of your data, that is find the required words out of the sentence, now the most possible approach for you is to find the related words along with their sentiment. i.e. food was good but the cleanliness was not appropriate
In this case you have [ food, good, cleanliness, not, appropriate ] now food links with its next term and cleanliness to its next terms "not appropriate"
again you can classify either into two classes i.e. 1,0 for good and bad .. or you can add classes based upon your case.
Then you would have data as such:
--------------------
FEATURE | VAL
--------------------
Cleanliness 0
Customer -1
Service -1
Parking -1
Billing -1
Food Pricing -1
Food Quality 1
Waiting time -1
Unspecified -1
I have given this just as an example where -1,1,0 are for no review, good and bad respectively. You can add more categories as 0,1,2 bad fair good
I may not be so good in answering this, but this is what i feel about it.
Note : You need to understand that you model cannot be perfect because that's what Machine Learning is all about, you have to be wrong. Your model cannot give a perfect classification it has to be wrong for certain inputs which it will learn with time and improve over.
There are many ways of doing multi label classification.
The simplest one would be having a model for each class, and if the review achieves a certain threshold score for that label, you would apply that label to the review.
This would treat the classes independently, but it seems like a good solution to your problem.
Related
I m making my first steps in AI and ML.
I choose myself a project, I want to fix with ML, but I m unsure which methode to use.
Business Case: A Customer can put offers and set a date he wants to receive his products.
He is able to change the amount of products he buys at every time.
I have to deal with the costs of unbuyed products and missing profit, in case I produced less than he wanted.
I have plenty of data from past transactions contianing the original amount of products ordered and the amount I sent to the costumer.
My goal is to get a predicitve analytics model which is able to tell me after a costumer changed the number of products from an order, how probably this change is final.
I m really new to this topic and are not quite getting all the information for the different methodes. I know classification and regression are the big players and can be implemented in different ways. But is one of those approaches fitting for my problem?
Many Thanks in advance.
You can go with a classification based approach. Since you goal is to predict whether the order change is final or not. The probability of happening that change can be derived from the accuracy/F1 score of your model. Higher the values, higher successful predictions. In laymen's terms think this as classifying whether the order is final or not.
You have to go for a regression approach if you're trying to predict a value based on the order change. For example if you want to predict what is the cost for the next order change, then you have to use regression.
As I understood your use case matches with the first scenario.
We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!
I have a set of ratings entered by users for N-items, along with reasons as to why they select that rating for that item. The ratings are in an ordinal scale (-2, -1, 0, +1, +2).
I would like to come up with meaningful groupings of these reasons. For example, say users are rating movies and reasons behind the ratings might fall under 3 broad groups: 1). 'They are huge fan of the actor', 2). 'Amazing Story line', 4). 'Lacks originality'. This is just a dummy example.
More concretely, given a set of free form textual entries, can one come up with such groupings. I know that topic modeling is one way of doing this. I can specify the number of topics K, and then feed data into my topic model (LDA etc.), the model will output K topics where each topic is a list of most probable words in that topic. So with respect to this dummy example, topic 1 may contain words and phrases like - 'fan', 'actor', 'great acting'.
Is there other ways to do this clustering? Do I need to consider the ordinal rating scale while clustering? How can I take that into account?
Word embeddings might be useful. Here is a recent, relevant Stanford project.
It depends on how sophisticated you wish the handling of the text to be. If just matching single words (1-grams) were sufficient then:
remove stop words
possibly do stemming or other text preprocessing
apply a naive bayes classification algorithm Options are here: http://en.wikipedia.org/wiki/Naive_Bayes_classifier
However you may also wish to do a better job with phrases / related words. In that case there is plenty of research - and implementations - to help you. Ngrams is a relatively simple approach, but more advanced methods that understand the semantics of the language have better statistical performance.
I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.
Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)
A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?
I'm trying to figure out a way I could represent a Facebook user as a vector. I decided to go with stacking the different attributes/parameters of the user into one big vector (i.e. age is a vector of size 100, where 100 is the maximum age you can have, if you are lets say 50, the first 50 values of the vector would be 1 just like a thermometer). I just can't figure out a way to represent the Facebook interests as a vector too, they are a collection of words and the space that represents all the words is huge, I can't go for a model like a bag of words or something similar. Does anyone know how I should proceed? I'm still new to this, any reference would be highly appreciated.
In the case of a desire to down vote this question just let me know what is wrong about it so that I could improve the wording and context.
Thanks
The "right" approach depends on what your learning algorithm is and what the decision problem is.
It would often be better, though, to represent age as a single numeric feature rather than 100 indicator features. That way learning algorithms don't have to learn the relationship between those hundred features (it's baked-in), and the problem has 99 fewer dimensions, which'll make everything better.
To model the interests, you might want to start with an extremely high-dimensional bag of words model and then use one of various options to reduce the dimensionality:
a general dimensionality-reduction technique like PCA or smarter nonlinear ones, including Kernel PCA or various nonlinear approaches: see wikipedia's overview of dimensionality reduction and of specifically nonlinear techniques
pass it through a topic model and use the learned topic weights as your features; examples include LSA, LDA, HDP and many more