I have a course project that I need to finish. I'm using Weka 3.8 and I need to classify text. The result needs to be as accurate as possible. We received a train and a test .arff file. We need to train it with the train file of course, and then let it classify the test file. The professor uploaded a 100% accurate classification of the test file. We need to upload our own results and than the system compares the two files. For now I've been using a FilteredClassifier composed of SMO and StringToWordVector with Snowball stremmer, but I can't get a better accuracy than 65.9% for some reason (this is not the split accuracy, but the one I get when the system compares my results to the 100% accurate one). I can't figure out why.
The train.arff file:
#relation train
#attribute index numeric
#attribute ingredients string
#attribute cuisine {greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,thai,vietnamese,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian}
#data
0,'romaine lettuce;black olives;grape tomatoes;garlic;pepper;purple onion;seasoning;garbanzo beans;feta cheese crumbles',greek
1,'plain flour;ground pepper;salt;tomatoes;ground black pepper;thyme;eggs;green tomatoes;yellow corn meal;milk;vegetable oil',southern_us
2,'eggs;pepper;salt;mayonaise;cooking oil;green chilies;grilled chicken breasts;garlic powder;yellow onion;soy sauce;butter;chicken livers',filipino
3,'water;vegetable oil;wheat;salt',indian
...
and 4995 more lines like these.
The test.arff is similar to this:
#relation test
#attribute index numeric
#attribute ingredients string
#attribute cuisine {greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,thai,vietnamese,cajun_creole,brazilian,french,
japanese,irish,korean,moroccan,russian}
#data
0,'white vinegar;sesame seeds;english cucumber;sugar;extract;Korean chile flakes;shallots;garlic cloves;pepper;salt',?
1,'eggplant;fresh parsley;white vinegar;salt;extra-virgin olive oil;onions;tomatoes;feta cheese crumbles',?
... and 4337 more lines, like these.
This is my weka configuration:
He told us that there are some instances when in the .arff file some ingredients amongst the #data are seperated with ',' by accident and that there are words that occur frequently and that those might not help much. I don't know if this is important or not. Is there any way I could improve the classification accuracy? Am I even using the right classifier for the job? Thanks in advance!
Related
I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
I consider the 'Other' category as noise and I cannot model this category.
I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".
I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.
You cannot do that.
You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.
You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other
You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc
But in your case, a little hack can be done as follows.
So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.
You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other
You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.
No, you cannot do that.
You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.
Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.
If you have huge data, go for deep learning models which does feature selection automatically.
Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class".
Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".
I am trying to use weka to classify a dataset with logistic regression but the option logistic is unavaliable even though I use only numeric values for attributes and nominal for class (Other main classifiers are also unavaiable like NaiveBayes, J48 etc). My Arff file is :
#RELATION data_weka
#ATTRIBUTE class {1,0}
#ATTRIBUTE 1 NUMERIC
.
.
.
#ATTRIBUTE 30 NUMERIC
#DATA
1,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
.
.
.
The dataset contains 562 examples.
Can anyone help me please?
In your file, the class attribute is not the last attribute. Did you change the class attribute to be the last (class) attribute in the Preprocess Editor (right click to see that menu).
Weka always assumes the class attribute is the last attribute in the file. Your last attribute (30) is numeric, so it's not letting you run logistic regression.
I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.
I got a situation that I don't know if is possible to use Weka classifications.
There is a big number of class classifications describing a pricing plan, just like that:
#attribute 'plan' {'Free', 'Basic', 'Premium', 'Enterprise'}
#attribute 'atr01' {TRUE, FALSE}
#attribute 'atr02' {TRUE, FALSE}
#attribute 'atr03' {TRUE, FALSE}
#attribute 'atr04' {TRUE, FALSE}
#attribute 'atr05' {TRUE, FALSE}
...
#attribute 'atr60' {TRUE, FALSE}
This list of attributes can grow up in the future... we expect to have 120 attributes.
What we need is to give a form so the user can check true or false for each attribute and our recomendation system will select the most appropriate plan for the user based on our training set.
The problem is that our training set contains only 1 row for each plan, just like that:
'Free',FALSE,TRUE,TRUE,FALSE...[+many trues and falses]...TRUE
'Basic',TRUE,FALSE,FALSE,FALSE...[+many trues and falses]...TRUE
'Premium',FALSE,FALSE,FALSE,FALSE...[+many trues and falses]...FALSE
'Enterprise',FALSE,TRUE,FALSE,FALSE...[+many trues and falses]...FALSE
This decision should try to match as many user selected options. I can't use filters because filters can result in zero results and I need at least one result.
I don't know if is it a machine learning problem and if Weka can help us.
Thanks.
You don't have a machine-learning problem because you do not have different examples to train for each class.
What you want is maybe a similarity measurement so to be able to score the suitness of the 4 plans. The most popular similarity measurement coming to mind is euclidean distance. Your attributes represent a vector in a euclidean space. Given the vector of the user you can calculate the distance to the vector of the 4 plans and present the "nearest" plan.
See http://en.wikipedia.org/wiki/Euclidean_distance
I have code to create decision tree from data set. i am using weather data set in weka examples. how can i generate the rules from the decision tree in java?
Data set::
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
You can get decision rules from a tree by following the path to each leaf and connecting the conditions on the junctions with "and". That is, for each leaf you would end up with one rule that tells you what conditions must be met to get to that leaf.
It might be easier though to instead of training a tree train a set of decision rules directly, e.g. with the DecisionTable classifier.