I have code to create decision tree from data set. i am using weather data set in weka examples. how can i generate the rules from the decision tree in java?
Data set::
#relation weather
#attribute outlook {sunny, overcast, rainy}
#attribute temperature real
#attribute humidity real
#attribute windy {TRUE, FALSE}
#attribute play {yes, no}
#data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
You can get decision rules from a tree by following the path to each leaf and connecting the conditions on the junctions with "and". That is, for each leaf you would end up with one rule that tells you what conditions must be met to get to that leaf.
It might be easier though to instead of training a tree train a set of decision rules directly, e.g. with the DecisionTable classifier.
Related
I am trying to use weka to classify a dataset with logistic regression but the option logistic is unavaliable even though I use only numeric values for attributes and nominal for class (Other main classifiers are also unavaiable like NaiveBayes, J48 etc). My Arff file is :
#RELATION data_weka
#ATTRIBUTE class {1,0}
#ATTRIBUTE 1 NUMERIC
.
.
.
#ATTRIBUTE 30 NUMERIC
#DATA
1,17.99,10.38,122.8,1001,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
.
.
.
The dataset contains 562 examples.
Can anyone help me please?
In your file, the class attribute is not the last attribute. Did you change the class attribute to be the last (class) attribute in the Preprocess Editor (right click to see that menu).
Weka always assumes the class attribute is the last attribute in the file. Your last attribute (30) is numeric, so it's not letting you run logistic regression.
I am brand new to WEKA and ML, so please excuse my ignorance with the following. I've wasted several hours trying to figure it out, so hopefully someone could point me in the right direction:
I am trying to run a J48 decision tree on data for USDJPY. The data was loaded via .csv file and the class value is of nominal type, more specifically a value of TRUE or FALSE if USDJPY was trading more than 1% higher after 20 sessions. The problem is, When I run the algorithm, the decision tree is simply using the class value to solve the problem, which is useless. There are *22 attributes other than the class attribute from which I am looking to predict the class attribute.
When comparing my dataset to the example "glass" dataset, I cannot find any difference between the two that would explain my problem. "glass.arff" works as expected when I run J48 (with identical settings) by trying to predict the class value (type of glass) via the other attributes (ie it gets some guesses wrong).
What am I missing here? here is a list of the attributes:
#ATTRIBUTE date NUMERIC
#ATTRIBUTE open NUMERIC
#ATTRIBUTE high NUMERIC
#ATTRIBUTE low NUMERIC
#ATTRIBUTE close NUMERIC
#ATTRIBUTE 1daypctchg NUMERIC
#ATTRIBUTE smavg50onclose NUMERIC
#ATTRIBUTE smavg100onclose NUMERIC
#ATTRIBUTE smavg200onclose NUMERIC
#ATTRIBUTE ubb2 NUMERIC
#ATTRIBUTE bollma2 onclose NUMERIC
#ATTRIBUTE lbb2 NUMERIC
#ATTRIBUTE bollwjpybgn NUMERIC
#ATTRIBUTE %bjpybgn NUMERIC
#ATTRIBUTE rsi NUMERIC
#ATTRIBUTE ma50>100 {FALSE,TRUE}
#ATTRIBUTE ma50>200 {FALSE,TRUE}
#ATTRIBUTE ma100>200 {FALSE,TRUE}
#ATTRIBUTE up1pct5d? {FALSE,TRUE}
#ATTRIBUTE up1pct20d? {FALSE,TRUE}
#ATTRIBUTE dwn1pct5d? {FALSE,TRUE}
#ATTRIBUTE dwn1pct20d? {FALSE,TRUE}
Weka (and its J48 implementation) should be able to classify your data as long as the ground-truth class is consistently in the same column of your .csv file.
I have a course project that I need to finish. I'm using Weka 3.8 and I need to classify text. The result needs to be as accurate as possible. We received a train and a test .arff file. We need to train it with the train file of course, and then let it classify the test file. The professor uploaded a 100% accurate classification of the test file. We need to upload our own results and than the system compares the two files. For now I've been using a FilteredClassifier composed of SMO and StringToWordVector with Snowball stremmer, but I can't get a better accuracy than 65.9% for some reason (this is not the split accuracy, but the one I get when the system compares my results to the 100% accurate one). I can't figure out why.
The train.arff file:
#relation train
#attribute index numeric
#attribute ingredients string
#attribute cuisine {greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,thai,vietnamese,cajun_creole,brazilian,french,japanese,irish,korean,moroccan,russian}
#data
0,'romaine lettuce;black olives;grape tomatoes;garlic;pepper;purple onion;seasoning;garbanzo beans;feta cheese crumbles',greek
1,'plain flour;ground pepper;salt;tomatoes;ground black pepper;thyme;eggs;green tomatoes;yellow corn meal;milk;vegetable oil',southern_us
2,'eggs;pepper;salt;mayonaise;cooking oil;green chilies;grilled chicken breasts;garlic powder;yellow onion;soy sauce;butter;chicken livers',filipino
3,'water;vegetable oil;wheat;salt',indian
...
and 4995 more lines like these.
The test.arff is similar to this:
#relation test
#attribute index numeric
#attribute ingredients string
#attribute cuisine {greek,southern_us,filipino,indian,jamaican,spanish,italian,mexican,chinese,british,thai,vietnamese,cajun_creole,brazilian,french,
japanese,irish,korean,moroccan,russian}
#data
0,'white vinegar;sesame seeds;english cucumber;sugar;extract;Korean chile flakes;shallots;garlic cloves;pepper;salt',?
1,'eggplant;fresh parsley;white vinegar;salt;extra-virgin olive oil;onions;tomatoes;feta cheese crumbles',?
... and 4337 more lines, like these.
This is my weka configuration:
He told us that there are some instances when in the .arff file some ingredients amongst the #data are seperated with ',' by accident and that there are words that occur frequently and that those might not help much. I don't know if this is important or not. Is there any way I could improve the classification accuracy? Am I even using the right classifier for the job? Thanks in advance!
I am following a book "Machine Learning: Hands-On for Developers and Technical Professionals" to create decision tree with WEKA. Though I followed the same process as shown in the book, I am not getting the same decision tree. I am using C4.5 (J48) algorithm.
Data (arff file)
#relation ladygaga
#attribute placement {end_rack, cd_spec, std_rack}
#attribute prominence numeric
#attribute pricing numeric
#attribute eye_level {TRUE, FALSE}
#attribute customer_purchase {yes, no}
#data
end_rack,85,85,FALSE,yes
end_rack,80,90,TRUE,yes
cd_spec,83,86,FALSE,no
std_rack,70,96,FALSE,no
std_rack,68,80,FALSE,no
std_rack,65,70,TRUE,yes
cd_spec,64,65,TRUE,yes
end_rack,72,95,FALSE,yes
end_rack,69,70,FALSE,no
std_rack,75,80,FALSE,no
end_rack,75,70,TRUE,no
cd_spec,72,90,TRUE,no
cd_spec,81,75,FALSE,yes
std_rack,71,91,TRUE,yes
Expected Output
My Output
What am I doing wrong?
It is a problem with the book (Keeping the answer over here so that it can help other readers of the book).
Book expects only one negative case in the end_rack category (Look for (5,1) in author's tree diagram). In data provided in the book and even on the book website, there are actually two negative cases (5,2). I removed one negative case and got the same decision tree as the book.
Here is the corrected data arff file
#relation ladygaga
#attribute placement {end_rack, cd_spec, std_rack}
#attribute prominence numeric
#attribute pricing numeric
#attribute eye_level {TRUE, FALSE}
#attribute customer_purchase {yes, no}
#data
end_rack,85,85,FALSE,yes
end_rack,80,90,TRUE,yes
cd_spec,83,86,FALSE,no
std_rack,70,96,FALSE,no
std_rack,68,80,FALSE,no
std_rack,65,70,TRUE,yes
cd_spec,64,65,TRUE,yes
end_rack,72,95,FALSE,yes
end_rack,69,70,FALSE,yes
std_rack,75,80,FALSE,no
end_rack,75,70,TRUE,no
cd_spec,72,90,TRUE,no
cd_spec,81,75,FALSE,yes
std_rack,71,91,TRUE,yes
Correct Output
I got a situation that I don't know if is possible to use Weka classifications.
There is a big number of class classifications describing a pricing plan, just like that:
#attribute 'plan' {'Free', 'Basic', 'Premium', 'Enterprise'}
#attribute 'atr01' {TRUE, FALSE}
#attribute 'atr02' {TRUE, FALSE}
#attribute 'atr03' {TRUE, FALSE}
#attribute 'atr04' {TRUE, FALSE}
#attribute 'atr05' {TRUE, FALSE}
...
#attribute 'atr60' {TRUE, FALSE}
This list of attributes can grow up in the future... we expect to have 120 attributes.
What we need is to give a form so the user can check true or false for each attribute and our recomendation system will select the most appropriate plan for the user based on our training set.
The problem is that our training set contains only 1 row for each plan, just like that:
'Free',FALSE,TRUE,TRUE,FALSE...[+many trues and falses]...TRUE
'Basic',TRUE,FALSE,FALSE,FALSE...[+many trues and falses]...TRUE
'Premium',FALSE,FALSE,FALSE,FALSE...[+many trues and falses]...FALSE
'Enterprise',FALSE,TRUE,FALSE,FALSE...[+many trues and falses]...FALSE
This decision should try to match as many user selected options. I can't use filters because filters can result in zero results and I need at least one result.
I don't know if is it a machine learning problem and if Weka can help us.
Thanks.
You don't have a machine-learning problem because you do not have different examples to train for each class.
What you want is maybe a similarity measurement so to be able to score the suitness of the 4 plans. The most popular similarity measurement coming to mind is euclidean distance. Your attributes represent a vector in a euclidean space. Given the vector of the user you can calculate the distance to the vector of the 4 plans and present the "nearest" plan.
See http://en.wikipedia.org/wiki/Euclidean_distance