I am trying to output the predictions of a test data set after loading a model into weka. The file is in .csv format and the classifier I am using is NaiveBayes. I am setting the supplied test to a test file which has about 110000 instances with label as ?. When I run the model on this test file and output the predictions into a .csv file I get a file like this:
inst# actual predicted error prediction
1 0 0.677 0.677
2 0 0.978 0.978
3 0 1 1
4 0 1 1
5 0 0.991 0.991
6 0 0.996 0.996
7 0 1 1
8 0 0.999 0.999
9 0 0.996 0.996
10 0 0.965 0.965
Can anyone tell me why the prediction column is empty? Why isn't the label being printed and how to resolve this.
I am very new to weka and was unable to solve this.
Related
I have a dataset that came from NLP for technical documents
my dataset has 60,000 records
There are 30,000 features in the dataset
and the value is the number of repetitions that word/feature appeared
here is a sample of the dataset
RowID Microsoft Internet PCI Laptop Google AWS iPhone Chrome
1 8 2 0 0 5 1 0 0
2 0 1 0 1 1 4 1 0
3 0 0 0 7 1 0 5 0
4 1 0 0 1 6 7 5 0
5 5 1 0 0 5 0 3 1
6 1 5 0 8 0 1 0 0
-------------------------------------------------------------------------
Total 9,470 821 5 107 4,605 719 25 8
Appearance
There are some words that only appeared less than 10 times in the whole dataset
The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)
what is this technique called? the one that only uses features that in total appeared more than a certain number.
This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.
If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.
References
[1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.
I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance
I am working on a ML problem to predict house prices and Zip Code is one feature which will be useful. I am also trying to use Random Forest Regressor to predict the log of the price.
However, should I use One Hot Encoding or Label Encoder for Zip Code? Because I have about 2000 Zip Codes in my dataset and performing One Hot Encoding will expand the columns significantly.
https://datascience.stackexchange.com/questions/9443/when-to-use-one-hot-encoding-vs-labelencoder-vs-dictvectorizor
To rephrase: does it make sense to use LabelEncoder instead of One Hot Encoding on Zip Codes
Like the link says:
LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but
then the imposed ordinality means that the average of dog and mouse is
cat. Still there are algorithms like decision trees and random forests
that can work with categorical variables just fine and LabelEncoder
can be used to store values using less disk space.
And yes, you are right, when you have 2000 categories for zip codes, one hot may blow up your feature set massively. In many cases when I had such problems, I opted for binary encoding and it worked out fine most of the times and hence is worth a shot for you perhaps.
Imagine you have 9 features, and you mark them from 1 to 9 and now binary encode them, you will get:
cat 1 - 0 0 0 1
cat 2 - 0 0 1 0
cat 3 - 0 0 1 1
cat 4 - 0 1 0 0
cat 5 - 0 1 0 1
cat 6 - 0 1 1 0
cat 7 - 0 1 1 1
cat 8 - 1 0 0 0
cat 9 - 1 0 0 1
There you go, you overcome the LabelEncoder problem, and you also get 4 feature columns instead of 8 unlike one hot encoding. This is the basic intuition behind Binary Encoder.
**PS:** Give 2 power 11 is 2048 and you have 2000 categories for zipcodes, you can reduce your feature columns to 11 instead of 1999 in the case of one hot encoding!
Consider this data table
NumberOfAccidents MeanDistance
1 5
3 0
0 NA
0 NA
6 1.2
2 0
the first feature is the number of accidents and the second is the average distance of these accidents to a certain point. It is obvious for a record with zero accident, there won't be a value for MeanDistance. However, imputing these missing values are not logical!
MY SOLUTION: I have decided to discretize the MeanDistance with NAs being a level (bin) and the rest of the data being in bins like: [0,1), [1,2.5), [2.5, Inf). the final table will look like this:
NumberOfAccidents NAs first_bin sec_bin third_bin
1 0 0 0 1
3 0 1 0 0
0 1 0 0 0
0 1 0 0 0
6 0 0 1 0
2 0 1 0 0
What is your idea with these types of missing values that cannot be imputed?
what is your solution to this problem?
It really depends on the domain and what you are trying to predict. Even though your solution is fine, I wouldn't bin the rest of the data as you did. Giving that the NumberOfAccidents feature already tells what MeanDistance have NA values, I would probably just impute 0 into the NA values (for computations) and leave the rest of the data as it is.
Nevertheless, there is no need to limit yourself, just try different approaches and keep the one that boost your KPI (Key Performance Indicator).
I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?