I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance
Related
I have a dataset that came from NLP for technical documents
my dataset has 60,000 records
There are 30,000 features in the dataset
and the value is the number of repetitions that word/feature appeared
here is a sample of the dataset
RowID Microsoft Internet PCI Laptop Google AWS iPhone Chrome
1 8 2 0 0 5 1 0 0
2 0 1 0 1 1 4 1 0
3 0 0 0 7 1 0 5 0
4 1 0 0 1 6 7 5 0
5 5 1 0 0 5 0 3 1
6 1 5 0 8 0 1 0 0
-------------------------------------------------------------------------
Total 9,470 821 5 107 4,605 719 25 8
Appearance
There are some words that only appeared less than 10 times in the whole dataset
The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)
what is this technique called? the one that only uses features that in total appeared more than a certain number.
This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.
If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.
References
[1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.
I have used Weka to train my data set, but I don't know if I got a good result then.
Can someone gives me some ideas?
This is my result:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 2823 97.9188 %
Incorrectly Classified Instances 60 2.0812 %
Kappa statistic 0
Mean absolute error 0.0208
Root mean squared error 0.1443
Relative absolute error 50.6234 %
Root relative squared error 101.0568 %
Coverage of cases (0.95 level) 97.9188 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 2883
Ignored Class Unknown Instances 119
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.020 0
1.000 1.000 0.979 1.000 0.989 0.000 0.500 0.940 1
Weighted Avg. 0.979 0.979 0.959 0.979 0.969 0.000 0.500 0.921
=== Confusion Matrix ===
a b <-- classified as
0 60 | a = 0
0 2823 | b = 1
2 classes was used, tagged with b classes all instances correctly classified, so Correctly Classified Instances rate = 2823/(2823+60) = 97.9188% and class 1 TP value is 1, but never tagged with a classes instances correctly classified so Incorrectly Classified Instances rate = 60/(2823+60) = 2.0812% and class 0 TP value is 0
In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.
In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.
Example 1. items are 11-12
Similarity between items:
101 102 0.36602540378443865
Matrix with preferences:
11 12
1 1
2 1
3 1 1
4 1
It looks like mahout treats null as 0.
Example 2. items are 101-103.
Similarity between items:
101 102 0.2612038749637414
101 103 0.4340578302732228
102 103 0.2600070276638468
Matrix with preferences:
101 102 103
1 1 0.1
2 1 0.1
3 1 0.1
4 1 1 0.1
5 1 1 0.1
6 1 0.1
7 1 0.1
8 1 0.1
9 1 0.1
10 1 0.1
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Both examples were run without any additional parameters.
Is this problem solved somwhere, somehow? Any ideas?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf
Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.
The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.
It's doing the right thing.
I am trying to output the predictions of a test data set after loading a model into weka. The file is in .csv format and the classifier I am using is NaiveBayes. I am setting the supplied test to a test file which has about 110000 instances with label as ?. When I run the model on this test file and output the predictions into a .csv file I get a file like this:
inst# actual predicted error prediction
1 0 0.677 0.677
2 0 0.978 0.978
3 0 1 1
4 0 1 1
5 0 0.991 0.991
6 0 0.996 0.996
7 0 1 1
8 0 0.999 0.999
9 0 0.996 0.996
10 0 0.965 0.965
Can anyone tell me why the prediction column is empty? Why isn't the label being printed and how to resolve this.
I am very new to weka and was unable to solve this.
I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?