Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I just need some guidance. I'm seeing a lot of directions to go and I want to see what would be my best Avenue. So essentially I have a pandas dataframe of groups similar to this(groups are in 4's):
Name Role XP Acumen
0 Johnny Tsunami Driver 1000 39
1 Michael B. Jackson Pistol 2500 46
2 Bobby Zuko Pistol 3000 50
3 Greg Ritcher Lookout 200 25
4 Johnny Tsunami Driver 1000 39
5 Michael B. Jackson Pistol 2500 46
6 Bobby Zuko Pistol 3000 50
7 Appa Derren Lookout 250 30
8 Baby Hitsuo Driver 950 35
9 Michael B. Jackson Pistol 2500 46
10 Bobby Zuko Pistol 3000 50
11 Appa Derren Lookout 250 30
So basically I want to train the model to pick similar groups based of the dataframe above. End goal is I want to give it a massive dataset and have it pick out rows to create groups similar to the one above. Maybe refine it so it picks out similar numbers accuracy in values.
What's the best route to take? Supervised unsupervised. Linear....k clusters? Where do I need to point my research. What are the best steps to take.
The first step I would take is to understand how you want to calculate the similarity in the above mentioned data that seems fairly categorical. The most basic approach would be to run a clustering/ classification algorithm (mostly unsupervised in your case). Personally, even k-means runs fairly quickly and accurately if you have no idea of how to proceed (DBSCAN is my fav). I would also do an exploratory analysis (Self-Organizing Maps/ Kohonen Maps maybe useful in your case) to understand how the data is distributed.
You want to create groups and compare the groups to one another on after your clustering/ classification, right? You will also need to come up with a similarity metric like the KL Divergence to compare.
The main issue is coming up with a 'k' that will cluster your data, but I feel like you will need to try out different values and your intuition will play an important role!
Links:
SOM: https://www.ncbi.nlm.nih.gov/pubmed/16566459
DBSCAN: https://scikit-learn.org/stable/modules/clustering.html#dbscan
KL Divergence/ Cross-Entropy Loss: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
Related
I have the following question, I have already created a forecasting model but it doesn't really predict the future, I just have a bunch of data, I split it into a "training" sample and into a "testing" sample and then I can check how good is my prediction. But now I want to forecast for the next 10 days that are not in the data I have. How on earth can I do it?
Example : Let's say I have the data for these days:
04-07-2017: 213
05-07-2017: 321
06-07-2017: 111
07-07-2017: 90
08-07-2017: 78
Now I want to forecast the data for the next 3 days. How can I do it?
I would assume you could use encoder/decoder. Something similar to seq2seq as in NLU.
I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am implementing Naive Bayes classifier for text category detection.
I have 37 categories and I've got accuracy about 36% on my test set.
I want to improve accuracy, so I decided to implement 37 two-way classifiers as suggested in many sources (Ways to improve the accuracy of a Naive Bayes Classifier? is one of them), these classifiers would answer for a given text:
specific_category OR everything_else
and I would determine text's category by applying them sequentally.
But I've got a problem with first classifier, it always fails in "specific_category" category.
I have training data - 37 categories, 100 documents for each category of the same size.
For each category I found list of 50 features I selected by mutual information criteria (features are just words).
For the sake of example, I use two categories "agriculture" and "everything_else" (except agriculture).
For category "agriculture":
number of words in all documents of this class
(first term in denominator in http://nlp.stanford.edu/IR-book/pdf/13bayes.pdf, (13.7))
W_agriculture = 31649.
Size of vocabulary V_agriculture = 6951.
Log probability of Unknown word (UNK) P(UNK|agriculture) = -10.56
Log probability of class P(agriculture) = log(1/37) = -3.61 (we have 37 categories of same-size documents)
For category "everything_else":
W_everything_else = 1030043
V_everything_else = 44221
P(UNK|everything_else) = -13.89
P(everything_else) = log(36/37) = -0.03
Then I have a text not related to agriculture, let it consist mostly of Unknown words (UNK). It has 270 words, they are mostly unknown for both categories "agriculture" and "everything_else". Let's assume 260 words are UNK for "everything_else", other 10 is known.
Then, when I calculate probabilities
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
P(text|everything_else) = P(everything_else) + SUM(P(UNK|everything_else) for 260 times) + SUM(P(word|everything_else) for 10 times)
In the last line we counted 260 words as UNK and 10 as known for a category.
Main problem. As P(UNK|agriculture) >> P(everything_else) (for log it is much greater), the influence of those 270 terms P(UNK|agriculture) outweighs influence of sum for P(word|everything_else) for each word in text.
Because
SUM(P(UNK|agriculture) for 270 times) = -2851.2
SUM(P(UNK|everything_else) for 260 times) = -3611.4
and first sum is much larger and can't be corrected not with P(agriculture) nor SUM(P(word|everything_else) for 10 words), because the difference is huge. Then I always fail in "agriculture" category though the text does not belong to it.
The questions is: Am I missing something? Or how should I deal with big number of UNK words and their probability being significantly higher for small categories?
UPD: Tried to enlarge tranining data for "agriculture" category (just concatenating the document 36 times) to be equal in number of documents. It helped for few categories, not much for others, I suspect due to fewer number of words and dictionary size, P(UNK|specific_category) gets bigger and outweighs P(UNK|everything_else) when summing 270 times.
So it seems such method is very sensitive on number of words in training data and vocabulary size. How to overcome this? Maybe bigrams/trigrams would help?
Right, ok. You're pretty confused, but I'll give you a couple of basic pointers.
Firstly, even if you're following a 1-vs-all scheme, you can't have different vocabularies for the different classes. If you do this, the event spaces of the random variables are different, so probabilities are not comparable. You need to decide on a single common vocabulary for all classes.
Secondly, throw out the unknown token. It doesn't help you. Ignore any words that aren't part of the vocabulary you decide upon.
Finally, I don't know what you're doing with summing probabilities. You're confused about taking logs, I think. This formula is not correct:
P(text|agriculture) = P(agriculture) + SUM(P(UNK|agriculture) for 270 times)
Instead it's:
p(text|agriculture) = p(agriculture) * p(unk|agriculture)^270 * p(all other words in doc|agriculture)
If you take logs, this becomes:
log( p(t|a) ) = log(p(agriculture)) + 270*log(p(unk|agriculture)) + log(p(all other words|agriculture))
Finally, if your classifier is right, there's no real reason to believe that one-vs-all will work better than just a straight n-way classification. Empirically it might, but theoretically their results should be equivalent. In any case, you shouldn't apply decisions sequentially, but do all n 2-way problems and assign to the class where the positive probability is highest.
Deal all,
I am looking for an appropriate algorithm which can allow me to learn how some numeric values are mapped into an array.
Try to imagine that I have a training data set like this:
1 1 2 4 5 --> [0 1 5 7 8 7 1 2 3 7]
2 3 2 4 1 --> [9 9 5 6 6 6 2 4 3 5]
...
1 2 1 8 9 --> [1 4 5 8 7 4 1 2 3 4]
So that given a new set of numeric values, I would like to predict this new array
5 8 7 4 2 --> [? ? ? ? ? ? ? ? ? ?]
Thank you very much in advance.
Best regards!
Some considerations:
Let us suppose that all numbers are integer and the length of the arrays is fixed
Quality of each predicted array can be determine by means of a distance function which try to measure the likeness between the ideal and the predicted array.
This is a challenging task in general. Are your array lengths fixed? What's the loss function (for example is it better to be "closer" for single digits -- is predicting 2 instead of 1 better than predicting 9 or it doesn't matter? Do you get credit for partial matches on the array, such as predicting the first half correct? etc)?
In any case, classical regression or classification techniques would likely not work very well for your scenario. I think the best bet would be to try a genetic programming approach. The fitness function would then be your loss measure i mentioned earlier. You can check this nice comparison for genetic programming libraries for different languages.
This is called a structured output problem, where the target you are trying to predict is a complex structure, rather than a simple class (classification) or number (regression).
As mentioned above, the loss function is an important thing you will have to think about. Minimum edit distance, RMS or simple 0-1 loss could be used.
Structured support vector machine or variations on ridge regression for structured output problems are two known algorithms that can tackle this problem. See wikipedia of course.
We have a research group on this topic at Universite Laval (Canada), led by Mario Marchand and Francois Laviolette. You might want to search for their publications like "Risk Bounds and Learning Algorithms for the Regression Approach to Structured Output Prediction" by Sebastien Giguere et al.
Good luck!
I am studying for my Machine Learning (ML) class and I have question that I couldn't find an answer with my current knowledge. Assume that I have the following data-set,
att1 att2 att3 class
5 6 10 a
2 1 5 b
47 8 4 c
4 9 8 a
4 5 6 b
The above data-set is clear and I think I can apply classification algorithms for new incoming data after I train my data-set. Since each instance has a label, it is easy to understand that each instance has a class that is labeled with. Now, my question is what if we had a class consisting of different instances such as gesture recognition data. Any class will have multiple instances that specifies its class. For example,
xcor ycord depth
45 100 10
50 20 45
10 51 12
the above three instances belong to class A and the below three instances belong to class B as a group, I mean those three data instances constitute that class together. For gesture data, the coordinates of movement of your hand.
xcor ycord depth
45 100 10
50 20 45
10 51 12
Now, I want every incoming three instances to be grouped either as A or B? Is it possible to label all of them either A or B without labeling each instance independently? As an example, assume that following group belongs to B, so I want all of the instances to be labelled together as B not individually because of their independent similarity to class A or B? If it is possible, how do we call it?
xcor ycord depth
45 10 10
5 20 87
10 51 44
I don't see an scenario where you might want to group an indeterminate number of rows in your dataset as features of a given class. They are either independently associated with a class or they are all features and therefore an unique row. Something like:
Instead of
xcor ycord depth
45 10 10
5 20 87
10 51 44
Would be something like:
xcor1 ycord1 depth1 xcor2 ycord2 depth2 xcor3 ycord3 depth3
45 10 10 5 20 87 10 51 44
This is pretty much the same approach that is used to model time series
It seems you may be confused between different types of machine learning.
The dataset given in your class is an example of a supervised classification algorithm. That is, given some data and some classes, learn a classifier that can predict classes on new, unseen data. Classifiers that you can apply to this problem include
decision trees,
support vector machines
artificial neural networks, etc.
The second problem you are describing is an example of an unsupervised classification problem. That is, given some data without labels, we want to find an automatic way to separate the different types of data (your A and B) algorithmically. Algorithms that solve this problem include
K-means clustering
Mixture models
Principal components analysis followed by some sort of clustering
I would look into running a factor analysis or normalizing your data, then running a K-means or gaussian mixture model. This should discover the A and B types of your data if they are distinguishable.
Take a peek at the use of neural networks for recognizing hand-written text. You can think of a gesture as a hand-written figure with an additional time component (so, give each pixel an "age".) If your training data also includes similar time data, then I think the technique should carry over well.