Machine learning, classification type - machine-learning

I am studying for my Machine Learning (ML) class and I have question that I couldn't find an answer with my current knowledge. Assume that I have the following data-set,
att1 att2 att3 class
5 6 10 a
2 1 5 b
47 8 4 c
4 9 8 a
4 5 6 b
The above data-set is clear and I think I can apply classification algorithms for new incoming data after I train my data-set. Since each instance has a label, it is easy to understand that each instance has a class that is labeled with. Now, my question is what if we had a class consisting of different instances such as gesture recognition data. Any class will have multiple instances that specifies its class. For example,
xcor ycord depth
45 100 10
50 20 45
10 51 12
the above three instances belong to class A and the below three instances belong to class B as a group, I mean those three data instances constitute that class together. For gesture data, the coordinates of movement of your hand.
xcor ycord depth
45 100 10
50 20 45
10 51 12
Now, I want every incoming three instances to be grouped either as A or B? Is it possible to label all of them either A or B without labeling each instance independently? As an example, assume that following group belongs to B, so I want all of the instances to be labelled together as B not individually because of their independent similarity to class A or B? If it is possible, how do we call it?
xcor ycord depth
45 10 10
5 20 87
10 51 44

I don't see an scenario where you might want to group an indeterminate number of rows in your dataset as features of a given class. They are either independently associated with a class or they are all features and therefore an unique row. Something like:
Instead of
xcor ycord depth
45 10 10
5 20 87
10 51 44
Would be something like:
xcor1 ycord1 depth1 xcor2 ycord2 depth2 xcor3 ycord3 depth3
45 10 10 5 20 87 10 51 44
This is pretty much the same approach that is used to model time series

It seems you may be confused between different types of machine learning.
The dataset given in your class is an example of a supervised classification algorithm. That is, given some data and some classes, learn a classifier that can predict classes on new, unseen data. Classifiers that you can apply to this problem include
decision trees,
support vector machines
artificial neural networks, etc.
The second problem you are describing is an example of an unsupervised classification problem. That is, given some data without labels, we want to find an automatic way to separate the different types of data (your A and B) algorithmically. Algorithms that solve this problem include
K-means clustering
Mixture models
Principal components analysis followed by some sort of clustering
I would look into running a factor analysis or normalizing your data, then running a K-means or gaussian mixture model. This should discover the A and B types of your data if they are distinguishable.

Take a peek at the use of neural networks for recognizing hand-written text. You can think of a gesture as a hand-written figure with an additional time component (so, give each pixel an "age".) If your training data also includes similar time data, then I think the technique should carry over well.

Related

Ideas for model selection for predicting sales at locations based on time component and class column

I am trying to build a model for sales prediction out of three different storages based on previous sales. However there is an extra (and very important) component to this which is a column with the values of A and B. These letters indicate a price category, where A siginifies a comparetively cheaper price compared to similar products. Here is a mock example of the table
week
Letter
Storage1 sales
Storage2 sales
Storage3 sales
1
A
50
28
34
2
A
47
29
19
3
B
13
11
19
4
B
14
19
8
5
B
21
13
3
6
A
39
25
23
I have previously worked with both types of prediction problems seperately, namely time series analysis and regression problems, using classical methods and using machine learning but I have not built a model which can take both predicition types into account.
I am writing this to hear any suggestions as how to tackle such a prediction problem. I am thinking of converting the three storage sale columns into one, in order to have one feature column, and having three one-hot encoder columns to indicate the storage. However I am not sure how to tackle this problem with a machine learning approach and would like to hear if anyone knows where to start with such a prediction problem.

External linkage - what to do when there is a tie

I am considering to implement a complete linkage clustering algorithm from scratch for study purposes. I've seen that there is a big difference when compared to single linkage:
Unlike single linkage, the complete linkage method can be strongly affected by draw cases (where there are 2 groups/clusters with the same distance value in the distance matrix).
I'd like to see an example of distance matrix where this occurs and understand why it happens.
Consider the 1-dimensional data set
1 2 3 4 5 6 7 8 9 10
Depending how you do the first merges, you can get pretty good or pretty bad results. For example, first merge 2-3, 5-6 and 8-9. Then 2-3-4 and 7-8-9. Compare this to the "obvious" result that most humans would produce.

Is a small vocabulary for Neural Nets ok?

I am designing a neural network to try and generate music. The neural network would be a 2 layered LSTM (Long Short Term Memory).
I am hoping to encode the music into a many-hot format for training, ie it would be a 1 if that note was playing and a 0 if that note was not playing.
Here is an excerpt of what this data would look like:
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000011010100100001010000000000000000000000
There are 88 columns which represent 88 notes and each now represents a new beat. The output will be at a character level.
I am just wondering since there are only 2 characters in the vocabulary, would the probability of a 0 being next always be higher than the probability of a 1 being next?
I know for a large vocabulary, a large training set is needed, but I only have a small vocabulary. I have 229 files which corresponds to about 50,000 lines of text. Is this enough to prevent the output being all 0s?
Also, would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?
Thanks in advance
A small vocabulary is fine as long as your dataset not skewed overwhelmingly to one of the "words".
As to "would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?", each timestep is represented as 88 characters. Each character is a feature of that timestep. Your LSTM should be outputting the next timestep, so you should have 88 nodes. Each node should output the probability of that node being present in that timestep.
Finally since you are building a Char-RNN I would strongly suggest using abc notation to represent your data. A song in ABC notation looks like this:
X:1
T:Speed the Plough
M:4/4
C:Trad.
K:G
|:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA|
GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:|
|:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df|
g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|
This is perfect for Char-RNNs because it represents every song as a set of of characters, and you can run conversions from MIDI to ABC and vice versa. All you have to do is train your model to predict the next character in this sequence instead of dealing with 88 output nodes.

Unbalanced model, confused as to what steps to take

This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited

LibSVM - Multi class classification with unbalanced data

I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!

Resources