I am considering to implement a complete linkage clustering algorithm from scratch for study purposes. I've seen that there is a big difference when compared to single linkage:
Unlike single linkage, the complete linkage method can be strongly affected by draw cases (where there are 2 groups/clusters with the same distance value in the distance matrix).
I'd like to see an example of distance matrix where this occurs and understand why it happens.
Consider the 1-dimensional data set
1 2 3 4 5 6 7 8 9 10
Depending how you do the first merges, you can get pretty good or pretty bad results. For example, first merge 2-3, 5-6 and 8-9. Then 2-3-4 and 7-8-9. Compare this to the "obvious" result that most humans would produce.
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
what is the best approach to deal with features which is both numerical and categorical? take the following feaure X for example:
X
1
5
3
0
1
10
10
7
0
5
9
9
In which X represents credit score which should range 1 to 10, and if X=0, it means for this instance, the credit score doesn't exist.
how should I deal with it while using models like random forest or logistic regression to do a classification problem? Thank you.
I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
Let's say I have a set of training examples where A_i is an attribute and the output is Iris-setosa
The values in the dataset are
A1, A2, A3, A4 outcome
3 5 2 2 Iris-setosa
3 4 2 2 Iris-setosa
2 4 2 2 Iris-setosa
3 6 2 2 Iris-setosa
2 5 3 2 Iris-setosa
3 5 2 2 Iris-setosa
3 5 2 3 Iris-setosa
4 6 2 2 Iris-setosa
3 7 2 2 Iris-setosa
from analysis the range of attribute are:
A1 ----> [2,3,4]
A2 ----> [4,5,6,7]
A3 ----> [2,3]
A4 ----> [2,3]
I have defined:
A1 ----> [Low(2),Medium(3),High(4)]
A2 ----> [Low(4,5),Medium(6),High(7)]
A3 ----> [Low(<2),Medium(2),High(3)]
A4 ----> [Low(<2),Medium(2),High(3)]
I have set like below:
A1, A2, A3, A4 outcome
Medium Low Medium Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Low Low Medium Medium Iris-setosa
Medium Medium Medium Medium Iris-setosa
Low Low High Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Medium Low Medium High Iris-setosa
High Medium Medium Medium Iris-setosa
Medium High Medium Medium Iris-setosa
I know I have to define the fitness function. What is it for this problem? In my actual problem there are 50 training examples but this is a similar problem.
How can I optimize rule by using GA? How can I encode?
Suppose if I input (4,7,2,3), how optimization can help me classify whether the input is Iris-setosa or not?
Thank You for your patience.
The task you describe is known as one-class classification.
Identifying elements of a specific class amongst all elements, by learning from a training set containing only the objects of that class is
... different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes.
A viable approach is to build the outlier class data artificially and train using a two class model but it can be tricky.
When generating artificial outlier data you need a wider range of possible values than the target data (you have to ensure that the target data is surrounded in all attribute directions).
The resulting two-class training data set tends to be unbalanced and
large.
Anyway:
if you want to try Genetic Programming for one-class classification take a look at
One-Class Genetic Programming - Robert Curry, Malcolm I. Heywood (presented in EuroGP'10, the 13th European conference on Genetic Programming)
also consider the anomaly detection techniques (a simple introduction is the 9th week of the Coursera Machine Learning class by Andrew Ng; notes here).
Okay, if you just want to know how to program a fitness function... Assume training data is list of tuples as so:
training_data = list((3,6,3,5),(8,3,1,2),(3,5,2,4)...)
Make a reference set for elements of A1, A2, etc. as follows, assuming first tuple tells us length of all the others (that way you can have any number of tuples in your training data):
A=[]
for x in training_data[0]:
res_list = set()
res_list.update(x[index] for x in training_data)
A.append(res_list)
index+=1
Now all your reference data is easily referred to (sets of A[0], A[1] etc). Let's make a fitness function that takes a tuple and return a fitness score that will help a GA converge on a right answer (1-4 if right elements, 5+ if in training_data). Play around with the scoring but these should work fine.
def fitness_function(target):
# Assume target is a tuple of same length as reference data
global A, training_data
score = 0
# Give a point for each element that is in the data set
index = 0
for t in target:
if t in A[index]:
score+=1
index += 1
# Give 5 points if entire tuple is exact match
if target in training_data:
score+=5
return score
What you have here is a multi-class classification problem that can be solved with Genetic Programming and related techniques.
I suppose that data are those from the well-known Iris data set: https://en.wikipedia.org/wiki/Iris_flower_data_set
If you need a quick start, you can use the source code of my method: Multi Expression Programming (which is based on Genetic Programming) which can be downloaded from here: https://github.com/mepx/mep-basic-src
There is a C++ source name mep_multi_class.cpp in the src folder which can "solve" iris dataset. Just call the read_training_data function with iris.txt file (which can also be downloaded from dataset folder from github).
Or, if you are not familiar with C++, you can try directly MEPX software which has a simple user-interface: http://www.mepx.org. A project with iris dataset can also be downloaded from github.