Agglomerative hierarchical clustering technique - machine-learning

Earlier in the year, my AI lecturer taught us about agglomerative hierarchical clustering and K means clustering but his explanations are lost and i'm trying to figure out how he uses the data in the table below to create a dendogram. It would be great.
He comes up with the following table that outlines the similarities:
Animal pairs Similarities
Bear & Tiger 4
Bear & Dog 3
Bear & Giant Squid 2
Bear & Cat 3
Tiger & Dog 2
Tiger & Giant Squid 1
Tiger & Cat 2
Dog & Giant Squid 2
Dog & Cat 5
Giant Squid & Cat 2
And the resulting dendogram from the table above:

Always connect the most similar ones.
Dog and Cat are most similar, and are thus connected at level 5.
Bear and Tiger are connected at level 4.
etc.

Related

Number of outputs in dense softmax layer

I've been working through a Coursera course for extra practice and ran into an issue I don't understand.
Link to Collab
So as far as I've worked on ML neural network problems, I've always been taught that the output layer of a multiclass classification problem will be Dense, with number of nodes equal to the number of classes. E.g. Dog, cat, horse - 3 classes = 3 nodes.
However, in the notebook, there are 5 classes in the labels, checked using len(label_tokenizer.word_index) but using 5 nodes I had terrible results and with 6 nodes the model worked properly.
Can anyone please explain why this is the case? I can't find any online example explaining this. Cheers!
I figured it out. The output of the dense layer with loss of categorical cross entropy expects labels/targets to be starting from zero. For example:
cat - 0
dog - 1
horse - 2
In this case, the number of dense nodes are 3.
However, in the collab, the labels were generated using keras tokenizer, which tokenizes starting from 1 (because padding is usually 0).
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
# {'business': 2, 'entertainment': 5, 'politics': 3, 'sport': 1, 'tech': 4}
This leads to a weird case where, if we have 5 dense nodes, we have output classes from 0-4, which doesn't match up with our labels with predictions 1-5.
I proved this empirically by rerunning the code with all labels reduced by 1 and the model trains successfully with 5 dense nodes, since labels are 0-4 now.
I suspect that with labels 1-5 and 6 dense nodes work because the model simply learns that label 0 is not used and focuses on 1-5.
If anyone understands the inner workings of categorical cross entropy, do feel free to add on!

Predicting next destination. What type of classification it is?

I have a very different machine learning problem. Or atleast I am facing such problem for 1st time.
It would be really great if u can guide me to solve it.
I have 3 data sets as follow:
hotel star_rating
1 3
2 2
user home_continent gender
1 2 female
2 3 female
3 1 male
user hotel
1 39
1 44
2 63
I need to find which hotel a user will visit next.
To me it does not look like normal classification or regression problem. Total 66 hotel and 4400 users
Can you please guide me.
Thanks
It is a ranking problem (which, in many cases can be solved applying classification techniques).
Have a look to the documentation from this Kaggle competition. It essentially poses the same kind of problem you are trying to solve.

LibSVM - Multi class classification with unbalanced data

I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!

Genetic Algorithm - Fitness function and Rule optimization

Let's say I have a set of training examples where A_i is an attribute and the output is Iris-setosa
The values in the dataset are
A1, A2, A3, A4 outcome
3 5 2 2 Iris-­setosa
3 4 2 2 Iris­-setosa
2 4 2 2 Iris­-setosa
3 6 2 2 Iris­-setosa
2 5 3 2 Iris­-setosa
3 5 2 2 Iris­-setosa
3 5 2 3 Iris­-setosa
4 6 2 2 Iris­-setosa
3 7 2 2 Iris­-setosa
from analysis the range of attribute are:
A1 ----> [2,3,4]
A2 ----> [4,5,6,7]
A3 ----> [2,3]
A4 ----> [2,3]
I have defined:
A1 ----> [Low(2),Medium(3),High(4)]
A2 ----> [Low(4,5),Medium(6),High(7)]
A3 ----> [Low(<2),Medium(2),High(3)]
A4 ----> [Low(<2),Medium(2),High(3)]
I have set like below:
A1, A2, A3, A4 outcome
Medium Low Medium Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Low Low Medium Medium Iris-setosa
Medium Medium Medium Medium Iris-setosa
Low Low High Medium Iris-setosa
Medium Low Medium Medium Iris-setosa
Medium Low Medium High Iris-setosa
High Medium Medium Medium Iris-setosa
Medium High Medium Medium Iris-setosa
I know I have to define the fitness function. What is it for this problem? In my actual problem there are 50 training examples but this is a similar problem.
How can I optimize rule by using GA? How can I encode?
Suppose if I input (4,7,2,3), how optimization can help me classify whether the input is Iris-setosa or not?
Thank You for your patience.
The task you describe is known as one-class classification.
Identifying elements of a specific class amongst all elements, by learning from a training set containing only the objects of that class is
... different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes.
A viable approach is to build the outlier class data artificially and train using a two class model but it can be tricky.
When generating artificial outlier data you need a wider range of possible values than the target data (you have to ensure that the target data is surrounded in all attribute directions).
The resulting two-class training data set tends to be unbalanced and
large.
Anyway:
if you want to try Genetic Programming for one-class classification take a look at
One-Class Genetic Programming - Robert Curry, Malcolm I. Heywood (presented in EuroGP'10, the 13th European conference on Genetic Programming)
also consider the anomaly detection techniques (a simple introduction is the 9th week of the Coursera Machine Learning class by Andrew Ng; notes here).
Okay, if you just want to know how to program a fitness function... Assume training data is list of tuples as so:
training_data = list((3,6,3,5),(8,3,1,2),(3,5,2,4)...)
Make a reference set for elements of A1, A2, etc. as follows, assuming first tuple tells us length of all the others (that way you can have any number of tuples in your training data):
A=[]
for x in training_data[0]:
res_list = set()
res_list.update(x[index] for x in training_data)
A.append(res_list)
index+=1
Now all your reference data is easily referred to (sets of A[0], A[1] etc). Let's make a fitness function that takes a tuple and return a fitness score that will help a GA converge on a right answer (1-4 if right elements, 5+ if in training_data). Play around with the scoring but these should work fine.
def fitness_function(target):
# Assume target is a tuple of same length as reference data
global A, training_data
score = 0
# Give a point for each element that is in the data set
index = 0
for t in target:
if t in A[index]:
score+=1
index += 1
# Give 5 points if entire tuple is exact match
if target in training_data:
score+=5
return score
What you have here is a multi-class classification problem that can be solved with Genetic Programming and related techniques.
I suppose that data are those from the well-known Iris data set: https://en.wikipedia.org/wiki/Iris_flower_data_set
If you need a quick start, you can use the source code of my method: Multi Expression Programming (which is based on Genetic Programming) which can be downloaded from here: https://github.com/mepx/mep-basic-src
There is a C++ source name mep_multi_class.cpp in the src folder which can "solve" iris dataset. Just call the read_training_data function with iris.txt file (which can also be downloaded from dataset folder from github).
Or, if you are not familiar with C++, you can try directly MEPX software which has a simple user-interface: http://www.mepx.org. A project with iris dataset can also be downloaded from github.

Best data analytic techniques/models for personal project

I'm not really sure how to word this and I'm sorry if the formatting is wrong, but I'm trying to get a foundation to be able to tackle this problem myself.
I am trying to develop a prediction algorithm for a set of data of "Hip Surgery Patients" that looks like:
Readmission Time | Symptom Code | Symptom Note | Related
6 | 2334 | swelling in hip | Yes
12 | 1324 | anxiety | Maybe
8 | 2334 | swelling in hip | Yes
30 | 1111 | Headaches | No
3 | 7934 | easily bruising | Yes
For context, doctors can identify whether or not a given "Symptom Code" is related to the "Hip Replacement Surgery" that occurred X days ago. I have about 200 entries in my data set that match this format, and my goal is to be able to match results in the given set as well as predict new results in the "Related" Column (with certainty statistics on predicted results) based on new inputs. For example given:
Input: 20 | 2334 | swelling in hip
Output: Yes (90% confidence)
I'm very new to Data Analytics and Machine Learning so I would really just like to get some kind of pointers of things to look up or where to get started on my research. I imagine there's an optimal function/model that would handle this best but as I said I'm very new to the topic so I have no clue as to where to start. Since I have a relatively small data set I'm looking for a technique that isn't easily over trained if possible
I really appreciate any help and pointers on where to get started.
Based on your data snippet, it looks like a multiclass classification problem (the 3-classses being Yes, Maybe or No).
Your columns (asides related) will be your features which can be reduced to numeric representations. For instance:
For the Symptom Note Feature, you can have a mapping as seen below:
Swelling in hip = 1
Anxiety = 2
Swelling = 3
Easily Bruised = 4
Obviously this can work if you have a definite number of symptoms in this columns. Machine learning algorithms usually work with numbers so your features will be extracted from the raw data into numeric form. Once that has been done, you can feed the data into a classification algorithm. The naive Bayes algorithm is a great place to start.
Scikit learn (if you can work with python) has a great introductory example on a 3class classification task where all the features are numbers. It tries to classify different types of iris flowers based on the sepal length, sepal width, petal length and petal width.
The full tutorial can be found here: Supervised learning: predicting an output variable from high-dimensional observations
Is it feasible to get additional data? If it is, I will suggest you get more. 200 instances is quite small and may not properly represent the feature space. In addition, it will be useful to split the data into a training and test set further reducing the quantity used while training. You can also opt for a K-Folds Cross validation.
Summarily: navigate to that scikit-learn page, try out the flower classification example. Once you're familiar with the environment; your data will need some cleaning and feature extraction. You will need to answer questions like what's the meaning of the Readmission Time and Symptom Code? Are those values over a specified range with a special internal meaning or they are just random numbers assigned like an id.
I would recommend transcribing your data into ARFF format and then use this with Weka. Weka is a program with many machine learning algorithms you can experiment with, it also has a very simple user interface so is good for beginners! Once you have found an algorithm that works well you can save your trained model and use this to predict new instances!

Resources