I'm working on a clustering of maze patterns which are binary sequences of 0(available cell) and 1(brick). Is there a good way to define how similar patterns are? Suppose I have two patterns:
1000 and 0100
1000 0100
1000 0100
1111 0111
Obviously they are similar, but the metrics I tried give me the following results:Euclidean 2.64575131106; Cosine 0.537089950114; Jaccard 0.7. At the same time for the absolutely non-similar patterns like:
1000 and 1111
1000 0001
1000 0001
1111 0001
it gives me: Euclidean 3.16227766017, Cosine 0.714285714286, Jaccard 0.833333333333. What I don't like is that numbers are very close. I want something like 0.1 for the first case and 0.9 for the second. Is there a solution?
Related
I'm currently developing a question plugin for some LMS that auto grade the answer based on the similarity between the answer and answer key with cosine similarity. But lately, I found that there is a better algorithm that promised to be more accurate called TS-SS. But, the result of the calculation 0 - infinity. Being not a machine learning guy, I was assuming that the result maybe a distance, just like Euclidean Distance, but I'm not sure. It can be a geometry or something, because the algorithm is calculating the triangle and sector, so I'm assuming that it is a geometric similarity or something, I'm not sure though.
So I have some example in my note, and then I tried to convert it with what people suggest, S = 1 / (1 + D), but the result was not what I'm looking for. With cosine similarity I got 0.77, but with TS-SS plus equation before, I got 0.4. And then I found this SO answer that uses S = 1 / (1.1 ** D). When I tried the equation, sure enough it gave me "relevant" result, 0.81. That is not far from cosine similarity, and in my opinion the result is better suited for auto grading than 0.77 one based on the answer key.
But unfortunately, I don't know where that equation come from, and I tried to google it but no luck, so that is why I'm asking this question.
How to convert the TS-SS result to similarity measure the right way? Is the S = 1 / (1.1 ** D) enough or...?
Edit:
When calculating TS-SS, it is actually using cosine similarity calculation as well. So, if the cosine similarity is 1, then the TS-SS will be 0. But, if the cosine similarity is 0, the TS-SS is not infinty. So, I think it is reasonable to compare the result between the two to know what conversion formula will be used
TS-SS Cosine Similarity
38.19 0
7.065 0.45
3.001 0.66
1.455 0.77
0.857 0.81
0.006 0.80
0 1
another random comparison from multiple answer key
36.89 0
9.818 0.42
7.581 0.45
3.910 0.63
2.278 0.77
2.935 0.75
1.329 0.81
0.494 0.84
0.053 0.75
0.011 0.80
0.003 0.98
0 1
comparison from the same answer key
38.11 0.71
4.293 0.33
1.448 0
1.203 0.17
0.527 0.62
Thank you in advance
With these new figures, the answer is simply that we can't give you an answer. The two functions give you a distance measure based on metrics that appear to be different enough that we can't simply transform between TS-SS and CS. In fact, if the two functions are continuous (which they're supposed to be for comfortable use), then the transformation between them isn't a bijection (two-way function).
For a smooth translation between the two, we need at least for the functions to be continuous and differentiable for the entire interval of application. a small change in the document results in a small change in the metric. We also need them to be monotonic over the interval, such that a rise in TS-SS would always result in a drop in CS.
Your data tables show that we can't even craft such transformation functions for a single document, let alone the metrics in general.
The cited question was a much simpler problem: there, the OP already has a transformation with all of desired properties; they needed only to alter the slopes of change and ensure the boundary properties.
I am trying to use Vowpal Wabbit to do a binary classification, i.e. given feature values vw will classify it either 1 or 0. This is how I have the training data formatted.
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
-1 'name2 | feature1:1 feature2:0 feature3:5 feature4:2565 ...
etc
I have about 30,000 1 data points, and about 3,000 0 data points. I have 100 1 and 100 0 data points that I'm using to test on, after I create the model. These test data points are classified by default as 1. Here is how I format the prediction set:
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
From my understanding of the VW documentation, I need to use either the logistic or hinge loss_function for binary classifications. This is how I've been creating the model:
vw -d ../training_set.txt --loss_function logistic/hinge -f model
And this is how I try the predictions:
vw -d ../test_set.txt --loss_function logistic/hinge -i model -t -p /dev/stdout
However, this is where I'm running into problems. If I use the hinge loss function, all the predictions are -1. When I use the logistic loss function, I get arbitrary values between 5 and 11. There is a general trend for data points that should be 0 to be lower values, 5-7, and for data points that should be 1 to be from 6-11. What am I doing wrong? I've looked around the documentation and checked a bunch of articles about VW to see if I can identify what my problem is, but I can't figure it out. Ideally I would get a 0,1 value, or a value between 0 and 1 which corresponds to how strong VW thinks the result is. Any help would be appreciated!
If the output should be just -1 and +1 labels, use the --binary option (when testing).
If the output should be a real number between 0 and 1, use --loss_function=logistic --link=logistic. The loss_function=logistic is needed when training, so the number can be interpreted as probability.
If the output should be a real number between -1 and 1, use --link=glf1.
If your training data is unbalanced, e.g. 10 times more positive examples than negative, but your test data is balanced (and you want to get the best loss on this test data), set the importance weight of the positive examples to 0.1 (because there are 10 times more positive examples).
Independently of your tool and/or specific algorithm you can use "learning curves" ,and train/cross validation/test splitting to diagnose your algorithm and determine whats your problem . After diagnosing your problem you can apply adjustments to your algorithm, for example if you find you have over-fitting you can apply some actions like:
Add regularization
Get more training data
Reduce the complexity of your model
Eliminate redundant features.
You can reference Andrew Ng. "Advice for machine learning" videos on YouTube to more details on this subject.
I'm new to machine learning and seek some help.
I would like to train a network to predict the next values I expect as follow:
reference: [val1 val2 ... val 15]
val = 0 if it doesn't exists, 1 if it does exist.
Input: [1 1 1 0 0 0 0 0 1 1 1 0 0 0 0]
Output: [1 1 1 0 0 0 0 0 1 1 1 0 0 1 1] (last two values appear)
So my neural network would have 15 inputs and 15 outputs
I would like to know if there would be a better way to do that kind of prediction. Do my data would need normalization also?
Now the problem is, I dont have 15 values, but actually 600'000 of them. Can a neural network handle such big tensors, and I've hear I would need twice the number for hidden layer units.
Thanks a lot for your help, you machine learning expert!
Best
This is not a problem for the concept of a neural network: the question is whether your computing configuration and framework implementation deliver the required memory. Since you haven't described your topology, there's not a lot we can do to help you scope this out. What do you have for parameter and weight counts? Each of those is at least a short float (4 bytes). For instance, a direct FC (fully-connected) layer would give you (6e5)^2 weights, or 3.6e11 * 4 bytes => 1.44e12 bytes. Yes, that's pushing 1.5 terabytes for that layer's weights.
You can get around some of this with the style of NN you choose. For instance, splitting into separate channels (say, 60 channels of 1000 features each) can give you significant memory savings, albeit at the cost of speed in training (more layers) and perhaps some accuracy (although crossover can fix a lot of that). Convolutions can also save you overall memory, again at the cost of training speed.
600K => 4 => 600K
That clarification takes care of my main worries: you have 600,000 * 4 weights in each of two places: 1,200,004 parameters and 4.8M weights. That's 6M total floats, which shouldn't stress the RAM of any modern general-purpose computer.
The channelling idea is when you're trying to have a fatter connection between layers, such as 600K => 600K FC. In that case, you break up the data into smaller groups (usually just 2-12), and make a bunch of parallel fully-connected stream. For instance, you could take your input and make 10 streams, each of which is 60K => 60K FC. In your next layer, you swap the organization, "dealing out" each set of 60K so that 1/10 goes into each of the next channels.
This way, you have only 10 * 60K * 60K weights, only 10% as many as before ... but now there are 3 layers. Still, it's a 5x saving on memory required for weights, which is where you have the combinatorial explosion.
I am working on a ad-click recommendation system in which I have to predict whether a user will click on a Advertisement. I have 98 features in total having both USER features and ADVERTISEMENT features. Some of the features which are very important for the prediction are having string values like this.
**FEATURE**
Inakdtive Kunmden
Stammkfunden
Stammkdunden
Stammkfunden
guteg Quartialskunden
gutes Quartialskunden
guteg Quartialskunden
gutes Quartialskunden
There are 14 different string value like this in whole data column. My model cannot take string values as input so I have to convert them to categorical int values. I have no idea how to do this and make these features useful. I am using K-MEANS CLUSTERING & RANDOMFOREST ALGORITHM.
Be careful in turning a list of string values into categorical ints as the model will likely interpret the integers as being numerically significant, but they probably are not.
For instance, if:
'Dog'=1,'Cat'=2,'Horse'=3,'Mouse'=4,'Human'=5
Then the distance metric in your clustering algorithm would think that humans are more like mice than they are like dogs. It is usually more useful to turn them into 14 binary values e.g.
Turn this:
'Dog'
'Cat'
'Human'
'Mouse'
'Dog'
Into this:
'Dog' 'Cat' 'Mouse' 'Human'
1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
1 0 0 0
Not this:
'Species'
1
2
5
4
1
However, if the data are going to be the 'targets' that you are classifying and not the data 'features', you can leave them as ints in most multi-classification algorithms in SciKit-Learn.
I like user1745038's answer and it should give you reasonably good results. However, if you want to extract more meaningful features out of your strings (specially if the number of strings increases significantly), consider using some NLP techniques. For example, 'Dog' and 'Cat' are more similar than 'Dog' and 'Mouse'.
Good luck
I'm working SVM in R software and I would appreaciate any input you may provide.
I have a data set that I need to train with SVM, the format of the data is the following
ToPredict Data1 Data2 Data3 Data4 DNA
S 1 12 1 11 000000000100
B -1 17 14 3 11011110111110111
S 1 4 0 4 0000
The question that I have is regarding the DNA column.
SVM is able to get an input like DNA and still calculate reliable predictions?
For my data set, 0≠00 or 1≠001 therefore, it cannot be taken as integers.Every value represents information that needs to be processed and the order is very important, it's a string of binary values, either is 1 or 0.
The 0101 information could be displayed as ABAB etc. (A=0, B=1)
How can I train a SVM with the data above?
Thank you.
For SVMs to work, "all" you need to have a kernel function.
So what is a sensible kernel function for your "DNA strings"? You probably don't need to be able to prove it is a proper kernel, but you can get away with a good similarity measure.
How would you evaluate similarity of your sequences? I cannot help you on that, because I don't know what the data means; this is up to the user (i.e. you) to specify.