ELKI OPTICS pre-computed distance matrix - machine-learning

I can't seem to get this algorithm to work on my dataset, so I took a very small subset of my data and tried to get it to work, but that didn't work either.
I want to input a precomputed distance matrix into ELKI, and then have it find the reachability distance list of my points, but I get reachability distances of 0 for all my points.
ID=1 reachdist=Infinity predecessor=1
ID=2 reachdist=0.0 predecessor=1
ID=4 reachdist=0.0 predecessor=1
ID=3 reachdist=0.0 predecessor=1
My ELKI arguments were as follows:
Running: -dbc DBIDRangeDatabaseConnection -idgen.start 1 -idgen.count 4 -algorithm clustering.optics.OPTICSList -algorithm.distancefunction external.FileBasedDoubleDistanceFunction -distance.matrix /Users/jperrie/Documents/testfile.txt -optics.epsilon 1.0 -optics.minpts 2 -resulthandler ResultWriter -out /Applications/elki-0.7.0/elkioutputtest
I use the DBIDRangeDatabaseConnection instead of an input file to create indices 1 through 4 and pass in a distance matrix with the following format, where there are 2 indices and a distance on each line.
1 2 0.0895585119724274
1 3 0.19458931684494
2 3 0.196315720677376
1 4 0.137940123677254
2 4 0.135852232575417
3 4 0.141511023044586
Any pointers to where I'm going wrong would be appreciated.

When I change your distance matrix to start counting at 0, then it appears to work:
ID=0 reachdist=Infinity predecessor=-2147483648
ID=1 reachdist=0.0895585119724274 predecessor=-2147483648
ID=3 reachdist=0.135852232575417 predecessor=1
ID=2 reachdist=0.141511023044586 predecessor=3
Maybe you should file a bug report - to me, this appears to be a bug. Also, predecessor=-2147483648 should probably be predecessor=None or something like that.

This is due to a recent change, that may not yet be correctly presented in the documentation.
When you do multiple invocations in the MiniGUI, ELKI will assign fresh object DBIDs. So if you have a data set with 100 objects, the first run would use 0-99, the second 100-199 the third 200-299 etc. - this can be desired (if you think of longer running processes, you want object IDs to be unique), but it can also be surprising behavior.
However, this makes precomputed distance matrixes really hard to use; in particular with real data. Therefore, these classes were changed to use offsets. So the format of the distance matrix now is
DBIDoffset1 DBIDoffset2 distance
where offset 0 = start + 0 is the first object.
When I'm back in the office (and do not forget), I will 1. update the documentation to reflect this, provide 2. an offset parameter so that you can continue counting starting at 1, 3. make the default distance "NaN" or "infinity", and 4. add a sanity check that warns if you have 100 objects, but distances are given for objects 1-100 instead of 0-99.

Related

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

SPSS: Multiple data lines for individual cases

My dataset looks like this:
ID Time Date_____v1 v2 v3 v4
1 2300 21/01/2002 1 996 5 300
1 0200 22/01/2002 3 1000 6 100
1 0400 22/01/2002 5 930 3 100
1 0700 22/01/2002 1 945 4 200
I have 50+ cases and 15+ variables in both categorical and measurement form (although SPSS will not allow me to set it as Ordinal and Scale I only have the options of Nominal and Ordinal?).
I am looking for trends and cannot find a way to get SPSS to recognise each case as whole rather than individual rows. I have used a pivot table in excel which gives me the means for each variable but I am aware that this can skew the result as it removes extreme readings (I need these ideally).
I have searched this query online multiple times but I have come up blank so far, any suggestions would be gratefully received!
I'm not sure I understand. If you are saying that each case has multiple records (that is multiple lines of data) - which is what it looks like in your example - then either
1) Your DATA LIST command needs to change to add RECORDS= (see the Help for the DATA LIST command); or
2) You will have to use CASESTOVARS (C2V) to put all the variables for a case in the same row of the Data Editor.
I may not be understanding, though.

NetLogo Histogram - Distribution of visits

I've a simple problem.
In my code there are some patches that contain food. I computed the # of visits for each one of these "resources". Now I want to put in a histogram the # of visits for each one of these "resource-patches" .
If I write in the GUI:
histogram [visits] of resource-patches
It plots me something strange.
This is because, usually, you put the "x" values in the brackets [ ] of the patches-own attribute. Instead, I want that
-in the "x-axis" there are the labels (for example) of the resource-patches (or their amount of food) whereas
-in "y-axis" I want the #of visits for each one of the resource-patches.
I'm struggling since yesterday but I can't find a solution.
Thank you in advance guys!
A histogram only takes in the values for the x-axis. The y-axis will always be the number of occurrences of x in the list provided. For you, it'll plot the frequency of each of yours visits.
If your visits list is [1 1 3 5 2 3 4 5]
You'll see a histogram of
x, y
1->2
2->1
3->1
4->1
5->2
I think you may want to look at another plotting tool if this is not what you want.

Mahout Recommender: What relative preference values are suitable for a GenericUserBasedRecommender?

In mahout, I'm setting up a GenericUserBasedRecommender, pretty straight forward for now, typical settings.
In generating a "preference" value for an item, we have the following 5 data points:
Positive interest
User converted on item (highest possible sign of interest)
Normal like (user expressed interest, e.g. like buttons)
Indirect expression of interest (clicks, cursor movements, measuring "eyeballs")
Negative interest
Indifference (items the user ignored when active on other items, a vague expression of disinterest)
Active dislike (thumbs down, remove item from my view, etc)
Over what range I should express these different attributes, let's use a 1-100 scale for discussion?
Should I be keeping the 'Active dislike' and 'Indifference' clustered close together, for example, at 1 and 5 respectively, with all the likes clustered in the 90-100 range?
Should 'Indifference' and 'Indirect expressions of interest' by closer to the center? As in 'Indifference' in the 20-35 range and 'Indirect like' in the 60-70 range?
Should 'User conversion' blow the scale away and be heads and tails higher than the others? As in: 'User Conversion' # 100, 'Lesser likes' # ~65, 'Dislikes' clustered in the 1-10 range?
On the scale of 1-100 is 50 effectively "null", or equivalent to no data point at all?
I know the final answer lies in trial and error and in the meaning of our data, but as far as the algorithm goes, I'm trying to understand at what point I need to tip the scales between interest and disinterest for the algorithm to function properly.
The actual range does not matter, not for this implementation. 1-100 is OK, 0-1 is OK, etc. The relative values are all that really matters here.
These values are estimated by a simple (linearly) weighted average. Therefore the response ought to be "linear". It ought to match an intuition that if action X gets a score 2x higher than action Y, then X should be an indicator of twice as much interest in real life.
A decent place to start is to simply size them relative to their frequency. If click-to-conversion rate is 2%, you might make a click worth 2% of a conversion.
I would ignore the "Indifference" signal you propose. It is likely going to be too noisy to be of use.

Package for fast determination of similarity between two bit sequences

I need to compare a query bit sequence with a database of up to a million bit sequences. All bit sequences are 100 bits long. I need the lookup to be as fast as possible. Are there any packages out there for fast determination of the similarity between two bit sequences? --Edit-- The bit sequences are position sensitive.
I have seen a possible algorithm on Bit Twiddling Hacks but if there is a ready made package that would be better.
If the database is rather static, you may want to build a tree data structure on it.
Search the tree recursively or in multiple threads and per search keep an actual difference variable. If the actual difference becomes greater than what you would consider 'similar', abort the search.
E.g. Suppose we have the following tree:
root
0 1
0 1 0 1
0 1 0 1 0 1 0 1
If you want to look for patterns similar to 011, and only want to allow 1 different bit at most, search like this (recursively or multi-threaded):
Start at the root
Take the left branch (0), this is similar, so difference is still 0
Take the left branch (0), this is different, so difference becomes 1, which is still acceptable
take the left branch (0), this is different, so difference becomes 2, which is too high. Abort looking in this branch.
take the right branch (1), this is equal, so difference remains 1, continue to search in this branch (not shown here)
Take the right branch (1), this is equal, so difference remains 0, go on
take the left branch (0), this is different, so difference becomes 1, which is still acceptable, go on.
This goes on until you have found your bit patterns.
If your bit patterns are more dynamic and being updated in your application, you will have to update the tree.
If memory is a problem, consider going to 64-bit.
If you want to look up the, let's say 50, most matching patterns, and we can assume that the input data set is rather static (or can be dynamically updated), you can repeat the initial phase of the previous comment, so:
For every bit pattern, count the bits.
Store the bit patterns in a multi_map (if you use STL, Java probably has something similar)
Then, use the following algorithm:
Make 2 collections: one for storing the found patterns, one for storing possibly good patterns (this second collection should probably be map, mapping 'distances' to patterns)
Take your own pattern and count the bits, assume this is N
Look in the multimap at index N, all these patterns will have the same sum, but not necessarily be completely identical
Compare all the patterns at index N. If they are equal store the result in the first collection. If they are not equal, store the result in the second collection/map, using the difference as key.
Look in the multimap at index N-1, all these patterns will have a distance of 1 or more
Compare all the patterns at index N-1. If they have a distance of 1, store them in the first collection. If they have a larger distance, store the result in the second collection/map, using the difference as key.
Repeat for index N+1
Now look in the second collection/map and see if there is something stored with distance 1. If it is, remove them from the second collection/map and store them in the first collection.
Repeat this for distance 2, distance 3, ... until you have enough patterns.
If the number of required patterns is not too big, and the average distance is also not too big, then the number of real compares between patterns is probably only a few %.
Unfortunately, since the patterns will be distributed using a Gaussian curve, there will still be quite some patterns to check. I didn't do a mathematical check on it, but in practice, if you don't want too many patterns out of the millions, and the average distance is not too far, you should be able to find the set of most-close patterns by checking only a few percent of the total bit patterns.
Please keep me updated of your results.
I came up with a second alternative.
For every bit pattern of the million ones count the number of bits and store the bit patterns in an STL multi_map (if you're writing in C++).
Then count the number of bits in your pattern. Suppose you have N bits set in your bit pattern.
If you now want to allow at most D differences, look up all the bit patterns in the multi_map having N-D, N-D+1, ..., N-1, N, N+1, ... N+D-1, N+D bits.
Unfortunately, the division of bit patterns in the multi_map will follow a Gaussian pattern, which means that in practice you will still have to compare quite some bit patterns.
(Originally I thought this could be solved by counting even 0's and uneven 1's but this isn't true.)
Assuming that you want to allow 1 difference, you have to look up 3 slots in the multi_map out of the 100 possible slots, leaving you with 3% of the actual bit patterns to do a full compare.

Resources