I've a simple problem.
In my code there are some patches that contain food. I computed the # of visits for each one of these "resources". Now I want to put in a histogram the # of visits for each one of these "resource-patches" .
If I write in the GUI:
histogram [visits] of resource-patches
It plots me something strange.
This is because, usually, you put the "x" values in the brackets [ ] of the patches-own attribute. Instead, I want that
-in the "x-axis" there are the labels (for example) of the resource-patches (or their amount of food) whereas
-in "y-axis" I want the #of visits for each one of the resource-patches.
I'm struggling since yesterday but I can't find a solution.
Thank you in advance guys!
A histogram only takes in the values for the x-axis. The y-axis will always be the number of occurrences of x in the list provided. For you, it'll plot the frequency of each of yours visits.
If your visits list is [1 1 3 5 2 3 4 5]
You'll see a histogram of
x, y
1->2
2->1
3->1
4->1
5->2
I think you may want to look at another plotting tool if this is not what you want.
Related
I can't seem to get this algorithm to work on my dataset, so I took a very small subset of my data and tried to get it to work, but that didn't work either.
I want to input a precomputed distance matrix into ELKI, and then have it find the reachability distance list of my points, but I get reachability distances of 0 for all my points.
ID=1 reachdist=Infinity predecessor=1
ID=2 reachdist=0.0 predecessor=1
ID=4 reachdist=0.0 predecessor=1
ID=3 reachdist=0.0 predecessor=1
My ELKI arguments were as follows:
Running: -dbc DBIDRangeDatabaseConnection -idgen.start 1 -idgen.count 4 -algorithm clustering.optics.OPTICSList -algorithm.distancefunction external.FileBasedDoubleDistanceFunction -distance.matrix /Users/jperrie/Documents/testfile.txt -optics.epsilon 1.0 -optics.minpts 2 -resulthandler ResultWriter -out /Applications/elki-0.7.0/elkioutputtest
I use the DBIDRangeDatabaseConnection instead of an input file to create indices 1 through 4 and pass in a distance matrix with the following format, where there are 2 indices and a distance on each line.
1 2 0.0895585119724274
1 3 0.19458931684494
2 3 0.196315720677376
1 4 0.137940123677254
2 4 0.135852232575417
3 4 0.141511023044586
Any pointers to where I'm going wrong would be appreciated.
When I change your distance matrix to start counting at 0, then it appears to work:
ID=0 reachdist=Infinity predecessor=-2147483648
ID=1 reachdist=0.0895585119724274 predecessor=-2147483648
ID=3 reachdist=0.135852232575417 predecessor=1
ID=2 reachdist=0.141511023044586 predecessor=3
Maybe you should file a bug report - to me, this appears to be a bug. Also, predecessor=-2147483648 should probably be predecessor=None or something like that.
This is due to a recent change, that may not yet be correctly presented in the documentation.
When you do multiple invocations in the MiniGUI, ELKI will assign fresh object DBIDs. So if you have a data set with 100 objects, the first run would use 0-99, the second 100-199 the third 200-299 etc. - this can be desired (if you think of longer running processes, you want object IDs to be unique), but it can also be surprising behavior.
However, this makes precomputed distance matrixes really hard to use; in particular with real data. Therefore, these classes were changed to use offsets. So the format of the distance matrix now is
DBIDoffset1 DBIDoffset2 distance
where offset 0 = start + 0 is the first object.
When I'm back in the office (and do not forget), I will 1. update the documentation to reflect this, provide 2. an offset parameter so that you can continue counting starting at 1, 3. make the default distance "NaN" or "infinity", and 4. add a sanity check that warns if you have 100 objects, but distances are given for objects 1-100 instead of 0-99.
first time user of this forum - guidance on how to provide enough information is very appreciated. I am trying to replicate the presentation of data used in the Medical education field. This will help improve the quality of examiners' marking of trainees in a Clinical Exam. What I would like to communicate will be similar to what is already communicated in the College of General Practitioners regarding one of their own exams, please see www.gp10.com.au/slides/thursday/slide29.pdf to help understand what it is I want to present. I have access to Excel, SPSS and R, so any help with any of these would be great. However as a first attempt I have used SPSS and created 3 variables: dummy variable, a "station score" and a "global rating score"(GRS). The "station score"(ST) is a value between 0 and 10 (non-integers) and is on the y-axis similar to the pdf presentation of "Candidate Final Marks". The x-axis is the "global rating scale", an integer from 1 to 6 and is represented in the pdf as the "Overall Performance Scale". When I use SPSS's boxplot I get a boxplot as depicted.
.
What I would like to do is overlay a single examiners own scoring of X number of examinees. So for one examiner (examiner A) provided the following marks:
ST: 5.53,7.38,7.38,7.44,6.81
GRS: 3,4,4,5,3
(this is transposed into two columns).
Whether it be SPSS, Excel or R how would I be able to overlay the box and whisker plots with the individual data points provided by the one examiner? This would help show the degree to which the examiners' marking styles are in concordance with the expected distribution of ST scores across GRS. Any help greatly appreciated! I like Excel graphics but I have found it very difficult to work with when choosing the examiners' data as a separate series - somehow the examiners' GRS scores do not line up nicely on the x-axis. I am very new to R but am also very interested in R, and would expend time to get a good result in R if a good result is viable. I understand JMP may be preferable for this type of thing but access to this may not be possible.
I'm working on a suggestion system. For a given input, the system outputs N suggestions.
We have collected data about what suggestions the users like. Example:
input1 - output11 output12 output13
input2 - output21
input3 - output31 output32
...
We now want to evaluate our system based on this data. The first metric is if these outputs are present in the suggestions of our system, that's easy.
But now, we would like to test how well positioned are these outputs in the suggestions. We would like to have the given outputs close to the first suggestions.
We would like a single score for the system or for each input.
Based on the previous data, here is what a score of 100% would be:
input1 - output11 output12 output13 other other other ...
input2 - output21 other other other other other ...
input3 - output31 output32 other other other other ...
...
(The order of output11 output12 output13 is not relevant. What is important is that ideally the three of them should be in the first three suggestions).
We could give a score to each position that is hold by a suggestion or count the displacement from the ideal position, but I don't see a good way to do this.
Is there an existing measure that could be used for that ?
You want something called the mean average precision (it's a metric from information retrieval).
Essentially, for each of the 'real' data points in your output list, you can compute the precision (#of correct entries above that point / #entries above that point). If you average this number across the positions of each of your real data points in the output list, you get a metric that does what you want.
In mahout, I'm setting up a GenericUserBasedRecommender, pretty straight forward for now, typical settings.
In generating a "preference" value for an item, we have the following 5 data points:
Positive interest
User converted on item (highest possible sign of interest)
Normal like (user expressed interest, e.g. like buttons)
Indirect expression of interest (clicks, cursor movements, measuring "eyeballs")
Negative interest
Indifference (items the user ignored when active on other items, a vague expression of disinterest)
Active dislike (thumbs down, remove item from my view, etc)
Over what range I should express these different attributes, let's use a 1-100 scale for discussion?
Should I be keeping the 'Active dislike' and 'Indifference' clustered close together, for example, at 1 and 5 respectively, with all the likes clustered in the 90-100 range?
Should 'Indifference' and 'Indirect expressions of interest' by closer to the center? As in 'Indifference' in the 20-35 range and 'Indirect like' in the 60-70 range?
Should 'User conversion' blow the scale away and be heads and tails higher than the others? As in: 'User Conversion' # 100, 'Lesser likes' # ~65, 'Dislikes' clustered in the 1-10 range?
On the scale of 1-100 is 50 effectively "null", or equivalent to no data point at all?
I know the final answer lies in trial and error and in the meaning of our data, but as far as the algorithm goes, I'm trying to understand at what point I need to tip the scales between interest and disinterest for the algorithm to function properly.
The actual range does not matter, not for this implementation. 1-100 is OK, 0-1 is OK, etc. The relative values are all that really matters here.
These values are estimated by a simple (linearly) weighted average. Therefore the response ought to be "linear". It ought to match an intuition that if action X gets a score 2x higher than action Y, then X should be an indicator of twice as much interest in real life.
A decent place to start is to simply size them relative to their frequency. If click-to-conversion rate is 2%, you might make a click worth 2% of a conversion.
I would ignore the "Indifference" signal you propose. It is likely going to be too noisy to be of use.
Does anyone know how to winsorize data in SPSS? I have outliers for some of my variables and want to winsorize them. Someone taught me how to do use the Transform -> compute variable command, but I forgot what to do. I believe they told me to just compute the square root of the subjects measurement that I want to winsorize. Could someone please elucidate this process for me?
There is a script online to do it already it appears. It could perhaps be simplified (the saving to separate files is totally unnecessary), but it should do the job. If you don't need a script and you know the values of the percentiles you need it would be as simple as;
Get the estimates for the percentiles for variable X (here I get the 5th and 95th percentile);
freq var X /format = notable /percentiles = 5 95.
Then lets say (just by looking at the output) the 5th percentile is equal to 100 and the 95th percentile is equal to 250. Then lets make a new variable named winsor_X replacing all values below the 5th and 95th percentile with the associated percentile.
compute winsor_X = X.
if X <= 100 winsor_X = 100.
if X >= 250 winsor_X = 250.
You could do the last part a dozen different ways, but hopefully that is clear enough to realize what is going on when you winsorize a variable.