Compare means between groups SPSS - spss

I have this problem.
I have an SPSS sheet that looks like this (it's an analogy, so don't ask me how I have measured it). This example is about tennis players.
Player % of points won % of points won
own service opponent's service
1 50 10
2 80 60
3 70 40
4 80 50
Now I want to know if there's a difference between your own service, and the opponent's service, in terms of points won. (As you see, there probably is. But is it significant?)
Link naar boxplot
So Hypothesis: Own service -+-> points won
Now, Kruskal-Wallis, Independent-samples t test, One Way ANOVA, all require a grouping variable or something, but this is already implied. I could have chosen to make the data set:
Player Own service Won
1 1 0
2 1 1
3 0 1
For all games, group them on own service and see if there is a statistical difference between these groups. Perfectly doable.
The first set of data contains the same information, but presented differently. I just want to compare means between variables, only based on their own value. Can SPSS handle this type of information as well?

If the two variables are inter-related like points in a tennis match (per your example), and organized in different columns like that, a paired-samples t-test should work for you.

Related

What is the function to just identify outliers in Google Sheets?

I know of the TRIMMEAN function to help automatically exclude outliers from means, but is there one that will just identify which data points are true outliers? I am working under the classical definition of outliers being 3 SD away from the mean and in the bottom 25% and top 25% of data.
I need to do this in order to verify that my R code is indeed removing true outliers as we are defining them in my lab for our research purposes. R can be weird with the work arounds of identifying and removing outliers and since our data is mixed (we have numerical data grouped by factor classes) it gets to tricky to ensure that we are for sure identifying and removing outliers within those class groups. This is why we are turning to a spreadsheet program to do a double-check instead of assuming that the code is doing it correctly automatically.
Is there a specific outlier identification function in Google Sheets?
Data looks like this:
group VariableOne VariableTwo VariableThree VariableFour
NAC 21 17 0.9 6.48
GAD 21 17 -5.9 0.17
UG 40 20 -0.4 6.8
SP 20 18 -6 -3
NAC 19 4 -8 8.48
UG 18 10 0.1 -1.07
NAC 23 24 -0.2 3.5
SP 21 17 1 3.1
UG 21 17 -5 5.19
As stated, each data corresponds to a specific group code. That is to say, their data should be relatively similar within each group. My data as a whole does show this generally, but there are outliers within these groups which we want to exclude and I want to ensure we are excluding the correct data.
If I can get even more specific with the function and see outliers within the groups then great, but as long as I can identify outliers in Google Sheets that could suffice.
To get the outliers, you must
Calculate first quartile (Q1): This can be done in sheets using =Quartile(dataset, 1)
Calculate third quartile (Q3): Same as number 1, but different quartile number =Quartile(dataset, 3)
Calculate interquartile range (IQR): =Q3-Q1
Calculate lower boundary LB: =Q1-(1.5*IQR)
Calculate upper boundary UB: =Q3+(1.5*IQR)
By getting the lower and upper boundary, we can easily determine which data in our datasets are outliers.
Example:
You can use Conditional formatting to highlight the outliers by clicking Format->Conditional Formatting and copy the following:
Click Done and the result should look like this:
Reference:
QUARTILE

SPSS: Multiple data lines for individual cases

My dataset looks like this:
ID Time Date_____v1 v2 v3 v4
1 2300 21/01/2002 1 996 5 300
1 0200 22/01/2002 3 1000 6 100
1 0400 22/01/2002 5 930 3 100
1 0700 22/01/2002 1 945 4 200
I have 50+ cases and 15+ variables in both categorical and measurement form (although SPSS will not allow me to set it as Ordinal and Scale I only have the options of Nominal and Ordinal?).
I am looking for trends and cannot find a way to get SPSS to recognise each case as whole rather than individual rows. I have used a pivot table in excel which gives me the means for each variable but I am aware that this can skew the result as it removes extreme readings (I need these ideally).
I have searched this query online multiple times but I have come up blank so far, any suggestions would be gratefully received!
I'm not sure I understand. If you are saying that each case has multiple records (that is multiple lines of data) - which is what it looks like in your example - then either
1) Your DATA LIST command needs to change to add RECORDS= (see the Help for the DATA LIST command); or
2) You will have to use CASESTOVARS (C2V) to put all the variables for a case in the same row of the Data Editor.
I may not be understanding, though.

ELKI OPTICS pre-computed distance matrix

I can't seem to get this algorithm to work on my dataset, so I took a very small subset of my data and tried to get it to work, but that didn't work either.
I want to input a precomputed distance matrix into ELKI, and then have it find the reachability distance list of my points, but I get reachability distances of 0 for all my points.
ID=1 reachdist=Infinity predecessor=1
ID=2 reachdist=0.0 predecessor=1
ID=4 reachdist=0.0 predecessor=1
ID=3 reachdist=0.0 predecessor=1
My ELKI arguments were as follows:
Running: -dbc DBIDRangeDatabaseConnection -idgen.start 1 -idgen.count 4 -algorithm clustering.optics.OPTICSList -algorithm.distancefunction external.FileBasedDoubleDistanceFunction -distance.matrix /Users/jperrie/Documents/testfile.txt -optics.epsilon 1.0 -optics.minpts 2 -resulthandler ResultWriter -out /Applications/elki-0.7.0/elkioutputtest
I use the DBIDRangeDatabaseConnection instead of an input file to create indices 1 through 4 and pass in a distance matrix with the following format, where there are 2 indices and a distance on each line.
1 2 0.0895585119724274
1 3 0.19458931684494
2 3 0.196315720677376
1 4 0.137940123677254
2 4 0.135852232575417
3 4 0.141511023044586
Any pointers to where I'm going wrong would be appreciated.
When I change your distance matrix to start counting at 0, then it appears to work:
ID=0 reachdist=Infinity predecessor=-2147483648
ID=1 reachdist=0.0895585119724274 predecessor=-2147483648
ID=3 reachdist=0.135852232575417 predecessor=1
ID=2 reachdist=0.141511023044586 predecessor=3
Maybe you should file a bug report - to me, this appears to be a bug. Also, predecessor=-2147483648 should probably be predecessor=None or something like that.
This is due to a recent change, that may not yet be correctly presented in the documentation.
When you do multiple invocations in the MiniGUI, ELKI will assign fresh object DBIDs. So if you have a data set with 100 objects, the first run would use 0-99, the second 100-199 the third 200-299 etc. - this can be desired (if you think of longer running processes, you want object IDs to be unique), but it can also be surprising behavior.
However, this makes precomputed distance matrixes really hard to use; in particular with real data. Therefore, these classes were changed to use offsets. So the format of the distance matrix now is
DBIDoffset1 DBIDoffset2 distance
where offset 0 = start + 0 is the first object.
When I'm back in the office (and do not forget), I will 1. update the documentation to reflect this, provide 2. an offset parameter so that you can continue counting starting at 1, 3. make the default distance "NaN" or "infinity", and 4. add a sanity check that warns if you have 100 objects, but distances are given for objects 1-100 instead of 0-99.

Genetic Algorithms - Crossover and Mutation operators for paths

I was wondering if anyone knew any intuitive crossover and mutation operators for paths within a graph? Thanks!
Question is a bit old, but the problem doesn't seem to be outdated or solved, so I think my research still might be helpful for someone.
As far as mutation and crossover is quite trivial in the TSP problem, where every mutation is valid (that is because chromosome represents an order of visiting fixed nodes - swapping order then always can create a valid result), in case of Shortest Path or Optimal Path, where the chromosome is a exact route representation, this doesn't apply and isn't that obvious. So here is how I approach problem of solving Optimal Path using GA.
For crossover, there are few options:
For routes that have at least one common point (besides start and end node) - find all common points and swap subroutes in the place of crossing
Parent 1: 51 33 41 7 12 91 60
Parent 2: 51 9 33 25 12 43 15 60
Potential crossing point are 33 and 12. We can get following children: 51 9 33 41 7 12 43 15 60 and 51 33 25 12 91 60 that are the result of crossing using both of these crossing points.
When two routes don't have common point, select randomly two points from each parent and connect them (you can use for that either random traversal, backtracking or heuristic search like A* or beam search). Now this path may be treated as crossover path. For better understanding, see below picture of two crossover methods:
see http://i.imgur.com/0gDTNAq.png
Black and gray paths are parents, pink and orange paths are
children, green point is a crossover place, and red points are start
and end nodes. First graph shows first type of crossover, second graph is example of another one.
For mutation, there are also few options. Generally, dummy mutation like swapping order of nodes or adding random node is really ineffective for graphs with average density. So here are the approaches that guarantee valid mutations:
Take randomly two points from path and replace them with a random path between those two nodes.
Chromosome: 51 33 41 7 12 91 60 , random points: 33 and 12, random/shortest path between then: 33 29 71 12, mutated chromosome: 51 33 29 71 12 91 60
Find random point from path, remove it and connect its neighbours (really very similar to the first one)
Find random point from path and find random path to its neighbour
Try subtraversing the path from some randomly chosen point, until reaching any point on the initial route (slight modification of the first method).
see http://i.imgur.com/19mWPes.png
Each graph corresponds to each mutation method in appropriate order. In last example, the orange path is the one that would replace original path between mutation points (green nodes).
Note: this methods obviously may have performance drawback in the case, when finding alternative subroute (using a random or heuristic method) will stuck at some place or find very long and useless subpath, so consider bounding the time of mutation execution or trials number.
For my case, which is finding an optimal path in terms of maximizing sum of vertices weights while keeping sum of nodes weight less than given bound, those methods are quite effective and give a good result. Should you have any question, feel free to ask. Also, sorry for my MS Paint skills ;)
Update
One big hint: I basically used this approach in my implementation, but there was one big drawback of using random path generating. I decided to switch to semi-random route generation using shortest path traversing randomly picked point(s) - it is much more efficent (but obviously may not be applicable for all problems).
Emm.. That is very difficult question, people write dissertations for that and still there is no right answer to that.
The general rule is "it all depends on your domain".
There are some generic GA libraries that will do some work for you, but for the best results it is recommended to implement your GA operations yourself, specifically for your domain.
You might have more luck with answers on Theoretical CS, but you need to expand your question more and add more details about your task and domain.
Update:
So you have a graph. In GA terms, a path through the graph represents an individual, nodes in the path would be chromosomes.
In that case I would say a mutation can be represented as deviation of the path somewhere from the original - one of the nodes is moved somewhere, and the path is adjusted so the start and end values in the path are remaining the same.
Mutation can lead to invalid individuals. And in that case you need to make a decision: allow invalid ones and hope that they will converge to some unexplored solution. Or kill them on the spot. When I was working with GA, I did allow invalid solution, adding "Unfitness" value along with fitness. Some researchers suggest this can help with broad exploring of the solution space.
Crossover can only happen to the paths that are crossing each other: on the point of the crossing, swap the remains of the path with the parents.
Bear in mind that there are various ways for crossover: individuals can be crossed-over in multiple points or just in one. In the case with graphs you can have multiple crossing points, and that can naturally lead to the multiple children graphs.
As I said before, there is no right or wrong way of doing this, but you will find out the best way only by experimenting on it.

Is spacial search in P2P network possible?

I want to build a Javascript/HTML5 geolocation based social network and I wonder the best choice of possible architectures. Client-server can be simple to develop but drawback is the system ressources that could be very high, especially because the application must manage moves (worst case: a user that is in a car must see others users that are around him in cars).
Basicaly, in a client-server architecture, server tasks will be :
collects and stores latitude and longitude of the users (could have thousands of them)
makes geo distance search for that user (to get the list of users present around him in a radius)
builds and sends to the client an XML file with position of the users in the list
These 3 operation must be done periodically, every 3 or 5 seconds because I want a "live" map that shows users in the list moving in their environnement (city, town).
All these 3 points could be optimized :
client send his position when moving of 10 meters to reduce amount of data to process
"spherical rectangle" search in MyISAM table with spatial index (use of MBRContains) to off load MySQL database.
common output file : the XML that is sent can be the same if 2 users are located in a radius of x meters (the 2 users are close each-other).
It is hard to make load estimation at this stage but I think client-server architecture is not appropriate for that type of application and peer2peer could be a nice answer if 2 clients could communicate when they are near each other.
My point is:
Is there any methode to make possible a client to blind search other clients that are located in a certain radius without the help of a central server ? (it is possible with UDP broadcast :-)
edit : Correction. UDP Brodcast allow a client to poll a machine wherever it is, in certain range or IP address.
Thank you for your help,
Florent
You will have to have central peers/servers, because you need to centralize some information to be able to perform you functionalities.
I would go for the following:
Assign square miles (or whatever size you want) to specific servers.
Have devices send a 'I am here' message with their coordinates to some dispatcher that will forward these to the correct square mile server for handling.
Have servers register when a device enters a square mile they manage. This could be a central map to make sure a device is registered to one and only one square.
Forward this message to all other devices in the square.
And/or make sure you include to which square this message is intended and make sure the devices checks it before displays it to the user.
Tune the size of the square and the rate of 'I am here' message. That's it.
The answer actually depends on many things so I'll help out with basic strategy. To understand things out you'll need to understand how does Kademlia works (Kademlia is a DHT P2P network that stores information).
In Kademlia at first startup each node picks random ID which is a 160 bit number that represents point in a space of all possible 160 bit IDs.
The ID of the information that needs to be stored is obtained with SHA-1 function (it receives arbitrary string, and outputs 160 bit number that is treated like ID of the information that needs to be stored)
After that you have the ID of the information, you publish it, the information is physically stored on a node that has it's ID close to information ID.
(The illustration is taken from here)
The information is queried via it's ID. Both the information lookups or node lookups takes O(log(N)) hops to obtain the required information. The "XOR" metric is used in Kademlia (in your case it can be ordinary Euclidian metric).
Each node maintains an array of buckets, each bucket contains addresses of nodes that are appropriate to the current bucket. The appropriate'ness is a measure of how close the IDs are. consider example:
0 160
Node 1 ID: 1101000101011111101110101001010...
Node 2 ID: 1101011101011111101110101001010...
Node 3 ID: 1101000101011001101110101001010...
After applying XOR metric to Nodes #1,2 i.e (computing the number that represents the virtual distance between these nodes) we get:
index - 012345678901234
xor - 000001100000000... (the difference is in 5-th msb bit)
order - msb lsb
After applying Xor metric to Nodes #1,3 we get:
index - 012345678901234
xor - 000000000000011... (the difference is in 13-th msb bit)
order - msb lsb
Apparently Node 1 is closer to Node 3 since it has difference in less significant bits than the distance from Node 1 to Node 2. And therefore from a point of view of a Node 1, it's neighbor Node 3 goes to 13-th bucket(higher index means closer IDs), and Node 2 goes to to 5-th bucket which contains a group of nodes that are 5 MSB radixes away from a current node ID.
Such data structure allows each node to know it's surroundings in variety of 160 levels of distances.
Back to your example, to allow efficient geospacial queries you'll need to replace Kademlias XOR metric with ordinary Euclidian metric. In this case you will have your ID's as a 3D or 2D vectors, and unfortunately due to fact that Euclidian metric results with floating point numbers which are not directly suitable for this type of algorithm so you will need to convert them to a discrete binary numbers somehow in a way similar to what XOR function does. After that, finding node's neighboring nodes is a trivial task.
Hope this helps. Oh by the way look to HyperDex, new searchable distributed datastore closely tied to euclidian metric, might help...

Resources