I know of the TRIMMEAN function to help automatically exclude outliers from means, but is there one that will just identify which data points are true outliers? I am working under the classical definition of outliers being 3 SD away from the mean and in the bottom 25% and top 25% of data.
I need to do this in order to verify that my R code is indeed removing true outliers as we are defining them in my lab for our research purposes. R can be weird with the work arounds of identifying and removing outliers and since our data is mixed (we have numerical data grouped by factor classes) it gets to tricky to ensure that we are for sure identifying and removing outliers within those class groups. This is why we are turning to a spreadsheet program to do a double-check instead of assuming that the code is doing it correctly automatically.
Is there a specific outlier identification function in Google Sheets?
Data looks like this:
group VariableOne VariableTwo VariableThree VariableFour
NAC 21 17 0.9 6.48
GAD 21 17 -5.9 0.17
UG 40 20 -0.4 6.8
SP 20 18 -6 -3
NAC 19 4 -8 8.48
UG 18 10 0.1 -1.07
NAC 23 24 -0.2 3.5
SP 21 17 1 3.1
UG 21 17 -5 5.19
As stated, each data corresponds to a specific group code. That is to say, their data should be relatively similar within each group. My data as a whole does show this generally, but there are outliers within these groups which we want to exclude and I want to ensure we are excluding the correct data.
If I can get even more specific with the function and see outliers within the groups then great, but as long as I can identify outliers in Google Sheets that could suffice.
To get the outliers, you must
Calculate first quartile (Q1): This can be done in sheets using =Quartile(dataset, 1)
Calculate third quartile (Q3): Same as number 1, but different quartile number =Quartile(dataset, 3)
Calculate interquartile range (IQR): =Q3-Q1
Calculate lower boundary LB: =Q1-(1.5*IQR)
Calculate upper boundary UB: =Q3+(1.5*IQR)
By getting the lower and upper boundary, we can easily determine which data in our datasets are outliers.
Example:
You can use Conditional formatting to highlight the outliers by clicking Format->Conditional Formatting and copy the following:
Click Done and the result should look like this:
Reference:
QUARTILE
Related
My dataset looks like this:
ID Time Date_____v1 v2 v3 v4
1 2300 21/01/2002 1 996 5 300
1 0200 22/01/2002 3 1000 6 100
1 0400 22/01/2002 5 930 3 100
1 0700 22/01/2002 1 945 4 200
I have 50+ cases and 15+ variables in both categorical and measurement form (although SPSS will not allow me to set it as Ordinal and Scale I only have the options of Nominal and Ordinal?).
I am looking for trends and cannot find a way to get SPSS to recognise each case as whole rather than individual rows. I have used a pivot table in excel which gives me the means for each variable but I am aware that this can skew the result as it removes extreme readings (I need these ideally).
I have searched this query online multiple times but I have come up blank so far, any suggestions would be gratefully received!
I'm not sure I understand. If you are saying that each case has multiple records (that is multiple lines of data) - which is what it looks like in your example - then either
1) Your DATA LIST command needs to change to add RECORDS= (see the Help for the DATA LIST command); or
2) You will have to use CASESTOVARS (C2V) to put all the variables for a case in the same row of the Data Editor.
I may not be understanding, though.
Let us say I have a 2D array that I can read from a file
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
I am looking to store them as 1D array arr[16].
I am aware of row wise and column wise storage.
This messes up the structure of the array. Say I would like to convolve this with a 2X2 filter. Then at conv(1,1), I would be accessing memory at position 1,2,5,6.
Instead, can I optimize the storage of the data in a pattern such that the elements 1,2,5,6 are stored next to each other rather than being located far away ?
This reduces memory latency issue.
It depends on your processor, but supposing you have a typical Intel cache line size of 64 bytes, then picking square subregions that are each 64 bytes in size feels like a smart move.
If your individual elements are a byte each then 8x8 subtiles makes sense. So, e.g.
#define index(x, y) (x&7) | ((y&7) << 3) | \
((x&~7) << 6) | ((y&~7) ... shifted and/or multiplied as per size of x)
So in each full tile:
in 49 of every 64 cases all data is going to be within the same cache line;
in a further 14 it's going to lie across two cache lines; and
in one case in 64 is it going to need four.
So that's an average of 1.265625 cache lines touched per output pixel, versus 2.03125 in the naive case.
I found out what I was looking for. I was looking for what is called Morten ordering of an array that has shown to reduce memory access time. One another method would be to use the hilbert curve method which is shown to be more effective than the morten ordering method.
I am attaching a link to an article explaining this
https://insidehpc.com/2015/10/morton-ordering/
I would like to use attribute selection for a numeric data-set.
My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.
For testing, I used the autoPrice.arff that I obtained from here(datasets-numeric.jar)
Using ReliefFAttributeEval I get the following outcome:
Ranked attributes:
**0.05793 8 engine-size**
**0.04976 5 width**
0.0456 7 curb-weight
0.04073 12 horsepower
0.03787 2 normalized-losses
0.03728 3 wheel-base
0.0323 10 stroke
0.03229 9 bore
0.02801 13 peak-rpm
0.02209 15 highway-mpg
0.01555 6 height
0.01488 4 length
0.01356 11 compression-ratio
0.01337 14 city-mpg
0.00739 1 symboling
while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:
Ranked attributes:
6.8914 7 curb-weight
5.2409 4 length
5.228 2 normalized-losses
5.0422 12 horsepower
4.7762 6 height
4.6694 3 wheel-base
4.4347 10 stroke
4.3891 9 bore
**4.3388 8 engine-size**
**4.2756 5 width**
4.1509 15 highway-mpg
3.9387 14 city-mpg
3.9011 11 compression-ratio
3.4599 13 peak-rpm
2.2038 1 symboling
My question is :
How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say e.g engine-size is important and the other says not so much !?
There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.
IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.
RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances.
Note that relief is more heuristic (i.e. is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.
So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent.
However, I'd say that actually your results are pretty similar: e.g. curb-weight and horsepower are pretty close to the top in both methods.
I was wondering if anyone knew any intuitive crossover and mutation operators for paths within a graph? Thanks!
Question is a bit old, but the problem doesn't seem to be outdated or solved, so I think my research still might be helpful for someone.
As far as mutation and crossover is quite trivial in the TSP problem, where every mutation is valid (that is because chromosome represents an order of visiting fixed nodes - swapping order then always can create a valid result), in case of Shortest Path or Optimal Path, where the chromosome is a exact route representation, this doesn't apply and isn't that obvious. So here is how I approach problem of solving Optimal Path using GA.
For crossover, there are few options:
For routes that have at least one common point (besides start and end node) - find all common points and swap subroutes in the place of crossing
Parent 1: 51 33 41 7 12 91 60
Parent 2: 51 9 33 25 12 43 15 60
Potential crossing point are 33 and 12. We can get following children: 51 9 33 41 7 12 43 15 60 and 51 33 25 12 91 60 that are the result of crossing using both of these crossing points.
When two routes don't have common point, select randomly two points from each parent and connect them (you can use for that either random traversal, backtracking or heuristic search like A* or beam search). Now this path may be treated as crossover path. For better understanding, see below picture of two crossover methods:
see http://i.imgur.com/0gDTNAq.png
Black and gray paths are parents, pink and orange paths are
children, green point is a crossover place, and red points are start
and end nodes. First graph shows first type of crossover, second graph is example of another one.
For mutation, there are also few options. Generally, dummy mutation like swapping order of nodes or adding random node is really ineffective for graphs with average density. So here are the approaches that guarantee valid mutations:
Take randomly two points from path and replace them with a random path between those two nodes.
Chromosome: 51 33 41 7 12 91 60 , random points: 33 and 12, random/shortest path between then: 33 29 71 12, mutated chromosome: 51 33 29 71 12 91 60
Find random point from path, remove it and connect its neighbours (really very similar to the first one)
Find random point from path and find random path to its neighbour
Try subtraversing the path from some randomly chosen point, until reaching any point on the initial route (slight modification of the first method).
see http://i.imgur.com/19mWPes.png
Each graph corresponds to each mutation method in appropriate order. In last example, the orange path is the one that would replace original path between mutation points (green nodes).
Note: this methods obviously may have performance drawback in the case, when finding alternative subroute (using a random or heuristic method) will stuck at some place or find very long and useless subpath, so consider bounding the time of mutation execution or trials number.
For my case, which is finding an optimal path in terms of maximizing sum of vertices weights while keeping sum of nodes weight less than given bound, those methods are quite effective and give a good result. Should you have any question, feel free to ask. Also, sorry for my MS Paint skills ;)
Update
One big hint: I basically used this approach in my implementation, but there was one big drawback of using random path generating. I decided to switch to semi-random route generation using shortest path traversing randomly picked point(s) - it is much more efficent (but obviously may not be applicable for all problems).
Emm.. That is very difficult question, people write dissertations for that and still there is no right answer to that.
The general rule is "it all depends on your domain".
There are some generic GA libraries that will do some work for you, but for the best results it is recommended to implement your GA operations yourself, specifically for your domain.
You might have more luck with answers on Theoretical CS, but you need to expand your question more and add more details about your task and domain.
Update:
So you have a graph. In GA terms, a path through the graph represents an individual, nodes in the path would be chromosomes.
In that case I would say a mutation can be represented as deviation of the path somewhere from the original - one of the nodes is moved somewhere, and the path is adjusted so the start and end values in the path are remaining the same.
Mutation can lead to invalid individuals. And in that case you need to make a decision: allow invalid ones and hope that they will converge to some unexplored solution. Or kill them on the spot. When I was working with GA, I did allow invalid solution, adding "Unfitness" value along with fitness. Some researchers suggest this can help with broad exploring of the solution space.
Crossover can only happen to the paths that are crossing each other: on the point of the crossing, swap the remains of the path with the parents.
Bear in mind that there are various ways for crossover: individuals can be crossed-over in multiple points or just in one. In the case with graphs you can have multiple crossing points, and that can naturally lead to the multiple children graphs.
As I said before, there is no right or wrong way of doing this, but you will find out the best way only by experimenting on it.
Does anyone have a straight forward Delphi example of filling in a grid using Delaunay
Triangles or kriging? Either method can fill a grid by 'interpolating.'
What do I want to do? I have a grid, similar to:
22 23 xx 17 19 18 05
21 xx xx xx 17 18 07
22 24 xx xx 18 21 20
30 22 25 xx 22 20 19
28 xx 23 24 22 20 18
22 23 xx 17 23 15 08
21 29 30 22 22 17 09
where the xx's represent grids cells with no data and the x,y coordinates of each cell is
known. Both kriging and Delaunay Triangles can supply the 'missing' points (which of course, are fictitious, but reasonable values).
Kriging is a statistical method to fill in 'missing' or unavailable data in
a grid with 'reasonable' values. Why would you need it? Principly to 'contour' the
data. Contouring algorithms (like CONREC for Delphi http://local.wasp.uwa.edu.au/~pbourke/papers/conrec/index.html) can contour regularly spaced data. Google around for 'kriging' and 'Delphi' and you eventually are pointed to the GEOBLOCK project on Source Forge (http://geoblock.sourceforge.net/ ). Geoblock has numerous Delphi pas units for kriging based on GSLIB (a Fortran statistical package developed at Stanford). However all the kriging/delauney units are dependent on units refered to in the Delphi uses clause. Unfortunately, these 'helper' units are not posted with the rest of the source code. It appears none of the kriging units can stand alone or work without helper units that are not posted or in some cases, undefined data types.
Delaunay triangulation is described at
http://local.wasp.uwa.edu.au/~pbourke/papers/triangulate/index.html. Posted is
a Delphi example, pretty neat, that generates 'triangles.' Unfortunately, I
haven't a clue how to use the unit with a static grid. The example 'generates' a data field on the fly.
Has anyone got either of these units to work to fill an irregular data grid? Any code or hints how to use the existing code for kriging a simple grid or using Delaunay to fill in the holes would be appreciated.
I'm writing this as an answer because it's too long to fit into a comment.
Assuming your grid really is irregular (you give no examples of a typical pattern of grid coordinates), then triangulation only partially helps. Once you have triangulated you would then use that triangulation to do an interpolation, and there are different choices that could be made.
But you've not said anything about how you want to interpolate, what you want to do with that interpolation.
It seems to me that you have asked for some code, but it's not clear that you know what algorithm you want. That's really the question you should have asked.
For example since you appear to have no criteria for how you should do the interpolation, why don't you choose the nearest neighbour for your missing values. Or why don't you use the overall mean for the missing values. Both of these choices meet all the criteria you have specified since you haven't specified any!
Really I think you need to spend some more time explaining what properties you want this interpolation to have, what you are going to do with it etc. I also think you should stop thinking about code for now and think about algorithms. Since you have mentioned statistics you should consider asking at https://stats.stackexchange.com/.
Code posted by Richard Winston on the Embarcadero Developer Newwork Code Central titled
Delaunay triangulation and contouring code
( ID: 29365 ) demonstrates routines for generating constrained Delaunay triangulations and for plotting contour lines based on data points at arbitrary locations. Richard's code . These algorithms do not manipulate and fill-in the holes in a grid. They do provide a method for contouring arbitrary data and do not require a grid without missing values.
I still have not found an acceptable krieging algorithm in Pascal to actually fill-in the holesin a grid.