I would like to use attribute selection for a numeric data-set.
My goal is to find the best attributes that I will later use in Linear Regression to predict numeric values.
For testing, I used the autoPrice.arff that I obtained from here(datasets-numeric.jar)
Using ReliefFAttributeEval I get the following outcome:
Ranked attributes:
**0.05793 8 engine-size**
**0.04976 5 width**
0.0456 7 curb-weight
0.04073 12 horsepower
0.03787 2 normalized-losses
0.03728 3 wheel-base
0.0323 10 stroke
0.03229 9 bore
0.02801 13 peak-rpm
0.02209 15 highway-mpg
0.01555 6 height
0.01488 4 length
0.01356 11 compression-ratio
0.01337 14 city-mpg
0.00739 1 symboling
while using the InfoGainAttributeEval (after applying numeric to nominal filter) leaves me with the following results:
Ranked attributes:
6.8914 7 curb-weight
5.2409 4 length
5.228 2 normalized-losses
5.0422 12 horsepower
4.7762 6 height
4.6694 3 wheel-base
4.4347 10 stroke
4.3891 9 bore
**4.3388 8 engine-size**
**4.2756 5 width**
4.1509 15 highway-mpg
3.9387 14 city-mpg
3.9011 11 compression-ratio
3.4599 13 peak-rpm
2.2038 1 symboling
My question is :
How can I justify contradiction between the 2 results ? If the 2 methods use different algorithms to achieve the same goal (revealing the relavance of the attribute to the class) why one say e.g engine-size is important and the other says not so much !?
There is no reason to think that RELIEF and Information Gain (IG) should give identical results, since they measure different things.
IG looks at the difference in entropies between not having the attribute and conditioning on it; hence, highly informative attributes (with respect to the class variable) will be the most highly ranked.
RELIEF, however, looks at random data instances and measures how well the feature discriminates classes by comparing to "nearby" data instances.
Note that relief is more heuristic (i.e. is a more stochastic) method, and the values and ordering you get depend on several parameters, unlike in IG.
So we would not expect algorithms optimizing different quantities to give the same results, especially when one is parameter-dependent.
However, I'd say that actually your results are pretty similar: e.g. curb-weight and horsepower are pretty close to the top in both methods.
Related
I know of the TRIMMEAN function to help automatically exclude outliers from means, but is there one that will just identify which data points are true outliers? I am working under the classical definition of outliers being 3 SD away from the mean and in the bottom 25% and top 25% of data.
I need to do this in order to verify that my R code is indeed removing true outliers as we are defining them in my lab for our research purposes. R can be weird with the work arounds of identifying and removing outliers and since our data is mixed (we have numerical data grouped by factor classes) it gets to tricky to ensure that we are for sure identifying and removing outliers within those class groups. This is why we are turning to a spreadsheet program to do a double-check instead of assuming that the code is doing it correctly automatically.
Is there a specific outlier identification function in Google Sheets?
Data looks like this:
group VariableOne VariableTwo VariableThree VariableFour
NAC 21 17 0.9 6.48
GAD 21 17 -5.9 0.17
UG 40 20 -0.4 6.8
SP 20 18 -6 -3
NAC 19 4 -8 8.48
UG 18 10 0.1 -1.07
NAC 23 24 -0.2 3.5
SP 21 17 1 3.1
UG 21 17 -5 5.19
As stated, each data corresponds to a specific group code. That is to say, their data should be relatively similar within each group. My data as a whole does show this generally, but there are outliers within these groups which we want to exclude and I want to ensure we are excluding the correct data.
If I can get even more specific with the function and see outliers within the groups then great, but as long as I can identify outliers in Google Sheets that could suffice.
To get the outliers, you must
Calculate first quartile (Q1): This can be done in sheets using =Quartile(dataset, 1)
Calculate third quartile (Q3): Same as number 1, but different quartile number =Quartile(dataset, 3)
Calculate interquartile range (IQR): =Q3-Q1
Calculate lower boundary LB: =Q1-(1.5*IQR)
Calculate upper boundary UB: =Q3+(1.5*IQR)
By getting the lower and upper boundary, we can easily determine which data in our datasets are outliers.
Example:
You can use Conditional formatting to highlight the outliers by clicking Format->Conditional Formatting and copy the following:
Click Done and the result should look like this:
Reference:
QUARTILE
My dataset looks like this:
ID Time Date_____v1 v2 v3 v4
1 2300 21/01/2002 1 996 5 300
1 0200 22/01/2002 3 1000 6 100
1 0400 22/01/2002 5 930 3 100
1 0700 22/01/2002 1 945 4 200
I have 50+ cases and 15+ variables in both categorical and measurement form (although SPSS will not allow me to set it as Ordinal and Scale I only have the options of Nominal and Ordinal?).
I am looking for trends and cannot find a way to get SPSS to recognise each case as whole rather than individual rows. I have used a pivot table in excel which gives me the means for each variable but I am aware that this can skew the result as it removes extreme readings (I need these ideally).
I have searched this query online multiple times but I have come up blank so far, any suggestions would be gratefully received!
I'm not sure I understand. If you are saying that each case has multiple records (that is multiple lines of data) - which is what it looks like in your example - then either
1) Your DATA LIST command needs to change to add RECORDS= (see the Help for the DATA LIST command); or
2) You will have to use CASESTOVARS (C2V) to put all the variables for a case in the same row of the Data Editor.
I may not be understanding, though.
Let us say I have a 2D array that I can read from a file
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
I am looking to store them as 1D array arr[16].
I am aware of row wise and column wise storage.
This messes up the structure of the array. Say I would like to convolve this with a 2X2 filter. Then at conv(1,1), I would be accessing memory at position 1,2,5,6.
Instead, can I optimize the storage of the data in a pattern such that the elements 1,2,5,6 are stored next to each other rather than being located far away ?
This reduces memory latency issue.
It depends on your processor, but supposing you have a typical Intel cache line size of 64 bytes, then picking square subregions that are each 64 bytes in size feels like a smart move.
If your individual elements are a byte each then 8x8 subtiles makes sense. So, e.g.
#define index(x, y) (x&7) | ((y&7) << 3) | \
((x&~7) << 6) | ((y&~7) ... shifted and/or multiplied as per size of x)
So in each full tile:
in 49 of every 64 cases all data is going to be within the same cache line;
in a further 14 it's going to lie across two cache lines; and
in one case in 64 is it going to need four.
So that's an average of 1.265625 cache lines touched per output pixel, versus 2.03125 in the naive case.
I found out what I was looking for. I was looking for what is called Morten ordering of an array that has shown to reduce memory access time. One another method would be to use the hilbert curve method which is shown to be more effective than the morten ordering method.
I am attaching a link to an article explaining this
https://insidehpc.com/2015/10/morton-ordering/
I'm trying to understand how SCD Type 5,6 & 7 work.
I read this article of Kimball Group and stack overflow answer on Type 6.
I could understand Type 6 concept, how it works and when to use it.
However, I'm still unable to understand how type 5 & 7 work and when to use them.
Explanation of type 5 & 7 with examples is highly appreciated.
Thanks in Advance.
I wouldn't worry too much- all the types above Type 3 have been called Type 6 at various times. Basically there are a range of techniques to deal with more complex history tracking, and it is up to you to pick the mix that works for your situation.
Having said that, I'll have a go at giving more of an idea of Type 5 and 7 from this article:
Design Tip #152 Slowly Changing Dimension Types 0, 4, 5, 6 and 7
Type 5 is a variation on a 'Mini Dimension', whereby some of the attributes of a large dimension are subject to change but you don't want to do type 2 because the dimension has millions of rows. You break out those attributes into a dimension that is built like a junk dimension, and you can use the key of that table in the fact to track history. In the Type 5 variation, you include the new key in the dimension itself as a type 1 attribute, allowing you to query the dimension itself at any one time to find out the value of those attributes, without having to go via the Fact. For more info, google "mini dimension kimball".
Type 7 is a different way of achieving the same thing as Type 6, where you maintain the Type 1 version of things separately from the Type 2 version of things. Often the Type 1 version of things is created by using a view of the Type 2 version. By having both keys in the fact you can query how things were at the time of the fact and also how things were based on current versions of dimensions. It avoids the need to update the old values with the current state.
I was wondering if anyone knew any intuitive crossover and mutation operators for paths within a graph? Thanks!
Question is a bit old, but the problem doesn't seem to be outdated or solved, so I think my research still might be helpful for someone.
As far as mutation and crossover is quite trivial in the TSP problem, where every mutation is valid (that is because chromosome represents an order of visiting fixed nodes - swapping order then always can create a valid result), in case of Shortest Path or Optimal Path, where the chromosome is a exact route representation, this doesn't apply and isn't that obvious. So here is how I approach problem of solving Optimal Path using GA.
For crossover, there are few options:
For routes that have at least one common point (besides start and end node) - find all common points and swap subroutes in the place of crossing
Parent 1: 51 33 41 7 12 91 60
Parent 2: 51 9 33 25 12 43 15 60
Potential crossing point are 33 and 12. We can get following children: 51 9 33 41 7 12 43 15 60 and 51 33 25 12 91 60 that are the result of crossing using both of these crossing points.
When two routes don't have common point, select randomly two points from each parent and connect them (you can use for that either random traversal, backtracking or heuristic search like A* or beam search). Now this path may be treated as crossover path. For better understanding, see below picture of two crossover methods:
see http://i.imgur.com/0gDTNAq.png
Black and gray paths are parents, pink and orange paths are
children, green point is a crossover place, and red points are start
and end nodes. First graph shows first type of crossover, second graph is example of another one.
For mutation, there are also few options. Generally, dummy mutation like swapping order of nodes or adding random node is really ineffective for graphs with average density. So here are the approaches that guarantee valid mutations:
Take randomly two points from path and replace them with a random path between those two nodes.
Chromosome: 51 33 41 7 12 91 60 , random points: 33 and 12, random/shortest path between then: 33 29 71 12, mutated chromosome: 51 33 29 71 12 91 60
Find random point from path, remove it and connect its neighbours (really very similar to the first one)
Find random point from path and find random path to its neighbour
Try subtraversing the path from some randomly chosen point, until reaching any point on the initial route (slight modification of the first method).
see http://i.imgur.com/19mWPes.png
Each graph corresponds to each mutation method in appropriate order. In last example, the orange path is the one that would replace original path between mutation points (green nodes).
Note: this methods obviously may have performance drawback in the case, when finding alternative subroute (using a random or heuristic method) will stuck at some place or find very long and useless subpath, so consider bounding the time of mutation execution or trials number.
For my case, which is finding an optimal path in terms of maximizing sum of vertices weights while keeping sum of nodes weight less than given bound, those methods are quite effective and give a good result. Should you have any question, feel free to ask. Also, sorry for my MS Paint skills ;)
Update
One big hint: I basically used this approach in my implementation, but there was one big drawback of using random path generating. I decided to switch to semi-random route generation using shortest path traversing randomly picked point(s) - it is much more efficent (but obviously may not be applicable for all problems).
Emm.. That is very difficult question, people write dissertations for that and still there is no right answer to that.
The general rule is "it all depends on your domain".
There are some generic GA libraries that will do some work for you, but for the best results it is recommended to implement your GA operations yourself, specifically for your domain.
You might have more luck with answers on Theoretical CS, but you need to expand your question more and add more details about your task and domain.
Update:
So you have a graph. In GA terms, a path through the graph represents an individual, nodes in the path would be chromosomes.
In that case I would say a mutation can be represented as deviation of the path somewhere from the original - one of the nodes is moved somewhere, and the path is adjusted so the start and end values in the path are remaining the same.
Mutation can lead to invalid individuals. And in that case you need to make a decision: allow invalid ones and hope that they will converge to some unexplored solution. Or kill them on the spot. When I was working with GA, I did allow invalid solution, adding "Unfitness" value along with fitness. Some researchers suggest this can help with broad exploring of the solution space.
Crossover can only happen to the paths that are crossing each other: on the point of the crossing, swap the remains of the path with the parents.
Bear in mind that there are various ways for crossover: individuals can be crossed-over in multiple points or just in one. In the case with graphs you can have multiple crossing points, and that can naturally lead to the multiple children graphs.
As I said before, there is no right or wrong way of doing this, but you will find out the best way only by experimenting on it.