Question: I'm looking for a technique that I can use to reduce the number of iterations my application has to perform to find the optimal variable combination out of all possible variable combinations without testing every variable combination.
Current situation: I have a list of variables and each variable has a valid list of values. At the moment I'm creating a cartesian product of the list of valid variable values and I run logic across each possible variable combination. This means I'm wanting to run 2 000 000 different iterations and this takes a lot of time. I'm not interested in how to more efficiently run 2 000 000 different variable combinations but, instead after a technique I could use to hone in on an optimal variable combination without running through all the combinations.
Example: lets say I've got 3 variables named "one", "two" & "three". Each variable can be any value between 1 and 2. This means I have 2 to the power of 3 or a 8 different variable combinations. My list of possible variable combinations would look something like:
[
[one:1,two:1,three:1],
[one:1,two:1,three:2],
[one:1,two:2,three:1],
[one:1,two:2,three:2],
[one:2,two:1,three:1],
[one:2,two:1,three:2],
[one:2,two:2,three:1],
[one:2,two:2,three:2]
]
I would then run logic against each possible variable combination and this gives me the result of that variable combination. The end result being that I know which variable combination gives me the best result. This works great across smaller variable sets but takes days across larger sets.
There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML
One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.
#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.
I conducted an experiment with a latin square design.
How do I arrange the data in IBM SPSS to analyze it in one file? I've watched this video https://www.youtube.com/watch?v=APvlPjYSSaI, where groups (e.g., Block 1 = Group 1, Block 2 = Group 2, Block 3 = Group 3) have been built in addition to the time period of measurement. But in the example of the video, the order of the treatments are the same (and not arranged in a latin square as in my example). So how should I arrange the data from the latin square design in SPSS?
This is a within subject design, as I measure after each treatment my dependent variable (3 times measurement of dependent variable per subject). Each Block contains 5 subjects.
latin squares
Latin squares are usually used to balance the possible treatments in an experiment, and to prevent confounding the results with the order of treatment.
There is no special way to analyze the latin square. You just make a note of it when describing your methods.
arranging data for analysis
From your description, this is a between within design. Each subject was measured 3 times, and subjects are grouped by the order of treatments.
Your data should be arranged like so for ANOVA:
I believe this thread would be useful for you:
https://stats.stackexchange.com/questions/244636/evaluating-contrasts-in-repeated-measures-anova-spss
I want to save a value in a variable and then use it in a conditional sentence. To clarify this I'll give you a simpler example: Imagine that I have a database with the kind of animal(dogs and cats),their age(1 or 2) and their weight. I want to do the following conditional:
IF( animal=dog & age=1 & weight>= percentile75 ) Wdogs=1.
EXECUTE.
IF( animal=dog & age=1 & weight<percentile75) Wdogs=0.
EXECUTE
I want to calculate percentile75 automatically and save in a variable so I can use the code in any database I have. Also I want to rewrite the variables if I change the database and execute the code. Is there a way to do this?
Thanks a lot
You can use RANK in order to divide the weights into n groups. The command creates a new rank variable which you can then use in your conditionals.
For the ranking to be done separately in each relevant sub-group, use the BY sub-command.
In your example, each animal/weight subgroup will be ranked separately into weight quartiles, and the subsequent command will use the new variable:
RANK VARIABLES=weight (A) BY animal weight/NTILES(4).
IF(animal=dog & age=1 & Nweight=4) Wdogs=1.
IF(animal=dog & age=1 & Nweight<=3) Wdogs=0.
EXECUTE.
You can save a line of syntax by using (instead of the two if commands):
IF(animal=dog & age=1) Wdogs=(Nweight=4).
First thing mate is lose those EXECUTE commands (They are unnecessary and betray you as a noob point and clicker with default options). From other posts I discern you need to do that with multiple variables. Simplify your life by boning up on VARSTOCASES, maintain the current RANK with BY and put a f'ing Shrimp (or a PRAWN) on the Barbie Queue.
If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.