gretl - dummy interactions - interaction

There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML

One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.

#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.

Related

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

How to find a function that fits a given data set?

The search algorithm is a Breadth first search. I'm not sure how to store terms from and equation into a open list. The function f(x) has the form of ax^e1 + bx^e2 + cx^e3 + k, where a, b, c, are coefficients; k is constant. All exponents, coefficients, and constants are integers between 0 and 5.
Initial state: of the problem solving process should be any term from the ax^e1, bx^e2, cX^e3, k.
The algorithm gradually expands the number of terms in each level of the list.
Not sure how to add the terms to an equation from an open Queue. That is the question.
The general problem that you are dealing belongs to the regression analysis area, and several techniques are available to find a function that fits a given data set, including the popular least squares methods for finding the line of best fit given a dataset (a brief starting point is the related page on wikipedia, but if you want to deepen this topic, you should look at the research paper out there).
If you want to stick with the breadth first search algorithm, although this kind of approach is not common for such a problem, first of all, you need to define all the elements for a search problem, namely (see for more information Chapter 3 of the book of Stuart and Russell, Artificial Intelligence: A Modern Approach):
Initial state: Some initial values for the different terms.
Actions: in your case it should be a change in the different terms. Note that you should discretize the changes in the values.
Transition function: function that determines the new states given a state and an action.
Goal test: a check to recognize whether a state is a goal state or not, and so to terminate the search. There are different ways to define this test in a regression problem. One way is to set a threshold for the sum of the square errors.
Step cost: The cost for an action. In such an abstract problem, probably you can consider the unweighted distance from the initial state on the search graph.
Note that you should carefully think about these elements, as, for example, they determine how efficient your search would be or whether you will have cycles in the search graph.
After you defined all of the elements for the search problem, you basically have to implement:
Node, that contains information about the parent, the state, and the current cost;
Function to expand a given node that returns the successor nodes (according to the transition function, the actions, and the step cost);
Goal test;
The actual search algorithm. In the queue at the beginning you will have the node containing the initial state. After, it is updated with the successor nodes.

How do you include categories with 0 responses in SPSS frequency output?

Is there a way to display response options that have 0 responses in SPSS frequency output? The default is for SPSS to omit in the frequency table output any response option that is not selected by at least a single respondent. I looked for a syntax-driven option to no avail. Thank you in advance for any assistance!
It doesn't show because there is no one single case in the data is with that attribute. So, by forcing a row of zero you'll need to realize we're asking SPSS to do something incorrect.
Having said that, you can introduce a fake case with the missing category. E.g. if you have Orange, Apple, and Pear, but no one answered they like Pear, the add one fake case that says Pear.
Now, make a new weight variable that consists of only 1. But for the Pear case, make it very very small like 0.00001. Then, go to Data > Weight Cases > Weight cases by and put that new weight variable over. Click OK to apply. Now what happens is that SPSS will treat the "1" with a weight of 1 and the fake case with a weight that is 1/10000 of a normal case. If you rerun the frequency you should see the one with zero count shows up.
If you have purchased the Custom Table module you can also do that directly as well, as far as I can tell from their technical document. That module costs 637 to 3630 depending on license type, so probably only worth a try if your institute has it.
So, I'm a noob with SPSS, I (shame on me) have a cracked version of SPSS 22 and if I understood your question correctly, this is my solution:
double click the Frequency table in Output
right click table, select Table Properties
go to General and then uncheck the Hide empty rows and columns option
Hope this helps someone!
If your SPSS version has no Custom Tables installed and you haven't collected money for that module yet then use the following (run this syntax):
*Note: please use variable names up to 8 characters long.
set mxloops 1000. /*in case your list of values is longer than 40
matrix.
get vars /vari= V1 V2 /names= names /miss= omit. /*V1 V2 here is your categorical variable(s)
comp vals= {1,2,3,4,5,99}. /*let this be the list of possible values shared by the variables
comp freq= make(ncol(vals),ncol(vars),0).
loop i= 1 to ncol(vals).
comp freq(i,:)= csum(vars=vals(i)).
end loop.
comp names= {'vals',names}.
print {t(vals),freq} /cnames= names /title 'Frequency'. /*here you are - the frequencies
print {t(vals),freq/nrow(vars)*100} /cnames= names /format f8.2 /title 'Percent'. /*and percents
end matrix.
*If variables have missing values, they are deleted listwise. To include missings, use
get vars /vari= V1 V2 /names= names /miss= -999. /*or other value
*To exclude missings individually from each variable, analyze by separate variables.

Getting length of vector in SPSS

I have an sav file with plenty of variables. What I would like to do now is create macros/routines that detect basic properties of a range of item sets, using SPSS syntax.
COMPUTE scale_vars_01 = v_28 TO v_240.
The code above is intended to define a range of items which I would like to observe in further detail. How can I get the number of elements in the "array" scale_vars_01, as an integer?
Thanks for info. (as you see, the SPSS syntax is still kind of strange to me and I am thinking about using Python instead, but that might be too much overhead for my relatively simple purposes).
One way is to use COUNT, such as:
COUNT Total = v_28 TO v_240 (LO THRU HI).
This will count all of the valid values in the vector. This will not work if the vector contains mixed types (e.g. string and numeric) or if the vector has missing values. An inefficient way to get the entire count using DO REPEAT is below:
DO IF $casenum = 1.
COMPUTE Total = 0.
DO REPEAT V = v_28 TO V240.
COMPUTE Total = Total + 1.
END REPEAT.
ELSE.
COMPUTE Total = LAG(Total).
END IF.
This will work for mixed type variables, and will count fields with missing values. (The DO IF would work the same for COUNT, this forces a data pass, but for large datasets and large lists will only evaluate for the first case.)
Python is probably the most efficient way to do this though - and I see no reason not to use it if you are familiar with it.
BEGIN PROGRAM.
import spss
beg = 'X1'
end = 'X10'
MyVars = []
for i in xrange(spss.GetVariableCount()):
x = spss.GetVariableName(i)
MyVars.append(x)
len = MyVars.index(end) - MyVars.index(beg) + 1
print len
END PROGRAM.
Statistics has a built-in macro facility that could be used to define sets of variables, but the Python apis provide much more powerful ways to access and use the metadata. And there is an extension command SPSSINC SELECT VARIABLES that can define macros based on variable metadata such as patterns in names, measurement level, type, and other properties. It generates a macro listing these variables that can then be used in standard syntax.

How to display the results of multiple comparisons

If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.

Resources