I have an sav file with plenty of variables. What I would like to do now is create macros/routines that detect basic properties of a range of item sets, using SPSS syntax.
COMPUTE scale_vars_01 = v_28 TO v_240.
The code above is intended to define a range of items which I would like to observe in further detail. How can I get the number of elements in the "array" scale_vars_01, as an integer?
Thanks for info. (as you see, the SPSS syntax is still kind of strange to me and I am thinking about using Python instead, but that might be too much overhead for my relatively simple purposes).
One way is to use COUNT, such as:
COUNT Total = v_28 TO v_240 (LO THRU HI).
This will count all of the valid values in the vector. This will not work if the vector contains mixed types (e.g. string and numeric) or if the vector has missing values. An inefficient way to get the entire count using DO REPEAT is below:
DO IF $casenum = 1.
COMPUTE Total = 0.
DO REPEAT V = v_28 TO V240.
COMPUTE Total = Total + 1.
END REPEAT.
ELSE.
COMPUTE Total = LAG(Total).
END IF.
This will work for mixed type variables, and will count fields with missing values. (The DO IF would work the same for COUNT, this forces a data pass, but for large datasets and large lists will only evaluate for the first case.)
Python is probably the most efficient way to do this though - and I see no reason not to use it if you are familiar with it.
BEGIN PROGRAM.
import spss
beg = 'X1'
end = 'X10'
MyVars = []
for i in xrange(spss.GetVariableCount()):
x = spss.GetVariableName(i)
MyVars.append(x)
len = MyVars.index(end) - MyVars.index(beg) + 1
print len
END PROGRAM.
Statistics has a built-in macro facility that could be used to define sets of variables, but the Python apis provide much more powerful ways to access and use the metadata. And there is an extension command SPSSINC SELECT VARIABLES that can define macros based on variable metadata such as patterns in names, measurement level, type, and other properties. It generates a macro listing these variables that can then be used in standard syntax.
Related
All I want to do is to define a set of integers that may have values above 255, but I'm not seeing any good options. For instance:
with MyObject do Visible := Tag in [100, 155, 200..225, 240]; // Works just fine
but
with MyObject do Visible := Tag in [100, 201..212, 314, 820, 7006]; // Compiler error
I've gotten by with (often lengthy) conditional statements such as:
with MyObject do Visible := (Tag in [100, 202..212]) or (Tag = 314) or (Tag = 820) or (Tag = 7006);
but that seems ridiculous, and this is just a hard-coded example. What if I want to write a procedure and pass a set of integers whose values may be above 255? There HAS to be a better, more concise way of doing this.
The base type of a Delphi set must be an ordinal type with at most 256 distinct values. Under the hood, such a variable has one bit for each possible value, so a variable of type set of Byte has size 256 bits = 32 bytes.
Suppose it were possible to create a variable of type set of Integer. There would be 232 = 4294967296 distinct integer values, so this variable must have 4294967296 bits. Hence, it would be of size 512 MB. That's a HUGE variable. Maybe you can put such a value on the stack in 100 years.
Consequently, if you truly need to work with (mathematical) sets of integers, you need a custom data structure; the built-in set types won't do. For instance, you could implement it as an advanced record. Then you can even overload the in operator to make it look like a true Pascal set!
Implementing such a slow and inefficient type is trivial, and that might be good enough for small sets. Implementing a general-purpose integer set data structure with efficient operations (membership test, subset tests, intersection, union, etc.) is more work. There might be third-party code available on the WWW (but StackOverflow is not the place for library recommendations).
If your needs are more modest, you can use a simple array of integers instead (TArray<Integer>). Maybe you don't need O(1) membership tests, subset tests, intersections, and unions?
I would say, that such task already requires a database. Something small and simple like TFDMemTable + TFDLocalSQL should do.
There does not seem to be an "easy" way (such as in R or python) to create interaction terms between dummy variables in gretl ?
Do we really need to code those manually which will be difficult for many levels? Here is a minimal example of manual coding:
open credscore.gdt
SelfemplOwnRent=OwnRent*Selfempl
# model 1
ols Acc 0 OwnRent Selfempl SelfemplOwnRent
Now my manual interaction term will not work for factors with many levels and in fact does not even do the job for binary variables.
Thanks,
ML
One way of doing this is to use lists. Use the dummify-command for generating dummies for each level and the ^-operator for creating the interactions. Example:
open griliches.gdt
discrete med
list X = dummify(med)
list D = dummify(mrt)
list INT = X^D
ols lw 0 X D INT
The command discrete turns your variable into a discrete variable and allows to use dummify (this step is not necessary if your variable is already discrete). Now all interactions terms are stored in the list INT and you can easily assess them in the following ols-command.
#Markus Loecher on your second question:
You can always use the rename command to rename a series. So you would have to loop over all elements in list INT to do so. However, I would rather suggest to rename both input series, in the above example mrt and med respectively, before computing the interaction terms if you want shorter series names.
Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)
I need to count the number of solutions of a bitvector theory. I would like first to bit-blast and then call a (propositional) model counter on the resulting CNF. However, in order for the count to be equal to the number of solutions of the original theory, I have to perform the so called projected model counting (due to the added Tseitin variables). The problem is that I haven't been able to identify the correct subset of variables (those that are not added by the Tseitin encoding) that is required for this task. This is what I'm doing at the moment:
F = z3.parse_smt2_file(inst)
g = Goal()
g.add(F)
t = Then('simplify', 'bit-blast')
subgoal = t(g)
vars = z3_util.get_vars(subgoal.as_expr())
t = Tactic('tseitin-cnf')
subgoal = t(subgoal.as_expr())
print_cnf(subgoal)
Where 'vars' is the subset of variables that I need. However, when I print the CNF to a file and I run a tool performing projected model counting using those variables, the number of models returned is not correct. Any idea on how to get the correct subset of variables? (i.e how to exclude the Tseitin variables)
Are there any shortcuts codes in SPSS for listing multiple variables? Say something similar to v1-v3 instead of v1 v2 v3 in SAS data step?
Some commands allow you to use the TO modifier (but not all). This is dependent on variables being in correct order in the data matrix. There are also multiple response sets, and defining macro calls to a specific set of variables.
Below I give examples of using TO and defining a set of variables via a macro. I admittedly never use multiple response sets, so I can only state it is an option (more useful for a set of dichotomous items than continuous variables I believe).
set seed = 10.
input program.
loop #i = 1 to 100.
compute id = #i.
compute V1 = RV.NORM(0,1).
compute V2 = RV.UNIFORM(0,1).
compute V3 = RV.POISSON(3).
compute V4 = RV.BERNOULLI(.5).
compute V5 = RV.BINOM(5,.8).
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.
freq var V1 to V5 /format = notable /statistics = mean.
DEFINE !myvars () V1 V2 V3 V4 V5.
!ENDDEFINE.
set mprint on.
freq var !myvars /format = notable /statistics = mean.
TO is always based on file order. It would be rare IMO to want a list selected by an interval in alphabetical order. Commands that accept a list of variables pretty much all honor TO.
You can change the variable order by using the KEEP subcommand of MATCH FILES.
You can also define a macro for a list of variables and reference it where you need the list.
Finally, if you install the Python Essentials from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral) and the SPSSINC SELECT VARIABLES extension command, the dialog box makes it easy to define a macro based on file order, alpha order or measurement level, among other criteria.
HTH