Destring many, many variables, checking for non-numeric values - spss

Let's say I've imported data from excel that has many, many variables, say v1 through v4000. Each of these is intended to be numeric, and most cases have numeric-only values, but there are some cases that have non-numeric characters. For some of those non-numerics, I know the meaning (e.g., "NA" for missing), and potentially some unknown strings that should be investigated.
For each variable, I think I would like to do something like 1) create a numeric version of that variable that has the original values for all cases that had numeric values, 2) create a list of unique string values for cases with non-numerics so those can be investigated. With 4,000 variables, I would ideally use some type of loop to do this.
How can that be done? Is it even possible?

I was able to solve this using the below macro, which creates a new variable with a "_str" suffix that holds the original values, and which can therefore be used to report frequencies of values that were turned into system missing values.
DEFINE destringvars(names=!cmdend)
!do !i !in (!names)
RENAME VARIABLES (!i=!concat(!i,"_str")).
STRING !i (A9).
compute !i=!concat(!i,"_str").
alter type !i(f8).
TEMPORARY.
SELECT if SYSMIS(!i).
FREQUENCY !concat(!i,"_str").
!DOEND
EXECUTE.
!enddefine

Related

Check values existence using spss syntax

I should check existence of values based on some conditions.
i.e. i have 3 variables, varA, varB and varC. varC should not be empty only if varA>varB (condition).
i normally use some syntax to check any of the variables and run a frequency of any of them to see if there are errors:
if missing(varC) and (varA>varB) ck_varC=1.
if not(missing(varC)) and not(varA>varB) ck_varC=2.
exe.
fre ck_varC.
exe.
I had some errors when the condition became complex and when in the condition there are missing() or other functions but i could have made a mistake.
do you think there is an easier way of doing this checks?
thanks in advance
EDIT: here an example of what i mean, think at a questionnaire with some routing, you ask age to anyone, if they are between 17 and 44 ask them if they work, if they work ask them how many hours.
i have an excel tool where i put down all variables with all conditions, then it will generate the syntax in the example, all with the same structure for all variables, considering both situations, we have a value that shouldn't be there or we don't have a value that should be there.
is there an easier way of doing that? is this structure always valid no matter what is the condition?
In SPSS, missing values are not numbers. You need to explicitly program those scenarios as well. you got varC covered (partially), but no scenario where varA or varB have missing data is covered.
(As good practice, maybe you should initialize your check variable as sysmis or 0, using syntax):
numeric ck_varC (f1.0).
compute ck_varC=0.
if missing(varC) and (varA>varB) ck_varC=1.
if not(missing(varC)) and not(varA>varB) ck_varC=2.
***additional conditional scenarios go here:.
if missing(varA) or missing(varB) ck_varC=3.
...
fre ck_varC.
By the way - you do not need any of the exe. commands if you are going to run your syntax as a whole.
Later Edit, after the poster updated the question:
Your syntax would be something like this. Note the use of the range function, which is not mandatory, but might be useful for you in the future.
I am also assuming that work is a string variable, so its values need to be referenced using quotation signs.
if missing(age) ck_age=1.
if missing(work) and range(age,17,44) ck_work=1.
if missing(hours) and work="yes" ck_hours=1.
if not (missing (age)) and not(1>0) ck_age=2. /*this will never happen because of the not(1>0).
if not(missing(work)) and (not range(age,17,44)) ck_work=2. /*note that if age is missing, this ck_work won't be set here.
if not(missing(hours)) and (not(work="yes")) ck_hours=2.
EXECUTE.
String variables are case sensitive
There is no missing equivalent in strings; an empty blank string ("") is still a string. not(work="yes") is True when work is blank ("").

How to dynamically define multiple polynomials inside a loop in Maxima

So...I want to create five different polynomials inside a loop in order to make a Sturm sequence, but I don't seem to be able to dynamically name a set of polynomials with different names.
For example:
In the first iteration it would define p1(x):whatever
Then, in the second iteration it would define p2(x):whatever
Lastly, in the Nth iteration it would define pn(x):whatever
So far, I have managed to simply store them in a list and call them one by one by its position. But surely there is a more professional way to accomplish this?
Sorry for the non-technical language :)
I think a subscripted variable is appropriate here. Something like:
for k:1 thru 5 do
p[k] : make_my_polynomial(k);
Then p[1], ..., p[5] are your polynomials.
When you assign to a subscripted variable e.g. something like foo[bar]: baz, where foo hasn't been defined as a list or array already, Maxima creates what it calls an "undeclared array", which is just a lookup table.
EDIT: You can refer to subscripted variables without assigning them any values. E.g. instead of x^2 - 3*x + 1 you could write u[i]^2 - 3*u[i] + 1 where u[i] is not yet assigned any value. Many (most?) functions treat subscripted variables the same as non-subscripted ones, e.g. diff(..., u[i]) to differentiate w.r.t. u[i].

What are buckets in terms of hash functions?

Looking at the book Mining of Massive Datasets, section 1.3.2 has an overview of Hash Functions. Without a computer science background, this is quite new to me; Ruby was my first language, where a hash seems to be equivalent to Dictionary<object, object>. And I had never considered how this kind of datastructure is put together.
The book mentions hash functions, as a means of implementing these dictionary data structures. This paragraph:
First, a hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the
range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any
type. There is an intuitive property of hash functions that they “randomize”
hash-keys
What exactly are buckets in terms of a hash function? it sounds like buckets are array-like structures, and that the hash function is some kind of algorithm / array-like-structure search that produces the same bucket number every time? What is inside this metaphorical bucket?
I've always read that javascript objects/ruby hashes/ etc don't guarantee order. In practice I've found that keys' order doesn't change (actually, I think using an older version of Mozilla's Rhino interpreter that the JS object order DID change, but I can't be sure...).
Does that mean that hashes (Ruby) / objects (JS) ARE NOT resolved by these hash functions?
Does the word hashing take on different meanings depending on the level at which you are working with computers? i.e. it would seem that a Ruby hash is not the same as a C++ hash...
When you hash a value, any useful hash function generally has a smaller range than the domain. This means that out of a large list of input values (for example all possible combinations of letters) it will output any of a smaller list of values (a number capped at a certain length). This means that more than one input value can map to the same output value.
When this is the case, the output values are refered to as buckets.
Consider the function f(x) = x mod 2
This generates the following outputs;
1 => 1
2 => 0
3 => 1
4 => 0
In this case there are two buckets (1 and 0), with a bunch of input values that fall into each.
A good hash function will fill all of these 'buckets' equally, and so enable faster searching etc. If you take the mod of any number, you get the bucket to look into, and thus have to search through less results than if you just searched initially, since each bucket has less results in it than the whole set of inputs. In the ideal situation, the hash is fast to calculate and there is only one result in each bucket, this enables lookups to take only as long as applying the hash function takes.
This is a simplified example of course but hopefully you get the idea?
The concept of a hash function is always the same. It's a function that calculates some number to represent an object. The properties of this number should be:
it's relatively cheap to compute
it's as different as possible for all objects.
Let's give a really artificial example to show what I mean with this and why/how hashes are usually used.
Take all natural numbers. Now let's assume it's expensive to check if 2 numbers are equal.
Let's also define a relatively cheap hash function as follows:
hash = number % 10
The idea is simple, just take the last digit of the number as the hash. In the explanation you got, this means we put all numbers ending in 1 into an imaginary 1-bucket, all numbers ending in 2 in the 2-bucket etc...
Those buckets don't really exists as data structure. They just make it easy to reason about the hash function.
Now that we have this cheap hash function we can use it to reduce the cost of other things. For example, we want to create a new datastructure to enable cheap searching of numbers. Let's call this datastructure a hashmap.
Here we actually put all the numbers with hash=1 together in a list/set/..., we put the numbers with hash=5 into their own list/set ... etc.
And if we then want to lookup some number, we first calculate it's hash value. Then we check the list/set corresponding to this hash, and then compare only "similar" numbers to find our exact number we want. This means we only had to do a cheap hash calculation and then have to check 1/10th of the numbers with the expensive equality check.
Note here that we use the hash function to define a new datastructure. The hash itself isn't a datastructure.
Consider a phone book.
Imagine that you wanted to look for Donald Duck in a phone book.
It would be very inefficient to have to look every page, and every entry on that page. So rather than doing that, we do the following thing:
We create an index
We create a way to obtain an index key from a name
For a phone book, the index goes from A-Z, and the function used to get the index key, is just getting first letter from the Surname.
In this case, the hashing function takes Donald Duck and gives you D.
Then you take D and go to the index where all the people with Surnames starting with D are.
That would be a very oversimplified way to put it.
Let me explain in simple terms. Buckets come into picture while handling collisions using chaining technique ( Open hashing or Closed addressing)
Here, each array entry shall correspond to a bucket and each array entry (if nonempty) will be having a pointer to the head of the linked list. (The bucket is implemented as a linked list).
The hash function shall be used by hash table to calculate an index into an array of buckets, from which the desired value can be found.
That is, while checking whether an element is in the hash table, the key is first hashed to find the correct bucket to look into. Then, the corresponding linked list is traversed to locate the desired element.
Similarly while any element addition or deletion, hashing is used to find the appropriate bucket. Then, the bucket is checked for presence/absence of required element, and accordingly it is added/removed from the bucket by traversing corresponding linked list.

What is the convention to document types used in Lua?

I come from the strongly typed world and I want to write some Lua code. How should I document what type things are? What do Lua natives do? Hungarian notation? Something else?
For example:
local insert = function(what, where, offset)
It's impossible to tell at a glance whether we're talking about strings or tables here.
Should I do
local sInsert = function(sWhat, sWhere, nOffset)
or
-- string what, string where, number offset, return string
local insert = function(what, where, offset)
or something else?
What about local variables? What about table entries (e.g. someThing.someProperty)?
For a reference on thoughts and opinions on Lua style in the community (or a particular community?), read this: LuaStyleGuide.
The closest one could get to an enforced style would be the format used by LuaDoc, as it's a fairly popular documentation generator used by high profile projects such as LuaFileSystem.
There are only seven types in Lua.
Here are some conventions (some of them might sound a bit obvious; sorry):
Anything that sounds like a string, should be a string: street_address, request_method. If you are not sure you can add _name (or any other suffix that makes clear it's a substantive) to it: method_name
Anything that sounds like a number, should be a number: mass, temperature, percentage. When in doubt, add number, amount, coefficient, or whatever fits : number_of_children, user_id. The names n and i are usually given to numbers. If a number must be positive or natural, make assertions at the top of the function.
Boolean parameters are either an adjective (cold, dirty) or is_<adjective> (is_wet, is_ready).
Anything that sounds like a verb should be a function: consume, check. You can add _function, _callback or _f if you need to clarify it further: update_function, post_callback. The single letter f represents a function quite often. And usually you should only have one parameter of type function (recommended to put it at the end)
Anything that sounds like a collection should be a table: children, words, dictionary. People typically don't differentiate between array-like tables and dictionary-like tables, since both can be parsed with pairs. If you need to specify that a table is an array, you could add _array or _sequence at the end of the name. The letter t typically means table.
Coroutines are not used quite often; you can follow the same rules as with functions, and you can also add _cor to their names.
Any value can be nil.
If it's an optional value, initialize it at the top of the function: options = options or {}
If it's a mandatory value, make an assertion (or return error): assert(name, "The name is mandatory")

Duh? help with f# option types

I am having a brain freeze on f#'s option types. I have 3 books and read all I can but I am not getting them.
Does someone have a clear and concise explanation and maybe a real world example?
TIA
Gary
Brian's answer has been rated as the best explanation of option types, so you should probably read it :-). I'll try to write a more concise explanation using a simple F# example...
Let's say you have a database of products and you want a function that searches the database and returns product with a specified name. What should the function do when there is no such product? When using null, the code could look like this:
Product p = GetProduct(name);
if (p != null)
Console.WriteLine(p.Description);
A problem with this approach is that you are not forced to perform the check, so you can easily write code that will throw an unexpected exception when product is not found:
Product p = GetProduct(name);
Console.WriteLine(p.Description);
When using option type, you're making the possibility of missing value explicit. Types defined in F# cannot have a null value and when you want to write a function that may or may not return value, you cannot return Product - instead you need to return option<Product>, so the above code would look like this (I added type annotations, so that you can see types):
let (p:option<Product>) = GetProduct(name)
match p with
| Some prod -> Console.WriteLine(prod.Description)
| None -> () // No product found
You cannot directly access the Description property, because the reuslt of the search is not Product. To get the actual Product value, you need to use pattern matching, which forces you to handle the case when a value is missing.
Summary. To summarize, the purpose of option type is to make the aspect of "missing value" explicit in the type and to force you to check whether a value is available each time you work with values that may possibly be missing.
See,
http://msdn.microsoft.com/en-us/library/dd233245.aspx
The intuition behind the option type is that it "implements" a null-value. But in contrast to null, you have to explicitly require that a value can be null, whereas in most other languages, references can be null by default. There is a similarity to SQLs NULL/NOT NULL if you are familiar with those.
Why is this clever? It is clever because the language can assume that no output of any expression can ever be null. Hence, it can eliminate all null-pointer checks from the code, yielding a lot of extra speed. Furthermore, it unties the programmer from having to check for the null-case all the same, should he or she want to produce safe code.
For the few cases where a program does require a null value, the option type exist. As an example, consider a function which asks for a key inside an .ini file. The key returned is an integer, but the .ini file might not contain the key. In this case, it does make sense to return 'null' if the key is not to be found. None of the integer values are useful - the user might have entered exactly this integer value in the file. Hence, we need to 'lift' the domain of integers and give it a new value representing "no information", i.e., the null. So we wrap the 'int' to an 'int option'. Now, if there is no integer value we will get 'None' and if there is an integer value, we will get 'Some(N)' where N is the integer value in question.
There are two beautiful consequences of the choice. One, we can use the general pattern match features of F# to discriminate the values in e.g., a case expression. Two, the framework of algebraic datatypes used to define the option type is exposed to the programmer. That is, if there were no option type in F# we could have created it ourselves!

Resources