How do I run multiple sets of regressions in SPSS without having to retype the command each time? - spss

How do I run multiple sets of regressions in SPSS without having to retype the command each time or without having to change the dependent variable every single time manually?
I need to run a lot of regressions with the same independent variables but I need to change the dependent variable. Is there a possibility to make this process easier?
Thank you very much for your help.

Note also that if you have to repeat this process for each country, you can use SPLIT FILES with the country id, and statistical procedures, including REGRESSION, will automatically iterate over each country.

Let's say you have 50 dependent variables, each of which needs to be regressed on the same predictors using the same regression options. Paste your list of dependent variables into Excel as a vertical list (cells b1:b50). Into cells a1:a:50 paste that part of your regression syntax that comes before the name of the dependent variable, right up to and including "/dependent ". Into cells c1:c50 paste the part of your syntax that follows the name of the dependent variable. Then in cell d1 type "=concatenate(a1,b1,c1)". Paste that formula down through cells d2:d50 and you'll have your 50 commands to paste into a syntax window. It may show gridlines; SPSS will not have any problem with these.
btw, What sort of research context is it that requires these identically-configured regressions for a large number of outcomes?

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

How can I tell Z3 where to start when solving a formula?

I am currently dealing with a situation where the assertions given to Z3 contain a large number of inequalities and equalities. They are dependent on each other in a way that it is most efficient to start solving the formula by assigning values to the variables used in the equalities.
Is there a way to alter the heuristics of Z3 such that the solver always chooses to "start" at these formulas?
My guess would be to use a tactic which initially processes a goal containing the mentioned equalities. It would then continue with the other assertions, restarting the whole process if necessary.
However, I'm not sure how to go about implementing this - how can I create custom goals from sets of formulas?
You can try asserting the first set of formulas you want, then issue check-sat, issue the next set, issue check-sat; repeat as necessary. You can also use push-pop to go back to these points if you want.
By issuing multiple check-sats this way, you would be forcing the solver to explore the formulas you asserted up to that point. Whether this would actually achieve what you want of course depends on exactly what your formulas look like and how much the solver can derive at each check-sat call.

How can I merge several files on SPSS by variable label?

I have 48 .sav data sets containing results of a monthly survey. I need to merge the cases of all common variables from them, in order to come up with a 4 years aggregate. As I'm new to SPSS and I'm not very proficient with syntax (although i can follow it) I would normally do this using Data - Merge files - Add Cases but most of these common variables have different variable names on each data set as the questions are not always formulated in the same order and some questions only appear on one or two data sets.
However, the variable labels do not change from one data set to another. It would be great if someone knows a way to merge this data sets by variable label instead of variable name. Swapping variable names and variable labels would also do as then I could use Data - Merge files - Add Cases without problems.
Many thanks beforehand!
The merge procedures such as ADD FILES (Data > Merge Files > Add Cases) provide a capability to rename variables in the input files before merging. However, if there are a lot of variables to merge, this would get pretty tedious and error prone. Also, the dialog box supports only merging two files, while syntax allows up to 50.
Variable labels are generally not valid as variable names due to the typical presence of characters such as blanks and punctuation and length restrictions. If you have a rule that could be used to turn labels into valid variable names, that could be automated, or if the variables are always in the same order and are present in all the files, they could be renamed something like V1, V2, ...
The renaming could be done manually in syntax that you would craft for each file, or this could be done with a short Python program that you run on each file. I can write that for you if you provide details and, preferably, a sample dataset to test with (jkpeck AT gmail.com).
The Python code could loop over all the sav files in a directory and apply the renaming logic to each in one step.

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

How to detect tabular data from a variety of sources

In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.
My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.
I realize this question isn't especially eloquently put so I hope it makes some sense!
Any ideas?
(no idea how to tag this either - so help there is welcomed!)
The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.
A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.
This assumes that you do not already have a defined types stored in the TSV.
A TSV file is typically
[Value1]\t[Value..N]\n
My suggestion would be to:
Count up all the tabs
Count up all of new lines
Count the total tabs in the first row
Divide the total number of tabs by the tabs in the first row
With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:
You can continue reading the data and ignoring the error of lines with less or more than the predicted tabs per line
You can scan each line before reading to make sure all are consistent
You can read up to the line that does not fit the format and then throw an error
Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].

Resources