How can I merge several files on SPSS by variable label? - spss

I have 48 .sav data sets containing results of a monthly survey. I need to merge the cases of all common variables from them, in order to come up with a 4 years aggregate. As I'm new to SPSS and I'm not very proficient with syntax (although i can follow it) I would normally do this using Data - Merge files - Add Cases but most of these common variables have different variable names on each data set as the questions are not always formulated in the same order and some questions only appear on one or two data sets.
However, the variable labels do not change from one data set to another. It would be great if someone knows a way to merge this data sets by variable label instead of variable name. Swapping variable names and variable labels would also do as then I could use Data - Merge files - Add Cases without problems.
Many thanks beforehand!

The merge procedures such as ADD FILES (Data > Merge Files > Add Cases) provide a capability to rename variables in the input files before merging. However, if there are a lot of variables to merge, this would get pretty tedious and error prone. Also, the dialog box supports only merging two files, while syntax allows up to 50.
Variable labels are generally not valid as variable names due to the typical presence of characters such as blanks and punctuation and length restrictions. If you have a rule that could be used to turn labels into valid variable names, that could be automated, or if the variables are always in the same order and are present in all the files, they could be renamed something like V1, V2, ...
The renaming could be done manually in syntax that you would craft for each file, or this could be done with a short Python program that you run on each file. I can write that for you if you provide details and, preferably, a sample dataset to test with (jkpeck AT gmail.com).
The Python code could loop over all the sav files in a directory and apply the renaming logic to each in one step.

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

How do I merge POT and PO files so that I exclude entries that are not in the POT file?

In short, I am trying to find a way to create a new PO file from a new POT and an existing PO file - but I want to exclude any strings (and their translations) that are not in the POT file.
Every time we change the wording on our cakePHP site, we generate a new POT file that contains all the translatable strings in the site. But when we merge it with the existing PO file (using POEdit), the merge process only adds the POT entries to the PO file. It doesn't remove the translations we no longer need. We have over 12k unneeded translations in our PO files. This makes our translator very unhappy. She has taken to just looking at the site and sending me translations to add manually, which makes me very unhappy.
I've looked around for tools that do this destructive merge, but I haven't been successful finding one. Before I head off to write one...is there something I missed?
(Sorry if this belongs on a different exchange, I will move this post to a better exchange if anyone tells me which one).
What you describe as "destructive" merge is the standard, normal merge operation in gettext and what everybody wants — you'd have to go out of your way to accomplish non-destructive versions, and I'm not even sure how.
From this it's safe to conclude that (1) you must be doing some weird steps not described above, or (2) your POT file contains more than you think it does (e.g. because you append to it instead of replacing it), or (3) you or the tools you use misinterpret the resulting PO file.
To merge using GNU gettext command line tools:
msgmerge -U your_old_translation.po latest_strings.pot
To merge using Poedit (notice the spelling):
Open PO file with the (now outdated) translations.
Use Catalog → Update from POT file…
Choose the newly regenerated POT file.
Notice that by default, outdated translations are kept in the PO file as backup. In Poedit, you can purge them (see Catalog → Purge deleted translations). However, these obsolete entries are stored in a different way in the PO file, as specially formatted comments, and are not visible or editable in Poedit or any conforming PO editing tool.
If I were to bet, I'd say (3) is the most likely cause (in which case, use a better editor like, ahem, Poedit), or perhaps (2) (should be easy to review by searching the POT for now-unused strings).
But merging really does the right thing that you expect it to do.

Error Adding Variables in SPSS

I am using the Data > Merge Files > Add Variables in SPSS. The two .sav files both contain a variable called "Student_No" which is numeric with the same width in each file. I am using this as the key variable in which to match cases. I am not indicating that cases are not sorted. It makes no difference if I indicate that the active or non-active data set is keyed. In either case the new variables are not properly matched with the cases.
What are some of the potential problems that might be causing this mismatch?
The dialog box pastes STAR JOIN syntax in some cases and MATCH FILES in others. There were some problems with STAR JOIN in older versions of Statistics, so you might need to use MATCH FILES instead. See the Command Syntax Reference for that command on how to do this.

how the multiple pdbs can be written in single pdb file using biopython libraries

I wonder how the multiple pdbs can be written in single pdb file using biopython libraries. For reading multiple pdbs such as NMR structure, there is content in documentation but for writing, I do not find. Does anybody have an idea on it?
Yes, you can. It's documented here.
Image you have a list of structure objects, let's name it structures. You might want to try:
from bio import PDB
pdb_io = PDB.PDBIO()
target_file = 'all_struc.pdb'
with pdb_file as open_file:
for struct in structures:
pdb_io.set_structure(struct[0])
pdb_io.save(open_file)
That is the simplest solution for this problem. Some important things:
Different protein crystal structures have different coordinate systems, then you probably need to superimpose them. Or apply some transformation function to compare.
In pdb_io.set_structure you can select a entity or a chain or even a bunch of atoms.
In pdb_io.save has an secondary argument which is a Select class instance. It will help you remove waters, heteroatoms, unwanted chains...
Be aware that NMR structures contain multiple entities. You might want to select one. Hope this can help you.
Mithrado's solution may not actually achieve what you want. With his code, you will indeed write all the structures into a single file. However, it does so in such a way that might not be readable by other software. It adds an "END" line after each structure. Many pieces of software will stop reading the file at that point, as that is how the PDB file format is specified.
A better solution, but still not perfect, is to remove a chain from one Structure and add it to a second Structure as a different chain. You can do this by:
# Get a list of the chains in a structure
chains = list(structure2.get_chains())
# Rename the chain (in my case, I rename from 'A' to 'B')
chains[0].id = 'B'
# Detach this chain from structure2
chains[0].detach_parent()
# Add it onto structure1
structure1[0].add(chains[0])
Note that you have to be careful that the name of the chain you're adding doesn't yet exist in structure1.
In my opinion, the Biopython library is poorly structured or non-intuitive in many respects, and this is just one example. Use something else if you can.
Inspired by Nate's solution, but adding multiple models to one structure, rather than multiple chains to one model:
ms = PDB.Structure.Structure("master")
i=0
for structure in structures:
for model in list(structure):
new_model=model.copy()
new_model.id=i
new_model.serial_num=i+1
i=i+1
ms.add(new_model)
pdb_io = PDB.PDBIO()
pdb_io.set_structure(ms)
pdb_io.save("all.pdb")

How do I run multiple sets of regressions in SPSS without having to retype the command each time?

How do I run multiple sets of regressions in SPSS without having to retype the command each time or without having to change the dependent variable every single time manually?
I need to run a lot of regressions with the same independent variables but I need to change the dependent variable. Is there a possibility to make this process easier?
Thank you very much for your help.
Note also that if you have to repeat this process for each country, you can use SPLIT FILES with the country id, and statistical procedures, including REGRESSION, will automatically iterate over each country.
Let's say you have 50 dependent variables, each of which needs to be regressed on the same predictors using the same regression options. Paste your list of dependent variables into Excel as a vertical list (cells b1:b50). Into cells a1:a:50 paste that part of your regression syntax that comes before the name of the dependent variable, right up to and including "/dependent ". Into cells c1:c50 paste the part of your syntax that follows the name of the dependent variable. Then in cell d1 type "=concatenate(a1,b1,c1)". Paste that formula down through cells d2:d50 and you'll have your 50 commands to paste into a syntax window. It may show gridlines; SPSS will not have any problem with these.
btw, What sort of research context is it that requires these identically-configured regressions for a large number of outcomes?

Resources