Problem reading variables containing mix of numbers and strings - spss

I am reading an Excel file (see syntax below) where some of the fields are text mixed with numbers. The problem is that SPSS reads some of these fields as numeric instead of string and then the text is deleted.
I assume this happens in cases where a large part of the first rows are empty or with a numeric value and then it defines the variable as numeric.
How can this be avoided?
GET DATA
/TYPE=XLSX
/FILE='M:\MyData.xlsx'
/SHEET=name 'Sheet1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0
/HIDDEN IGNORE=YES.

When you use the get data command, the subcommand /DATATYPEMIN PERCENTAGE=95.0 tells SPSS that if up to 5% of the values in the field do not conform to the selected format it's still ok. So in order to avoid cases where only very few values are text and the field is read as number, you have to correct the subcommand to:
/DATATYPEMIN PERCENTAGE=100

Related

Character limit in string variables

when I export my data from a .dta file to .sps, my string variables get cut off and the data contains signs, which look like a UTF-8 problem. I think, the problem might be that some string variables have a width over 261- at least they are cut at that point.
Does SPSS have a character limit, and if so, how can I increase the number?
"The values of string variables can contain numbers, letters, and special characters and can be up to 32,767 bytes. "
You seem to be trying to export STATA data (.sta) to SPSS Statistics (.sav). What mechanism are you using in STATA to do this? Does STATA have a limitation on the width of string fields?
As horace_vr has already pointed out, an SPSS Statistics command file has the *.sps extension. Are you truly trying to have STATA write SPSS Statistics Command Syntax for you and save as *.sps? Perhaps you meant you were exporting from *.dta (in STATA) to *.sav (in SPSS Statistics).
Note also that if you have created an SPSS Statistics data file (*.sav) in Codepage mode, but then open it while SPSS Statistics is in Unicode mode, your string widths will triple in size. This is an artifact of the conversion of the various codepages to unicode.
"When code page data files are read in Unicode mode, the defined width of all string variables is tripled. You can use ALTER TYPE to automatically adjust the width of all string variables."
I hope this helps
-ddwyer

How to prevent Google Spreadsheet from interpreting commas as thousand separators?

Currently, pasting 112,359,1003 into Google Sheets automatically converts the value to 1123591003.
This prevents me from applying the Split text to columns option as there are no commas left to split by.
Note that my number format is set to the following (rather than being Automatic):
Selecting the Plain text option prevents the commas from being truncated but also prevents me from being able to use the inserted data in formulas.
The workaround for this is undesirable when inserting large amounts of data: select cells that you expect to occupy, set to Plain Text, paste data, set to back to desired number format.
How do I disable the automatic interpretation by Google Spreadsheet of the commas in my pasted numeric values?
You can not paste it in any number format, because of the nature of numerical format types. It will parse it into an actual number and physically store it in this format. Using plaintext type, like you are, is the way to go for this.
However, there are some options to perform these tasks in a slightly different way;
- you might be able to use CSV-import functionality, which prevents having to change types for a sheet.
- you can use int() function to parse the plaintext value into an int. (and combine this with lookup functions).
TEXT formatting:
Use ' to prepend the number. It'll be stored as text regardless of actual formatting.
Select the column and set formatting as `plain text.
In both the above cases, You can multiply the resulting text by 1 *1 to use in any formula as a number.
NUMBER formatting:
Keep Number formatting with ,/Automatic.
Here, though split text to columns might not work, You can use TEXT() or TO_TEXT()
=ARRAYFORMULA(SPLIT(TO_TEXT(A1:A5),","))

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

reading and sorting a variable length CSV file

We am using OpenVMS system and I believe it is using the Cobol from HP.
With a data file of a lot of records ( 500mb or more ) which variable length. The records are comma delimited. I would like to parse each records and extract corresponding fields for processing. After that, I might want to sort it by some particular fields. Is it possible with cobol?
I've seen sorting with fixed-length records only.
Variable length is no problem, not sure exactly how this is done in VMS cobol but the IBMese for this is:-
FILE SECTION.
FD THE-FILE RECORD IS VARYING DEPENDING ON REC-LENGTH.
01 THE-RECORD PICTURE X(5000) .
WORKING-STORAGE SECTION.
01 REC-LENGTH PICTURE 9(5) COMPUTATIONAL.
When you read the file "REC-LENGTH" will contain the record length, when write a record it will write a record of length REC-LENGTH.
To handle the delimited record files you will probably need to use the "UNSTRING" verb to convert into a fixed format. This is pretty verbose (but then this is COBOL).
UNSTRING record DELIMITED BY ","
INTO field1, field2, field3, field4, field5 etc....
END-UNSTRING
Once the record is in fixed format you can use the SORT as normal.
The Cobol SORT verb will do what you need.
If the SD file contains variable-length records, all of the KEY data-items must be contained within the first n character positions of the record, where n equals the minimum records size
specified for the file. In other words, they have to be in the fixed part.
However, you can get around this easily by using an input procedure. This will let you create a virtual file that has its keys in the right place. In your input procedure, you will reformat your variable, comma delimited, record, into one that has its keys at the front, then "Release" it to the sort.
If my memory is correct, VMS has a SORT/MERGE utility that you could use after you have processed the file into a fixed file format (variable may also be possible). Typically a standalone SORT utility performs better than in-line COLBOL SORT and can be better design if the sort criteria changes in the future.
No need to write a solution in COBOL, at least not to sort the file. The UNIX sort utility should do it just fine, just call sort -t ',' -n with maybe a couple of other options.

Resources