Character limit in string variables - spss

when I export my data from a .dta file to .sps, my string variables get cut off and the data contains signs, which look like a UTF-8 problem. I think, the problem might be that some string variables have a width over 261- at least they are cut at that point.
Does SPSS have a character limit, and if so, how can I increase the number?

"The values of string variables can contain numbers, letters, and special characters and can be up to 32,767 bytes. "
You seem to be trying to export STATA data (.sta) to SPSS Statistics (.sav). What mechanism are you using in STATA to do this? Does STATA have a limitation on the width of string fields?
As horace_vr has already pointed out, an SPSS Statistics command file has the *.sps extension. Are you truly trying to have STATA write SPSS Statistics Command Syntax for you and save as *.sps? Perhaps you meant you were exporting from *.dta (in STATA) to *.sav (in SPSS Statistics).
Note also that if you have created an SPSS Statistics data file (*.sav) in Codepage mode, but then open it while SPSS Statistics is in Unicode mode, your string widths will triple in size. This is an artifact of the conversion of the various codepages to unicode.
"When code page data files are read in Unicode mode, the defined width of all string variables is tripled. You can use ALTER TYPE to automatically adjust the width of all string variables."
I hope this helps
-ddwyer

Related

Why flang Fortran print adds a new line at a certain width? [duplicate]

I want to write the following output to a txt file using f77:
14 76900.56273 0.000077 -100000 1000000000 -0.769006
I use:
write(6,*) KINC, BM, R2, AF, BK, BM/AF
without any format (which works well in terms of decimal digits). However in my txt file the output is written as:
14 76900.56273 0.000077 -100000
1000000000 -0.769006
Because I think there is a fixed column width limit by default. I don't know if it is possible to change this so that I can just copy and paste it to excel.
I've looked at FORTRAN 77 Language Reference but I haven't found a way to do it. Any ideas? Thanks
use format
or check your compiler's option
if your compiler is one of dec/compaq/intel, read this link.
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-win/hh_goto.htm#GUID-C6A40AAC-81D8-4DD8-A792-62792B3AC213.htm#GUID-C6A40AAC-81D8-4DD8-A792-62792B3AC213
list directed output (fmt=*) :: 80 column limit default.
"There is a property of list-directed sequential WRITE statements called the right margin. If you do not specify RECL as an OPEN statement specifier or in environmental variable FORT_FMT_RECL, the right margin value defaults to 80. When RECL is specified, the right margin is set to the value of RECL. If the length of a list-directed sequential WRITE exceeds the value of the right margin value, the remaining characters will wrap to the next line. Therefore, writing 100 characters will produce two lines of output, and writing 180 characters will produce three lines of output."
In intel's manual, blue color indicates extensions to the Fortran Standards. These extensions (non-standard features) may or may not be implemented by other compilers that conform to the language standard.
oracle(sun) F77
http://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnbu/index.html#z400074369ac
"Output lines longer than 80 characters are avoided where possible"
With the asterisk as the format, you are using listed-directed IO. This is intended as a convenience. It gives the programmer minimal control, with few restrictions on the compiler and incomplete portability. The compiler is free to determine aspects such as line length. If you want control over line length, switch to using an actual format.
P.S. Why use FORTRAN 77? Fortran 90/95/2003 is easier to use and more powerful. gfortran is an open-source compiler.

Problem reading variables containing mix of numbers and strings

I am reading an Excel file (see syntax below) where some of the fields are text mixed with numbers. The problem is that SPSS reads some of these fields as numeric instead of string and then the text is deleted.
I assume this happens in cases where a large part of the first rows are empty or with a numeric value and then it defines the variable as numeric.
How can this be avoided?
GET DATA
/TYPE=XLSX
/FILE='M:\MyData.xlsx'
/SHEET=name 'Sheet1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0
/HIDDEN IGNORE=YES.
When you use the get data command, the subcommand /DATATYPEMIN PERCENTAGE=95.0 tells SPSS that if up to 5% of the values in the field do not conform to the selected format it's still ok. So in order to avoid cases where only very few values are text and the field is read as number, you have to correct the subcommand to:
/DATATYPEMIN PERCENTAGE=100

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

Mixing ASCII and Binary for record delimiters

My requirements are to write binary records inside a file. The binary records can be thought of as raw bytes in memory. I need a way to delimit each record, so that i can do something similar to binary search on the file. For example start in middle of file, find the next record delimited and start the search.
My question is that can ASCII such "START-RECORD" be used to delimit the binary record ?
START-RECORD, data-length, .......binary data...........START-RECORD, data-length, .......binary data...........
When starting from an arbitrary position within a file, i can simply search for ASCII String "START-DATA". Is this approach feasible?
Not in a single pass, since you're reading in binary mode or not. If you insert some strings or another pattern as "delimiter", you'd need to search for the binary representation of it while reading the file.

Resources