reading and sorting a variable length CSV file - cobol

We am using OpenVMS system and I believe it is using the Cobol from HP.
With a data file of a lot of records ( 500mb or more ) which variable length. The records are comma delimited. I would like to parse each records and extract corresponding fields for processing. After that, I might want to sort it by some particular fields. Is it possible with cobol?
I've seen sorting with fixed-length records only.

Variable length is no problem, not sure exactly how this is done in VMS cobol but the IBMese for this is:-
FILE SECTION.
FD THE-FILE RECORD IS VARYING DEPENDING ON REC-LENGTH.
01 THE-RECORD PICTURE X(5000) .
WORKING-STORAGE SECTION.
01 REC-LENGTH PICTURE 9(5) COMPUTATIONAL.
When you read the file "REC-LENGTH" will contain the record length, when write a record it will write a record of length REC-LENGTH.
To handle the delimited record files you will probably need to use the "UNSTRING" verb to convert into a fixed format. This is pretty verbose (but then this is COBOL).
UNSTRING record DELIMITED BY ","
INTO field1, field2, field3, field4, field5 etc....
END-UNSTRING
Once the record is in fixed format you can use the SORT as normal.

The Cobol SORT verb will do what you need.
If the SD file contains variable-length records, all of the KEY data-items must be contained within the first n character positions of the record, where n equals the minimum records size
specified for the file. In other words, they have to be in the fixed part.
However, you can get around this easily by using an input procedure. This will let you create a virtual file that has its keys in the right place. In your input procedure, you will reformat your variable, comma delimited, record, into one that has its keys at the front, then "Release" it to the sort.

If my memory is correct, VMS has a SORT/MERGE utility that you could use after you have processed the file into a fixed file format (variable may also be possible). Typically a standalone SORT utility performs better than in-line COLBOL SORT and can be better design if the sort criteria changes in the future.

No need to write a solution in COBOL, at least not to sort the file. The UNIX sort utility should do it just fine, just call sort -t ',' -n with maybe a couple of other options.

Related

Problem reading variables containing mix of numbers and strings

I am reading an Excel file (see syntax below) where some of the fields are text mixed with numbers. The problem is that SPSS reads some of these fields as numeric instead of string and then the text is deleted.
I assume this happens in cases where a large part of the first rows are empty or with a numeric value and then it defines the variable as numeric.
How can this be avoided?
GET DATA
/TYPE=XLSX
/FILE='M:\MyData.xlsx'
/SHEET=name 'Sheet1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0
/HIDDEN IGNORE=YES.
When you use the get data command, the subcommand /DATATYPEMIN PERCENTAGE=95.0 tells SPSS that if up to 5% of the values in the field do not conform to the selected format it's still ok. So in order to avoid cases where only very few values are text and the field is read as number, you have to correct the subcommand to:
/DATATYPEMIN PERCENTAGE=100

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

Mixing ASCII and Binary for record delimiters

My requirements are to write binary records inside a file. The binary records can be thought of as raw bytes in memory. I need a way to delimit each record, so that i can do something similar to binary search on the file. For example start in middle of file, find the next record delimited and start the search.
My question is that can ASCII such "START-RECORD" be used to delimit the binary record ?
START-RECORD, data-length, .......binary data...........START-RECORD, data-length, .......binary data...........
When starting from an arbitrary position within a file, i can simply search for ASCII String "START-DATA". Is this approach feasible?
Not in a single pass, since you're reading in binary mode or not. If you insert some strings or another pattern as "delimiter", you'd need to search for the binary representation of it while reading the file.

Stored procedure parameter data type that allows alphanumeric values

I have a stored procedure for SQL 2000 that has an input parameter with a data type of varchar(17) to handle a vehicle identifier (VIN) that is alphanumeric. However, whenever I enter a value for the parameter when executing that has a numerical digit in it, it gives me an error. It appears to only accept alphabetic characters. What am I doing wrong here?
Based on comments, there is a subtle "feature" of SQL Server that allows letters a-z to be used as stored proc parameters without delimiters. It's been there forever (since 6.5 at least)
I'm not sure of the full rules, but it's demonstrated in MSDN (rename SQL Server etc): there are no delimiters around the "local" parameter. And I just found this KB article on it
In this case, it could be starting with a number that breaks. I assume it works for a contained number (but as I said I'm not sure of the full rules).
Edit: confirmed by Martin as "breaks with leading number", OK for "containing number"
This doesn't help much, but somewhere, you have a bug, typo, or oversight in your code. I spent 2+ years working with VINs as parameters, and other than regretting not having made it char(17) instead ov varchar(17), we never had any problems passing in alphanumeric VIN values. Somewhere, and I'd guess it's in the application layer, something is not liking digits -- perhaps a filter looking for only alphabetical characters?

Adding a field to an existing COBOL data file

I have an existing MF COBOL 4.0 program with years of data in a ISAM file but I need to add a new field to the existing file. The record currently has 1208 chars and I need to add another 10 to it.
If I simply put the extra PIC X(10) field in my copybook, it gives me an error.
You need to modify the underlying data file to match your file definition in COBOL. One way to do so would be to define a line of output exactly like what lines of your data look like now, but with an extra Pic x(10) on the end of it. You would then read in your data line by line, and output it to a new location with 10 extra spaces on the end of it. That way your data is 10 characters longer, and you can go back and add that extra Pic x(10) to your main program. It should work after that.
With changing the copybook, you're only changing the representation of the data used in your program. Shouldn't you be restructuring the data source (i.e. the ISAM file) as well?
Late answer, but I thought you might be interested.
I've been working on our Cobol system for over 20 years and we've come across this issue many times.
Changes to the structure of our index files are what we consider a "Major Release". These require specific Conversion programs which:
Rename the physical file, moving it aside to an 'old' file
Open the 'old' version of the file (using a version of the copybook before the change)
Open (create) the 'new' version of the file
Move the contents of each of the 'old' records to a 'new' record and WRITEs it
Of course these conversions require the system to be 'down', hence the reason why they are considered major releases.
If you have files which are likely to have fields added to them in the future, you can add extra FILLER to the end of index file to let you cope with new fields being added. We tend to add a FILLER of 50 or 100. Of course this doesn't help you if you change one of the existing fields, or even the structure of any of the keys.
For file errors, you will want to keep a list handy. I recommend starting with a list you find online, and any time you get an error you cannot figure out in 5 seconds, add a detailed explanation of the resolution so you will have it in your notes the next time it happens. Here are a couple decent lists to start with
http://www.simotime.com/vsmfsk01.htm
http://www.briarcliff.edu/departments/cis/Cobol/Error%20Codes.html
In my list, file status 39 is:
OPEN-CONFLICT-FILE-ATR - The 'open' statement was unsuccessful because a conflict was detected between the fixed file attributes specified for that file in the program. These attibutes include the organization of the file (sequential, relative, or indexed), the prime record key, the alternate record keys, the code set, the maximum record size, and the record type (fixed or variable).
And this is from the personalized note: Check the file that you have assigned to your ddname in your JCL. Especially the length allocation. In your case, you know that the length does not match, since you just changed the program.
There are utilities to reformat datasets, particularly SYNCSORT. Or of course you can write your own.

Resources