GNUCobol compiled program counts one more record than expected - cobol

I'm learning COBOL programming and using GNUCobol (on Linux) to compile and test some simple programs. In one of those programs I have found an unexpected behavior that I don't understand: when reading a sequential file of records, I'm always getting one extra record and, when writing these records to a report, the last record is duplicated.
I have made a very simple program to reproduce this behavior. In this case, I have a text file with a single line of text: "0123456789". The program should count the characters in the file (or 1 chararacter long records) and I expect it to display "10" as a result, but instead I get "11".
Also, when displaying the records, as they are read, I get the following output:
0
1
2
3
4
5
6
7
8
9
11
(There are two blank spaces between 9 and 11).
This is the relevant part of this program:
FD SIMPLE.
01 SIMPLE-RECORD.
05 SMP-NUMBER PIC 9(1).
[...]
PROCEDURE DIVISION.
000-COUNT-RECORDS.
OPEN INPUT SIMPLE.
PERFORM UNTIL SIMPLE-EOF
READ SIMPLE
AT END
SET SIMPLE-EOF TO TRUE
NOT AT END
DISPLAY SMP-NUMBER
ADD 1 TO RECORD-COUNT
END-READ
END-PERFORM
DISPLAY RECORD-COUNT.
CLOSE SIMPLE.
STOP RUN.
I'm using the default options for the compiler, and I have tried using 'WITH TEST {BEFORE|AFTER}' but the result is the same. What can be the cause of this behavior or how can I get the expected result?
Edit: I tried using an "empty" file as data source, expecting a 0 record count, using two different methods to empty the file:
$ echo "" > SIMPLE
This way the record count is 1 (ls -l gives a size of 1 byte for the file).
$ rm SIMPLE
$ touch SIMPLE
This way the record count is 0 (ls -l gives a size of 0 bytes for the file). So I guess that somehow the compiled program is detecting an extra character, but I don't know how to avoid this.

I found out that the cause of this behavior is the automatic newline character that vim seems to append when saving the data file.
After disabling this in vim this way
:set binary
:set noeol
the program works as expected.
Edit: A more elegant way to prevent this problem, when working with data files created from a text editor, is using ORGANIZATION IS LINE SEQUENTIAL in the SELECT clause.
Since the problem was caused by the data format, should I delete this question?

Related

Pipe character ignored in SPSS syntax

I am trying to use the pipe character "|" in SPSS syntax with strange results:
In the syntax it appears like this:
But when I copy this line from the syntax window to here, this is what I get:
SELECT IF(SEX = 1 SEX = 2).
The pipe just disappears!
If I run this line, this is the output:
SELECT IF(SEX = 1 SEX = 2).
Error # 4007 in column 20. Text: SEX
The expression is incomplete. Check for missing operands, invalid operators,
unmatched parentheses or excessive string length.
Execution of this command stops.
So the pipe is invisible to the program too!
When I save this syntax and reopen it, the pipe is gone...
The only way I found to get SPSS to work with the pipe is when I edited the syntax (adding the pipe) and saved it in an alternative editor (notepad++ in this case). Now, without opening the syntax, I ran it from another syntax using insert command, and it worked.
EDIT: some background info:
I have spss version 23 (+service pack 3) 64 bit.
The same things happens if I use my locale (encoding: windows-1255) or Unicode (Encoding: UTF-8). Suspecting my Hebrew keyboard I tried copying syntax from the web with same results.
Can anyone shed any light on this subject?
Turns out (according to SPSS support) that's a version specific (ver. 21) bug and was fixed in later versions.

Read a single number from a text file and advance stream position in Julia

I understand that Julia has a complete set of low level tools for interfacing with binary files on one hand and some powerfull utilities such as readdlm to load text files containing rectangular data into Array structures on the other hand.
What I cannot discover in the standard library docs, however, is how to easily get input from less structured text files. In particular, what would be the Julia equivalent of the c++ idiom
some_input_stream >> a_variable_int_perhaps;
Given this is such a common usage scenario I am surprised something like this does not feature prominently in the standard library...
You can use readuntil http://docs.julialang.org/en/latest/stdlib/io-network/#Base.readuntil
shell> cat test.txt
1 2 3 4
julia> i,j = open("test.txt") do f
parse(Int, readuntil(f," ")), parse(Int, readuntil(f," "))
end
(1,2)
EDIT: To address comments
To get the last integer in an irregularly formatted ascii file you could use split if you know the character preceding the integer (I've use a blank space here)
shell> cat test.txt
1.0, two five:$#!() + 4
last line 3
julia> i = open("test.txt") do f
parse(Int, split(readline(f), " ")[end])
end
4
As far as code length is concerned, the above examples are completely self contained and the file is opened and closed in an exception safe manner (i.e. wrapped in a try-finally block). To do the same in C++ would be quite verbose.

parsing a text file where each record spans more than 1 line

I need to parse a text file that contains hundreds of records that span more than 1 line each. I'm new to Python and have been trying to do this with grep and awk in several complex ways but no luck yet.
The file contains records that look like this:
409547095517 911033 00:47:41 C44 00:47:46 D44 00:47:53 00:47:55
(555) 555-1212 00:47 10/31 100 Main Street - NW
Some_City TX 323 WRLS METRO PCS
P# 122-5217 ALT# 555-555-1212 LEC:MPCSI WIRELESS CALL Q
UERY CALLER FOR LOCATION QUERY CALLER FOR PHONE #*
Really I can do all I need to if I could just get these multi-line records condensed to 1 line per record. Each record will always begin with "40" or I could let 9110 indicate start as these will always be there and are unqiue providing 40 is at begining of line. I used a HEX editer and found that I could remove all line feeds (hex 0D0A) but this is not better than manually editing the files and programaticaly I'd need to not remove the last one per record. Some records will be only 2 lines but most will be 5 like this one.
Is there a way python or otherwise to concatonate the lines that make up a record into one line where 40 or maybe better choice where 9110 indicates the start of the record?
Any ideas or pointers will be much appreciated. I've got python and a good IDE and I'm good with grep and find but learning awk (don't laugh)...
awk will do it. You need to identify The line that starts a record. In this case it is 409547095517
So let's assume that to be safe if a line starts with 8 numbers it is the start of a record.
awk ' NR> 1 && /^[0-9]{8}/ { printf("\n") }
{printf("%s", $0) }
END{ printf("\n") }' filename > newfilename
Change the {8} to any number that works for you.

How can we eliminate junk value in field?

I have some csv record which are variable in length , for example:
0005464560,45667759,ZAMTR,!To ACC 12345678,DR,79.85
0006786565,34567899,ZAMTR,!To ACC 26575443,DR,1000
I need to seperate each of these fields and I need the last field which should be a money.
However, as I read the file, and unstring the record into fields, I found that the last field contain junk value at the end of itself. The amount(money) field should be 8 characters, 5 digit at the front, 1 dot, 2 digit at the end. The values from the input could be any value such as 13.5, 1000 and 354.23 .
"FILE SECTION"
FD INPUT_FILE.
01 INPUT_REC PIC X(66).
"WORKING STORAGE SECTion"
01 WS_INPUT_REC PIC X(66).
01 WS_AMOUNT_NUM PIC 9(5).9(2).
01 WS_AMOUNT_TXT PIC X(8).
"MAIN SECTION"
UNSTRING INPUT_REC DELIMITED BY ","
INTO WS_ID_1, WS_ID_2, WS_CODE, WS_DESCRIPTION, WS_FLAG, WS_AMOUNT_TXT
MOVE WS_AMOUNT_TXT(1:8) TO WS_AMOUNT_NUM(1:8)
DISPLAY WS_AMOUNT_NUM
From the display, the value is rather normal: 345.23, 1000, just as what are, however, after I wrote the field into a file, here is what they become:
79.85^M^#^#
137.35^M^#
I have inspect the field WS_AMOUNT_NUM, which came from the field WS_AMOUNT_TXT, and found that ^# is a kind of LOW-VALUE. However, I cannot find what is ^M, it is not a space, not a high-value.
I am guessing, but it looks like you may be reading variable length records from a file into a fixed length
COBOL record. The junk
at the end of the COBOL record is giving you some grief. Hard to say how consistent that junk is going
to be from one read to the next (data beyond the bounds of actual input record length are technically
undefined). That junk ends up
being included in WS_AMOUNT_TXT after the UNSTRING
There are a number of ways to solve this problem. The suggestion I am giving you here may not
be optimal, but it is simple and should get the job done.
The last INTO field, WS_AMOUNT_TXT, in your UNSTRING statement is the one that receives all of the trailing
junk. That junk needs to be stripped off. Knowing that the only valid characters in the last field are
digits and the decimal character, you could clean it up as follows:
PERFORM VARYING WS_I FROM LENGTH OF WS_AMOUNT_TXT BY -1
UNTIL WS_I = ZERO
IF WS_AMOUNT_TXT(WS_I:1) IS NUMERIC OR
WS_AMOUNT_TXT(WS_I:1) = '.'
MOVE ZERO TO WS_I
ELSE
MOVE SPACE TO WS_AMOUNT_TXT(WS_I:1)
END-IF
END-PERFORM
The basic idea in the above code is to scan from the end of the last UNSTRING output field
to the beginning replacing anything that is not a valid digit or decimal point with a space.
Once a valid digit/decimal is found, exit the loop on the assumption that the rest will
be valid.
After cleanup use the intrinsic function NUMVAL as outlined in my answer to your
previous question
to convert WS_AMOUNT_TXT into a numeric data type.
One final piece of advice, MOVE SPACES TO INPUT_REC before each READ to blow away data left over
from a previous read that might be left in the buffer. This will protect you when reading a very "short"
record after a "long" one - otherwise you may trip over data left over from the previous read.
Hope this helps.
EDIT Just noticed this answer to your question about reading variable length files. Using a variable length input record is a better approach. Given the
actual input record length you can do something like:
UNSTRING INPUT_REC(1:REC_LEN) INTO...
Where REC_LEN is the variable specified after OCCURS DEPENDING ON for the INPUT_REC file FD. All the junk you are encountering occurs after the end of the record as defined by REC_LEN. Using reference modification as illustrated above trims it off before UNSTRING does its work to separate out the individual data fields.
EDIT 2:
Cannot use reference modification with UNSTRING. Darn... It is possible with some other COBOL dialects but not with OpenVMS COBOL. Try the following:
MOVE INPUT_REC(1:REC_LEN) TO WS_BUFFER
UNSTRING WS_BUFFER INTO...
Where WS_BUFFER is a working storage PIC X variable long enough to hold the longest input record. When you MOVE a short alpha-numeric field to a longer one, the destination field is left justified with spaces used to pad remaining space (ie. WS_BUFFER). Since leading and trailing spaces are acceptable to the NUMVAL fucnction you have exactly what you need.
I have a reason for pushing you in this direction. Any junk that ends up at the trailing end of a record buffer when reading a short record is undefined. There is a possibility that some of that junk just might end up being a digit or a decimal point. Should this occur, the cleanup routine I originally suggested would fail.
EDIT 3:
There are no ^# in the resulting WS_AMOUNT_TXT, but still there are a ^M
Looks like the file system is treating <CR> (that ^M thing) at the end of each record as data.
If the file you are reading came from a Windows platform and you are now
reading it on a UNIX platform that would explain the problem. Under Windows records
are terminated with <CR><LF> while on UNIX they are terminated with <LF> only. The
UNIX file system treats <CR> as if it were part of the record.
If this is the case, you can be pretty sure that there will be a single <CR> at the
end of every record read. There are a number of ways to deal with this:
Method 1: As you already noted, pre-edit the file using Notepad++ or some other
tool to remove the <CR> characters before processing through your COBOL program.
Personally I don't think this is the best way of going about it. I prefer to use a COBOL
only solution since it involves fewer processing steps.
Method 2: Trim the last character from each input record before processing it. The last
character should always be <CR>. Try the following if you
are reading records as variable length and have the actual input record length available.
SUBTRACT 1 FROM REC_LEN
MOVE INPUT_REC(1:REC_LEN) TO WS_BUFFER
UNSTRING WS_BUFFER INTO...
Method 3: Treat <CR> as a delimiter when UNSTRINGing as follows:
UNSTRING INPUT_REC DELIMITED BY "," OR x"0D"
INTO WS_ID_1, WS_ID_2, WS_CODE, WS_DESCRIPTION, WS_FLAG, WS_AMOUNT_TXT
Method 4: Condition the last receiving field from UNSTRING by replacing trailing
non digit/non decimal point characters with spaces. I outlined this solution a litte earlier in this
question. You could also explore the INSPECT statement using the REPLACING option (Format 2). This should be able to do pretty much the same thing - just replace all x"00" by SPACE and x"0D" by SPACE.
Where there is a will, there is a way. Any of the above solutions should work for you. Choose the one you are most comfortable with.
^M is a carriage return.
Would Google Refine be useful for rectifying this data?

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?
I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.
I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w
latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.
To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)
In Texmaker interface you can get the word count by right clicking in the PDF preview:
Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:
I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.
If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.
For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Resources