parsing a text file where each record spans more than 1 line - parsing

I need to parse a text file that contains hundreds of records that span more than 1 line each. I'm new to Python and have been trying to do this with grep and awk in several complex ways but no luck yet.
The file contains records that look like this:
409547095517 911033 00:47:41 C44 00:47:46 D44 00:47:53 00:47:55
(555) 555-1212 00:47 10/31 100 Main Street - NW
Some_City TX 323 WRLS METRO PCS
P# 122-5217 ALT# 555-555-1212 LEC:MPCSI WIRELESS CALL Q
UERY CALLER FOR LOCATION QUERY CALLER FOR PHONE #*
Really I can do all I need to if I could just get these multi-line records condensed to 1 line per record. Each record will always begin with "40" or I could let 9110 indicate start as these will always be there and are unqiue providing 40 is at begining of line. I used a HEX editer and found that I could remove all line feeds (hex 0D0A) but this is not better than manually editing the files and programaticaly I'd need to not remove the last one per record. Some records will be only 2 lines but most will be 5 like this one.
Is there a way python or otherwise to concatonate the lines that make up a record into one line where 40 or maybe better choice where 9110 indicates the start of the record?
Any ideas or pointers will be much appreciated. I've got python and a good IDE and I'm good with grep and find but learning awk (don't laugh)...

awk will do it. You need to identify The line that starts a record. In this case it is 409547095517
So let's assume that to be safe if a line starts with 8 numbers it is the start of a record.
awk ' NR> 1 && /^[0-9]{8}/ { printf("\n") }
{printf("%s", $0) }
END{ printf("\n") }' filename > newfilename
Change the {8} to any number that works for you.

Related

How to make a variable non delimited file to be a delimited one

Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.
There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).

GNUCobol compiled program counts one more record than expected

I'm learning COBOL programming and using GNUCobol (on Linux) to compile and test some simple programs. In one of those programs I have found an unexpected behavior that I don't understand: when reading a sequential file of records, I'm always getting one extra record and, when writing these records to a report, the last record is duplicated.
I have made a very simple program to reproduce this behavior. In this case, I have a text file with a single line of text: "0123456789". The program should count the characters in the file (or 1 chararacter long records) and I expect it to display "10" as a result, but instead I get "11".
Also, when displaying the records, as they are read, I get the following output:
0
1
2
3
4
5
6
7
8
9
11
(There are two blank spaces between 9 and 11).
This is the relevant part of this program:
FD SIMPLE.
01 SIMPLE-RECORD.
05 SMP-NUMBER PIC 9(1).
[...]
PROCEDURE DIVISION.
000-COUNT-RECORDS.
OPEN INPUT SIMPLE.
PERFORM UNTIL SIMPLE-EOF
READ SIMPLE
AT END
SET SIMPLE-EOF TO TRUE
NOT AT END
DISPLAY SMP-NUMBER
ADD 1 TO RECORD-COUNT
END-READ
END-PERFORM
DISPLAY RECORD-COUNT.
CLOSE SIMPLE.
STOP RUN.
I'm using the default options for the compiler, and I have tried using 'WITH TEST {BEFORE|AFTER}' but the result is the same. What can be the cause of this behavior or how can I get the expected result?
Edit: I tried using an "empty" file as data source, expecting a 0 record count, using two different methods to empty the file:
$ echo "" > SIMPLE
This way the record count is 1 (ls -l gives a size of 1 byte for the file).
$ rm SIMPLE
$ touch SIMPLE
This way the record count is 0 (ls -l gives a size of 0 bytes for the file). So I guess that somehow the compiled program is detecting an extra character, but I don't know how to avoid this.
I found out that the cause of this behavior is the automatic newline character that vim seems to append when saving the data file.
After disabling this in vim this way
:set binary
:set noeol
the program works as expected.
Edit: A more elegant way to prevent this problem, when working with data files created from a text editor, is using ORGANIZATION IS LINE SEQUENTIAL in the SELECT clause.
Since the problem was caused by the data format, should I delete this question?

Need to selectively remove newline characters from a file using unix (solaris)

I am trying to find a way to selectively remove newline characters from a file. I have no issues removing all of them..but I need some to remain.
Here is the example of the bad input file. Note that rows with Permit ID COO789 & COO012 have newlines embedded in the description field that I need to remove.
"Permit Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians
Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race
weekend",,"05/11/2013","05/11/2013"
Here is an example of how I need the file to look like:
"Permit Number/Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race weekend",,"05/11/2013","05/11/2013"
NOTE: I did simplify the file by removing a few extra columns. The logic should be able to accommodation any number of columns though. The actual full header line is with all columns is. Technically, I expect the "extra" newlines to be found in Description and Location columns.
"Permit Number/Id","Permit Name","Description","Start Date","End Date","Custom Status","Owner Name","Total Expected Attendance","Location"
I have tried sed, cut, tr, nawk, etc. Open to any solution that can do this..that can be called from within a unix script.
Thanks!!!
If you must remove newline characters from only within the 'Description' and 'Location' fields, you will need a proper csv parser (think Text::CSV). You could also do this fairly easily using GNU awk, but you won't have access to gawk on Solaris unfortunately. Therefore, the next best solution would be to join lines that don't start with a double-quote to the previous line. You can do this using sed. I've written this with compatibility in mind:
sed -e :a -e '$!N; s/ *\n\([^"]\)/ \1/; ta' -e 'P;D' file
Results:
"Permit Id","Permit Name","Description","Start Date","End Date"
"COO123","Music Festival",,"02/12/2013","02/12/2013"
"COO456","Race Weekend",,"02/23/2013","02/23/2013"
"COO789","Basketball Final 8 Championships - Media vs. Politicians Skills Competition",,"02/22/2013","02/22/2013"
"COO012","Dragonboat race weekend",,"05/11/2013","05/11/2013"
sed ':a;N;$!ba;s/ \n/ /g'
Reads the whole file into the pattern space, then removes all newlines which occur directly after a space - assuming that all the errant newlines fit this pattern. If not, when else should newlines be removed?

Using sed or awk to parse current field/column into additional fields/columns on the same line

Here are 4 sample rows of the text file of interest
EnumerateKey,explorer.exe,HKCR\\Directory\\shellex\\ContextMenuHandlers,NOMORE
CreateSec,explorer.exe,\\WINDOWS\\system32\\verclsid.exe,SUCCESS
QueryKey,AcroRd32.exe,HKCU\\Control Panel\\International,BUFOVRFLOW
QueryValue,AcroRd32.exe,HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\Explorer\\NoRecentDocsHistory,NOTFOUND
I would like to augment the rows by appending K fields/columns (for example K=3 below) which contain the elements of the path found in $3 but parsed by \\
Here are the desired output for the 4 lines.
EnumerateKey,explorer.exe,HKCR\\Directory\\shellex\\ContextMenuHandlers,NOMORE, Directory, shellex, ContextMenuHandlers
CreateSec,explorer.exe,\\WINDOWS\\system32\\verclsid.exe,SUCCESS, WINDOWS, system32, verclsid.exe
QueryKey,AcroRd32.exe,HKCU\\Control Panel\\International,BUFOVRFLOW, Control Panel, International,
QueryValue,AcroRd32.exe,HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\Explorer\\NoRecentDocsHistory,NOTFOUND, Software, Microsoft, Windows
After some more study, here are 2 nuances:
Some of the paths begin with HK**, others don't. However, in both cases I only care about the path that starts after the initial \\. This difference is captured between line 1 and 2. Therefore I believe the parsing must be anchored at \\ rather than simply $3 if possible. (Am I using that terminology correctly?)
Second, the depth of the path varies. In order to keep consistency in column/fields I'm willing to lose some information (line 4) as well as have empty fields for the short paths (line 3) in order to maintain this.
Here's one way using GNU awk:
awk 'BEGIN { FS=OFS="," } { split($3,a,"\\\\\\\\"); print $0, a[2], a[3], a[4] }' file
Results:
EnumerateKey,explorer.exe,HKCR\\Directory\\shellex\\ContextMenuHandlers,NOMORE,Directory,shellex,ContextMenuHandlers
CreateSec,explorer.exe,\\WINDOWS\\system32\\verclsid.exe,SUCCESS,WINDOWS,system32,verclsid.exe
QueryKey,AcroRd32.exe,HKCU\\Control Panel\\International,BUFOVRFLOW,Control Panel,International,
QueryValue,AcroRd32.exe,HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\Explorer\\NoRecentDocsHistory,NOTFOUND,Software,Microsoft,Windows
With some ugly Perl:
perl -lane '{$l=$_;s/.*?\\\\([^,]*),.*/$1/;#v=split(/\\\\/,$_); print "$l,".join(",",#v[0,1,2]);}' input

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?
I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.
I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w
latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.
To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)
In Texmaker interface you can get the word count by right clicking in the PDF preview:
Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:
I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.
If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.
For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Resources