How to make a variable non delimited file to be a delimited one - delimiter

Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.

There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).

Related

How to replace some characters of input file, before it getting lexed in flex?

How to replace all occurrences of some character or char-sequence with some other character or char-sequence, before flex lexes it. For example I want B\65R to match identifier rule as it is equivalent to BAR in my grammar. So, essentially I want to turn a sequence of \dd into its equivalent ascii character and then lex it. (\65 -> A, \66 -> B, …).
I know, I can first search the entire file for a sequence of \dd and replace it with equivalent character and then feed it to flex. But I wonder if there exists a better way. Something like writing a rule that matches \dd and then replacing it with corresponding alternative in the input stream, so that, I don't have to parse entire file twice.
Several options...
Next, flex is going to read from a filter that
substitutes "\dd" by "chr(dd)" (untested).
You could run something along the lines of
YYIN = popen("perl -pe 's/\\(\d\d)/chr($1)/e' ", "r");
yylex()....

How to use grep to search for strings with (exclusively) a finite set of characters

I have a plain text file with a one string per line. I'd like to identify any instances where a string contains a value outside of a restricted character set. In this particular instance, if the string contains any character outside of the set "[THADGRC.SMBN-WVKY]" I want to retain it and pass it along to a new file.
For example, let's say the original file "mystrings.txt" contained the following data:
THADGRC.SMBN-WVKY
YKVW-NBMS.CRGDHAT
THADGRC.SMBN-WVKYI
My intention is to retain only the third sequence, because it contains a character outside of the allowed set (I) in this case.
It doesn't matter how many times, or in what order, an allowed character is present - all I care about is if a character exists in that string outside of the allowed set.
Originally I tried:
cat mystrings.txt | grep -v [THADGRC\.SMBN-WVKY] > badstrings.txt
but of course the third string contains those allowed character in addition to the non-allowed characters, thus this search ended up producing no "offending" strings.
Last thing: I'm not sure what characters outside of the allowed set might exist in this text file. It would be great to know ahead of time to just search for anything with an "I", but I don't actually know this ahead of time.
So the question: is there a way to use grep (or another tool, say awk?) to pass in a restricted list of characters, and flag any instances where a string contains any number of characters outside of that set?
Thanks for your consideration
I think that your problem is N-W. This doesn't match "N", "-" and "W", it matches a range from "N" to "W". You should move "-" to the end of the character class, or escape it. I suggest changing to:
grep '[^THADGRC.SMBNWVKY-]' mystrings.txt
Also, note that "." doesn't have to be escaped when it's inside a character class.
Your attempt says "remove any lines which contain one of these characters at least once". But you want "print any lines which contain at least one character not in this set."
(Also, quote your regular expressions , and lose the useless cat.)
grep '[^-THADGRC.SMBNWVKY]' mystrings.txt > badstrings.txt
I moved the dash to the beginning of the character class on the assumption that you want a literal dash, not the regex range N-W (i.e. N, O, P, Q, R, S, T, U, V, W).

Behavior of STRING verb

I am reading a COBOL program file and I am struggling to understand the way the STRING command works in the following example
STRING WK-NO-EMP-SGE
','
WK-DT-DEB-PER-FEU-TEM
','
WK-DT-FIN-PER-FEU-TEM
DELIMITED BY SIZE
INTO UUUUUU-CO-CLE-ERR-DB2
I have three possible understandings of what it does:
Either the code concatenate each variables into UUUUUU-CO-CLE-ERR-DB2 and separate each values with ',', and the last variable is delimited by size;
Either the code concatenate each variables into UUUUUU-CO-CLE-ERR-DB2 and separate each values with ',', but all the values are delimited by size (meaning that the DELIMITED BY SIZE in this case applies to all the values passed in the string command;
Or each variable is delimited by a specific character, for example WK-NO-EMP-SGE would be delimited by ',', WK-DT-DEB-PER-FEU-TEM by ',' and WK-DT-FIN-PER-FEU-TEM would then be DELIMITED BY SIZE.
Which of my reading is actually the good one?
Here's the syntax-diagram for STRING (from the Enterprise COBOL Language Reference):
Now you need to know how to read it.
Fortunately, the same document tells you how:
How to read the syntax diagrams
Use the following description to read the syntax diagrams in this
document:
. Read the syntax diagrams from left to right, from top to bottom,
following the path of the line.
The >>--- symbol indicates the beginning of a syntax diagram.
The ---> symbol indicates that the syntax diagram is continued on the
next line.
The >--- symbol indicates that the syntax diagram is continued from
the previous line.
The --->< symbol indicates the end of a syntax diagram. Diagrams of
syntactical units other than complete statements start with the >---
symbol and end with the ---> symbol.
. Required items appear on the horizontal line (the main path).
. Optional items appear below the main path.
. When you can choose from two or more items, they appear vertically,
in a stack.
If you must choose one of the items, one item of the stack appears on
the main path.
If choosing one of the items is optional, the entire stack appears
below the main path.
. An arrow returning to the left above the main line indicates an item
that can be repeated.
A repeat arrow above a stack indicates that you can make more than one
choice from the stacked items, or repeat a single choice.
. Variables appear in italic lowercase letters (for example, parmx).
They represent user-supplied names or values.
. If punctuation marks, parentheses, arithmetic operators, or other
such symbols are shown, they must be entered as part of the syntax.
All that means, if you follow it through, that your number 2 is correct.
You can use a delimiter (when you don't have fixed-length data) or just use the size. Any item which is not explicit in how it is delimited, is delimited by the next DELIMITED BY statement.
One thing to watch for with STRING, which doesn't matter in your case, is that the target field does not get space-padded if the data is shorter than the target. With variable-length data, you need to clear the field to space before the STRING executes.
There is a nuance one must grasp in order to understand the results. DELIMITED BY SIZE can be misleading if one has experience in other programming languages.
Each of the three variables has a size that is defined in WORKING-STORAGE. Let's presume it looks something like this.
05 WK-NO-EMP-SGE PIC X(04).
05 WK-DT-DEB-PER-FEU-TEM PIC X(10).
05 WK-DT-FIN-PER-FEU-TEM PIC X(10).
If the value of the variables were set like this:
MOVE 'BOB' TO WK-NO-EMP-SGE.
MOVE 'Q' TO WK-DT-DEB-PER-FEU-TEM.
MOVE 'D19EIEIO2B' TO WK-DT-FIN-PER-FEU-TEM.
Then one might expect the value of UUUUUU-CO-CLE-ERR-DB2 to be:
BOB,Q,D19EIEIO2B
But it would actually be:
BOB ,Q ,D19EIEIO2B

Regular expression in Ruby

Could anybody help me make a proper regular expression from a bunch of text in Ruby. I tried a lot but I don't know how to handle variable length titles.
The string will be of format <sometext>title:"<actual_title>"<sometext>. I want to extract actual_title from this string.
I tried /title:"."/ but it doesnt find any matches as it expects a closing quotation after one variable from opening quotation. I couldn't figure how to make it check for variable length of string. Any help is appreciated. Thanks.
. matches any single character. Putting + after a character will match one or more of those characters. So .+ will match one or more characters of any sort. Also, you should put a question mark after it so that it matches the first closing-quotation mark it comes across. So:
/title:"(.+?)"/
The parentheses are necessary if you want to extract the title text that it matched out of there.
/title:"([^"]*)"/
The parentheses create a capturing group. Inside is first a character class. The ^ means it's negated, so it matches any character that's not a ". The * means 0 or more. You can change it to one or more by using + instead of *.
I like /title:"(.+?)"/ because of it's use of lazy matching to stop the .+ consuming all text until the last " on the line is found.
It won't work if the string wraps lines or includes escaped quotes.
In programming languages where you want to be able to include the string deliminator inside a string you usually provide an 'escape' character or sequence.
If your escape character was \ then you could write something like this...
/title:"((?:\\"|[^"])+)"/
This is a railroad diagram. Railroad diagrams show you what order things are parsed... imagine you are a train starting at the left. You consume title:" then \" if you can.. if you can't then you consume not a ". The > means this path is preferred... so you try to loop... if you can't you have to consume a '"' to finish.
I made this with https://regexper.com/#%2Ftitle%3A%22((%3F%3A%5C%5C%22%7C%5B%5E%22%5D)%2B)%22%2F
but there is now a plugin for Atom text editor too that does this.

Funny CSV format help

I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob

Resources