Adding tabs to non delimited text file with empty and variable length columns - parsing

I have a non-delimited text file and want to parse it to add tabs at specific spots to delimit columns. The columns are sometimes empty or vary in length, which is why I need to add tabs to those specific spots. I had found the answer to this once a couple of years ago on the net using batch, but now can't find it or the code. I already have the following code to replace more than 2 spaces in the file, but this doesn't account for when the columns are empty.
gc $FileToOpen | % { $_ -replace ' +',"`t" } | set-content $FileToSave
So, I need to read each line, but be able to only read a portion (certain number of characters) of it and add the tabs after each portion to itself.
Here is a sample of the data file, the top row is the header and the data rows have no blank lines in between them:
MRUN Number Name X Exception Reason Data CDM# Quantity D.O.S
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 08/13/2015 0000000 0 08/13/2015
000000 00000000 Name W MODIFIER CANNOT BE FILED WITHOUT 0000000 0 08/13/2015
The second data row is missing Data.
Using Ansgar's answer, my code that does find empty fields:
gc $FileToOpen |
? { $_ -match '^(.{8})(.{12})(.{20})(.{3})(.{34})(.{62})(.{10})(.{22})(.{10})$' } |
% { "{0}`t{1}`t{2}`t{3}`t{4}`t{5}`t{6}`t{7}`t{8}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim(), $matches[4].Trim(), $matches[5].Trim(), $matches[6].Trim(), $matches[7].Trim(), $matches[8].Trim(), $matches[9].Trim() } |
Set-Content $FileToSave
Thanks for your patience Ansgar, I know I tried it! I really do appreciate the help!

Since you seem to have an input file with fixed-width columns, you should probably use a regular expression for transforming the input into a tab-delimited format.
Assume the following input file:
A B C
foo 13 22
bar 4 17
baz 142 23
The file has 3 columns. The first column is 6 characters wide, the other two columns 4 characters each.
The transformation could be done with a regular expression like this:
Get-Content 'C:\path\to\input.txt' |
? { $_ -match '^(.{6})(.{4})(.{4})$' } |
% { "{0}`t{1}`t{2}" -f $matches[1].Trim(), $matches[2].Trim(), $matches[3].Trim() } |
Set-Content 'C:\path\to\output.txt'
The regular expression defines the columns by character count and captures them in groups (parentheses). The groups can then be accessed as the indexes 1 and above of the resulting $matches collection. Trimming removes the leading/trailing whitespace. The format operator (-f) then inserts the trimmed values into the tab-separated format string.
If the last column has a variable width (because its values are aligned to the left and don't have trailing spaces) you may need to change the regular expression to ^(.{6})(.{4})(.{,4})$ to take care of that. The quantifier {,4} (or {0,4}) means up to four times the preceding expression.

Related

Counting word in cell/column base on number at that word

Please, allow me ask a question about formula of counting word base on last number follow that word.
example:
| A | B
--------------------
1 | thumbnail20 | 20
2 | gallery13 | 13
3 | girl45 | 45
I'm so appreciate for all answer, sorry for duplicate question
thanks for #ziganotschka and #BHAWANI SINGH, it's all work, case close :)
There are several options depending on your data structure, e.g.
=VALUE(REGEXREPLACE(A1,"[^[:digit:]]", ""))
will extract all digits from the A column to the B column
Should you have several numbers within your string,
=SPLIT(lower(A4),"qwertyuiopasdfghjklzxcvbnm`-=[]\;',./!##$%^&*()")
will extract the first number into column B, the second into column C etc.
If you want to extract only the digits to the right, then
=arrayformula(RIGHT(A1,LEN(A1)+1-min(SEARCH({0,1,2,3,4,5,6,7,8,9},A1&"0123456789"))))

How to make a variable non delimited file to be a delimited one

Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.
There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).

How can I extract some data out of the middle of a noisy file using Perl 6?

I would like to do this using idiomatic Perl 6.
I found a wonderful contiguous chunk of data buried in a noisy output file.
I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:
</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
Input file: "../../filename.count.fa"
...
Here's what I want parsed out:
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
One-liner version
.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;
In English
Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.
Same code with comments
.say # print the default variable $_
if # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/ # Match $_ if it contains "Cluster Unique"
ff^ # Flip-flop operator: true until preceding term becomes true
# false once the term after it becomes true
/^\s*$/ # Match $_ if it contains an empty line
for # Create a loop placing each element of the following list into $_
lines # Create a list of all of the lines in the file
; # End of statement
Expanded version
for lines() {
.say if (
$_ ~~ /Cluster \s+ Unique/ ff^ $_ ~~ /^\s*$/
)
}
lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
^ matches the beginning of a string
\s* matches zero or more spaces
$ matches the end of a string
By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.
I would like to do this using idiomatic Perl 6.
In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.
In Perl 6, you can read a paragraph at a time like this:
my $fname = 'data.txt';
my $infile = open(
$fname,
nl => "\n\n", #Set what perl considers the end of a line.
); #Removed die() per Brad Gilbert's comment.
for $infile.lines() -> $para {
if $para ~~ /^ 'Cluster Unique'/ {
say $para.chomp;
last; #Quit reading the file.
}
}
$infile.close;
# ^ Match start of string.
# 'Cluster Unique' By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.
However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.
Also, there are certain idioms that are considered by some to be bad practice, like:
Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say
So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

parsing a text file where each record spans more than 1 line

I need to parse a text file that contains hundreds of records that span more than 1 line each. I'm new to Python and have been trying to do this with grep and awk in several complex ways but no luck yet.
The file contains records that look like this:
409547095517 911033 00:47:41 C44 00:47:46 D44 00:47:53 00:47:55
(555) 555-1212 00:47 10/31 100 Main Street - NW
Some_City TX 323 WRLS METRO PCS
P# 122-5217 ALT# 555-555-1212 LEC:MPCSI WIRELESS CALL Q
UERY CALLER FOR LOCATION QUERY CALLER FOR PHONE #*
Really I can do all I need to if I could just get these multi-line records condensed to 1 line per record. Each record will always begin with "40" or I could let 9110 indicate start as these will always be there and are unqiue providing 40 is at begining of line. I used a HEX editer and found that I could remove all line feeds (hex 0D0A) but this is not better than manually editing the files and programaticaly I'd need to not remove the last one per record. Some records will be only 2 lines but most will be 5 like this one.
Is there a way python or otherwise to concatonate the lines that make up a record into one line where 40 or maybe better choice where 9110 indicates the start of the record?
Any ideas or pointers will be much appreciated. I've got python and a good IDE and I'm good with grep and find but learning awk (don't laugh)...
awk will do it. You need to identify The line that starts a record. In this case it is 409547095517
So let's assume that to be safe if a line starts with 8 numbers it is the start of a record.
awk ' NR> 1 && /^[0-9]{8}/ { printf("\n") }
{printf("%s", $0) }
END{ printf("\n") }' filename > newfilename
Change the {8} to any number that works for you.

Print text between ( ) sed

This is an extension of my previous question. In that question, I needed to retrieve the text between parentheses where all the text was on a single line. Now I have this case:
(aop)
(abc
d)
This time, the open parenthesis can be on one line and the close parenthesis on another line, so:
(abc
d)
also counts as text between the delimiters '( )' and I need to print it as
abc
d
EDIT:
In response to possible confusions of my question, let me clarify a little. Basically, I need to print text between delimiters which could span multiple lines.
for example I have this text in my file:
randomtext(1234
567) randomtext
randomtext(abc)randomtext
Now I want Sed to pick out text between the delimiter "(" and ")". So the output would be:
1234
567
abc
Notice that the left and right brackets are not on the same line but they still count as a delimiter for 1234 567, so I need to print that part of the text. (note, I only want the text between the first pair of delimiters).
Any help would be appreciated.
Ah! another tricky sed puzzle :)
I believe this code will work for your problem:
sed -n '/(/,/)/{:a; $!N; /)/!{$!ba}; s/.*(\([^)]*\)).*/\1/p}' file
OUTPUT
For the provided input it produced:
1234
567
abc
Explanation:
-n suppresses the regular sed output
/(/,/)/ is for range selection between ( and )
:a is for marking a label a
$!N means append the next line of input into the current pattern space
/)/! means do some actions if ) is not matched in current pattern space
/)/!${!ba} means go to label a if ) is not matched in current pattern space
s/.*(\([^)]*\)).*/\1/ means replace content between ( and ) by just the content thus stripping out parenthesis
\1 is for back reference of group 1 i.e. text between \( and \)
p is for printing the replaced content
This link has the answer. I am paraphrasing to match your need:
sed -n '1h;1!H;${;g;s/.*(\([^)]*\)).*/\1/;p}' < your_input
The answer given didn't work for my case. What worked for me was:
cat file | tr -d '\n'
^^^
this puts the whole file in a single line by deleting line breaks.
and then I further piped it into the answer here. (note: instead of brackets, OPEN and CLOSE are used in that question)

Resources