Parsing XML which is in a file but is 1 single line? - parsing

So the problem is I am trying to use AWK, Perl to find how many records are inside one xml that is one loooong line sometimes in the megabytes.
Most if not all examples I've seen are assuming a nice structured xml like
<?xml version="1.0" encoding="UTF-8"?>
<spendownrequest xmlns="http://www.foo.com/Adv/HR/SSt">
<spenddowndata>
<employeeId>0002</employeeId>
<transactionId>103</transactionId>
<transactionType>T</transactionType>
</spenddowndata>
<spenddowndata>
<employeeId>0003</employeeId>
<transactionId>104</transactionId>
<transactionType>T</transactionType>
</spenddowndata>
<spenddowndata>
<employeeId>0004</employeeId>
<transactionId>105</transactionId>
<transactionType>T</transactionType>
</spenddowndata>
</spendownrequest>
with newlines at each row. These files are like this
<?xml version="1.0" encoding="UTF-8"?><spendownrequest xmlns="http://www.foo.com/Adv/HR/SSt">
<spenddowndata><employeeId>0002</employeeId><transactionId>103</transactionId>
<transactionType>T</transactionType></spenddowndata><spenddowndata><employeeId>0003</employeeId>
<transactionId>104</transactionId><transactionType>T</transactionType></spenddowndata><spenddowndata>
<employeeId>0005</employeeId><transactionId>105</transactionId><transactionType>T</transactionType>
</spenddowndata></spendownrequest>
One long line with only (1) newline at the end.
I tried:
awk -F'[<|>]' '/spenddowndata/ {i++} { print i }' file.xml
get back 1
How would I get the count for all 3 that are in this file?

awk 'BEGIN {RS="<"; count = 0;} { if ($0 ~ /^spenddowndata*/) {count++}} END {print(count);}'
Should work?

You can also store the pattern in a file, say pat.awk:
BEGIN{
FPAT = "(<spenddowndata>)"
}
{
print NF
}
To display count, run :
awk -f pat.awk file.xml

awk -F'</spenddowndata>' 'END{print (NF?NF-1:0)}' file
The ternary condition testing for NF is to avoid printing -1 for an empty file.

With grep:
grep -o '</spenddowndata>' f | wc -l
With awk (in fact gawk (Thank you #EdMorton)):
gawk -v RS='</spenddowndata>' 'END{print NR-1}' f
With perl:
perl -n0E 's!</spenddowndata>!$i++!ge; say $i+0'

Related

output lines without multiple patterns

I used cat file | grep -v "pat1" | grep -v "pat2" | ... | grep -v "patN" to drop lines with any of a group of patterns. It looks awkward. Is there a better (concise) way to do that?
In case you are ok with awk you could try following. Created a variable named valIgnore which has all the values to be ignored, you can mention all values by comma separated fashion and have them in it. By doing this you can give N number of keywords in a single shot only in a variable itself. Moreover you can also create a shell variable which has all values to be ignored(in lines) make sure its comma separated and pass it to this awk program here. Since no samples are given so didn't test it but should work though.
awk -v valIgnore="pat1,pat2,pat3,pat4,pat5,pat6,pat7,pat8,pat9" '
BEGIN{
num=split(valIgnore,arr,",")
for(i=1;i<=num;i++){ ignoreVals[arr[i]] }
}
{
for(key in ignoreVals){
if(index($0,key)){ next }
}
}
1' Input_file

Grep match only before ":"

Hello How can I grep only match before : mark?
If I run grep test1 file, it shows all three lines.
test1:x:29688:test1,test2
test2:x:22611:test1
test3:x:25163:test1,test3
But I would like to get an output test1:x:29688:test1,test2
I would appreciate any advice.
If the desired lines always start with test1 then you can do:
grep '^test1' file
If it's always followed by : but not the other (potential) matches then you can include it as part of the pattern:
grep 'test1:' file
As your data is in row, columns delimited by a character, you may consider awk:
awk -F: '$1 == "test1"' file
I think that you just need to add “:” after “test1”, see an example:
grep “test1:” file

Grep words with exact two vowels

I have the following issue, I need to retrieve all words that contains exactly 2 vowels (in any order) from a file. The file only contains one word per line.
My current workaround is:
Grep1: Retrieve words such as earth, over, under, one...
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
and
Grep2: Retrieve words such as formless, deep, said...
grep -i "^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > B.txt
the above solution works but when I concatenate both regexs into a single regex then return nothing!
Mother of Grep1 & Grep2: should retrieve everything!
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
I think issue is around my implementation of ^$ in expression but have tried diff versions with no sucess!
Any help will be highly appreciated!
OS is AIX 6100-09-04-1441
You were close. This should work:
grep -i "^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
So it should find all eight possibilities (two vowels identify three nonvowel sequence, each possibly empty; 2^3 is 8):
[ ]I[ ]o[ ]
[ ]e[ ]a[r]
[ ]e[r]a[ ]
[ ]e[l]a[n]
[T]e[ ]a[ ]
[D]e[ ]a[r]
[D]e[w]a[r]
[D]a[w]a[ ]
[H]a[w]a[y]
As for concatenation, | needs escaping. You can use a single anchoring:
^(regexp1\|regexp2)$
Since the * can match 0 times or more you should be able to start the string with [^aeiou]*: try
"^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$"
As for fixing your regex, I think you need to escape the bar as \|, so
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$\|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
If you don't mind Perl, you could use this:
perl -lne '$m=$_; tr/[aeiou]//cd; print $m if length()==2;' /usr/share/dict/words
That says... "save the current line (word) in $m. Delete everything that is not a vowel. Print the original word if there are two things (i.e vowels) left."
Note that I am using the system dictionary as input for my tests.
You could do pretty much the same thing in awk.
If you're able to use an alternative to grep tr with wc works well:
words=/path/to/words.txt
while read -e word ; do
v=$(echo $word | tr -cd 'aeiou' | wc -c)
[[ ! $v -eq "2" ]] || echo $word >> output.txt
done < $words
This reads the original file line by line, counts the vowels & returns results with only 2 to output.txt.

Recursively grep results and pipe back

I need to find some matching conditions from a file and recursively find the next conditions in previously matched files , i have something like this
input.txt
123
22
33
The files where you need to find above terms in following files, the challenge is if 123 is found in say 10 files , the 22 should be searched in these 10 files only and so on...
Example of files are like f1,f2,f3,f4.....f1200
so it is like i need to grep -w "123" f* | grep -w "123" | .....
its not possible to list them manually so any easier way?
You can solve this using awk script, i ve encountered a similar problem and this will work fine
awk '{ if(!NR){printf("grep -w %d f*|",$1)} else {printf("grep -w %d f*",$1)} }' input.txt | sh
What it Does?
it reads input.txt line by line
until it is at last record , it prints grep -w %d | (note there is a
pipe here)
which is then sent to shell for execution and results are piped back
to back
and when you reach the end the pipe is avoided
Perhaps taking a meta-programming viewpoint would help. Have grep output a series of grep commands. Or write a little PERL program. Maybe Ruby, if the mood suits.
You can use grep -lw to write the list of file names that matched (note that it will stop after finding the first match).
You capture the list of file names and use that for the next iteration in a loop.

Matching pattern across multiple files: perl or grep?

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501
1n1e+NDE+A+400 2qzl+IXS+A+449
1llf+F23+A+800 1y0g+8PP+A+320
1ewf+PC1+A+577 2a94+AP0+A+336
2ydx+TXP+E+1339 3g8i+RO7+A+1
1gvh+HEM+A+1398 1v9y+HEM+A+1140
2i0z+FAD+A+501 3m2r+F43+A+1
1h6d+NDP+A+500 3rt4+LP5+C+501
1w07+FAD+A+1660 2pgn+FAD+A+612
2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt
I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings;
open AS, "combi_output_2_fixed.txt";
open AQ, "NAMES.txt";
#arr=<AS>;
#arr1=<AQ>;
foreach $line(#arr)
{
#split=split(' ',$line);
foreach $line1(#arr1)
{
#split1=split(' ',$line1);
if($split[0] eq $split1[0] && $split[1] eq $split1[1])
{ print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
}
}
close AQ;
close AS;
Doing this uses up the entire memory: and shows Out of memory error message..
I am aware that this can be done using grep. but do not know hw to do it.
Can anyone please let me know how I can do this using grep -F AND WITHOUT USING UP THE ENTIRE MEMORY?
Thanks.
Does pattern.txt fit in memory?
If it does, you could use a command like grep -F -f pattern.txt data.txt to match lines in data.txt against the patterns. You would get the full line though, and extra processing would be required to get only the second column of numbers.
Or you could fix the Perl script. The reason you run out of memory is because you read the 8gb file entirely to memory, when you could be processing it line-by-line like grep. For the 8GB file you should use code like this:
open FH, "<", "data.txt";
while ($line = <FH>) {
# check $line against list of patterns ...
}
Try This
grep "`more pattern.txt`" data.txt | awk -F' ' '{ print $1 " " $2 " " $4}'

Resources