creating a file with uniques string per line in command line - parsing

I am trying to create a file (using AWK, but do not mind switching if another command is easier) that has a unique string in each line (183745 lines total). I am trying to make a file as such:
line1
line2
line3
....
line183745
With poor knowledge of AWK, and failure to find a similar example, I have unsuccessfully tried (with 10 lines for this example):
awk '{ i = 1; while (i < 10) { print "line$i \n"}; i++ }'
And this leads to no error or output. Thank you.

Why make it complicate?
seq -f "line%06g" 3
line000001
line000002
line000003
seq -f "line%06g" 183745 >newfile

You'll need to put this in a BEGIN block, as you're not processing any lines of input.
awk 'BEGIN { i = 1 ; while (i <= 10) { print "line"i ; i++ } }'

awk acts like a filter by default. In your case, it's simply blocking on input. Unblock it by explicitly not having input, for example.
awk '...' </dev/null

If I do this, I would do it with seq or in vim.
but since others have already posted seq and classic awk solution, I would add another awk solution for fun.
A very "useful" command yes could help us:
awk '$0="line"NR;NR==183745{exit}'
test with 1-10, for example:
kent$ yes|awk '$0="line"NR;NR==10{exit}'
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10

Related

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

Join multiple lines into One (.cap file) CentOS

Single entry has multiple lines. Each entry is separated by two blank lines.
Each entry has to be made into a single line followed by a delimiter(;).
Sample Input:
Name:Sid
ID:123
Name:Jai
ID:234
Name:Arun
ID:12
Tried replacing the blank lines with cat test.cap | tr -s [:space:] ';'
Output:
Name:Sid;ID:123;Name:Jai;ID:234;Name:Arun;ID:12;
Expected Output:
Name:SidID:123;Name:JaiID:234;Name:ArunID:12;
Same is the case with Xargs.
I've used sed command as well but it only joined two lines into one. Where as I've 132 lines as one entry and 1000 such entries in one file.
You may use
cat file | awk 'BEGIN { FS = "\n"; RS = "\n\n"; ORS=";" } { gsub(/\n/, "", $0); print }' | sed 's/;;*$//' > output.file
Output:
Name:SidID:123;Name:JaiID:234;Name:ArunID:12
Notes:
FS = "\n" will set field separators to a newline`
RS = "\n\n" will set your record separators to double newline
gsub(/\n/, "", $0) will remove all newlines from a found record
sed 's/;;*$//' will remove the trailing ; added by awk
See the online demo
Could you please try following.
awk 'NF{val=(val?$0~/^ID/?val $0";":val $0:$0)} END{print val}' Input_file
Output will be as follows.
Name:SidID:123;Name:JaiID:234;Name:ArunID:12;
Explanation: Adding explanation of above code too now.
awk ' ##Starting awk program here.
NF{ ##Checking condition if a LINE is NOT NULL and having some value in it.
val=(val?$0~/^ID/?val $0";":val $0:$0) ##Creating a variable val here whose value is concatenating its own value along with check if a line starts with string ID then add a semi colon at last else no need to add it then.
}
END{ ##Starting END section of awk here.
print val ##Printing value of variable val here.
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -r '/./{N;s/\n//;H};$!d;x;s/.//;s/\n|$/;/g' file
If it is not a blank line, append the following line and remove the newline between them. Append the result to the hold space and if it is not the end of the file, delete the current line. At the end of the file, swap to the hold space, remove the first character (which will be a newline) and then replace all newlines (append an extra semi-colon for the last line only) with semi-colons.

Grep words with exact two vowels

I have the following issue, I need to retrieve all words that contains exactly 2 vowels (in any order) from a file. The file only contains one word per line.
My current workaround is:
Grep1: Retrieve words such as earth, over, under, one...
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
and
Grep2: Retrieve words such as formless, deep, said...
grep -i "^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > B.txt
the above solution works but when I concatenate both regexs into a single regex then return nothing!
Mother of Grep1 & Grep2: should retrieve everything!
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
I think issue is around my implementation of ^$ in expression but have tried diff versions with no sucess!
Any help will be highly appreciated!
OS is AIX 6100-09-04-1441
You were close. This should work:
grep -i "^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
So it should find all eight possibilities (two vowels identify three nonvowel sequence, each possibly empty; 2^3 is 8):
[ ]I[ ]o[ ]
[ ]e[ ]a[r]
[ ]e[r]a[ ]
[ ]e[l]a[n]
[T]e[ ]a[ ]
[D]e[ ]a[r]
[D]e[w]a[r]
[D]a[w]a[ ]
[H]a[w]a[y]
As for concatenation, | needs escaping. You can use a single anchoring:
^(regexp1\|regexp2)$
Since the * can match 0 times or more you should be able to start the string with [^aeiou]*: try
"^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$"
As for fixing your regex, I think you need to escape the bar as \|, so
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$\|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
If you don't mind Perl, you could use this:
perl -lne '$m=$_; tr/[aeiou]//cd; print $m if length()==2;' /usr/share/dict/words
That says... "save the current line (word) in $m. Delete everything that is not a vowel. Print the original word if there are two things (i.e vowels) left."
Note that I am using the system dictionary as input for my tests.
You could do pretty much the same thing in awk.
If you're able to use an alternative to grep tr with wc works well:
words=/path/to/words.txt
while read -e word ; do
v=$(echo $word | tr -cd 'aeiou' | wc -c)
[[ ! $v -eq "2" ]] || echo $word >> output.txt
done < $words
This reads the original file line by line, counts the vowels & returns results with only 2 to output.txt.

Parsing XML which is in a file but is 1 single line?

So the problem is I am trying to use AWK, Perl to find how many records are inside one xml that is one loooong line sometimes in the megabytes.
Most if not all examples I've seen are assuming a nice structured xml like
<?xml version="1.0" encoding="UTF-8"?>
<spendownrequest xmlns="http://www.foo.com/Adv/HR/SSt">
<spenddowndata>
<employeeId>0002</employeeId>
<transactionId>103</transactionId>
<transactionType>T</transactionType>
</spenddowndata>
<spenddowndata>
<employeeId>0003</employeeId>
<transactionId>104</transactionId>
<transactionType>T</transactionType>
</spenddowndata>
<spenddowndata>
<employeeId>0004</employeeId>
<transactionId>105</transactionId>
<transactionType>T</transactionType>
</spenddowndata>
</spendownrequest>
with newlines at each row. These files are like this
<?xml version="1.0" encoding="UTF-8"?><spendownrequest xmlns="http://www.foo.com/Adv/HR/SSt">
<spenddowndata><employeeId>0002</employeeId><transactionId>103</transactionId>
<transactionType>T</transactionType></spenddowndata><spenddowndata><employeeId>0003</employeeId>
<transactionId>104</transactionId><transactionType>T</transactionType></spenddowndata><spenddowndata>
<employeeId>0005</employeeId><transactionId>105</transactionId><transactionType>T</transactionType>
</spenddowndata></spendownrequest>
One long line with only (1) newline at the end.
I tried:
awk -F'[<|>]' '/spenddowndata/ {i++} { print i }' file.xml
get back 1
How would I get the count for all 3 that are in this file?
awk 'BEGIN {RS="<"; count = 0;} { if ($0 ~ /^spenddowndata*/) {count++}} END {print(count);}'
Should work?
You can also store the pattern in a file, say pat.awk:
BEGIN{
FPAT = "(<spenddowndata>)"
}
{
print NF
}
To display count, run :
awk -f pat.awk file.xml
awk -F'</spenddowndata>' 'END{print (NF?NF-1:0)}' file
The ternary condition testing for NF is to avoid printing -1 for an empty file.
With grep:
grep -o '</spenddowndata>' f | wc -l
With awk (in fact gawk (Thank you #EdMorton)):
gawk -v RS='</spenddowndata>' 'END{print NR-1}' f
With perl:
perl -n0E 's!</spenddowndata>!$i++!ge; say $i+0'

Matching pattern across multiple files: perl or grep?

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501
1n1e+NDE+A+400 2qzl+IXS+A+449
1llf+F23+A+800 1y0g+8PP+A+320
1ewf+PC1+A+577 2a94+AP0+A+336
2ydx+TXP+E+1339 3g8i+RO7+A+1
1gvh+HEM+A+1398 1v9y+HEM+A+1140
2i0z+FAD+A+501 3m2r+F43+A+1
1h6d+NDP+A+500 3rt4+LP5+C+501
1w07+FAD+A+1660 2pgn+FAD+A+612
2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt
I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings;
open AS, "combi_output_2_fixed.txt";
open AQ, "NAMES.txt";
#arr=<AS>;
#arr1=<AQ>;
foreach $line(#arr)
{
#split=split(' ',$line);
foreach $line1(#arr1)
{
#split1=split(' ',$line1);
if($split[0] eq $split1[0] && $split[1] eq $split1[1])
{ print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
}
}
close AQ;
close AS;
Doing this uses up the entire memory: and shows Out of memory error message..
I am aware that this can be done using grep. but do not know hw to do it.
Can anyone please let me know how I can do this using grep -F AND WITHOUT USING UP THE ENTIRE MEMORY?
Thanks.
Does pattern.txt fit in memory?
If it does, you could use a command like grep -F -f pattern.txt data.txt to match lines in data.txt against the patterns. You would get the full line though, and extra processing would be required to get only the second column of numbers.
Or you could fix the Perl script. The reason you run out of memory is because you read the 8gb file entirely to memory, when you could be processing it line-by-line like grep. For the 8GB file you should use code like this:
open FH, "<", "data.txt";
while ($line = <FH>) {
# check $line against list of patterns ...
}
Try This
grep "`more pattern.txt`" data.txt | awk -F' ' '{ print $1 " " $2 " " $4}'

Resources