Extracting lines from a fixed format without spaces file based on a column and list of inquiring IDs - grep

I have a quite large fixed format file without spaces (file1):
file1:
0808563800555550000367120000500000
0005555566369330000078020000500000
01066666780000000008933600009000005635
0904251263088000000786590056500000
0000469011009904440425120444444440
I want to extract lines with fields 4-8,11-15 and 20-24 when fields 4-8 (only) are in a list of IDs in file2
file2:
55555
42512
The desired outputs are:
55555 36933 07802
42512 08800 78659
I have tried the following combination of cut | grep commands:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -w -F -f file2
It works fine and the speed is very good, but the problem is that I am getting columns where the lookup ID (fields 4-8) is not in the first column of the cutted data, and that is because grep checks the three columns after cut, not only the first one. 
Here are the outputs of the command above:
85638 55555 36712
55555 36933 07802
66666 00000 89336
42512 08800 78659
04690 00990 42512
I know one may write the output to a file and then use, for example awk, but I thought there could be a much simpler approach to avoid longer processing time (for example, makes grep picks only the match in a specific cutted column).
Any help will be very appreciated and many thanks!

With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='3 5 2 5 4 5 *' 'NR==FNR{a[$0]; next} $2 in a{ print $2, $4, $6 }' file2 file1
55555 36933 07802
42512 08800 78659

Would you please try the following:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -wf <(sed 's/^/^/' file2)
Each line in file2 is prepended by a caret ^ character to anchor to
the start of the line of the output by cut.
It may be a bit slower than before due to the lack of -F option.

Related

Find matching words

I have a corpus file and the rules file. I am trying to find matching words where the word from rule appear in corpus.
# cat corpus.txt
this is a paragraph number one
second line
third line
# cat rule.txt
a
b
c
This returns 2 lines
# grep -F0 -f rule.txt corpus.txt
this is a paragraph number one
second line
But I am expecting 4 words like this...
a
paragraph
number
second
Trying to achive these results using grep or awk.
Assuming words are seperated by white spaces
awk '{print "\\S*" $1 "\\S*"}' rule.txt | grep -m 4 -o -f - corpus.txt

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

reading semi-formatted data

I'm totally new to AWK, however I think this is the best way to solve my problem and a good time to learn AWK.
I am trying to read a large data file that is created by a simulation program. The output is made to be readable by humans, so its formatting isn't very consistent. An example of the output is in this image
http://i.imgur.com/0kf8l.png
I need a way to find a line like "He 2 4686A -2.088 0.0071", by specifying the "He 2 4686A" part and get the following two numbers. The problem is the line "He 2 4686A -2.088 0.0071" can appear anywhere in the table.
I know how to find the entry "He 2 4686A", but I don't know which of the 4 columns it's in. So I don't know how to address the values that follow it.
A command that lets me just read the next two words, or tells me the location of the pattern once a match is found will both help.
/He 2 4686A/ finds the line
Ca A 3970A -0.900 0.1100 He 2 4686A -2.088 0.0071 S 3 18.67m -0.371 0.3721 Ar 4 444.7A -2.124 0.0066
Any help is appreciated.
First step should be to bring what seems to be 4 columns of records into a 1-column format...then its easy with awk because you can then filter for the first 5 fields - like:
echo "He 2 4686A -2.088 0.0071" | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
which gives
-2.088 0.0071
So, for me, the only challenge is to transform your data to one-column format...And from the picture that look simple because it seems that the columns have a fixed length which you can count.
Assuming that your column-width is 30 characters (difficult to tell from a picture, beware of tabs) and you data is in input_file, then you could first "cut" the data into 4 columns and then pipe the output to another awk-process
awk '{
print substr($0,1,30)
print substr($0,31,30)
print substr($0,61,30)
print substr($0,91,30)
}' input_file | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
If you really just need the next two numbers behind an anchor then I would say the grep-solution from Costa is best for you, however this gives you the possibility to implement further logic...
If you're not dead set on using awk, grep would be the easiest way...
egrep -o "He 2 4686A \-?[0-9.]+ \-?[0-9.]+" output.txt
EDIT: The above would work only if the spacing was done with a whitespace, which doesn't seem to be your case. In order to handle tabs and/or repeating whitespaces...
egrep -o "He[ \t]+2[ \t]+4686A[ \t]+\-?[0-9.]+[ \t]+\-?[0-9.]+" output.txt

Recursively grep results and pipe back

I need to find some matching conditions from a file and recursively find the next conditions in previously matched files , i have something like this
input.txt
123
22
33
The files where you need to find above terms in following files, the challenge is if 123 is found in say 10 files , the 22 should be searched in these 10 files only and so on...
Example of files are like f1,f2,f3,f4.....f1200
so it is like i need to grep -w "123" f* | grep -w "123" | .....
its not possible to list them manually so any easier way?
You can solve this using awk script, i ve encountered a similar problem and this will work fine
awk '{ if(!NR){printf("grep -w %d f*|",$1)} else {printf("grep -w %d f*",$1)} }' input.txt | sh
What it Does?
it reads input.txt line by line
until it is at last record , it prints grep -w %d | (note there is a
pipe here)
which is then sent to shell for execution and results are piped back
to back
and when you reach the end the pipe is avoided
Perhaps taking a meta-programming viewpoint would help. Have grep output a series of grep commands. Or write a little PERL program. Maybe Ruby, if the mood suits.
You can use grep -lw to write the list of file names that matched (note that it will stop after finding the first match).
You capture the list of file names and use that for the next iteration in a loop.

Resources