joining 2 files on matching column values using awk - join

I know there have been similar questions posted but I'm still having a bit of trouble getting the output I want using awk FNR==NR...
I have 2 files as such
File 1:
123|this|is|good
456|this|is|better
...
File 2:
aaa|123
bbb|456
...
So I want to join on values from file 2/column2 to file 1/column1 and output file 1 (col 2,3,4) and file 2 (col 1).
Thanks in advance.

With awk you could do something like
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } $1 in val { $(NF + 1) = val[$1]; print }' file2 file1
NF is the number of fields in a record (line by default), so $NF is the last field, and $(NF + 1) is the field after that. By assigning the saved value from the pass over file2 to it, a new field is appended to the record before it is printed.
One thing to note: This behaves like an inner join, i.e., only records are printed whose key appears in both files. To make this a right join, you can use
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } { $(NF + 1) = val[$1]; print }' file2 file1
That is, you can drop the $1 in val condition on the append-and-print action. If $1 is not in val, val[$1] is empty, and an empty field will be appended to the record before printing.
But it's probably better to use join:
join -1 1 -2 2 -t \| file1 file2
If you don't want the key field to be part of the output, pipe the output of either of those commands through cut -d \| -f 2- to get rid of it, i.e.
join -1 1 -2 2 -t \| file1 file2 | cut -d \| -f 2-

If the files have the same number of lines in the same order, then
paste -d '|' file1 file2 | cut -d '|' -f 2-5
this|is|good|aaa
this|is|better|bbb
I see in a comment to Wintermute's answer that the files aren't sorted. With bash, process substitutions are handy to sort on the fly:
paste -d '|' <(sort -t '|' -k 1,1 file1) <(sort -t '|' -k 2,2 file2) |
cut -d '|' -f 2-5
To reiterate: this solution requires a one-to-one correspondence between the files

Related

Nullify fields in pipe delimited file

Am not able to get the desired o/p when the data field has pipe in it.
If the i/p is
SAmple file is tst
hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst"
lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst"
I tried with this cmd but dont get the expected o/p - cut -f2,3 -d"|" tst
The expected o/p is
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Is there an easy way that we can crack this o/p...Dont want to go with sed bcoz the tool that am using doesnt allow the charecter (""- backslash). I mean am embedding this command in one of the tool
Also am using old version of gawk -
so this cmd doesnt give te desired o/p
gawk -v FPAT='[^|]*|("[^"]*")+' '{print $2, $3}' OFS="|"
Output of gawk --version
GNU Awk 3.1.7
Output of cat -vet tst
hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst"$
lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst"$
Upgrading your gawk version is by far the best approach as you're missing a few bug fixes and a ton of extremely useful functionality introduced since gawk 3.1.7 came out 10+ years ago (we're currently on gawk version 5.1!) but if you can't do that for some reason then - here's what you can do if you don't have FPAT using any awk in any shell on every UNIX box:
$ cat tst.awk
BEGIN { OFS="|" }
{
orig = $0
$0 = i = ""
while ( (orig != "") && match(orig,/[^|]*|("[^"]*")+/) ) {
$(++i) = substr(orig,RSTART,RLENGTH)
orig = substr(orig,RSTART+RLENGTH+1)
}
print $2, $3
}
.
$ awk -f tst.awk file
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Just to verify that it's identifying all of the fields correctly:
$ cat tst.awk
BEGIN { OFS="|" }
{
orig = $0
$0 = i = ""
while ( (orig != "") && match(orig,/[^|]*|("[^"]*")+/) ) {
$(++i) = substr(orig,RSTART,RLENGTH)
orig = substr(orig,RSTART+RLENGTH+1)
}
print NF " <" $0 ">"
for (i=1; i<=NF; i++) {
print "\t" i " <" $i ">"
}
}
.
$ awk -f tst.awk file
5 <hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst">
1 <hdr1>
2 <"hdr2|tst">
3 <"hdr3|tst|tst">
4 <hdr4>
5 <"hdr5|tst|tst">
5 <lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst">
1 <lbl1>
2 <"lbl2|tst">
3 <"lbl3|tst|tst">
4 <lbl4>
5 <"lbl5|tst|tst">
if you don't have embedded double quotes, you can substitute the quoted delimiter values with another unused character (I used ~) and after extraction switch back to the original values. Obviously it requires that the new delimiter is not used within text.
$ awk 'BEGIN{OFS=FS="\""} {for(i=2;i<NF;i+=2) gsub("\\|","~",$i)}1' file |
awk 'BEGIN{OFS=FS="|"} {print $2,$3}' |
sed 's/~/|/g'
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Not sure it's simpler than the single awk script though.
Main problem here is the document format design. Requires another patch if there are embedded double quotes, or escaped pipes etc.

Grep: Count the number of times a string occurs if another string does not occur

I have a set of many .json.gz files. In each file, there are entries such as this:
{"type":"e1","public":true, "login":"username1", "org":{"dict","of":"lots_of_things"}}
{"type":"e2","public":true, "login":"username2"}
No matter where in each nested dict "login" appears, I want to be able to detect it and take the username, only if the key "org" does not exist anywhere in the nested dict. I also want to count the number of times each username appears in the files.
My final output should be a file of dicts that looks like this:
{'username2: 1}
because of course username1 wouldn't be counted: the key "org" appears in its dict.
I'm looking for something like:
zgrep -Rv "org" . | zgrep -o 'login":"[^"]*"' /path/to/files/* | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > outputfile.txt
I'm not sure about this part:
zgrep -Rv "org" . |
The rest successfully creates the type of file I'm looking for. I'm just unsure about the order of operations here.
EDIT
I should have been more clear, I apologize. There are also often multiple instances of the key "login" per main dict object. For example (using "k" for any key that is not login and not org, and using "v" for a value):
{"k":"v","k":{"k":{"k":"v","login":"username1"},"k":"v"},"k":{"k":"v","login":"username2"}}
{"k":{"k":"v","k":"v"},"k":{"org":{"k":"v","k":v,"login":"username3"},"k":"v"},"k":{"k":"v","login":"username4"}}
{"k":{"k":"v"},"k":{"k":{"k":"v","login":"username1"},"login":"username2"}}
Since the key org appears in the second dict, I want to exclude usernames 3 and 4 from the dict I make and save to a file.
For example, I want this in a file:
{'username1': 2}
{'username2': 2}
AWK solution and replacing find -R with more reliable find:
find . -type f -name "*.json.gz" -print0 | xargs -0 zgrep -v -h '"org"' | awk '{ if ( match($0,/"login":"[^"]+"/) ) logins[substr($0,RSTART+8,RLENGTH-8)]++; } END { for ( i in logins ) print("{" i ":" logins[i] "}"); }'
Example output:
{"username2":1}
not grep but gnu sed job with script, your data in 'a'
i=
for e in $(sed -nE '/.*\borg\b.*/!s/.*"login":"(\w+)".*/{\1:}/p' a)
{
let i++;echo ${e/:/:$i}
}
use '>' at end to save in file
if better regex : 'pcregrep' installed, it does as well;
pcregrep -io '(?!.*\borg\b.*)(?<="login":")\w+(?=".*)' a
replace sed... script above, with a bit adjusted printout
This worked:
zgrep -v "org" *.json.gz | zgrep -o 'login":"[^"]*"' | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > usernames_2011.txt

Count the number of occurrence of string in a large file

I have a large 900MB xml file and the entire file is just one lines. There is no line break between tags. I need to count the occurence of a particular tag in that file.
I tried
grep -o '<start tag>' filename | wc -l
i get a grep: line too long error.
How can I get around this?
Here's a bit of a hack:
perl -ne 'BEGIN { $/ = ">"; $c = 0 } $c++ if /<start tag>/; END { print "$c\n" }' filename
The idea is to loop over "lines" that are terminated by > instead of \n (newline). That should avoid "line too long" errors.
Just use awk:
awk -F'<start tag>' '{print NF-1}' file
If that fails, you can do this with GNU awk (for multi-char RS):
awk -v RS='<start tag>' 'END{print NR-1}' file

merge 2 lists based on first 2 columns

I need to merge 2 lists based on column 1 and 2
file1:
client1,server1,3000.00
client1,server2,2500.00
client1,server3,1500.00
client2,server1,4500.00
client2,server2,2300.00
client2,server3,1230.00
client3,server1,3400.00
client3,server2,4500.00
client3,server3,1245.00
client4,server1,3400.00
client5,server2,4500.00
client6,server3,1245.00
client7,server1,3400.00
client7,server2,4500.00
client8,server3,1245.00
client8,server1,3400.00
client8,server2,4500.00
client9,server3,1245.00
file2:
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
what I need is to update file2 with the missing values from column 1 an 2 only of file1 and adding comma to keep same number of columns
with this example the output should be like this :
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
client4,server1,,
client5,server2,,
client6,server3,,
client7,server1,,
client7,server2,,
client8,server3,,
client8,server1,,
client8,server2,,
client9,server3,,
I have tried with awk and join but I am not able to get the same result
if creating new file is easier then no issue
thanks for your help
Another awk way
awk -F, -vOFS="," 'NR!=FNR{NF--;NF+=2}!a[$1 FS $2]++' test2 test
or
awk -F, 'NR!=FNR{$0=$1 FS $2",,"}!a[$1 FS $2]++' test2 test
Shortest
awk -F, '{x=$1","$2}NR!=FNR{$0=x",,"}!a[x]++' test2 test
give this line a try:
awk -F, '{k=$1 FS $2}NR==FNR{a[k]++;print;next}!a[k]{print k",,"}' file2 file1
Using the join command. Problem is join can not join on multiple fields, so we need to manipulate the first comma temporarily:
join -t , -o 0,2.2,2.3 -a 1 <(sed 's/,/:/' file1) <(sed 's/,/:/' file2) | sed 's/:/,/'

grep that match around the first match

I would like to grep a specific word 'foo' inside specific files, then get the N lines around my match and show only the blocks that contain a second grep.
I found this but it doesn't really work...
find . | grep -E '.*?\.(c|asm|mac|inc)$' | \
xargs grep --color -C3 -rie 'foo' | \
xargs -n1 --delimiter='--' | grep --color -l 'bar'
For instance I have the file 'a':
a
b
c
d
bar
f
foo
g
h
i
j
bar
l
The file b:
a
bar
c
d
e
foo
g
h
i
j
k
I expect this for grep -c2 on both files because bar is contained in the -c2 range of foo. I do not get any match for ./bar because bar is not in the range -c2 of foo...
--
./foo- bar
./foo- f
./foo- **foo**
./foo- g
./foo- h
--
Any ideas?
You could do this pretty simply with a "while read line" loop:
find -regextype posix-extended -regex "./file[a-z]" | while read line; do grep -nHC2 "foo" $line | grep --color bar; done
Output:
./filea-5-bar
./filec-46-... host pwns.me [94.23.120.252]: 451 4.7.1 Local bar
configuration error ...
In this example, I created the following files:
filea - your example a
fileb - your example b
filec - some random exim log output with foo and bar tossed in 2 lines apart
filed - the same exim log output, but with foo and bar tossed in 3 lines apart
You could also pipe the output after done, to alter the format:
; done | sed 's/-([0-9]{1,6})-/: line: \1 ::: /'
Formatted output
./filea: line: 5 ::: bar
./filec: line: 46 ::: ... host pwns.me [94.23.120.252]: 451 4.7.1 Local bar configuration error ...
I think I only understand the first line of your question and this does what I think you mean!
#!/bin/bash
N=2
pattern1=a
pattern2=z
matchinglines=$(awk -v p="$pattern1" '$0~p{print NR}' file) # Generate array of matching line numbers
for x in ${matchinglines[#]}
do
((start=x-N))
[[ $start -lt 1 ]] && start=1 # Avoid passing negative line nmumbers to sed
((end=x+N))
echo DEBUG: Checking block between lines $start and $end
sed -ne "${start},${end}p" file | grep -q "$pattern2"
[[ $? -eq 0 ]] && sed -ne "${start},${end}p" file
done
You need to set pattern1 and pattern2 at the start of the script. It basically does some awk to build an array of the line numbers that match your first pattern. Then it loops through the array and sets the start and end range to +/-N either side of each matching line number. It then uses sed to extraact that block and passes it through grep to see if it contains pattern2 printing it if it does. It may not be the most efficient, but it is easy enough to understand and maintain.
It assumes your file is called file
pipe it twice
grep "[^foo\n]" | grep "\n{ntimes}foo\n{ntimes}"

Resources