join and paste command - join

I have 2 files.
bash-3.2$ cat result2.txt
HOSTNAME=host4 2
HOSTNAME=host1 2
HOSTNAME=host6 1
HOSTNAME=host3 1
HOSTNAME=host2 1
bash-3.2$ cat result1.txt
HOSTNAME=host1 2
HOSTNAME=host2 1
bash-3.2$ cat result.txt
HOSTNAME=host1 2
HOSTNAME=host2 1
HOSTNAME=host3 1
bash-3.2$ cat result3.txt
HOSTNAME=host4 3
HOSTNAME=host1 4
HOSTNAME=host3 7
HOSTNAME=host2 8
HOSTNAME=host6 6
bash-3.2$ join -1 1 -2 1 -a 1 -a 1 result2.txt result1.txt
HOSTNAME=host4 2
HOSTNAME=host1 2
HOSTNAME=host6 1
HOSTNAME=host3 1
HOSTNAME=host2 1
I would like to join 2 files when the order and value of 1st column of both the files are not same.
I want the output to be
hostname result result1 result2 result3
HOSTNAME=host1 2 2 2 4
HOSTNAME=host2 1 1 1 8
HOSTNAME=host3 1 0 1 7
HOSTNAME=host4 0 0 2 3
HOSTNAME=host6 0 0 1 6
Even paste command is not working as it assumes the 1st column of both the files are same. Or is there any other command in bash that i can use for this output

Update: You changed the question significantly after I've answered it already. Now you say that you have 4 files instead of just 2.
However, the basic logic keeps the same, we just need to join again with the results of the previous join operation:
join -o auto -j1 -a1 -a2 -e0 \
<(join -o auto -j1 -a 1 -a 2 -e 0 \
<(join -o auto -j 1 -a 1 -a 2 -e 0 \
<(sort r1.txt) <(sort r0.txt)) <(sort r2.txt)) <(sort r3.txt)
Output:
HOSTNAME=host1 2 2 2 4
HOSTNAME=host2 1 1 1 8
HOSTNAME=host3 0 1 1 7
HOSTNAME=host4 0 0 2 3
HOSTNAME=host6 0 0 1 6
You are looking for the following command:
join -o '1.1 1.2 2.2' -j 1 -a 1 -a 2 -e 0 <(sort r2.txt) <(sort r1.txt)
Output:
HOSTNAME=host1 2 2
HOSTNAME=host2 1 1
HOSTNAME=host3 1 0
HOSTNAME=host4 2 0
HOSTNAME=host6 1 0
Explanation:
-j 1 is the same as -1 1 -2 1 (which you had). It means "join by field 1 in both files"
-a 1 -a 2 prints un-joinable lines from file1 and file2
-e 0 uses 0 as the default value for empty columns
<(sort file) is so called process substitution
-o '1.1 1.2 2.2' tells join that you want to output field 1 from file1 and field2 from file1 and file2. If one of the files is missing field2, a 0 will be used because of -e 0.

This is a solution on the first requirement, with just two files. For the solution on multiple files, check hek2mgl's answer!
What about using awk for this? It is just a matter of storing the data from the second file (result1.txt) in an array and then printing accordingly when reading the first one (result2.txt):
$ awk 'FNR==NR {data[$1]=$2; next} {print $0, ($1 in data) ? data[$1] : 0}' f2 f1
HOSTNAME=host4 2 0
HOSTNAME=host1 2 2
HOSTNAME=host6 1 0
HOSTNAME=host3 1 0
HOSTNAME=host2 1 1
If you need this to be sorted, pipe to sort: awk '...' f2 f1 | sort or say awk '...' f2 <(sort f1).
How does this work?
awk 'things' f2 f1
reads the file f2 and then the file f1.
FNR==NR {data[$1]=$2; next}
Since FNR stands for File Number of Record and NR for Number of Record, when reading the first file, these values match. This way, saying FNR==NR allows you to do something just when reading the first file. Here, it consists in storing the data in an array data[first field] = second field. Then, next triggers to skip the current line without doing anything else. You can read more about this technique in Idiomatic awk.
{print $0, ($1 in data) ? data[$1] : 0}
Now we are reading the second file. Here, we check if the first field is present in the array. If so, we print its corresponding value from the first file; otherwise, we just print a 0.

Related

Why do "docker run -t" outputs include \r in the command output?

I'm using Docker client Version: 18.09.2.
When I run start a container interactively and run a date command, then pipe its output to hexdump for inspection, I'm seeing a trailing \n as expected:
$ docker run --rm -i -t alpine
/ # date | hexdump -c
0000000 T h u M a r 7 0 0 : 1 5
0000010 : 0 6 U T C 2 0 1 9 \n
000001d
However, when I pass the date command as an entrypoint directly and run the container, I get a \r \n every time there's a new line in the output.
$ docker run --rm -i -t --entrypoint=date alpine | hexdump -c
0000000 T h u M a r 7 0 0 : 1 6
0000010 : 1 9 U T C 2 0 1 9 \r \n
000001e
This is weird.
It totally doesn't happen when I omit -t (not allocating any TTY):
docker run --rm -i --entrypoint=date alpine | hexdump -c
0000000 T h u M a r 7 0 0 : 1 7
0000010 : 3 0 U T C 2 0 1 9 \n
000001d
What's happening here?
This sounds dangerous, as I use docker run command in my scripts, and if I forget to omit -t from my scripts, the output I'll collect from docker run command will have invisible/non-printible \r characters which can cause all sorts of issues.
tldr; This is a tty default behaviour and unrelated to docker. Per the ticket filed on github about your exact issue.
Quoting the relevant comments in that ticket:
Looks like this is indeed TTY by default translates newlines to CRLF
$ docker run -t --rm debian sh -c "echo -n '\n'" | od -c
0000000 \r \n
0000002
disabling "translate newline to carriage return-newline" with stty -onlcr correctly gives;
$ docker run -t --rm debian sh -c "stty -onlcr && echo -n '\n'" | od -c
0000000 \n
0000001
Default TTY options seem to be set by the kernel ... On my linux host it contains:
/*
* Defaults on "first" open.
*/
#define TTYDEF_IFLAG (BRKINT | ISTRIP | ICRNL | IMAXBEL | IXON | IXANY)
#define TTYDEF_OFLAG (OPOST | ONLCR | XTABS)
#define TTYDEF_LFLAG (ECHO | ICANON | ISIG | IEXTEN | ECHOE|ECHOKE|ECHOCTL)
#define TTYDEF_CFLAG (CREAD | CS7 | PARENB | HUPCL)
#define TTYDEF_SPEED (B9600)
ONLCR is indeed there.
When we go looking at the ONLCR flag documentation, we can see that:
[-]onlcr: translate newline to carriage return-newline
To again quote the github ticket:
Moral of the story, don't use -t unless you want a TTY.
TTY line endings are CRLF, this is not Docker's doing.

Combine -v option for grep with -A option

I'd like to ask is it possible to combine somehow -v with -A?
I have example file:
abc
1
2
3
ACB
def
abc
1
2
3
ABC
xyz
with -A I can see the parts I want to "cut":
$ grep abc -A 4 grep_v_test.txt
abc
1
2
3
ACB
--
abc
1
2
3
ABC
it there some option to specify something to see only
def
xyz
?
I found this answer - Combining -v flag and -A flag in grep but it is not working for me, I tried
$ sed -e "/abc/{2;2;d}" grep_v_test.txt
sed: -e expression #1, char 8: unknown command: `;'
also
$ sed "/abc/2d" grep_v_test.txt
sed: -e expression #1, char 6: unknown command: `2'
or
$ sed "/abc/+2d" grep_v_test.txt
sed: -e expression #1, char 6: unknown command: `+'
Sed version is:
$ sed --version
GNU sed version 4.2.1
edit1:
Based on comment I experimented a little bit with both solution, but it is not working as I want to
for grep -v -A 1 abc I would expect line abc and 1 to be removed, but the rest will be printed awk 'c&&!--c; /abc/ {c=2}' grep_v_test.txt prints just the line containing 2, which is not what I wanted.
Very similar it is with sed
$ sed -n '/abc/{n;n;p}' grep_v_test.txt
2
2
edit2:
It seems, I'm not able to describe it properly, let me try again.
What grep -A N abc file does is to print N lines after abc. I want to remove what grep -A will show, so in a file
abc
1
2
3
ACB
def
DEF
abc
1
2
3
ABC
xyz
XYZ
I'll just remove the part abc to ABC and I'll print the rest:
abc
1
2
3
ACB
def
DEF
abc
1
2
3
ABC
xyz
XYZ
so 4 lines will remain... The awk solution prints just def and xyz and skips DEF and XYZ...
To skip 5 lines of context starting with the initial matching line is:
$ awk '/abc/{c=5} c&&c--{next} 1' file
def
xyz
See Extract Nth line after matching pattern for other related scripts.
wrt the comments below, here's the difference between this answer and #fedorqui's answer:
$ cat file
now is the Winter
of our discontent
abc
1
2
bar
$ awk '/abc/{c=3} c&&c--{next} 1' file
now is the Winter
of our discontent
bar
$ awk '/abc/ {c=0} c++>2' file
bar
See how the #fedorqui's script unconditionally skips the first 2 lines of the file?
If I understand you properly, you want to print all the lines that occur 4 lines after a given match.
For this you can tweak the solutions in Extract Nth line after matching pattern and say:
$ awk '/abc/ {c=0} c++>4' file
def
DEF
xyz
XYZ

why once have empty line, grep -F -f can't work correctly?

There are file a and b, and want to find common lines and diff lines.
➜ ~ cat a <(echo) b
1
2
3
4
5
1
2
a
4
5
#find common lines
➜ ~ grep -F -f a b
1
2
4
5
#find b-a
➜ ~ grep -F -v -f a b
a
everything is ok, but when have empty line in one file, the grep can't work, see below
# add an empty line in file a
➜ ~ cat a
1
2
3
4
5
# content a is not common
➜ ~ grep -F -f a b
1
2
a
4
5
# b-a is nothing
➜ ~ grep -F -v -f a b
why is so, why once have empty line, grep can't work correctly?
in addition, use grep to find common elements have another problem, e.g.
➜ ~ cat a <(echo) b
1
2
3
4
5
6
1
2
a
4
5
6_id
➜ ~ grep -F -f a b
1
2
4
5
6_id
Can you use comm and diff instead of grep?
to find common lines use:
comm -12 a b
to find diff line:
diff a b

naivebayes Mahout 0.7

I am working on sentiment analysis of tweets.
I am using mahout naive bayes classifier for it.I am making a directory "data".Inside "data" I am making three more directories named "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) on each of these positive,negatie and uncertain directory..Then I kept the data directory in hdfs..below are the set of command i ran to generate the model and labelindex out of it.
bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq
bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o ${WORK_DIR}/data-vectors -lnorm -nv -wttfidf
bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput ${WORK_DIR}/data-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c
I am getting the confusion matrix after testing on the same set of data using "testnb" command as given below:
bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
151 0 0 | 151 a = negative
0 151 0 | 151 b = positive
0 0 151 | 151 c = uncertain
Then I created a some another directory "data2" in the same way and put some random data(which is a sub set of the training data(30 files(total size 30MB) each)) in positive,negative,uncertain directory inside it .Then i created a vector out of it using the "seq2sparse" command given below :-
bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq
bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o ${WORK_DIR}/data2-vectors -lnorm -nv -wttfidf
On running the "testnb" using the model/lablelindex created from the previous set of data using the command given below:-
bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000 -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c
I am getting confusion matrix like this.
Confusion Matrix
-------------------------------------------------------
a b c <--Classified as
0 30 0 | 30 a = negative
0 30 0 | 30 b = positive
0 30 0 | 30 c = uncertain
Can anyone tell me why this is coming.Am i using the correct way to test the model or it is a bug in mahout 0.7.If it is not the correct way please suggest a way out of it.
Can you try this:
bin/mahout testnb -i ${WORK_DIR}/data2-vectors/tfidf-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data2-testing $c
(remove the "part-r-00000")

How to grep a specific integer

I have a list of number in a file with format: {integer}\n . So a possible list is:
3
12
53
23
18
32
1
4
i want to use grep to get the count of a specific number, but grep -c "1" file results 3 because it takes into account except the 1, the 12, 18 also. How can i correct this?
Although all the answers until now are logical, and i thought of them and tested before, actually nothing works:
username#domain2:~/code/***/project/random/r2$ cat out.txt
2
16
11
1
13
2
1
16
16
9
username#domain2:~/code/***/project/random/r2$ grep -Pc "^1$" out.txt
0
username#domain2:~/code/***/project/random/r2$ grep -Pc ^1$ out.txt
0
username#domain2:~/code/***/project/random/r2$ grep -c ^1$ out.txt
0
username#domain2:~/code/***/project/random/r2$ grep -c "^1$" out.txt
0
username#domain2:~/code/***/project/random/r2$ grep -xc "^1$" out.txt
0
username#domain2:~/code/***/project/random/r2$ grep -xc "1" out.txt
0
Use the -x flag:
grep -xc 1 file
This is what it means:
-x, --line-regexp
Select only those matches that exactly match the whole line.
There a some other ways you can do this besides grep
$ cat file
3 1 2 100
12 x x x
53
23
18
32
1
4
$ awk '{for(i=1;i<=NF;i++) if ($i=="1") c++}END{print c}' file
2
$ ruby -0777 -ne 'puts $_.scan(/\b1\b/).size' file
2
$ grep -o '\b1\b' file | wc -l
2
$ tr " " "\n" < file | grep -c "\b1\b"
2
Use this regex...
\D1\D
...or ^1$ with multiline mode on.
Tested with RegExr and they both work.
Use e.g. ^123$ to match "Beginning of line, 123, End of line"
grep -wc '23' filename.txt
It will count the number of exact matches of digit 23.

Resources