GREP to columns along with comma seperation - grep

Im greping a bunch of files in a directory as below
grep -EIho 'abc|def' *|sort|uniq -c >>counts.csv
My output is
150 abc
130 def
What I need is Current date (-1) and the result of grep like below to be inserted to counts.csv
5/21/2018 150,130

grep..|sort|uniq -c
|awk -v d="$(date -d '1 day ago' +%D)" 'NR==1{printf "%s",d}{printf "%s",","$1;}END{print ""}'
will do it.
With your example data, it gives:
05/21/18,150,130

Related

Extract specific number from command outout

I have the following issue.
In a script, I have to execute the hdparm command on /dev/xvda1 path.
From the command output, I have to extract the MB/sec values calculated.
So, for example, if executing the command I have this output:
/dev/xvda1:
Timing cached reads: 15900 MB in 1.99 seconds = 7986.93 MB/sec
Timing buffered disk reads: 478 MB in 3.00 seconds = 159.09 MB/sec
I have to extract 7986.93 and 159.09.
I tried:
grep -o -E '[0-9]+', but it returns to me all the six number in the output
grep -o -E '[0-9]', but it return to me only the first character of the six values.
grep -o -E '[0-9]+$', but the output is empty, I suppose because the number is not the last character set of outoput.
How can I achieve my purpose?
To get the last number, you can add a .* in front, that will match as much as possible, eating away all the other numbers. However, to exclude that part from the output, you need GNU grep or pcregrep or sed.
grep -Po '.* \K[0-9.]+'
Or
sed -En 's/.* ([0-9.]+).*/\1/p'
Consider using awk to just print the fields you want rather than matching on numbers. This will work using any awk in any shell on every Unix box:
$ hdparm whatever | awk 'NF>1{print $(NF-1)}'
7986.93
159.09

print the rest of input along with matching line

I am new to linux and I am experimenting with basic terminal commands. I found out that I can list all users using compgen -u but what if I only want to display the bottom line outputs ?
Ok lets say the output of compgen -u goes like this:
extra
extra
extra
extra
extra
extra
extra
extra
extra
John
William
Kate
Harold
I can only use grep to find a single text (ex. compgen -u | grep John). But what if I want to use grep to display John as well as all the remaining entries after it ?
sed or awk solution would be easier, but if you can only use grep, then the option --after-context (or -A) might do:
grep -A 5 John file
The drawback is that you need to know the number of lines to display after the matching (or use an arbitrary big number for the rest of the file).
compgen -u | grep -A$(compgen -u| wc -l) John
Explanation:
From man grep
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line containing a group separator (described under --group-separator) between
contiguous groups of matches.
grep -A -- print number of rows after pattern
$() -- Execute unix command
compgen -u| wc -l --> Get total number of rows of output of command.
You can use the following one-liner :
n=$( compgen -u | grep -n John | head -1 | cut -d ":" -f 1 ) && compgen -u | tail -n +$n
This finds out the line number for first occurrence of John, and prints everything starting that line.

grep every fourth line in .fastq

I am working on a linux machine using bash.
My question is, how can I skip lines in the query file using grep?
I am working with a large ~16Gb .fastq file named example.fastq which has the following format.
example.fastq
#SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
#SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I need to extract lines containing a strings of interest #SRR6750041.2 #SRR6750041.5 stored in a bash array called IDarray as well as the 3 lines following each match. The following grep command allows me to do this
for ID in "${IDarray[#]}";
do
grep -F -A 3 "$ID " example.fastq
done
This correctly output the following.
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I am looking for ways to speed this process up... one way would be to reduce the number of lines searched by grep by restricting the search to lines beginning with # or skipping lines that can not possibly contain the match #SRR6750041.1 such as lines 2,3,4 and 6,7,8 etc. Is there a way to do this using grep? Alternative methods are also welcome!
Here are some thoughts with examples. For test purposes I created test case as mini version of Yours example_mini.fastq is 145 MB big and IDarray has 999 elements (interests).
Your version has this performance (more than 2 mins in user space):
$ time for i in "${arr[#]}"; do grep -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 3m16.310s
user 2m9.645s
sys 0m53.092s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
First upgrade of grep to end grep after first match -m 1, I am assuming that interest ID is unique. This narrow down by 50% of complexity and takes approx 1 min in user space:
$ time for i in "${arr[#]}"; do grep -m 1 -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 1m19.325s
user 0m55.844s
sys 0m21.260s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
These solutions are linearly dependent on number of elements. Call n times grep on huge file.
Now let's implement in AWK only for one run, I am exporting IDarray into input file so I can process in one run. I am loading big file into associative array per ID and then looping 1x through You array of IDs to search. This is generic scenario where You can define regexp and number of lines after to print. This has complexity with only one run through file + N comparisons. This is 2000% speed up:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (linesafter-- > 0) { arr[interest]=arr[interest] RS $0; next; }
(FNR==NR) && /^#/ { interest=$1; arr[interest]=$0; linesafter=3; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m7.044s
user 0m6.628s
sys 0m0.307s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
As in Your title If You really can confirm that every fourth line is id of interest and three lines after are about to be printed. You can simplify into this and speed up by another 20%:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (FNR%4==1) { interest=$1; arr[interest]=$0; next; }
(FNR==NR) { arr[interest]=arr[interest] RS $0; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m5.944s
user 0m5.593s
sys 0m0.242s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
On 1.5 GB file with 999 elements to search time is:
real 1m4.333s
user 0m59.491s
sys 0m3.460s
So per my predictions on my machine Your 15 GB example with 10k elements would take approx 16 minutes in user space to process.

How do I 'grep -c' and avoid printing files with zero '0' count

The command 'grep -c blah *' lists all the files, like below.
% grep -c jill *
file1:1
file2:0
file3:0
file4:0
file5:0
file6:1
%
What I want is:
% grep -c jill * | grep -v ':0'
file1:1
file6:1
%
Instead of piping and grep'ing the output like above, is there a flag to suppress listing files with 0 counts?
SJ
How to grep nonzero counts:
grep -rIcH 'string' . | grep -v ':0$'
-r Recurse subdirectories.
-I Ignore binary files (thanks #tongpu, warlock).
-c Show count of matches. Annoyingly, includes 0-count files.
-H Show file name, even if only one file (thanks #CraigEstey).
'string' your string goes here.
. Start from the current directory.
| grep -v ':0$' Remove 0-count files. (thanks #LaurentiuRoescu)
(I realize the OP was excluding the pipe trick, but this is what works for me.)
Just use awk. e.g. with GNU awk for ENDFILE:
awk '/jill/{c++} ENDFILE{if (c) print FILENAME":"c; c=0}' *

Multiple counts in grep?

So I have a big log file where each line contains a date. I would like to count the number of lines containing each date.
I came up with an awful solution, consisting of manually typing each of the following commands:
grep -c "2014-01-01" big.log
grep -c "2014-01-02" big.log
grep -c "2014-01-03" big.log
I could also have written a small Python script, but that seems overkill. Is there a quicker / more elegant solution?
You can maybe use of a regex and then uniq -c to count the results.
See an example:
$ cat a
2014-01-03 aaa
2014-01-03 aaa
2014-01-02 aaa
2014-01-01 aaa
2014-01-04 aaa
hello
2014-01-01 aaa
And let's look for all the 2014-01-0X, being X a digit, and count them:
$ grep -o "2014-01-0[0-9]" a | sort | uniq -c
2 2014-01-01
1 2014-01-02
2 2014-01-03
1 2014-01-04
Note piping to sort is needed to make uniq -c work properly. You can see more info about it in my answer to what is the meaning of delimiter in cut and why in this command it is sorting twice?.
Borrowing fedorqui's sample date file - thanks #fedorqui :-)
awk '/2014/{x[$1]++} END{for (k in x) print x[k],k}' file
2 2014-01-01
1 2014-01-02
2 2014-01-03
1 2014-01-04
try this
grep '2014-01-01' big.log |wc -l
grep '2014-01-02' big.log |wc -l
grep '2014-01-03' big.log |wc -l
Hope this will solve ur prob

Resources