I have to find and print only matched strings of a line in file using grep command.
Below is example text:
04-06-2013 INFO blah blah blah blah Param : 39 another text Ending the line.
05-06-2013 INFO blah blah allah line 2 ending here with Param : 21.
I want output to be printed as below after grep command
04-06-2013 INFO Param : 39
04-06-2013 INFO Param : 21
I tried grep command with -o option and regex '.*INFO'. I was successful to print both the text separately in different grep commands where as i want this in single command.
Thanks in Advance.
grep -o ".*INFO\|Param : [0-9]*"
I'm not sure you can do this with pure grep, as you'd need to be able to specify a regex with grouped terms, and then only print out certain regex groups rather than everything matched by the entire regex - so you'd e.g. specify (.*INFO)(.*)(Param : [0-9]*) as the regex and then only print groups 1 and 3 (assuming you start counting at 1).
You can however use sed to post-process the output for you:
% cat foo
04-06-2013 INFO blah blah blah blah Param : 39 another text Ending the line.
05-06-2013 INFO blah blah allah line 2 ending here with Param : 21.
% grep 'Param :' foo | sed 's/\(.*INFO\)\(.*\)\(Param : [0-9]*\)\(.*\)/\1 \3/'
04-06-2013 INFO Param : 39
05-06-2013 INFO Param : 21
What I'm doing above is replacing the match with just groups 1 and 3, separated by a space.
I think this question is related (possibly even a duplicate).
Related
I have a input file such as
file;14;19;;;hello 2019
file2;2019;2020;;;this is a test 2020
file3;25;31;this is a number 31
I would like to grep numbers only after ;;;. For example if I wanted to grep 2019 it would give me
file;14;19;;;hello 2019
instead of if I did grep '2019' file
file;14;19;;;hello 2019
file2;2019;2020;;;this is a test 2020
How can I accomplish this task?
Regular expression can include stuff other than fixed text, it sounds like all you need is:
grep ';;;.*[0-9]' inputFile.txt
This will deliver all lines that have the text ;;; followed by a digit somewhere after that in the line. In terms of explanation:
;;; is the literal text, three semicolons;
.* is zero or more of any character;
[0-9] is any digit.
That will give you lines with any number. If you want a specific number, use that for the final bullet point above.
Just keep in mind that this will also give you the line xyzzy ;;; 920194 if you go looking for 2019.
If you want just the 2019 numbers (i.e., without any digits on either side), you can use the zero-width negative look-behind and look-ahead assertions, assuming your version of grep has Perl-compatible regular expressions (PCRE, which GNU grep does with the -P flag):
grep -P ';;;.*(?<![0-9])2019(?![0-9])' inputFile.txt
This can be read as:
;;; is the literal text, three semicolons;
.* is zero or more of any character;
(?<![0-9]) means next match cannot be preceded by a digit;
2019 is the number you're looking for;
(?![0-9]) means previous match cannot be followed by a digit.
Use this Perl one-liner:
perl -F';' -lane 'print if $F[-1] =~ /2019/' in_file
Example:
( echo 'file;14;19;;;hello 2019' ; echo 'file2;2019;2020;;;this is a test 2020' ) | perl -F';' -lane 'print if $F[-1] =~ /2019/'
Prints:
file;14;19;;;hello 2019
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F';' : Split into #F on semicolon (;), rather than on whitespace.
$F[-1] : the last element of the array #F = the last element of the input line split on semicolon. Alternatively, use $F[5] (the 6th element - the arrays are 0-indexed), if you need to count from the left.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
Please I have question: I have a file like this
#HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;#?++8A?;C;F92+2A#19:1*1?DDDECDE?B4:BDEEI
#BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
#HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
#C#FF?EDGFDHH#HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
#HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ
I
I'm interested just about the line that starts with '#HWI', but I want to count all the lines that are not starting with '#HWI'. In the example shown, the result will be 1 because there's one line that starts with '#BBB'.
To be more clear: I just want to know know the number of the first line of the patterns (that are 4 line that repeated) that are not '#HWI'; I hope I'm clear enough. Please tell me if you need more clarification
With GNU sed, you can use its extended address to print every fourth line, then use grep to count the ones that don't start with #HWI:
sed -n '1~4p' file.fastq | grep -cv '^#HWI'
Otherwise, you can use e.g. Perl
perl -ne 'print if 1 == $. % 4' -- file.fastq | grep -cv '^#HWI'
$. contains the current line number, % is the modulo operator.
But once we're running Perl, we don't need grep anymore:
perl -lne '++$c if 1 == $. % 4; END { print $c }' -- file.fastq
-l removes newlines from input and adds them to output.
I've a tabbed log file but I need only few chracters of the line marked 30.10 in the beginning.
Using the command
awk '/^30.10/{print}' FOOD_ORDERS_201907041307.DEL
i get this output
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
What i need to extract is 3547 and the last nth caracthers from the very end after zeros.
So, expected output will be:
3547
34
29
11
But if the last 10 caracthers contains leading zeros and a number, i need that number
While your question is unclear, your answer to Ed Morton's comment provides a bit more clarity on what you are trying to achieve. Where it is still unclear is just exactly you want from the third field. From your question and the various comments, it appears if the line begins with 30.10 you want the first 4-digits from second field and you want the rightmost digits that are [1-9] from the third field.
If that accurately captures what you need, then awk with a combination of substr, match and length string functions can isolate the digits you are interested in. For example:
awk '/^30.10/ {
l=match ($3, /[1-9]+$/)
print substr ($2, 1, 4) " " substr ($3, l, length($3)-l+1)
}' test
Would take the input file (borrowed from Dudi Boy's answer), e.g.
$ cat test
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000001143
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
and return to you:
3547 34
3547 1143
3547 29
3547 11
Let me know if that accurately captures what you need.
Here is a simple awk script to do the task:
script.awk
/^30.10/ { # for each line starting with 30.10
last2chars = substr($3, length($3)-1); # extract last 2 chars from 3rd field into variable last2chars
if($3 ~ /00001143$/) last2chars = 1143; # if 3rd field ends with 1143, update variable last2chars respectively
print last2chars; # output variable last2chars
}
input.txt
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000001143
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
running:
awk -f script.awk input.txt
outupt:
34
1143
29
11
GOT Part of it!
awk '/^30.10/{print}' FOOD_ORDERS_201907041307.DEL | sed 's/.*(..)/\1/'
I have a fasta file like the test one here:
>HWI-D00196:168:C66U5ANXX:3:1106:16404:19663 1:N:0:GCCAAT
CCTAGCACCATGATTTAATGTTTCTTTTGTACGTTCTTTCTTTGGAAACTGCACTTGTTGCAACCTTGCAAGCCATATAAACACATTTCAGATATAAGGCT
>HWI-D00196:168:C66U5ANXX:3:1106:16404:19663 2:N:0:GCCAAT
AAAACATAAATTTGAGCTTGACAAAAATTAAAAATGAGCCCAGCCTTATATCTGAAATGTGTTTATATGGCTTGCAAGGTTGCAACAAGTGCAGTTTCCAA
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 1:N:0:GCCAAT
ATATTTGAATTATCAGAAATAAACACAAAGAAAACCTAGAACAGATAATTTCTTCCACATTATTGATCAGATACAGATTTCAAGGGTACCGTTGTGAATTG
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 2:N:0:GCCAAT
AAACGATTGATAGATCTATTTGCATTATAAAAACATTAAAAAAACAAAATACTGATTAAATGTCGTCTTTCTATTCCACAATTTTATAGATCTCACTGTAT
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 1:N:0:GCCAAT
CTTACTTTGCCTCTCTCAGCCAATGTCTCCTGAGTCTAATTTTTTGGAGGCTAAGCTATGAGCTAATGATGGGTTCCATTTGGGGCCAATGCTTCAGCCTG
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 2:N:0:GCCAAT
CTATTAGTTCTTATCTTTGCCTGCAAATATAAGACTAGCGCTTGAGTAGCTGACAGAGACAAAGTAAGCTGGAGTGTTTATCACCTGGTCACTCCAATTGT
When i type in a simple grep command like:
grep -B1 "CTT" test.fasta
I get a really strange output in which "--" is sometimes placed on a newline above the grep hit like so:
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 2:N:0:GCCAAT
AAACGATTGATAGATCTATTTGCATTATAAAAACATTAAAAAAACAAAATACTGATTAAATGTCGTCTTTCTATTCCACAATTTTATAGATCTCACTGTAT
--
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 2:N:0:GCCAAT
CTATTAGTTCTTATCTTTGCCTGCAAATATAAGACTAGCGCTTGAGTAGCTGACAGAGACAAAGTAAGCTGGAGTGTTTATCACCTGGTCACTCCAATTGT
I can't figure out why some fasta entries have this and others don't. I don't get this problem when i remove the -B1. I can remove those lines from my file with a grep -v "--" statement, but I'd really like to understand what's going on here.
You are asking for one line of leading context by using the -B1 option. This means grep will display both the line which matched and the line directly before it. Each match will be separated by -- on a line by itself as shown below:
$ man grep | grep -B1 context
-A num, --after-context=num
Print num lines of trailing context after each match. See also
--
-B num, --before-context=num
Print num lines of leading context before each match. See also
--
-C[num, --context=num]
Print num lines of leading and trailing context surrounding each
--
--context[=num]
Print num lines of leading and trailing context. The default is
The reason you aren't seeing -- between every match is that the context is only displayed above a sequence of consecutive matches. So see the following example:
seq 13 | grep -B1 1
1
--
9
10
11
12
13
The seq command produces all the numbers between 1 and 13. Only the first line and the lines from 10 on contain a 1, so you see the 1 in its own group, then --, then the one line context, then the group of consecutive matching lines.
GREP_COLORS section of the grep manpage says :
Specifies the colors and other attributes used to highlight various > parts of the output. Its value is a colon-separated list
of capabilities that defaults to
ms=01;31:mc=01;31:sl=:cx=:fn=35:ln=32:bn=32:se=36 with the rv and
ne boolean capabilities omitted (i.e., false).
and
se=36 SGR substring for separators that are inserted between
selected line fields (:), between context line fields, (-), and
between groups of adjacent lines when nonzero context is
specified (--). The default is a cyan text foreground over the
terminal's default background.
Consider file sample.txt :
$cat sample.txt
ABBB
AAB
AAB
S
S
S
AABB
ABAA
BAA
CCC
$grep -B2 'AAB' sample.txt
ABBB
AAB
AAB
--
S
S
AABB
Here -- is the way of grep to tell you that AAB before -- and S after -- are not adjacent lines in the actual file.
I have several files that goes like that:
abcd
several lines
abcd
several lines
abcd
several lines
.
.
.
what I want to do (preferably using grep) is to get the 20 lines immediately following the LAST abcd line.
Any help is appreciated.
Thanks
Use -A option:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line
containing a group separator (--) between contiguous groups of matches.
With the -o or --only-matching option, this has no effect and a warning
is given.
So:
$ grep -A 20 abcd file.txt
will give you abcd lines + 20 lines after each. To get that last 21 lines, use tail:
$ grep -A 20 abcd file.txt | tail -21
You can do this:
awk '/abcd/ {n=NR} {a[NR]=$0} END {for (i=n;i<=n+20;i++) print a[i]}' file
It will search for pattern abcd and update n so only last will be stored.
It also store all line in array a
Then it print 20 lines form last pattern found in the END section.