search for words after splitting on delimiter - grep

I am trying to select all tags having "3" next to a word delimited by /
# cat test.txt
test/1,2,3
new/3
one/2,3
more/1,2,4,5
123456/1,2,4,5
I can not use simple grep because it will select a word where there is 3 and I am looking for that digit after /
# grep '3' test.txt
test/1,2,3
new/3
one/2,3
123456/1,2,4,5
This is close, but does not return an entry "new/3"
# grep '/*,3' test.txt
test/1,2,3
one/2,3
What is the correct regular expression for this?
Expected output:
test/1,2,3
one/2,3
new/3

I suggest:
grep '/.*\b3\b' test.txt
Output:
test/1,2,3
new/3
one/2,3
\b: a zero-width word boundary
See: The Stack Overflow Regular Expressions FAQ

Related

Regex for line containing one or more spaces or dashes

I got .txt file with city names, each in separate line. Some of them are few words with one or multiple spaces or words connected with '-'. I need to create bash command which will echo those lines out. Currently I'm using cat piped with grep but I can't get both spaces and dash into one search and I had problems with checking for multiple spaces.
print lines with dash:
cat file.txt | grep ".*-.*"
print lines with spaces:
cat file.txt | grep ".*\s.*"
tho when I try to do:
cat file.txt | grep ".*\s+.*"
I get nothing.
Thanks for help
Something like that should work:
grep -E -- ' |\-' file.txt
Explanation:
-E: to interpret patterns as extended regular expressions
--: to signify the end of command options
' |\-': the line contains either a space or a dash
This does not directly address your question, but is too much to put in a comment.
You don't need the .* in your patterns. .* at the beginning or end of a pattern is useless, because it means "0 or more of any character" and so will always match.
These lines are all identical:
cat file.txt | grep ".*-.*"
cat file.txt | grep "-.*"
cat file.txt | grep "-"
Plus you don't need to cat and pipe:
grep "-" file.txt
When grep pattern matches, the default action is to print the whole line, so .* in all your patterns are redundant, you may delete them. Also, you don't have to use cat file | as you may specify the file to grep directly after pattern, i.e. grep 'pattern' file.txt.
Here are some more details:
grep ".*-.*" = grep -- "-" - returns any lines having a - char (-- singals the end of options, the next thing is the pattern)
grep ".*\s.*" = grep "\s" - matches and returns lines containing a whitespace char (only GNU grep)
grep ".*\s+.*" = grep "\s+" - returns line containing a whitespace followed with a literal + char (since you are using POSIX BRE regex here the unescaped + matches a literal plus symbol).
You want
grep "[[:space:]-]" file.txt
See the online demo:
#!/bin/bash
s='abc - def
ghi
jkl mno'
grep '[[:space:]-]' <<< "$s"
Output:
abc - def
jkl mno
The [[:space:]-] POSIX BRE and ERE (enabled with -E option) compliant pattern matches either any whitespace (with the [:space:] POSIX character class) or a hyphen.
Note that [\s-] won't work since \s inside a bracket expression is not treated as a regex escape sequence but as a mere \ or s.

How to make "grep" output complete word that includes the match?

I would like grep to print out all complete words that include the match.
Google did not help me. Here what I tried:
cat file.txt
21676 Mm.24685 NM_009346 ENSMUSG00000055320
20349 Mm.134093 NM_011348 ENSMUSG00000063531
12456 Mm.134000 NM_011228 GM415666
grep -o "ENSMUS" file.txt
ENSMUS
ENSMUS
Desired output:
ENSMUSG00000055320
ENSMUSG00000063531
Thanks for your help!
You may use:
grep -wo "ENSMUS[^[:blank:]]*" file.txt
ENSMUSG00000055320
ENSMUSG00000063531
Here [^[:blank:]]* will match 0 or more characters that are not whitespaces. -w will ensure full word matches.
To extract ENSEMBL mouse accession numbers without the version number:
grep -Po 'ENSMUS\w+' in_file
With the version number:
grep -Po 'ENSMUS\S+' in_file
Here,
\w+ : 1 or more word characters ([A-Za-z0-9_]).
\S+ : 1 or more non-whitespace characters (you can also be more restrictive and use [\w.]+, which is 1 or more word character or literal dot).
Here, GNU grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
SEE ALSO:
grep manual
perlre - Perl regular expressions

Cutting a length of specific string with grep

Let's say we have a string "test123" in a text file.
How do we cut out "test12" only or let's say there is other garbage behind "test123" such as test123x19853 and we want to cut out "test123x"?
I tried with grep -a "test123.\{1,4\}" testasd.txt and so on, but just can't get it right.
I also looked for example, but never found what I'm looking for.
expr:
kent$ x="test123x19853"
kent$ echo $(expr "$x" : '\(test.\{1,4\}\)')
test123x
What you need is -o which print out matched things only:
$ echo "test123x19853"|grep -o "test.\{1,4\}"
test123x
$ echo "test123x19853"|grep -oP "test.{1,4}"
test123x
-o, --only-matching show only the part of a line matching PATTERN
If you are ok with awkthen try following(not this will look for continuous occurrences of alphabets and then continuous occurrences of digits, didn't limit it to 4 or 5).
echo "test123x19853" | awk 'match($0,/[a-zA-Z]+[0-9]+/){print substr($0,RSTART,RLENGTH)}'
In case you want to look for only 1 to 4 digits after 1st continuous occurrence of alphabets then try following(my awk is old version so using --re-interval you could remove it in case you have latest version of ittoo).
echo "test123x19853" | awk --re-interval 'match($0,/[a-zA-Z]+[0-9]{1,4}/){print substr($0,RSTART,RLENGTH)}'

Grep only exact last 4 digits from Number file

Grep only exact last 4 digits from Number file.
$ cat test
12298700077
56198700770
23192604888
34198701041
89198701285
$ cat test | grep 0077
12298700077
56198700770
Required output is just this
12298700077
Use regex and especially (man 7 regex): '$' (matching the null string at the end of a line):
$ grep 0077$ file
12298700077

grep return only one match per line

example file :
foobar random text foobar random text foobar
text
text
text
If I use grep and search for the word foobar, how can I prevent grep to return me the first line 3 times, because it founds 3 times foobar ? What I would like to have is only one return per line, even if the word has been found multiple times on the line
Simple awk alternative:
awk '/\<foobar\>/{print NR,"foobar"}' file
The output(for your exemplary input):
1 foobar
\< and \> mean word boundaries
NR - contains current line number
With perl:
perl -ne 'print $.," ",$1,"\n" if /\b(foobar)\b/' file
The output:
1 foobar
file.txt:
foobar random text foobar random text foobar
text
text
text
command:
grep foobar file.txt
output:
foobar random text foobar random text foobar
grep version: GNU grep 3.4
So, the line containing foobar is shown only once. If you see more lines, include option -n to see the line numbers of each output line, i.e.,
grep -n foobar file.txt

Resources