grep: one pattern works but not the other - grep

I have a teb-delimited file that has gene names in one column and expression values for these genes in the other. I want to delete certain genes from this file using grep. So, this:
"42261" "SNHG7" "20.2678"
"42262" "SNHG8" "25.3981"
"42263" "SNHG9" "0.488534"
"42264" "SNIP1" "7.35454"
"42265" "SNN" "2.05365"
"42266" "snoMBII-202" "0"
"42267" "snoMBII-202" "0"
"42268" "snoMe28S-Am2634" "0"
"42269" "snoMe28S-Am2634" "0"
"42270" "snoR26" "0"
"42271" "SNORA1" "0"
"42272" "SNORA1" "0"
becomes this:
"42261" "SNHG7" "20.2678"
"42262" "SNHG8" "25.3981"
"42263" "SNHG9" "0.488534"
"42264" "SNIP1" "7.35454"
"42265" "SNN" "2.05365"
I've used the following command that i've put together with my limited terminal knowledge:
grep -iv sno* <input.text> | grep -iv rp* | grep -iv U6* | grep -iv 7SK* > <output.txt>
So with this command, my output file lacks genes that start with sno, u6 and 7sk but somehow grep has deleted all the genes that has "r" in them instead of the ones that start with "rp". I'm very confused about this. Any ideas why sno* works but rp* not?
Thanks!

The grep command uses regular expressions, not globbing patterns.
The pattern rp* means "'r' followed by zero or more 'p'". What you really want is rp.*, or even better, "rp.* (or even just "rp, there's no point in trying to grep for anything after the "rp" after all). Likewise, sno* means "'sn' followed by zero or more 'o'". Again, you'd want sno.* or "sno.* (or even just "sno).

Although this doesn't directly answer your question, there is one thing in your sample command line that you may want to be careful with: Whenever you use a special shell metacharacter (like "*"), you need to escape or quote it. So your command line should look more like:
grep -iv 'sno*' <input.text> | grep -iv 'rp*' | grep -iv 'U6*' | grep -iv '7SK*' > <output.txt>
Often, shells are smart, and if no files match the glob, they will use the text as-is (so if you enter "foo*" but there are no filenames starting with "foo", then the string "foo*" will be passed to the command).

grep -iEv "sno|rp|U6|7SK" yourInput
test:
kent$ cat b
"42261" "SNHG7" "20.2678"
"42262" "SNHG8" "25.3981"
"42263" "SNHG9" "0.488534"
"42264" "SNIP1" "7.35454"
"42265" "SNN" "2.05365"
"42266" "snoMBII-202" "0"
"42267" "snoMBII-202" "0"
"42268" "snoMe28S-Am2634" "0"
"42269" "snoMe28S-Am2634" "0"
"42270" "snoR26" "0"
"42271" "SNORA1" "0"
"42272" "SNORA1" "0"
kent$ grep -iEv "sno|rp|U6|7SK" b
"42261" "SNHG7" "20.2678"
"42262" "SNHG8" "25.3981"
"42263" "SNHG9" "0.488534"
"42264" "SNIP1" "7.35454"
"42265" "SNN" "2.05365"

Related

Grep with as least one matching value and at least one not matching

I have some files, and I want grep to return the lines, where I have at least one string Position:"Engineer" AND at least one string which does have Position not equal to "Engineer"
So in the below file should return only first line:
Position:"Engineer" Name:"Jes" Position:"Accountant" Name:"Criss"
Position:"Engineer" Name:"Eva" Position:"Engineer" Name:"Adam"
I could write something like
grep 'Position:"Engineer"' filename | grep 'Position:"Accountant"'
And this works fine (I get only first line), but the thing is I don't know what are all of the possible values in Position, so the grep needs to be generic something like
grep 'Position:"Engineer"' filename | grep -v 'Position:"Engineer"'
But this doesn't return anything (as both grep contradict each other)
Do you have any idea how this can be done?
This line works :
grep "^Position:\"Engineer\"" filename | grep -v " Position:\"Engineer\""
The first expresion with "$" catch only the Position at the begining of line, the second expression with " " space remove the second "Postion" expression.
You can avoid the pipe and additional subshell by using awk if that is allowed, e.g.
awk '
$1~/Engineer/ {if ($3~/Engineer/) next; print}
$3~/Engineer/ {if ($1~/Engineer/) next; print}
' file
Above just checks if the first field contains Engineer and if so checks if field 3 also contains Engineer, and if so skips the record, if not prints it. The second rule, just swaps the order of the tests. The result of the tests is that Engineer can only appear in one of the fields (either first or third, but not both)
Example Use/Output
With your sample input in file, you would have:
$ awk '
$1~/Engineer/ {if ($3~/Engineer/) next; print}
$3~/Engineer/ {if ($1~/Engineer/) next; print}
' file
Position:"Engineer" Name:"Jes" Position:"Accountant" Name:"Criss"
Use negative lookahead to exclude a pattern after match.
grep 'Position:"Engineer"' | grep -P 'Position:"(?!Engineer)'
With two greps in a pipe:
grep -F 'Position:"Engineer"' file | grep -Ev '(Position:"[^"]*").*\1'
or, perhaps more robustly
grep -F 'Position:"Engineer"' file | grep -v 'Position:"Engineer".*Position:"Engineer"'
In general case, if you want to print the lines with unique Position fields,
grep -Ev '(Position:"[^"]*").*\1' file
should do the job, assuming all the lines have the format specified. This will work also when there are more than two Position fields in the line.

Get content inside brackets using grep

I have text that looks like this:
Name (OneData) [113C188D-5F70-44FE-A709-A07A5289B75D] (MoreData)
I want to use grep or some other way to get the ID inside [].
How to do it?
You can do something like this via bash (GNU grep required):
t="Name (OneData) [113C188D-5F70-44FE-A709-A07A5289B75D] (MoreData)"
echo "$t" | grep -Po "(?<=\[).*(?=\])"
The pattern will give you everything between the brackets, and uses a zero-width look-behind assertion (?<= ...) to eliminate the opening bracket and uses a zero-width look-ahead assertion (?= ...) to eliminate the closing bracket.
The -P flag activates perl-style regexes which can be useful not having too much to escape, then. The -o flag will give you only the wanted result (not the "non-capturing groups").
If you don't have GNU grep available, you can solve the problem in two steps (there are probably also other solutions):
Get the ID with the brackets (\[.*\])
Remove the brackets (] and [, here via sed, for example)
echo "$t" | grep -o "\[.*\]" | sed 's/[][]//g'
As Cyrus commented, you can also use the pattern grep -oE '[0-9A-F-]{36}' if you can ensure not having strings of length 36 or larger containing only the characters 0-9, A-F and - and if all the IDs have the length of 36 characters, of course. Then you can simply ignore the brackets.

Grep words with exact two vowels

I have the following issue, I need to retrieve all words that contains exactly 2 vowels (in any order) from a file. The file only contains one word per line.
My current workaround is:
Grep1: Retrieve words such as earth, over, under, one...
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
and
Grep2: Retrieve words such as formless, deep, said...
grep -i "^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > B.txt
the above solution works but when I concatenate both regexs into a single regex then return nothing!
Mother of Grep1 & Grep2: should retrieve everything!
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
I think issue is around my implementation of ^$ in expression but have tried diff versions with no sucess!
Any help will be highly appreciated!
OS is AIX 6100-09-04-1441
You were close. This should work:
grep -i "^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
So it should find all eight possibilities (two vowels identify three nonvowel sequence, each possibly empty; 2^3 is 8):
[ ]I[ ]o[ ]
[ ]e[ ]a[r]
[ ]e[r]a[ ]
[ ]e[l]a[n]
[T]e[ ]a[ ]
[D]e[ ]a[r]
[D]e[w]a[r]
[D]a[w]a[ ]
[H]a[w]a[y]
As for concatenation, | needs escaping. You can use a single anchoring:
^(regexp1\|regexp2)$
Since the * can match 0 times or more you should be able to start the string with [^aeiou]*: try
"^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$"
As for fixing your regex, I think you need to escape the bar as \|, so
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$\|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
If you don't mind Perl, you could use this:
perl -lne '$m=$_; tr/[aeiou]//cd; print $m if length()==2;' /usr/share/dict/words
That says... "save the current line (word) in $m. Delete everything that is not a vowel. Print the original word if there are two things (i.e vowels) left."
Note that I am using the system dictionary as input for my tests.
You could do pretty much the same thing in awk.
If you're able to use an alternative to grep tr with wc works well:
words=/path/to/words.txt
while read -e word ; do
v=$(echo $word | tr -cd 'aeiou' | wc -c)
[[ ! $v -eq "2" ]] || echo $word >> output.txt
done < $words
This reads the original file line by line, counts the vowels & returns results with only 2 to output.txt.

How to filter using grep on a selected word

grep (GNU grep) 2.14
Hello,
I have a log file that I want to filter on a selected word. However, it tends to filter on many for example.
tail -f gateway-* | grep "P_SIP:N_iptB1T1"
This will also find words like this:
"P_SIP:N_iptB1T10"
"P_SIP:N_iptB1T11"
"P_SIP:N_iptB1T12"
etc
However, I don't want to display anything after the 1. grep is picking up 11, 12, 13, etc.
Many thanks for any suggestions,
You can restrict the word to end at 1:
tail -f gateway-* | grep "P_SIP:N_iptB1T1\>"
This will work assuming that you have a matching case which is only "P_SIP:N_iptB1T1".
But if you want to extract from P_SIP:N_iptB1T1x, and display only once, then you need to restrict to show only first match.
grep -o "P_SIP:N_iptB1T1"
-o, --only-matching show only the part of a line matching PATTERN
More info
At least two approaches can be tried:
grep -w pattern matches for full words. Seems to work for this case too, even though the pattern has punctuation.
grep pattern -m 1 to restrict the output to first match. (Also doable with grep xxx | head -1)
If the lines contains the quotes as in your example, just use the -E option in grep and match the closing quote with \". For example:
grep -E "P_SIP:N_iptB1T1\"" file
If these quotes aren't in the text file, and there's blank spaces or endlines after the word, you can match these too:
# The word is followed by one or more blanks
grep -E "P_SIP:N_iptB1T1\s+" file
# Match lines ending with the interesting word
grep -E "P_SIP:N_iptB1T1$" file

Recursively grep results and pipe back

I need to find some matching conditions from a file and recursively find the next conditions in previously matched files , i have something like this
input.txt
123
22
33
The files where you need to find above terms in following files, the challenge is if 123 is found in say 10 files , the 22 should be searched in these 10 files only and so on...
Example of files are like f1,f2,f3,f4.....f1200
so it is like i need to grep -w "123" f* | grep -w "123" | .....
its not possible to list them manually so any easier way?
You can solve this using awk script, i ve encountered a similar problem and this will work fine
awk '{ if(!NR){printf("grep -w %d f*|",$1)} else {printf("grep -w %d f*",$1)} }' input.txt | sh
What it Does?
it reads input.txt line by line
until it is at last record , it prints grep -w %d | (note there is a
pipe here)
which is then sent to shell for execution and results are piped back
to back
and when you reach the end the pipe is avoided
Perhaps taking a meta-programming viewpoint would help. Have grep output a series of grep commands. Or write a little PERL program. Maybe Ruby, if the mood suits.
You can use grep -lw to write the list of file names that matched (note that it will stop after finding the first match).
You capture the list of file names and use that for the next iteration in a loop.

Resources