Whitespace in search string when matching using grep. - grep

I have a file which looks like this.
10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872
10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282
10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256
10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462
10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846
17gs+VWW+A+210 11ba-SER-A-77- 0.415789 0.101282
15gs+VWW+A+210 11ba-VAL-A-47- 0.413793 0.215385
I want to grep out the lines that match a pattern [inclusive of the whitespace in it]. Let's say the pattern is: '10gs+VWW+A+210 11ba-'
When I give such a pattern as an argument to grep, I get the matching lines correctly. However the problem arises when I want to match multiple patterns like these from a file say pattern.txt which has a list of all these patterns on each line.
pattern.txt looks like this:
10gs+VWW+A+210 11ba-
10gs+VWW+A+210 10gs-
When I use a shell script like this:
for i in `cat pattern.txt`; do grep -e "^$i" bigfile.txt ; done
the command takes 10gs+VWW+A+210 separately and 11ba separately to match. I want to match the entire thing (separated by a space) i.e. 10gs+VWW+A+210 11ba to be matched, and not the two strings separately.
How do I modify the existing shell script to overcome the white space character in the search string?
Also, since the file against which I am matching these set of strings is large, ~50GB.
So, a memory efficient solution is welcome.
Thanks.

Replace spaces with other symbols
Assuming # never occurs in the patterns
for i in $( cat pattern.txt | tr ' ' '#' ) ; do
j=$(echo "$i" | tr '#' ' ' )
grep -e "^$j" bigfile.txt
done
Timing on my test file
real 0m20.739s
user 0m11.773s
sys 0m8.345s
Use -f flag in grep
grep -f pattern.txt bigfile.txt
Timing on the same test file
real 0m2.190s
user 0m2.163s
sys 0m0.026s
In other words, the performance of grep -f appears to be about 10 times better with a large pattern file.

Does the following command and corresponding result suit you? The patterns must be split by a pipe to make either one of them match.
Command:
egrep '10gs\+VWW\+A\+210 11ba-|10gs\+VWW\+A\+210 10gs-' bigfile.txt
Result:
10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872
10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282
10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256
10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462
10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846

Related

Can grep print its matches to multiple lines, even if found on the same line?

For example, with the following string:
[:variable_one] == options[:variable_two]
and the following grep argument:
grep -Eo "\[\:.*?\]"
It will show the output of:
[:variable_one] == options[:variable_two]
but instead, I'm looking to get an output of:
[:variable_one]
[:variable_two]
Is there a way to "split" each match into a separate line, even if it finds multiple matches on a single line? Basically looking for the opposite answer of this: Print multiple regex matches using grep on the same line
The : and ] (that is not part of a bracket expression) chars are not special inside a regex pattern. *? is treated as * in the POSIX ERE pattern, so it is too greedy and matches until the rightmost occurrence of ].
A POSIX BRE compliant regex for use with grep can look like
#!/bin/bash
s='[:variable_one] == options[:variable_two]'
grep -o "\[:[^][]*]" <<< "$s"
See the online demo. Output:
[:variable_one]
[:variable_two]

Get content inside brackets using grep

I have text that looks like this:
Name (OneData) [113C188D-5F70-44FE-A709-A07A5289B75D] (MoreData)
I want to use grep or some other way to get the ID inside [].
How to do it?
You can do something like this via bash (GNU grep required):
t="Name (OneData) [113C188D-5F70-44FE-A709-A07A5289B75D] (MoreData)"
echo "$t" | grep -Po "(?<=\[).*(?=\])"
The pattern will give you everything between the brackets, and uses a zero-width look-behind assertion (?<= ...) to eliminate the opening bracket and uses a zero-width look-ahead assertion (?= ...) to eliminate the closing bracket.
The -P flag activates perl-style regexes which can be useful not having too much to escape, then. The -o flag will give you only the wanted result (not the "non-capturing groups").
If you don't have GNU grep available, you can solve the problem in two steps (there are probably also other solutions):
Get the ID with the brackets (\[.*\])
Remove the brackets (] and [, here via sed, for example)
echo "$t" | grep -o "\[.*\]" | sed 's/[][]//g'
As Cyrus commented, you can also use the pattern grep -oE '[0-9A-F-]{36}' if you can ensure not having strings of length 36 or larger containing only the characters 0-9, A-F and - and if all the IDs have the length of 36 characters, of course. Then you can simply ignore the brackets.

Grep match only before ":"

Hello How can I grep only match before : mark?
If I run grep test1 file, it shows all three lines.
test1:x:29688:test1,test2
test2:x:22611:test1
test3:x:25163:test1,test3
But I would like to get an output test1:x:29688:test1,test2
I would appreciate any advice.
If the desired lines always start with test1 then you can do:
grep '^test1' file
If it's always followed by : but not the other (potential) matches then you can include it as part of the pattern:
grep 'test1:' file
As your data is in row, columns delimited by a character, you may consider awk:
awk -F: '$1 == "test1"' file
I think that you just need to add “:” after “test1”, see an example:
grep “test1:” file

Grep words with exact two vowels

I have the following issue, I need to retrieve all words that contains exactly 2 vowels (in any order) from a file. The file only contains one word per line.
My current workaround is:
Grep1: Retrieve words such as earth, over, under, one...
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
and
Grep2: Retrieve words such as formless, deep, said...
grep -i "^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > B.txt
the above solution works but when I concatenate both regexs into a single regex then return nothing!
Mother of Grep1 & Grep2: should retrieve everything!
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
I think issue is around my implementation of ^$ in expression but have tried diff versions with no sucess!
Any help will be highly appreciated!
OS is AIX 6100-09-04-1441
You were close. This should work:
grep -i "^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words > A.txt
So it should find all eight possibilities (two vowels identify three nonvowel sequence, each possibly empty; 2^3 is 8):
[ ]I[ ]o[ ]
[ ]e[ ]a[r]
[ ]e[r]a[ ]
[ ]e[l]a[n]
[T]e[ ]a[ ]
[D]e[ ]a[r]
[D]e[w]a[r]
[D]a[w]a[ ]
[H]a[w]a[y]
As for concatenation, | needs escaping. You can use a single anchoring:
^(regexp1\|regexp2)$
Since the * can match 0 times or more you should be able to start the string with [^aeiou]*: try
"^[^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$"
As for fixing your regex, I think you need to escape the bar as \|, so
grep -i "^[aeiou][^aeiou]*[aeiou][^aeiou]*$\|^[^aeiou][^aeiou]*[aeiou][^aeiou]*[aeiou][^aeiou]*$" genesis.words
If you don't mind Perl, you could use this:
perl -lne '$m=$_; tr/[aeiou]//cd; print $m if length()==2;' /usr/share/dict/words
That says... "save the current line (word) in $m. Delete everything that is not a vowel. Print the original word if there are two things (i.e vowels) left."
Note that I am using the system dictionary as input for my tests.
You could do pretty much the same thing in awk.
If you're able to use an alternative to grep tr with wc works well:
words=/path/to/words.txt
while read -e word ; do
v=$(echo $word | tr -cd 'aeiou' | wc -c)
[[ ! $v -eq "2" ]] || echo $word >> output.txt
done < $words
This reads the original file line by line, counts the vowels & returns results with only 2 to output.txt.

Opposite of "only-matching" in grep?

Is there any way to do the opposite of showing only the matching part of strings in grep (the -o flag), that is, show everything except the part that matches the regex?
That is, the -v flag is not the answer, since that would not show files containing the match at all, but I want to show these lines, but not the part of the line that matches.
EDIT: I wanted to use grep over sed, since it can do "only-matching" matches on multi-line, with:
cat file.xml|grep -Pzo "<starttag>.*?(\n.*?)+.*?</starttag>"
This is a rather unusual requirement, I don't think grep would alternate the strings like that. You can achieve this with sed, though:
sed -n 's/$PATTERN//gp' file
EDIT in response to OP's edit:
You can do multiline matching with sed, too, if the file is small enough to load it all into memory:
sed -rn ':r;$!{N;br};s/<starttag>.*?(\n.*?)+.*?<\/starttag>//gp' file.xml
You can do that with a little help from sed:
grep "pattern" input_file | sed 's/pattern//g'
I don't think there is a way in grep.
If you use ack, you could output Perl's special variables $` and $' variables to show everything before and after the match, respectively:
ack string --output="\$`\$'"
Similarly if you wanted to output what did match along with other text, you could use $& which contains the matched string;
ack string --output="Matched: $&"

Resources