grep exact match of string with alphabets and numbers - grep

I am using grep to extract lines from file 1 that matches with string in file2. The string in file 2 has both alphabets and numbers. eg;
MSTRG.18691.1
MSTRG.18801.1
I used sed to write word boundaries for all the strings in the file 2.
file 2
\<MSTRG.18691.1\>
\<MSTRG.18801.1\>
and used grep -f file2 file1
but output has
MSTRG.18691.1.2
MSTRG.18801.1.3 also..
I want lines that matches exactly,
MSTRG.18691.1
MSTRG.18801.1
and not,
MSTRG.18691.1.2
MSTRG.18801.1.3
Few lines from my file1
t_name gene_name FPKM TPM
MSTRG.25.1 . 0 0
rna71519 . 93.398872 194.727926057583
gene34024 ND1 2971.72876 6195.77694943117
MSTRG.28.1 . 0 0
MSTRG.28.2 . 0 0
rna71520 . 33.235409 69.2927240732149

Updating the answer
You can use start with ^ and end with $ operator to match start with and begin with. To match exactly MSTRG.18691.1 you can add ^ & $ at both ends and remove the word boundaries, additionally . has special meaning in regex to match exactly . we need to escape that with a backslash \
Example pattern:
^MSTRG\.18691\.1$
^MSTRG\.18801\.1$
file1
MSTRG.18691.1
MSTRG.1311.1
MSTRG.18801.2
MSTRG.18801.3
MSTRG.18801.1.2
MSTRG.18801.1.1
MSTRG.18801.1
PrefixMSTRG.18801.1
Just create a normal file named file1 and paste the above content into it.
file2 (pattern file)
^MSTRG\.18801\.1$
Just create a normal file named file2 and paste the above content into it.
Run the below command from commandline
grep -i --color -f file2 file1
Result:
MSTRG.18801.1
Sed to add changes to the pattern file
Here is the sed command to escape . and add ^ and $ at the beginning and end of the pattern file you already have.
sed -Ee 's/\./\\./g' -e 's/^/\^/g' -e 's/$/\$/g' file2 > file2_updated
-E to support extended regex on BSD sed, you may need to replace -E with -r based on your system's sed
Updated patterns will be saved to file2_updated. Need to use the new pattern file in grep like this
grep -i -f file2_updated file1

The flag you're looking for is -F. From man grep:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
You can use this quite comfortably in conjunction with -f:
grep -Ff file2 file1
To be clear, this will treat every line of file2 as an exact match against file1.

Related

Parse the output of a grep to tag files

After a lengthy pipe which ends with a grep, I correctly end up with a set of matching absolute paths/files and match string separated by a comma delimiter for each. I want to tag each file with its match string. Complicated also in that the path has spaces but there is none between the delimiter and the preceding and succeeding characters.
I need to be able to deal with an absolute path rather than just the filename within the directory. The match strings are space_free but the filename might not be:
So by way of example, the output of the pipe might look like:
pipe1 | pipe2 |
outputs
/Users/bloggs/Directory One/matched_file.doc,attributes_0001ABC
/Users/bloggs/Directory One/matched_file1.doc,attributeY_2
/Users/bloggs/Directory One/match_file_00x.doc,Attribute_00201
/Users/bloggs/Directory One/matching file 2.doc,attribute_0004
I want to tag each using something which will probably include:
tag --add "$attribute" "$file"
Where attribute refers to the match string eg "Attribute_00201"
Normally I'd just say eg:
tag --add Attribute_00201 /Users/bloggs/Directory\ One/match_file_00x.doc
At this point I am stuck how to parse each line ideally via another pipe and to deal with spaces correctly and execute the tag command. Grateful for any help
So I'm looking for a new pipe, pipe3 to execute or give me the correctly formatted tag command:
pipe1 | pipe2 | pipe3
delivers eg
tag --add Attribute_00201 /Users/bloggs/Directory\ One/match_file_00x.doc
etc
etc
This seems to work
| tee >(cut -f2 -d","| sed 's/^/tag --add /' > temp_out.txt) >(cut -d"," -f1 | sed -e 's/[[:space:]]/\\ /g' > temp_out1.txt) > /dev/null && paste -d' ' temp_out.txt temp_out1.txt > command.sh && chmod +x ./command.sh

Using grep to find a string that starts with a character with numbers after

Okay I have a file that contains numbers like this:
L21479
What I am trying to do is use grep (or a similar tool) to find all the strings in a file that have the format:
L#####
The # will be the number. SO an L followed by 5 numbers.
Is this even possible in grep? Should I load the file and perform regex?
You can do this with grep, for example with the following command:
grep -E -o 'L[0-9]{5}' name_of_file
For example, given a file with the text:
kasdhflkashl143112343214L232134614
3L1431413543454L2342L3523269ufoidu
gl9983ugsdu8768IUHI/(JHKJASHD/(888
The command above will output:
L23213
L14314
L35232
If it is just in a single file, you can do something along the lines of:
grep -e 'L[0-9]{5}' filename
If you need to search all files in a directory for these strings:
find . -type f | xargs grep -e 'L[0-9]{5}'

How to grep with a list of words

I have a file A with 100 words in it separated by new lines. I would like to search file B to see if ANY of the words in file A occur in it.
I tried the following but does not work to me:
grep -F A B
You need to use the option -f:
$ grep -f A B
The option -F does a fixed string search where as -f is for specifying a file of patterns. You may want both if the file only contains fixed strings and not regexps.
$ grep -Ff A B
You may also want the -w option for matching whole words only:
$ grep -wFf A B
Read man grep for a description of all the possible arguments and what they do.
To find a very long list of words in big files, it can be more efficient to use egrep:
remove the last \n of A
$ tr '\n' '|' < A > A_regex
$ egrep -f A_regex B

Grep in multiple files prints matches line with file name

I'm using grep to found matching lines from a file in two different files. It finds the matching files just fine from File1 into File2 and File3, but from the moment there is more than one file, it prints the file name in which it was found next to the line.
grep -w -f File1 File2 File3
Output:
File2: pattern
File2: pattern
File3: pattern
Is there an option to avoid the print of File2: and File3:?
grep --no-filename -w -f File1 File2 File3
If you're on a UNIX system, please refer to the man pages. Whenever you encounter a problem, your first step should be man $programName. In this case, man grep. It appears that you want the "-h" option. Here's an excerpt from the man page:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.

How to find a pattern and surrounding content in a very large SINGLE line file?

I have a very large file 100Mb+ where all the content is on one line.
I wish to find a pattern in that file and a number of characters around that pattern.
For example I would like to call a command like the one below but where -A and -B are number of bytes not lines:
cat very_large_file | grep -A 100 -B 100 somepattern
So for a file containing content like this:
1234567890abcdefghijklmnopqrstuvwxyz
With a pattern of
890abc
and a before size of -B 3
and an after size of -A 3
I want it to return:
567890abcdef
Any tips would be great.
Many thanks.
You could try the -o option:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
and use a regular expression to match your pattern and the 3 preceding/following characters i.e.
grep -o -P ".{3}pattern.{3}" very_large_file
In the example you gave, it would be
echo "1234567890abcdefghijklmnopqrstuvwxyz" > tmp.txt
grep -o -P ".{3}890abc.{3}" tmp.txt
Another one with sed (you may need it on systems where GNU grep is not available):
sed -n '
s/.*\(...890abc...\).*/\1/p
' infile
Best way I can think of doing this is with a tiny Perl script.
#!/usr/bin/perl
$pattern = $ARGV[0];
$before = $ARGV[1];
$after = $ARGV[2];
while(<>) {
print $& if( /.{$before}$pattern.{$after}/ );
}
You would then execute it thusly:
cat very_large_file | ./myPerlScript.pl 890abc 3 3
EDIT: Dang, Paolo's solution is much easier. Oh well, viva la Perl!

Resources