Grep outputs entire searched file - grep

I'm currently trying to parse the following file type (.fasta):
>SeqID=0001__GroupID=0001
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0002__GroupID=0001
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0003__GroupID=0002
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0004__GroupID=0003
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0005__GroupID=0003
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0006__GroupID=0004
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
To extract sequences by their group IDs. I have a file of the IDs to extract in the following format:
GroupID=0002
GroupID=0003
I've been using the following command:
$ grep -A 1 -f groupIDs_to_extract.txt sequences_file.fasta > output.txt
The idea being to perform a grep with each ID in the input text file, with the following line of context included to actually extract the sequence. So, from my example, the output would be all sequences from group 2 and 3:
>SeqID=0003__GroupID=0002
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0004__GroupID=0003
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0005__GroupID=0003
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
However, it just outputs the entire sequences_file.fasta at the end, and I have no idea why. Can anyone help?

Turns out that my file was actually formatted as follows:
>SeqID=0001__GroupID=0001 ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0002__GroupID=0001 ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0003__GroupID=0002 ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>SeqID=0004__GroupID=0003 ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
I didn't notice as my text editor (gedit) wrapped the text, so it looked like a normal .fasta file.
I used a regex find + replace to add newlines in to format it correctly, and now the grep works as intended.
As an aside, I altered the end of the command:
$ grep -A 1 -f groupIDs_to_extract.txt sequences_file.fasta | grep -v "\--" > output.txt
So that it removed the -- that grep sticks in if you use the context command.

Related

How to exclude from grep double colons?

I'm trying to find lines with words not preceded by double colons (::).
Example
void myClass::doMything() // I don't want this line
myObj->doMyThing() // I want this line
My goal is to get the lines where some methods are used, but not where the methods are defined.
I try with this command :
grep --color=always -rwna "methodName" --include=*.cpp | grep -v "::methodName"
but it doesn't work : it keeps extracting also lines containing
::methodName
I've also tried by writing
grep --color=always -rwna "methodName" --include=*.cpp | grep -v "\:\:methodName"
egrep --color=always -rwna "methodName" --include=*.cpp | egrep -v "\:\:methodName"
but neither works.
What should I do ?
Although grep is probably most common used tool among all linux CLI tools and is used by every1 and everywhere... still doesnt mean its perfect. The thing you are trying to achieve is not achievable with basic grep's regex - you need python/perl regex here.
As a workaround (I assume you are trying to only find line where method is invoked) you can try:
grep -Eno "(::)?methodName" your_input_files | grep -v "::methodName"
-n to prints line number and I believe it will give convenience to you
-o to prints only matched part, but I use it here to split output - to have each match in separate line (if you have 5x methodName in line of code you will have 5 lines in grep's output)
(::)? to find distinguish if its declaration or invokation of methodName, we will need it when 2nd grep comes to play...
grep -v ...and here it comes, to get rid of what you dont want
I guess you want to use maaaaany times so you can even try to make a function into your .bashrc
find_invocations () {
# below example goes through current dir, but you can improve it :)
grep --color=yes -Eno "(::)?$1" * 2>/dev/null | grep -v "::$1"
}
in above function you might go risky and use $1.* instead of $1 but an unpleasant case is if you have both methodname and ::methodName in same line AFAIR my C++ lessons (ages ago - anno 2010) methodName::methodName is a constructor...
...sorry for bad english
I've finally managed to make it work.
I've tried linux_beginner's suggestion:
grep -Eno '(::)?myMethodName' path/to/one/of/the/files.cpp | grep -v '::myMethodName'
with a single file and this works. (I found I prefer not using the o option, because I also want to se how it's used).
In this search I need anyway to use multiple files. So I've also tried to include more files :
grep -Eno '(::)?myMethodName' --include=*.cpp | grep -v '::myMethodName'
but in this case it remains like stuck in the search (maybe it triggers some slow scripting ? perl or python ?).
I've checked RavinderSingh13's command. Taken in a single instance, it can capture the lines with double colon(and only them, correctly), both on single file or in multiple files :
grep -rna '::myMethodName' path/to/one/of/the/file.cpp
grep -rna '::myMethodName' --include=*.cpp
but there must not be the -w switch, so the following:
grep -rna '::myMethodName' path/to/one/of/the/file.cpp
grep -rna '::myMethodName' --include=*.cpp
don't get any result.
RavinderSingh13's suggestion put inside the pipelining doesn't manage to filter out the double colon lines (my original goal), either with single or multiple files :
grep -rwna 'myMethodName' path/to/one/of/the/files.cpp | grep -v '::[[:alpha:]]+'
-> extracts both myMethodName and ::myMethodName from the chosen file
grep -rwna 'myMethodName' --include=*.cpp | grep -v '::[[:alpha:]]+'
-> extracts both myMethodName and ::myMethodName from all the cpp files
Now, how I could solve:
usually, when I concatenate grep commands I also add to the first of them the switch --color=always, which preserves results coloring also across the piping of multiple commands.
But that... was the culprit !
i.e., doing
grep --color=always -rwna 'myMethodName' --include=*.cpp | grep -v '::myMethodName'
preserves the color in results, but sadly fails to exclude lines containing ::myMethodName, while
grep -rwna 'myMethodName' --include=*.cpp | grep -v '::myMethodName'
gives colorless but correct results (manages to filter out double column lines).
The distribution on which I've experimented these codes and behaviours is Ubuntu 20.04.1 LTS.
Grep version : grep (GNU grep) 3.4
Thanks everybody for the interest.

Can I grep each file of directory and save output to one file? Example code included

I am running a grep command on each file in a directory. I want the outputs to be appended into the same file. Is this possible?
Here is what I am using unsuccessfully:
for f in /directory/*.txt
do
grep -Eo "[0-9]+\.[0-9]+" $f >> one_output_file.txt
done
I am grepping out a number from each file and I want the numbers to be listed together in ONE output file. Possible?
Thanks!
Why not drop the for loop and do
grep -Eoh "[0-9]+\.[0-9]+" /directory/*.txt > one_output_file.txt

(bash) grep -i not making search case insensitive for input files

I am trying to search inside a folder containing several files. The name of the files is written in upper case with a .sub extension in lower case:
AAA.sub
BBB.sub
CCC.sub
DDD.sub
I am searching a pattern trough those file using grep, however i would like to only use lower case letter for the input files.
In the man page for grep it is written:
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is specified by POSIX.)
So, if i understood properly:
grep -i subckt /schematics/aaa
and
grep -i subckt /schematics/AAA
Are supposed to both be able to search a pattern "subckt" in the file "aaa" regardless of its case (AAA or aaa) and if two files named aaa and AAA are present at the same time in the foler, i expect grep to search trough both of them.
However when i try my search with the 1st instruction (lower case) it does not work, giving me "no such file or directory" message.
When i try to search with the 2nd instruction (upper case) it works properly.
I obviously understood something wrong about how the -i option with grep, can anyone give me an answer regarding this matter?
Is it possible to be case insensitive with the input files when using grep?
EDIT:
My question was lacking details, even tough i have found the answer to my problem i will add the details in case someone else stumbles upon this:
I have one file that contains a list of each file name i want to grep. My list looks like this:
aaa capacitor C_0
bbb capacitor C_0
ccc resistor R_in
...
The grep is done inside a perl script, the perl script parses the list file and gets the name of each individual file name (aaa bbb ccc) inside a while loop.
However the name inside the list file is written in lower case whereas the name of the files i want to grep is written in upper case.
This is why i wanted to have the input file search to be case insensitive so that i could directly do a grep -i subck aaa and it would search inside the file 'AAA'
However, since the grep is launched from a perl script, and since it is apparently not possible to have grep behave like that, i used the uc() function of perl to convert aaa to AAA and do my grep with it. (see my answer below)
-i affects how the contents are searched, not the name of the files.
When the man page says "Ignore case distinctions in both the PATTERN and the input files." that really means that case is ignored in the pattern ( searching for AAA and aaa are equivalent) and the contents of the input files (a line would match if it includes "AAA" or "aaa" or even "AaA")
I think you want to either list all the filenames on the command line, or find a glob (i.e. wildcard) that matches all the filenames:
grep -i subckt *.sub
In Unix/Linux shells (bash, zsh, and so on) "*" is processed by the shell (bash) not the command (grep). The command receives the list of files and actually can't tell the difference between whether a user typed "grep foo *" and "grep foo file1 file2 file3" (if the directory includes those 3 files)
Please try the following command
find . -iname aaa.sub | grep -rn subckt
find with -iname option will list out files ignoring their case. In the above case find . -iname will list out both aaa.sub & AAA.sub. The output is piped to the grep command.
I have found a way to circumvent my problem by using the uc (upper case) function of perl to convert the input files for the grep function into upper case.
The grep command was launched from a perl script in the first place:
grep -i subckt /schematics/aaa
So, i just did that in my perl script:
$tmp=aaa
$tmp=uc($tmp)
grep -i subckt /schematics/$tmp
Now, the "aaa" name is just an example. In the perl script it is recovered from another parsed file that is written in lower case.
Thanks for the answers tough.
grep uses the filenames as they are listed on the command line. The -i option affects the contents of the files, not the names of the files.
You can use find to select filenames to be searched. The -iname option lets you match files ignoring case.
grep subckt $(find /schematics -iname aaa.sub -print)
If you have many filenames, or those filenames include spaces or other characters that would confuse the shell, the safe and secure way to do this is using the -print0 and -0 options:
find /schematics -iname aaa.sub -print0 | xargs -r -0 grep -i subckt

How can I get grep to find a line in the file but also show the line following it?

I want to use grep command to extract those lines in a text file which contain a special pattern but i also want to extract the next line of those specific lines. Is it possible using grep?
You can specify how many lines to print after a match With the -A option:
grep -A1 pattern file
Demo:
$ cat file
line one
line two
line three
$ grep -A1 'one' file
line one
line two
Next time man grep!

grep is unable to find all pattern matching "\[\[\[\["

I am having problems with using grep along with a pipe. The scenario is as follows:
I am running a python script that outputs (using print) to the screen debug messages. I use ./prog | grep "\[\[\[\[" to catch the strings with "[[[[" in them. It returns few matching results but not others (Another observation: results found by grep come before the results not found by grep in the file). I have ran the ./prog without pipe and grep and it outputs all the strings with "[[[[" pattern.
The problem is that the left square bracket is a special character in regular expressions. "grep" is not just a string matcher. Regular expressions are an involved language that let you describe patterns of text. Grep is trying to interpret [[[[ as a regular expression, not just a string.
As your question subject suggests, you can usually escape special characters with a backslash. So the following might work:
./prog | grep '\[\[\[\['
You can also "escape" square brackets by putting them inside square brackets. Thus, [[][[][[][[] or [[]{4} if your version of grep handles it.
You also need to determine whether your program, ./prog, is sending output to "standard output" or "standard error". You can put all your stderr through the pipe with:
./proc 2>&1 | egrep '[[]{4}'
UPDATE:
[ghoti#pc ~]$ printf '[[[[\n[[[\n[[[[\n[[[[[\n[[\n' | grep '\[\[\[\['
[[[[
[[[[
[[[[[
[ghoti#pc ~]$ printf '[[[[\n[[[\n[[[[\n[[[[[\n[[\n' | egrep '[[]{4}'
[[[[
[[[[
[[[[[
[ghoti#pc ~]$
Obviously, my results do not match yours. If you can provide more details as to the data you're processing, it will be helpful in trying to duplicate your results.
Error messages are usually sent to stderr, not stdout; your pipe is filtering stdout. (Your "another observation" hints at this.) You can redirect stderr along with stdout to the pipe:
./prog 2>&1 | grep '\[\[\[\['

Resources