I have a file like this (delimited by \t):
AAED1 Previous_symbol PRXL2C
AARS Previous_symbol AARS1
ABP1 Previous_symbol AOC1
ACN9 Previous_symbol SDHAF3
ADCY3 Previous_symbol ADCY8
AK3 Previous_symbol AK4
AK8 Previous_symbol AK3
I want to delete the rows that contain AAED1 and AK3 in the first column. In reality my file have thousand of lines and I want to delete hundred of rows. I have a file with the patterns I want to search for (this is an example):
AAED1
AK3
I tried this:
grep -wvf pattern.txt file.txt
Expected output:
AARS Previous_symbol AARS1
ABP1 Previous_symbol AOC1
ACN9 Previous_symbol SDHAF3
ADCY3 Previous_symbol ADCY8
AK8 Previous_symbol AK3
The result I obtained:
AARS Previous_symbol AARS1
ABP1 Previous_symbol AOC1
ACN9 Previous_symbol SDHAF3
ADCY3 Previous_symbol ADCY8
The last row is also deleted because it contains AK3 on the third column. Is there a way to only grep the first column?
In the current set up, the file with patterns will search for any occurrence of those patterns in the lines and so the line:
AK8 Previous_symbol AK3
will also match AK3
You need to add a start of line marker to the patterns to ensure that the patterns are anchored checked at the start of the lines only and so:
^AAED1
^AK3
If you cannot directly edit the file with patterns use the following:
grep -f <(sed 's/^/^/' file1) file
With file 1 as the file with the patterns and file as the file to search. We run a sed command to replace the start of every line in file1 with ^ and then redirect the result back into grep as the patterns to check.
Related
Content of testfile.txt
/path1/abc.txt
/path2/abc.txt.1
/path3/abc.txt123
Content of pattern.txt
abc.txt$
Bash Command
grep -i -f pattern.txt testfile.txt
Output:
/path1/abc.txt
This is a working solution, but currently the $ in the pattern is manually added to each line and this edited pattern file is uploaded to users. I am trying to avoid the manual amendment.
Alternate solution to loop and read line by line, but required scripting skills or upload scripts to user environment.
Want to keep the original pattern files in an audited environment, users just login and run simple cut-n-paste commands.
Any one liner solution?
You can use sed to add $ to pattern.txt and then use grep, but you might run into issues due to regexp metacharacters like the . character. For example, abc.txt$ will also match abc1txt. And unless you take care of matching only the basename from the file path, abc.txt$ will also match /some/path/foobazabc.txt.
I'd suggest to use awk instead:
$ awk '!f{a[$0]; next} $NF in a' pattern.txt f=1 FS='/' testfile.txt
/path1/abc.txt
pattern.txt f=1 FS='/' testfile.txt here a flag f is set between the two files and field separator is also changed to / for the second file
!f{a[$0]; next} if flag f is not set (i.e. for the first file), build an array a with line contents as the key
$NF in a for the second file, if the last field matches a key in array a, print the line
Just noticed that you are also using -i option, so use this for case insensitive matching:
awk '!f{a[tolower($0)]; next} tolower($NF) in a'
Since pattern.txt contains only a single pattern, and you don't want to change it, since it is an audited file, you could do
grep -i -f "$(<pattern.txt)'$' testfile.txt
instead. Note that this would break, if the maintainer of the file one day decided to actually write there a terminating $.
IMO, it would make more sense to explain to the maintainer of pattern.txt that he is supposed to place there a simple regular expression, which is going to match your testfile. In this case s/he can decide whether the pattern really should match only the right edge or some inner part of the lines.
If pattern.txt contains more than one line, and you want to add the $ to each line, you can likewise do a
grep -i -f <(sed 's/$/$/' <pattern.txt) testfile.txt
As the '$' symbol indicates pattern end. The following script should work.
#!/bin/bash
file_pattern='pattern.txt' # path to pattern file
file_test='testfile.txt' # path to test file
while IFS=$ read -r line
do
echo "$line"
grep -wn "$line" $file_test
done < "$file_pattern"
You can remove the IFS descriptor if the pattern file comes with leading/trailing spaces.
Also the grep option -w matches only whole word and -n provides with line number.
I am using grep to extract lines from file 1 that matches with string in file2. The string in file 2 has both alphabets and numbers. eg;
MSTRG.18691.1
MSTRG.18801.1
I used sed to write word boundaries for all the strings in the file 2.
file 2
\<MSTRG.18691.1\>
\<MSTRG.18801.1\>
and used grep -f file2 file1
but output has
MSTRG.18691.1.2
MSTRG.18801.1.3 also..
I want lines that matches exactly,
MSTRG.18691.1
MSTRG.18801.1
and not,
MSTRG.18691.1.2
MSTRG.18801.1.3
Few lines from my file1
t_name gene_name FPKM TPM
MSTRG.25.1 . 0 0
rna71519 . 93.398872 194.727926057583
gene34024 ND1 2971.72876 6195.77694943117
MSTRG.28.1 . 0 0
MSTRG.28.2 . 0 0
rna71520 . 33.235409 69.2927240732149
Updating the answer
You can use start with ^ and end with $ operator to match start with and begin with. To match exactly MSTRG.18691.1 you can add ^ & $ at both ends and remove the word boundaries, additionally . has special meaning in regex to match exactly . we need to escape that with a backslash \
Example pattern:
^MSTRG\.18691\.1$
^MSTRG\.18801\.1$
file1
MSTRG.18691.1
MSTRG.1311.1
MSTRG.18801.2
MSTRG.18801.3
MSTRG.18801.1.2
MSTRG.18801.1.1
MSTRG.18801.1
PrefixMSTRG.18801.1
Just create a normal file named file1 and paste the above content into it.
file2 (pattern file)
^MSTRG\.18801\.1$
Just create a normal file named file2 and paste the above content into it.
Run the below command from commandline
grep -i --color -f file2 file1
Result:
MSTRG.18801.1
Sed to add changes to the pattern file
Here is the sed command to escape . and add ^ and $ at the beginning and end of the pattern file you already have.
sed -Ee 's/\./\\./g' -e 's/^/\^/g' -e 's/$/\$/g' file2 > file2_updated
-E to support extended regex on BSD sed, you may need to replace -E with -r based on your system's sed
Updated patterns will be saved to file2_updated. Need to use the new pattern file in grep like this
grep -i -f file2_updated file1
The flag you're looking for is -F. From man grep:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
You can use this quite comfortably in conjunction with -f:
grep -Ff file2 file1
To be clear, this will treat every line of file2 as an exact match against file1.
I have a bunch of files: some contain the word star, some contain the word start, some contain both.
I'd like to grep for files that contain the word star, but not the word start.
How can this be accomplished using only grep?
grep has some options for inverting the matches at the line or file level. You want the latter option, with the -L switch. The following will print the names of all the files in a folder that don't contain the text start:
grep -LF start *
-F tells grep that start is a literal string and not a regex. It's optional here, but might speed things up a tiny bit.
You can use the resulting list to search for files that contain star:
grep -lF star $(grep -LF start *)
-l prints only the names of files containing a match, not any line-by-line or match-by-match details. If this is not exactly what you want, man grep is your friend.
This uses an additional shell construct to run the inverted match, but it technically doesn't call any additional programs that aren't grep.
Update
Since you mention wanting to look through all the files starting with a given root folder, change -LF to -LFr. Replace * with your root folder if you don't want to change working directories.
-r tells grep to recurse into directories, and search every file it finds along the way.
With GNU grep for -w:
$ cat file
foo star bar
oof start rab
$ grep -w star *
foo star bar
or if you just want the names of the files containing star:
$ grep -lw star *
file
and to just find files to look in:
$ find . -maxdepth 1 -type f -exec grep -w 'star' {} \;
foo star bar
I am trying to search inside a folder containing several files. The name of the files is written in upper case with a .sub extension in lower case:
AAA.sub
BBB.sub
CCC.sub
DDD.sub
I am searching a pattern trough those file using grep, however i would like to only use lower case letter for the input files.
In the man page for grep it is written:
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is specified by POSIX.)
So, if i understood properly:
grep -i subckt /schematics/aaa
and
grep -i subckt /schematics/AAA
Are supposed to both be able to search a pattern "subckt" in the file "aaa" regardless of its case (AAA or aaa) and if two files named aaa and AAA are present at the same time in the foler, i expect grep to search trough both of them.
However when i try my search with the 1st instruction (lower case) it does not work, giving me "no such file or directory" message.
When i try to search with the 2nd instruction (upper case) it works properly.
I obviously understood something wrong about how the -i option with grep, can anyone give me an answer regarding this matter?
Is it possible to be case insensitive with the input files when using grep?
EDIT:
My question was lacking details, even tough i have found the answer to my problem i will add the details in case someone else stumbles upon this:
I have one file that contains a list of each file name i want to grep. My list looks like this:
aaa capacitor C_0
bbb capacitor C_0
ccc resistor R_in
...
The grep is done inside a perl script, the perl script parses the list file and gets the name of each individual file name (aaa bbb ccc) inside a while loop.
However the name inside the list file is written in lower case whereas the name of the files i want to grep is written in upper case.
This is why i wanted to have the input file search to be case insensitive so that i could directly do a grep -i subck aaa and it would search inside the file 'AAA'
However, since the grep is launched from a perl script, and since it is apparently not possible to have grep behave like that, i used the uc() function of perl to convert aaa to AAA and do my grep with it. (see my answer below)
-i affects how the contents are searched, not the name of the files.
When the man page says "Ignore case distinctions in both the PATTERN and the input files." that really means that case is ignored in the pattern ( searching for AAA and aaa are equivalent) and the contents of the input files (a line would match if it includes "AAA" or "aaa" or even "AaA")
I think you want to either list all the filenames on the command line, or find a glob (i.e. wildcard) that matches all the filenames:
grep -i subckt *.sub
In Unix/Linux shells (bash, zsh, and so on) "*" is processed by the shell (bash) not the command (grep). The command receives the list of files and actually can't tell the difference between whether a user typed "grep foo *" and "grep foo file1 file2 file3" (if the directory includes those 3 files)
Please try the following command
find . -iname aaa.sub | grep -rn subckt
find with -iname option will list out files ignoring their case. In the above case find . -iname will list out both aaa.sub & AAA.sub. The output is piped to the grep command.
I have found a way to circumvent my problem by using the uc (upper case) function of perl to convert the input files for the grep function into upper case.
The grep command was launched from a perl script in the first place:
grep -i subckt /schematics/aaa
So, i just did that in my perl script:
$tmp=aaa
$tmp=uc($tmp)
grep -i subckt /schematics/$tmp
Now, the "aaa" name is just an example. In the perl script it is recovered from another parsed file that is written in lower case.
Thanks for the answers tough.
grep uses the filenames as they are listed on the command line. The -i option affects the contents of the files, not the names of the files.
You can use find to select filenames to be searched. The -iname option lets you match files ignoring case.
grep subckt $(find /schematics -iname aaa.sub -print)
If you have many filenames, or those filenames include spaces or other characters that would confuse the shell, the safe and secure way to do this is using the -print0 and -0 options:
find /schematics -iname aaa.sub -print0 | xargs -r -0 grep -i subckt
I'm using grep to found matching lines from a file in two different files. It finds the matching files just fine from File1 into File2 and File3, but from the moment there is more than one file, it prints the file name in which it was found next to the line.
grep -w -f File1 File2 File3
Output:
File2: pattern
File2: pattern
File3: pattern
Is there an option to avoid the print of File2: and File3:?
grep --no-filename -w -f File1 File2 File3
If you're on a UNIX system, please refer to the man pages. Whenever you encounter a problem, your first step should be man $programName. In this case, man grep. It appears that you want the "-h" option. Here's an excerpt from the man page:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default when there is only one file (or only standard input) to search.