Using grep to extract reads from nanopore fastq files - grep

I'm trying to extract the a specific sequence from a fastq file using grep to search the sequence ID
less all_barcode03.fastq.gz
#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t runid=7204dc15205b93bfd6430ca0f3a0218f11ce0787 read=10 ch=120 start_time=2019-04-12T13:55:25Z
TCGGTAGCCACTTCGTTCAGTCAATTTGGGTTGTTTAACCGAGTCTTGTGTGTCCCAGTTACCAGGGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTCGTGCGCCGCTTCAGTGATCAGTGAAGATGGGTTTGTGGTGGAATACTCTTGCTGCTCATGGCAAACTTTATGTTGGTTTTCTCATGCATTTGTTTCTCGTAATCCCATACGTCATCCAAAGTCATCTGAAAAAGAGGGAAGGGGTGGATTGTGGGTGAAATGTTGTGTACTCCTCTATAATGGGGCTCAGTTGACAAACAGGTGGAGGAGAGGATCATTTGCTTAAAGGGGTGAGTGAAGCGGAGTTTAAGGATAATTCAAGCTTTTAAAAGTGGCTTTAGAGGTAAAGGGTTAGCTCCCATGACCCACAGGATTTATAGGAGATGGCTCTGAACAAACCAGAGCCACACACACA
+
-%&&($$%%#%,-*),-5(&,$$%$%+).'-(+-4-(')%%$*+-,3...14,7/))/03.06-./-3:8.0(*,/7+*,966006.,(*(,-(&(*,./+--902/./),,,0,-/./,4(+0/,0).0-7048,(+*',*/.)*#(((.0--10764+('(%.3/+$&%&'./4'0.;:6.895778+0/*(28/),(+-/404/*'(),.16517&83+*/0/0.--033**$&'*,''*/,,,/..0.*0*0$##*((($/6&('-,.230/01/2+4,,::8719(*.4.'.26/0(*))0*+,(*+-,-+-.4765-$%&.'%.*/')(&''#-()*21,-.;+3).*,,'557686+(-7;-2:8))(&%%'*)**%&&).6&,*(.-'$'(*2+*0587:0+*+)/*/63--/*('#&)-68664&%534)/13.))'14*+**%%$$#
#69e7e435-a78c-4ec8-94cd-b0c1f3c40c11_t runid=7204dc15205b93bfd6430ca0f3a0218f11ce0787 read=15 ch=465 start_time=2019-04-12T13:55:25Z
TCGGTACTTCGTTCGGTTGGAGAAGGTGGTGTTGCCGAGTCTTGTGTCCCAGTTACCAGGGTTTTCGCATTTATCGTGGCTTGCTGCGTTTTCGTGCGCCACCGCTTCATGTGTGTGTGTGTGTCTGGTGTTATTACTCACTTGGCAAGCGTGTCTGGACAGCAGCTGTTTGAGTGTTGAGAGCGCTTCTTCTCCAGGAGAAGCGGTTGAGCCTAAGCTGAATCCCCGTCCGTCTTTATCTTCGGACATGCTCTGGATATGCCTGAGGAGGACAATGGAGGAACAGAACAGATGGATGAAGAGCTCATAAAACTGGCACACATGCATCAAAGCCCACCTTCGTCACTCTGATGACCAGTGACTGCCGTTTATTACTGCGATTTACCATGAAGTTATCTGCTTTTTGGGTCAGTTAGTGTGTGTGTGTGTGTGTGTGTGTGTGCCTTTTCTGTCCTCCAGATACTCAGTACTACAGAGGAGCTATTAATACTTACTACATCGATATGTTATGTAATATCATTCTAGCCTGCTACTCCTGTCTTCTGTATACAACTGTCGTCTGTCCCGAATAGCTCCTGGGTGCCCTCTCCTCCATAGTAGCCACAGTTACAGGAATATTACTCTTTATCATAGAAGCGGTATCTAGTAGAACAGTCCTTAGTTAAAATAATAACGGGGTGTGGGCATGTACAGCCTCTGGTATTCCGTTGCTCAGCAGAGCCTCATAACTCTCCTAGTGGCTCAGGAAGGCTGAAACAGGCTGTGTGCACCCAGCCAGCTGGAACTGTGTTTGAGTGCCATCTTGGAATACTGTTTATAAGCGCTCTTAAGTTATATGTGAGGATGGTGGTATTAGATATGGAAGTGTGTAGGAGGAGAAAGAGGAAATAGTGTCATGTTGATATGAACAGTTTGGTCAGTAAAATGAGGGCAGTAAAAAAGTGTTTTAAGCGTTTTGTCGGTCGACAATATGATAATAAAATGCATTTGGTTCACGATAACAAGAAAACAGAAAAGACCAGCAATGAATATTTAGCATTTTTTGTTTGAAAGATGAAACAAATAATTGAAATAGCTGCCAAATATTTGTGAAATGTACTAAATGGTCAGAGTGAAGATGCAGCTTTGAAAAGAAGATTCGGA
+
+&$%)'./-0,*1(&&&%#%&$(&)'%&&%$"#$&+,'*-*1+++5-73+)*/+,32/46552:/-+2025/+-057,$#$$&)/01,)433/2732'&$#&$"$'$((+*+),+,,,*+,,-11)*'&((*"0#"&*((,*.--.&.+-*,)-17861+&%'%)),73:60-/-32:++(('.')+56894,4+)./'%')%$&-,('%#41.'$%&')$0))/2.*04632,20)(+'&&,+7.97825-++**166678950-))%*+,-26-.6,*/(4.$+'+-5/0/.-02/-+)'%+73//245+(&(%%'))(&$#&&(7.:2-0;7014354398')-83/00/04:*330))&#)))-5/(-*++5#./+50-(,0765/1,,8//05/0.:0/%#$&)--+4+)+5575312+1&-')).'+&*%)(,,,((%++/,.2486112'&#$&##$%'(*+,1/)/+...+-.1312/1+**-(-.8---,*+,-.5,1,(+%..1,)--.8;441019.1780000313658;99621-,,.++)#,-.011537%#&-2,',-,86)(.''%(.2+/24,.23/./+*$)4--.0.340/+())0..62019-7:+).2(/*%),&--30/32*)&)%)$%')+2;829%*)'4:;401/,-71%.,'(*+)2837653/0-&/63861'(*-6*()5:.3--'%')',)2977&(%(%'+-/**-0727112246..*1,-..3&/.4535-3+3.00,7*%'1+12311321.35567:93&)*))'-/,2-7-.6/,..-4;6/3/&(&%**03745+-.-.::95544467..--))'*)#('*+,..(%)&'(%%&-+'++)*/1&&'$%&+*&())$()(,%+'$&'&($&'2.44:0..++#%).78*(((/1'($$&-:;98.(*00;;2-''),053.//3+&))+14-8&**,..01.2:;743425:7(,*.((+*,,-+'&*'+057,*(.53-(+3703/210.06256;.+,01.5<<5,06;:+.7)')3,$(+'4;.,*'*'*-4--)+-*)+&--,*$(+&(-$*,''/2778:;9/.857+%%'()*((*11-,)+-5-+,31/#&%$%5)-#%#
Then try to show one of the sequences by searching for the sequence ID:
grep '#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t' all_barcode03.fastq.gz
grep '*#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t*' all_barcode03.fastq.gz
grep #3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t all_barcode03.fastq.gz
grep *#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t* all_barcode03.fastq.gz
All the above grep commands return no results however there is a line in the file staring with #3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t

Use zgrep not grep on .gz files.
zgrep - search possibly compressed files for a regular expression

Related

show filename with matching word from grep only

I am trying to find which words happened in logfiles plus show the logfilename for anything that matches following pattern:
'BA10\|BA20\|BA21\|BA30\|BA31\|BA00'
so if file dummylogfile.log contains BA10002 I would like to get a result such as:
dummylogfile.log:BA10002
it is totally fine if the logfile shows up twice for duplicate matches.
the closest I got is:
for f in $(find . -name '*.err' -exec grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} \+);do printf $f;printf ':';grep -o 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' $f;done
but this gives things like:
./register-05-14-11-53-59_24154.err:BA10
BA10
./register_mdw_files_2020-05-14-11-54-32_24429.err:BA10
BA10
./process_tables.2020-05-18-11-18-09_11428.err:BA30
./status_load_2020-05-18-11-35-31_9185.err:BA30
so,
1) there are empty lines with only the second match and
2) the full match (e.g., BA10004) is not shown.
thanks for the help
There are a couple of options you can pass to grep:
-H: This will report the filename and the match
-o: only show the match, not the full line
-w: The match must represent a full word (string build from [A-Za-z0-9_])
If we look at your regex, you use BA01, this will match only BA01 which can appear anywhere in the text, also mid word. If you want the regex to match a full word, it should read BA01[[:alnum:]_]* which adds any sequence of word-constituent characters (equivalent to [A-Za-z0-9_]). You can test this with
$ echo "foo BA01234 barBA012" | grep -Ho "BA01"
(standard input):BA01
(standard input):BA01
$ echo "foo BA01234 barBA012" | grep -How "BA01"
$ echo "foo BA01234 barBA012" | grep -How "BA01[[:alnum:]_]*"
(standard input):BA01234
So your grep should look like
grep -How "\('BA10\|BA20\|BA21\|BA30\|BA31\|BA00'\)[[:alnum:]_]*" *.err
From your example it seems that all files are in one directory. So the following works right away:
grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' *.err
If the files are in different directories:
find . -name '*.err' -print | xargs -I {} grep 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} /dev/null
Explanation: the addition of /dev/null to the filename {} forces grep to report the matching filename

grep: Find all files containing the word `star`, but not the word `start`

I have a bunch of files: some contain the word star, some contain the word start, some contain both.
I'd like to grep for files that contain the word star, but not the word start.
How can this be accomplished using only grep?
grep has some options for inverting the matches at the line or file level. You want the latter option, with the -L switch. The following will print the names of all the files in a folder that don't contain the text start:
grep -LF start *
-F tells grep that start is a literal string and not a regex. It's optional here, but might speed things up a tiny bit.
You can use the resulting list to search for files that contain star:
grep -lF star $(grep -LF start *)
-l prints only the names of files containing a match, not any line-by-line or match-by-match details. If this is not exactly what you want, man grep is your friend.
This uses an additional shell construct to run the inverted match, but it technically doesn't call any additional programs that aren't grep.
Update
Since you mention wanting to look through all the files starting with a given root folder, change -LF to -LFr. Replace * with your root folder if you don't want to change working directories.
-r tells grep to recurse into directories, and search every file it finds along the way.
With GNU grep for -w:
$ cat file
foo star bar
oof start rab
$ grep -w star *
foo star bar
or if you just want the names of the files containing star:
$ grep -lw star *
file
and to just find files to look in:
$ find . -maxdepth 1 -type f -exec grep -w 'star' {} \;
foo star bar

How to grep in one line starting from particular string to end with particular string

I want to grep "[calleruid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse"
in Below file
2014-10-15 18:38:32,831 plivo-rest[2781]: INFO: Fetching GET http://*******/outbound_callback.aspx with smscresponse[to]=8912722fsf9&smscresponse[ALegUUID]=5bb516fsd64-546c-11e4-879f-551816a551303677&smscresponse[calluid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse[direction]=outbosund&smscresfdsponse[endreason]=UNALLOCATED_NUMBER&smscresponse[from]=83339995896999&smscresponse[starttime]=0&smscresponse[ALegRequestUUID]=5bb4bafc-546c-11e4-891d-000c29ec6e41&smscresponse[RequestUUID]=5bb4bafc-546c-11e4-891d-000c29ec6e41&smscresponse[callstatus]=completed&smscresponse[endtime]=1413378509&smscresponse[ScheduledHangupId]=5bb4c15a-546c-11e4-891d-000c29ec6e41&smscresponse[event]=missed_call_hangup
I used this command
$ grep -oP '(calluid).*$'
this greps upto end of file
I used this command
$ grep -oP '(calluid).{40}'
it fetches 40 characters but i have 1000's of calleruid's so each have different no.s of characters
So please guide me to grep exact callerid data
Use a lookahead to force the regex engine to do the match upto a specific character or a boundary.
$ grep -oP '\[calluid\][^\]\[]*(?=\[|$)' file
[calluid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse
Here is an gnu awk (due to multiple characters in RS) version:
awk -v RS="[[]calluid[]]=" -F[ 'NR==2 {print $1}' file
aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse
You can also set RS like this: RS="\\\[calluid]="

Using grep to find a string that starts with a character with numbers after

Okay I have a file that contains numbers like this:
L21479
What I am trying to do is use grep (or a similar tool) to find all the strings in a file that have the format:
L#####
The # will be the number. SO an L followed by 5 numbers.
Is this even possible in grep? Should I load the file and perform regex?
You can do this with grep, for example with the following command:
grep -E -o 'L[0-9]{5}' name_of_file
For example, given a file with the text:
kasdhflkashl143112343214L232134614
3L1431413543454L2342L3523269ufoidu
gl9983ugsdu8768IUHI/(JHKJASHD/(888
The command above will output:
L23213
L14314
L35232
If it is just in a single file, you can do something along the lines of:
grep -e 'L[0-9]{5}' filename
If you need to search all files in a directory for these strings:
find . -type f | xargs grep -e 'L[0-9]{5}'

Is there a way in grep to find out how many lines matched the grep result?

Suppose I write a grep query to find out the occurrence of a method call on an object like this:
// might not be accurate, but irrelevant
grep -nr "[[:alnum:]]\.[[:alnum:]](.*)" .
This would give many results. How to find out how many such results are obtained?
What about using | wc -l to count the number of result lines?
What about
man grep | grep "count"
It outputs
-c, --count
Suppress normal output; instead print a count of matching lines for each input file. [...]
Previous answers are OK, I just want to put it into command line instructions in order to have copy-paste versions (from explicit to simplest) for the future:
grep --count "PATTERN" FILE
Is exactly the same as:
grep -c "PATTERN" FILE
And it is equivalent to:
grep "PATTERN" FILE | wc -l
As a bonus, below i give you a version where a file with a list of patterns is used.
grep -count --file=PATTERNFILE FILE
or simply
grep -cf PATTERNFILE FILE

Resources