Using grep to extract reads from nanopore fastq files - grep
I'm trying to extract the a specific sequence from a fastq file using grep to search the sequence ID
less all_barcode03.fastq.gz
#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t runid=7204dc15205b93bfd6430ca0f3a0218f11ce0787 read=10 ch=120 start_time=2019-04-12T13:55:25Z
TCGGTAGCCACTTCGTTCAGTCAATTTGGGTTGTTTAACCGAGTCTTGTGTGTCCCAGTTACCAGGGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTCGTGCGCCGCTTCAGTGATCAGTGAAGATGGGTTTGTGGTGGAATACTCTTGCTGCTCATGGCAAACTTTATGTTGGTTTTCTCATGCATTTGTTTCTCGTAATCCCATACGTCATCCAAAGTCATCTGAAAAAGAGGGAAGGGGTGGATTGTGGGTGAAATGTTGTGTACTCCTCTATAATGGGGCTCAGTTGACAAACAGGTGGAGGAGAGGATCATTTGCTTAAAGGGGTGAGTGAAGCGGAGTTTAAGGATAATTCAAGCTTTTAAAAGTGGCTTTAGAGGTAAAGGGTTAGCTCCCATGACCCACAGGATTTATAGGAGATGGCTCTGAACAAACCAGAGCCACACACACA
+
-%&&($$%%#%,-*),-5(&,$$%$%+).'-(+-4-(')%%$*+-,3...14,7/))/03.06-./-3:8.0(*,/7+*,966006.,(*(,-(&(*,./+--902/./),,,0,-/./,4(+0/,0).0-7048,(+*',*/.)*#(((.0--10764+('(%.3/+$&%&'./4'0.;:6.895778+0/*(28/),(+-/404/*'(),.16517&83+*/0/0.--033**$&'*,''*/,,,/..0.*0*0$##*((($/6&('-,.230/01/2+4,,::8719(*.4.'.26/0(*))0*+,(*+-,-+-.4765-$%&.'%.*/')(&''#-()*21,-.;+3).*,,'557686+(-7;-2:8))(&%%'*)**%&&).6&,*(.-'$'(*2+*0587:0+*+)/*/63--/*('#&)-68664&%534)/13.))'14*+**%%$$#
#69e7e435-a78c-4ec8-94cd-b0c1f3c40c11_t runid=7204dc15205b93bfd6430ca0f3a0218f11ce0787 read=15 ch=465 start_time=2019-04-12T13:55:25Z
TCGGTACTTCGTTCGGTTGGAGAAGGTGGTGTTGCCGAGTCTTGTGTCCCAGTTACCAGGGTTTTCGCATTTATCGTGGCTTGCTGCGTTTTCGTGCGCCACCGCTTCATGTGTGTGTGTGTGTCTGGTGTTATTACTCACTTGGCAAGCGTGTCTGGACAGCAGCTGTTTGAGTGTTGAGAGCGCTTCTTCTCCAGGAGAAGCGGTTGAGCCTAAGCTGAATCCCCGTCCGTCTTTATCTTCGGACATGCTCTGGATATGCCTGAGGAGGACAATGGAGGAACAGAACAGATGGATGAAGAGCTCATAAAACTGGCACACATGCATCAAAGCCCACCTTCGTCACTCTGATGACCAGTGACTGCCGTTTATTACTGCGATTTACCATGAAGTTATCTGCTTTTTGGGTCAGTTAGTGTGTGTGTGTGTGTGTGTGTGTGTGCCTTTTCTGTCCTCCAGATACTCAGTACTACAGAGGAGCTATTAATACTTACTACATCGATATGTTATGTAATATCATTCTAGCCTGCTACTCCTGTCTTCTGTATACAACTGTCGTCTGTCCCGAATAGCTCCTGGGTGCCCTCTCCTCCATAGTAGCCACAGTTACAGGAATATTACTCTTTATCATAGAAGCGGTATCTAGTAGAACAGTCCTTAGTTAAAATAATAACGGGGTGTGGGCATGTACAGCCTCTGGTATTCCGTTGCTCAGCAGAGCCTCATAACTCTCCTAGTGGCTCAGGAAGGCTGAAACAGGCTGTGTGCACCCAGCCAGCTGGAACTGTGTTTGAGTGCCATCTTGGAATACTGTTTATAAGCGCTCTTAAGTTATATGTGAGGATGGTGGTATTAGATATGGAAGTGTGTAGGAGGAGAAAGAGGAAATAGTGTCATGTTGATATGAACAGTTTGGTCAGTAAAATGAGGGCAGTAAAAAAGTGTTTTAAGCGTTTTGTCGGTCGACAATATGATAATAAAATGCATTTGGTTCACGATAACAAGAAAACAGAAAAGACCAGCAATGAATATTTAGCATTTTTTGTTTGAAAGATGAAACAAATAATTGAAATAGCTGCCAAATATTTGTGAAATGTACTAAATGGTCAGAGTGAAGATGCAGCTTTGAAAAGAAGATTCGGA
+
+&$%)'./-0,*1(&&&%#%&$(&)'%&&%$"#$&+,'*-*1+++5-73+)*/+,32/46552:/-+2025/+-057,$#$$&)/01,)433/2732'&$#&$"$'$((+*+),+,,,*+,,-11)*'&((*"0#"&*((,*.--.&.+-*,)-17861+&%'%)),73:60-/-32:++(('.')+56894,4+)./'%')%$&-,('%#41.'$%&')$0))/2.*04632,20)(+'&&,+7.97825-++**166678950-))%*+,-26-.6,*/(4.$+'+-5/0/.-02/-+)'%+73//245+(&(%%'))(&$#&&(7.:2-0;7014354398')-83/00/04:*330))&#)))-5/(-*++5#./+50-(,0765/1,,8//05/0.:0/%#$&)--+4+)+5575312+1&-')).'+&*%)(,,,((%++/,.2486112'&#$&##$%'(*+,1/)/+...+-.1312/1+**-(-.8---,*+,-.5,1,(+%..1,)--.8;441019.1780000313658;99621-,,.++)#,-.011537%#&-2,',-,86)(.''%(.2+/24,.23/./+*$)4--.0.340/+())0..62019-7:+).2(/*%),&--30/32*)&)%)$%')+2;829%*)'4:;401/,-71%.,'(*+)2837653/0-&/63861'(*-6*()5:.3--'%')',)2977&(%(%'+-/**-0727112246..*1,-..3&/.4535-3+3.00,7*%'1+12311321.35567:93&)*))'-/,2-7-.6/,..-4;6/3/&(&%**03745+-.-.::95544467..--))'*)#('*+,..(%)&'(%%&-+'++)*/1&&'$%&+*&())$()(,%+'$&'&($&'2.44:0..++#%).78*(((/1'($$&-:;98.(*00;;2-''),053.//3+&))+14-8&**,..01.2:;743425:7(,*.((+*,,-+'&*'+057,*(.53-(+3703/210.06256;.+,01.5<<5,06;:+.7)')3,$(+'4;.,*'*'*-4--)+-*)+&--,*$(+&(-$*,''/2778:;9/.857+%%'()*((*11-,)+-5-+,31/#&%$%5)-#%#
Then try to show one of the sequences by searching for the sequence ID:
grep '#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t' all_barcode03.fastq.gz
grep '*#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t*' all_barcode03.fastq.gz
grep #3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t all_barcode03.fastq.gz
grep *#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t* all_barcode03.fastq.gz
All the above grep commands return no results however there is a line in the file staring with #3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t
Use zgrep not grep on .gz files.
zgrep - search possibly compressed files for a regular expression
Related
show filename with matching word from grep only
I am trying to find which words happened in logfiles plus show the logfilename for anything that matches following pattern: 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' so if file dummylogfile.log contains BA10002 I would like to get a result such as: dummylogfile.log:BA10002 it is totally fine if the logfile shows up twice for duplicate matches. the closest I got is: for f in $(find . -name '*.err' -exec grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} \+);do printf $f;printf ':';grep -o 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' $f;done but this gives things like: ./register-05-14-11-53-59_24154.err:BA10 BA10 ./register_mdw_files_2020-05-14-11-54-32_24429.err:BA10 BA10 ./process_tables.2020-05-18-11-18-09_11428.err:BA30 ./status_load_2020-05-18-11-35-31_9185.err:BA30 so, 1) there are empty lines with only the second match and 2) the full match (e.g., BA10004) is not shown. thanks for the help
There are a couple of options you can pass to grep: -H: This will report the filename and the match -o: only show the match, not the full line -w: The match must represent a full word (string build from [A-Za-z0-9_]) If we look at your regex, you use BA01, this will match only BA01 which can appear anywhere in the text, also mid word. If you want the regex to match a full word, it should read BA01[[:alnum:]_]* which adds any sequence of word-constituent characters (equivalent to [A-Za-z0-9_]). You can test this with $ echo "foo BA01234 barBA012" | grep -Ho "BA01" (standard input):BA01 (standard input):BA01 $ echo "foo BA01234 barBA012" | grep -How "BA01" $ echo "foo BA01234 barBA012" | grep -How "BA01[[:alnum:]_]*" (standard input):BA01234 So your grep should look like grep -How "\('BA10\|BA20\|BA21\|BA30\|BA31\|BA00'\)[[:alnum:]_]*" *.err
From your example it seems that all files are in one directory. So the following works right away: grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' *.err If the files are in different directories: find . -name '*.err' -print | xargs -I {} grep 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} /dev/null Explanation: the addition of /dev/null to the filename {} forces grep to report the matching filename
grep: Find all files containing the word `star`, but not the word `start`
I have a bunch of files: some contain the word star, some contain the word start, some contain both. I'd like to grep for files that contain the word star, but not the word start. How can this be accomplished using only grep?
grep has some options for inverting the matches at the line or file level. You want the latter option, with the -L switch. The following will print the names of all the files in a folder that don't contain the text start: grep -LF start * -F tells grep that start is a literal string and not a regex. It's optional here, but might speed things up a tiny bit. You can use the resulting list to search for files that contain star: grep -lF star $(grep -LF start *) -l prints only the names of files containing a match, not any line-by-line or match-by-match details. If this is not exactly what you want, man grep is your friend. This uses an additional shell construct to run the inverted match, but it technically doesn't call any additional programs that aren't grep. Update Since you mention wanting to look through all the files starting with a given root folder, change -LF to -LFr. Replace * with your root folder if you don't want to change working directories. -r tells grep to recurse into directories, and search every file it finds along the way.
With GNU grep for -w: $ cat file foo star bar oof start rab $ grep -w star * foo star bar or if you just want the names of the files containing star: $ grep -lw star * file and to just find files to look in: $ find . -maxdepth 1 -type f -exec grep -w 'star' {} \; foo star bar
How to grep in one line starting from particular string to end with particular string
I want to grep "[calleruid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse" in Below file 2014-10-15 18:38:32,831 plivo-rest[2781]: INFO: Fetching GET http://*******/outbound_callback.aspx with smscresponse[to]=8912722fsf9&smscresponse[ALegUUID]=5bb516fsd64-546c-11e4-879f-551816a551303677&smscresponse[calluid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse[direction]=outbosund&smscresfdsponse[endreason]=UNALLOCATED_NUMBER&smscresponse[from]=83339995896999&smscresponse[starttime]=0&smscresponse[ALegRequestUUID]=5bb4bafc-546c-11e4-891d-000c29ec6e41&smscresponse[RequestUUID]=5bb4bafc-546c-11e4-891d-000c29ec6e41&smscresponse[callstatus]=completed&smscresponse[endtime]=1413378509&smscresponse[ScheduledHangupId]=5bb4c15a-546c-11e4-891d-000c29ec6e41&smscresponse[event]=missed_call_hangup I used this command $ grep -oP '(calluid).*$' this greps upto end of file I used this command $ grep -oP '(calluid).{40}' it fetches 40 characters but i have 1000's of calleruid's so each have different no.s of characters So please guide me to grep exact callerid data
Use a lookahead to force the regex engine to do the match upto a specific character or a boundary. $ grep -oP '\[calluid\][^\]\[]*(?=\[|$)' file [calluid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse
Here is an gnu awk (due to multiple characters in RS) version: awk -v RS="[[]calluid[]]=" -F[ 'NR==2 {print $1}' file aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse You can also set RS like this: RS="\\\[calluid]="
Using grep to find a string that starts with a character with numbers after
Okay I have a file that contains numbers like this: L21479 What I am trying to do is use grep (or a similar tool) to find all the strings in a file that have the format: L##### The # will be the number. SO an L followed by 5 numbers. Is this even possible in grep? Should I load the file and perform regex?
You can do this with grep, for example with the following command: grep -E -o 'L[0-9]{5}' name_of_file For example, given a file with the text: kasdhflkashl143112343214L232134614 3L1431413543454L2342L3523269ufoidu gl9983ugsdu8768IUHI/(JHKJASHD/(888 The command above will output: L23213 L14314 L35232
If it is just in a single file, you can do something along the lines of: grep -e 'L[0-9]{5}' filename If you need to search all files in a directory for these strings: find . -type f | xargs grep -e 'L[0-9]{5}'
Is there a way in grep to find out how many lines matched the grep result?
Suppose I write a grep query to find out the occurrence of a method call on an object like this: // might not be accurate, but irrelevant grep -nr "[[:alnum:]]\.[[:alnum:]](.*)" . This would give many results. How to find out how many such results are obtained?
What about using | wc -l to count the number of result lines?
What about man grep | grep "count" It outputs -c, --count Suppress normal output; instead print a count of matching lines for each input file. [...]
Previous answers are OK, I just want to put it into command line instructions in order to have copy-paste versions (from explicit to simplest) for the future: grep --count "PATTERN" FILE Is exactly the same as: grep -c "PATTERN" FILE And it is equivalent to: grep "PATTERN" FILE | wc -l As a bonus, below i give you a version where a file with a list of patterns is used. grep -count --file=PATTERNFILE FILE or simply grep -cf PATTERNFILE FILE