I got .txt file with city names, each in separate line. Some of them are few words with one or multiple spaces or words connected with '-'. I need to create bash command which will echo those lines out. Currently I'm using cat piped with grep but I can't get both spaces and dash into one search and I had problems with checking for multiple spaces.
print lines with dash:
cat file.txt | grep ".*-.*"
print lines with spaces:
cat file.txt | grep ".*\s.*"
tho when I try to do:
cat file.txt | grep ".*\s+.*"
I get nothing.
Thanks for help
Something like that should work:
grep -E -- ' |\-' file.txt
Explanation:
-E: to interpret patterns as extended regular expressions
--: to signify the end of command options
' |\-': the line contains either a space or a dash
This does not directly address your question, but is too much to put in a comment.
You don't need the .* in your patterns. .* at the beginning or end of a pattern is useless, because it means "0 or more of any character" and so will always match.
These lines are all identical:
cat file.txt | grep ".*-.*"
cat file.txt | grep "-.*"
cat file.txt | grep "-"
Plus you don't need to cat and pipe:
grep "-" file.txt
When grep pattern matches, the default action is to print the whole line, so .* in all your patterns are redundant, you may delete them. Also, you don't have to use cat file | as you may specify the file to grep directly after pattern, i.e. grep 'pattern' file.txt.
Here are some more details:
grep ".*-.*" = grep -- "-" - returns any lines having a - char (-- singals the end of options, the next thing is the pattern)
grep ".*\s.*" = grep "\s" - matches and returns lines containing a whitespace char (only GNU grep)
grep ".*\s+.*" = grep "\s+" - returns line containing a whitespace followed with a literal + char (since you are using POSIX BRE regex here the unescaped + matches a literal plus symbol).
You want
grep "[[:space:]-]" file.txt
See the online demo:
#!/bin/bash
s='abc - def
ghi
jkl mno'
grep '[[:space:]-]' <<< "$s"
Output:
abc - def
jkl mno
The [[:space:]-] POSIX BRE and ERE (enabled with -E option) compliant pattern matches either any whitespace (with the [:space:] POSIX character class) or a hyphen.
Note that [\s-] won't work since \s inside a bracket expression is not treated as a regex escape sequence but as a mere \ or s.
I'm trying to grep multiple strings which look like this (there's a few hundred) against a file which contains data:string
Example strings: (no sensitive data is provided, they have been modified).
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1
I've been researching how to grep a file of patterns against another file, and came across the following commands
grep -f strings.txt datastring.txt > output.txt
grep -Ff strings.txt datastring.txt > output.txt
But unfortunately, these commands do NOT work successfully, and only print out a handful of results to my output file. I think it may be something to do with the symbols contained in strings.txt, but I'm unsure. Any help/advice would be great.
To further mention, I'm using Cygwin on Windows (if this is relevant).
Here's an updated example:
strings.txt contains the following:
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1
datastring.txt contains the following:
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1:53491
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91:03221
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/:20521
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1:30142
So technically, all lines should be included in the OUTPUT file, but only this line is outputted:
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1:30142
I just don't understand.
You have showed the output of cat -A strings.txt elsewhere, which includes ^M representing a CR (carriage return) character at the end of each line:
This indicates your file has Windows line endings (CR LF) instead of the Unix line endings (only LF) that grep would expect.
You can convert files with dos2unix strings.txt and back with unix2dos strings.txt.
Alternatively, if you don't have dos2unix installed in your Cygwin environment, you can also do that with sed.
sed -i 's/\r$//' strings.txt # dos2unix
sed -i 's/$/\r/' strings.txt # unix2dos
I'm trying to extract the a specific sequence from a fastq file using grep to search the sequence ID
less all_barcode03.fastq.gz
#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t runid=7204dc15205b93bfd6430ca0f3a0218f11ce0787 read=10 ch=120 start_time=2019-04-12T13:55:25Z
TCGGTAGCCACTTCGTTCAGTCAATTTGGGTTGTTTAACCGAGTCTTGTGTGTCCCAGTTACCAGGGTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTCGTGCGCCGCTTCAGTGATCAGTGAAGATGGGTTTGTGGTGGAATACTCTTGCTGCTCATGGCAAACTTTATGTTGGTTTTCTCATGCATTTGTTTCTCGTAATCCCATACGTCATCCAAAGTCATCTGAAAAAGAGGGAAGGGGTGGATTGTGGGTGAAATGTTGTGTACTCCTCTATAATGGGGCTCAGTTGACAAACAGGTGGAGGAGAGGATCATTTGCTTAAAGGGGTGAGTGAAGCGGAGTTTAAGGATAATTCAAGCTTTTAAAAGTGGCTTTAGAGGTAAAGGGTTAGCTCCCATGACCCACAGGATTTATAGGAGATGGCTCTGAACAAACCAGAGCCACACACACA
+
-%&&($$%%#%,-*),-5(&,$$%$%+).'-(+-4-(')%%$*+-,3...14,7/))/03.06-./-3:8.0(*,/7+*,966006.,(*(,-(&(*,./+--902/./),,,0,-/./,4(+0/,0).0-7048,(+*',*/.)*#(((.0--10764+('(%.3/+$&%&'./4'0.;:6.895778+0/*(28/),(+-/404/*'(),.16517&83+*/0/0.--033**$&'*,''*/,,,/..0.*0*0$##*((($/6&('-,.230/01/2+4,,::8719(*.4.'.26/0(*))0*+,(*+-,-+-.4765-$%&.'%.*/')(&''#-()*21,-.;+3).*,,'557686+(-7;-2:8))(&%%'*)**%&&).6&,*(.-'$'(*2+*0587:0+*+)/*/63--/*('#&)-68664&%534)/13.))'14*+**%%$$#
#69e7e435-a78c-4ec8-94cd-b0c1f3c40c11_t runid=7204dc15205b93bfd6430ca0f3a0218f11ce0787 read=15 ch=465 start_time=2019-04-12T13:55:25Z
TCGGTACTTCGTTCGGTTGGAGAAGGTGGTGTTGCCGAGTCTTGTGTCCCAGTTACCAGGGTTTTCGCATTTATCGTGGCTTGCTGCGTTTTCGTGCGCCACCGCTTCATGTGTGTGTGTGTGTCTGGTGTTATTACTCACTTGGCAAGCGTGTCTGGACAGCAGCTGTTTGAGTGTTGAGAGCGCTTCTTCTCCAGGAGAAGCGGTTGAGCCTAAGCTGAATCCCCGTCCGTCTTTATCTTCGGACATGCTCTGGATATGCCTGAGGAGGACAATGGAGGAACAGAACAGATGGATGAAGAGCTCATAAAACTGGCACACATGCATCAAAGCCCACCTTCGTCACTCTGATGACCAGTGACTGCCGTTTATTACTGCGATTTACCATGAAGTTATCTGCTTTTTGGGTCAGTTAGTGTGTGTGTGTGTGTGTGTGTGTGTGCCTTTTCTGTCCTCCAGATACTCAGTACTACAGAGGAGCTATTAATACTTACTACATCGATATGTTATGTAATATCATTCTAGCCTGCTACTCCTGTCTTCTGTATACAACTGTCGTCTGTCCCGAATAGCTCCTGGGTGCCCTCTCCTCCATAGTAGCCACAGTTACAGGAATATTACTCTTTATCATAGAAGCGGTATCTAGTAGAACAGTCCTTAGTTAAAATAATAACGGGGTGTGGGCATGTACAGCCTCTGGTATTCCGTTGCTCAGCAGAGCCTCATAACTCTCCTAGTGGCTCAGGAAGGCTGAAACAGGCTGTGTGCACCCAGCCAGCTGGAACTGTGTTTGAGTGCCATCTTGGAATACTGTTTATAAGCGCTCTTAAGTTATATGTGAGGATGGTGGTATTAGATATGGAAGTGTGTAGGAGGAGAAAGAGGAAATAGTGTCATGTTGATATGAACAGTTTGGTCAGTAAAATGAGGGCAGTAAAAAAGTGTTTTAAGCGTTTTGTCGGTCGACAATATGATAATAAAATGCATTTGGTTCACGATAACAAGAAAACAGAAAAGACCAGCAATGAATATTTAGCATTTTTTGTTTGAAAGATGAAACAAATAATTGAAATAGCTGCCAAATATTTGTGAAATGTACTAAATGGTCAGAGTGAAGATGCAGCTTTGAAAAGAAGATTCGGA
+
+&$%)'./-0,*1(&&&%#%&$(&)'%&&%$"#$&+,'*-*1+++5-73+)*/+,32/46552:/-+2025/+-057,$#$$&)/01,)433/2732'&$#&$"$'$((+*+),+,,,*+,,-11)*'&((*"0#"&*((,*.--.&.+-*,)-17861+&%'%)),73:60-/-32:++(('.')+56894,4+)./'%')%$&-,('%#41.'$%&')$0))/2.*04632,20)(+'&&,+7.97825-++**166678950-))%*+,-26-.6,*/(4.$+'+-5/0/.-02/-+)'%+73//245+(&(%%'))(&$#&&(7.:2-0;7014354398')-83/00/04:*330))&#)))-5/(-*++5#./+50-(,0765/1,,8//05/0.:0/%#$&)--+4+)+5575312+1&-')).'+&*%)(,,,((%++/,.2486112'&#$&##$%'(*+,1/)/+...+-.1312/1+**-(-.8---,*+,-.5,1,(+%..1,)--.8;441019.1780000313658;99621-,,.++)#,-.011537%#&-2,',-,86)(.''%(.2+/24,.23/./+*$)4--.0.340/+())0..62019-7:+).2(/*%),&--30/32*)&)%)$%')+2;829%*)'4:;401/,-71%.,'(*+)2837653/0-&/63861'(*-6*()5:.3--'%')',)2977&(%(%'+-/**-0727112246..*1,-..3&/.4535-3+3.00,7*%'1+12311321.35567:93&)*))'-/,2-7-.6/,..-4;6/3/&(&%**03745+-.-.::95544467..--))'*)#('*+,..(%)&'(%%&-+'++)*/1&&'$%&+*&())$()(,%+'$&'&($&'2.44:0..++#%).78*(((/1'($$&-:;98.(*00;;2-''),053.//3+&))+14-8&**,..01.2:;743425:7(,*.((+*,,-+'&*'+057,*(.53-(+3703/210.06256;.+,01.5<<5,06;:+.7)')3,$(+'4;.,*'*'*-4--)+-*)+&--,*$(+&(-$*,''/2778:;9/.857+%%'()*((*11-,)+-5-+,31/#&%$%5)-#%#
Then try to show one of the sequences by searching for the sequence ID:
grep '#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t' all_barcode03.fastq.gz
grep '*#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t*' all_barcode03.fastq.gz
grep #3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t all_barcode03.fastq.gz
grep *#3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t* all_barcode03.fastq.gz
All the above grep commands return no results however there is a line in the file staring with #3cb04ae7-2c7b-4da8-8d09-59edb5b8f45c_t
Use zgrep not grep on .gz files.
zgrep - search possibly compressed files for a regular expression
I have 18 csv files, all between 1mb and 14mb. The sum of all files is 64mb. I want to create a new csv file that contains a subset of those files-- only the lines featuring the pattern "Hello" (or "HELLO", or "hello" ...). Here's what I'm doing
cat *.csv | head -n 1 > new.csv # I want to create a header first
cat *.csv | grep -i "hello" >> new.csv
I'm running Debian on WSL. The output file is much, much larger than the original 64mb (I stopped the process after 1+ hour, and the file was 300+ GB).
How can a subset of a text file be larger than the original files? Does it have anything to do with WSL?
This is not an OS issue. When you redirect your output to new.csv, shell creates that file first, before the glob expression *.csv is evaluated. That means the expansion of *.csv would include new.csv as well. That seems like the root cause of the recursive grep issue you are facing.
You are reading all the files twice, which is not necessary. You can make your operation a lot simpler and efficient with a single awk command:
awk 'NR==1 {print} tolower($0) ~ /hello/ {print}' *.csv > csv.new
mv csv.new new.csv
since the output file is named csv.new it won't interfere with the glob *.csv
NR==1 picks up the first line (header) from the very first file
The awk command can be written more succinctly as:
awk 'NR==1 || tolower($0) ~ /hello/' *.csv > csv.new
You are using *.csv and redirecting the output to new.csv which falls under *.csv which is causing recursion in grep result. perhaps you can try,
grep -i hello *.csv --exclude="new.csv" >> new.csv
Okay I have a file that contains numbers like this:
L21479
What I am trying to do is use grep (or a similar tool) to find all the strings in a file that have the format:
L#####
The # will be the number. SO an L followed by 5 numbers.
Is this even possible in grep? Should I load the file and perform regex?
You can do this with grep, for example with the following command:
grep -E -o 'L[0-9]{5}' name_of_file
For example, given a file with the text:
kasdhflkashl143112343214L232134614
3L1431413543454L2342L3523269ufoidu
gl9983ugsdu8768IUHI/(JHKJASHD/(888
The command above will output:
L23213
L14314
L35232
If it is just in a single file, you can do something along the lines of:
grep -e 'L[0-9]{5}' filename
If you need to search all files in a directory for these strings:
find . -type f | xargs grep -e 'L[0-9]{5}'