regexp character or end of line with egrep - grep

I have following regexp:
egrep '(chr1 .*n70$|chr1 .*n70-)' results/files/forbidden_variants
This matches
chr1 n70
chr1 n70-n79
chr1 n70-n79-n83
chr1 n70-n79
chr1 n70-n79-s15-s16
chr1 n70
chr1 n70-n91
chr1 n70
and is terribly slow as I am replacing ids such as n70 with different values millions of times.
Therefore I wanted to get rid of OR. I have written:
egrep '(chr1 .*n70[-\$])' results/files/forbidden_variants
but it is not working as I am not matching end of line with this command. Output looks like this:
chr1 n70-n79
chr1 n70-n79-n83
chr1 n70-n79
chr1 n70-n79-s15-s16
chr1 n70-n91
What am I doing wrong here? :) Thank you.

Just add a + to the current Regex :
egrep '(chr1 n70[-\$]+)' results/files/forbidden_variants

Why don't you use simply
chr1 n70
you can use a OR
chr1 n70($|-)
which is basically equivalent to your first expression, but in your first expression i don't see the need of .* in your matches.

Related

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

How to get ripgrep to tell me which expressions from a list have no matches on the filesystem

For instance, say I have the list of strings that I want to search for:
alfa bravo charlie delta nebuchadnezzar bartholomew
and in my repo there are files that contain alfa, bravo, charlie and delta, but there are no files that contain nebuchadnezzar and no files that contain bartholomew. Then I want the answer to be:
nebuchadnezzar bartholomew
As you might guess, I'm searching for deprecated things. I ended up using the following Ruby code workaround as I couldn't figure a solution after trying man rg.
%w[alfa bravo charlie delta nebuchadnezzar bartholomew].each do |word|
command = 'rg ' + word
if `#{command}` == '' # execute the command, see if ripgrep found nothing
puts word
end
end
You can use the exit code of rg when no match is found in a simple shell loop construct. From the docs, it seems it returns a code 1 when no match is found for the regex and no errors are seen. Adopting it
for word in alfa bravo charlie delta nebuchadnezzar bartholomew; do
rg "$word" >/dev/null 2>&1
[ "$?" -eq 1 ] && printf '%s\n' "no match for $word"
done

Join Merge only the first two lines of a file using AWK or SED

I have a file like this
Line1
Line2
Line3
Line4
Line5
I need the output like this:
Line1Line2
Line3
Line4
Line5
I tried sed ":a;N;$!ba;s/\n//g" asd.txt but it combines all lines into one.
An awk solution would be like
$ awk '{ORS=(NR==1?"":"\n")}1 ' input
Line1Line2
Line3
Line4
Line5
OR
$ awk '{ORS=(NR==1?"":RS)}1 ' input
Line1Line2
Line3
Line4
Line5
Using sed you can restrict an operation to a specific line number. In this case, we are restricting the append (to pattern space) and substitution to line 1:
sed '1 {N; s/\n//}' file
Note that this solution could also be written without the braces:
sed '1N; s/\n//' file
But please note that this last solution is somewhat less maintainable. Whether or not that's problematic for you is another thing. In either case, the results are:
Line1Line2
Line3
Line4
Line5
You could try the below sed command,
$ sed 'N;0,/\n/s/\n//' file
Line1Line2
Line3
Line4
Line5
N appends the next line into pattern-space. 0,/./ (specifies the range) which helps to do the replacement on the first match only. s/\n// replaces the first newline character with an empty string.
sed '1 {N;s/\n//}'
results
Line1Line2
Line3
Line4
Line5
take line 1 and add the next line to it . Afterwards remove the newline character

creating a file with uniques string per line in command line

I am trying to create a file (using AWK, but do not mind switching if another command is easier) that has a unique string in each line (183745 lines total). I am trying to make a file as such:
line1
line2
line3
....
line183745
With poor knowledge of AWK, and failure to find a similar example, I have unsuccessfully tried (with 10 lines for this example):
awk '{ i = 1; while (i < 10) { print "line$i \n"}; i++ }'
And this leads to no error or output. Thank you.
Why make it complicate?
seq -f "line%06g" 3
line000001
line000002
line000003
seq -f "line%06g" 183745 >newfile
You'll need to put this in a BEGIN block, as you're not processing any lines of input.
awk 'BEGIN { i = 1 ; while (i <= 10) { print "line"i ; i++ } }'
awk acts like a filter by default. In your case, it's simply blocking on input. Unblock it by explicitly not having input, for example.
awk '...' </dev/null
If I do this, I would do it with seq or in vim.
but since others have already posted seq and classic awk solution, I would add another awk solution for fun.
A very "useful" command yes could help us:
awk '$0="line"NR;NR==183745{exit}'
test with 1-10, for example:
kent$ yes|awk '$0="line"NR;NR==10{exit}'
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10

Whitespace in search string when matching using grep.

I have a file which looks like this.
10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872
10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282
10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256
10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462
10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846
17gs+VWW+A+210 11ba-SER-A-77- 0.415789 0.101282
15gs+VWW+A+210 11ba-VAL-A-47- 0.413793 0.215385
I want to grep out the lines that match a pattern [inclusive of the whitespace in it]. Let's say the pattern is: '10gs+VWW+A+210 11ba-'
When I give such a pattern as an argument to grep, I get the matching lines correctly. However the problem arises when I want to match multiple patterns like these from a file say pattern.txt which has a list of all these patterns on each line.
pattern.txt looks like this:
10gs+VWW+A+210 11ba-
10gs+VWW+A+210 10gs-
When I use a shell script like this:
for i in `cat pattern.txt`; do grep -e "^$i" bigfile.txt ; done
the command takes 10gs+VWW+A+210 separately and 11ba separately to match. I want to match the entire thing (separated by a space) i.e. 10gs+VWW+A+210 11ba to be matched, and not the two strings separately.
How do I modify the existing shell script to overcome the white space character in the search string?
Also, since the file against which I am matching these set of strings is large, ~50GB.
So, a memory efficient solution is welcome.
Thanks.
Replace spaces with other symbols
Assuming # never occurs in the patterns
for i in $( cat pattern.txt | tr ' ' '#' ) ; do
j=$(echo "$i" | tr '#' ' ' )
grep -e "^$j" bigfile.txt
done
Timing on my test file
real 0m20.739s
user 0m11.773s
sys 0m8.345s
Use -f flag in grep
grep -f pattern.txt bigfile.txt
Timing on the same test file
real 0m2.190s
user 0m2.163s
sys 0m0.026s
In other words, the performance of grep -f appears to be about 10 times better with a large pattern file.
Does the following command and corresponding result suit you? The patterns must be split by a pipe to make either one of them match.
Command:
egrep '10gs\+VWW\+A\+210 11ba-|10gs\+VWW\+A\+210 10gs-' bigfile.txt
Result:
10gs+VWW+A+210 10gs-ASN-A-206 0.616667 0.094872
10gs+VWW+A+210 10gs-GLU-A-31- 0.363077 0.151282
10gs+VWW+A+210 10gs-GLY-A-207 0.602564 0.060256
10gs+VWW+A+210 10gs-LEU-A-132 0.378151 0.288462
10gs+VWW+A+210 10gs-LEU-A-60- 0.376812 0.133333
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846
10gs+VWW+A+210 11ba-GLU-A-2-z 0.333333 0.065385
10gs+VWW+A+210 11ba-SER-A-15- 0.400000 0.053846

Resources