I want to find out if there are lines in a text that are similar and after eachother. For example I want to find if there are any lines that has "cccc" in and after eacother.
aaaaaaaa
bbbbaaaa
ccccxxxx
ddddaaaa
eeeeaaaa
ccccxxxx <---
ccccyyyy <---
ddddaaaa
eeeeaaaa
So I should print out only the double cccc**** lines.
I tried something like:
grep "cccc" -A1 file.txt
but got all "cccc*" lines.
Simple problem I know...
Another example:
Search for duplicates of "Finland":
Iceland
Germany
FinlandsIsNiceButNoMatch
France
FinlandWillMatchTHisTime <---
FinlandWillAlsoMatch <---
Hungary
This will match two lines if they both begin with at least 3 identical letters:
grep -Pzo "([a-zA-Z]{3}).*\n\1.*" file.txt
Related
I'm trying to print line containing 2 or 3 numbers along with the rest of the line. I came with the code:
grep -P '[[:digit:]]{2,3}' address
But this even prints the line having 4 digits. I don know why is this happening.
Output:
Neither this code works;
grep -E '[0-9]{2,3}' address
Here is the file containing address text:
12 main st
123 main street
1234 main street
I have already specified to print 2 or 3 values with {2,3} still the filter doesn't work and more than 3 digits line is being printed. Can anyone assist me on this? Thank you so much.
You can use inverted grep (-v) to filter all lines with 4 digits (and above):
grep -vE '[0-9]{4}' address
EDIT:
I noticed you want only 2 or 3 digit along the line, so first command will get you also 1 digit.
Here's the fix, again using same method:
grep -E '[0-9]{2,3}' txt.txt | grep -vE '[0-9]{4}'
I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm
I have several files that goes like that:
abcd
several lines
abcd
several lines
abcd
several lines
.
.
.
what I want to do (preferably using grep) is to get the 20 lines immediately following the LAST abcd line.
Any help is appreciated.
Thanks
Use -A option:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line
containing a group separator (--) between contiguous groups of matches.
With the -o or --only-matching option, this has no effect and a warning
is given.
So:
$ grep -A 20 abcd file.txt
will give you abcd lines + 20 lines after each. To get that last 21 lines, use tail:
$ grep -A 20 abcd file.txt | tail -21
You can do this:
awk '/abcd/ {n=NR} {a[NR]=$0} END {for (i=n;i<=n+20;i++) print a[i]}' file
It will search for pattern abcd and update n so only last will be stored.
It also store all line in array a
Then it print 20 lines form last pattern found in the END section.
I have a big txt file and I am looking for seq id that starts with species name "ABS". When I do grep "ABS", I only get the list of ABS but not seq id followed by that word. For example list what I am looking for is like this:
ABS|contig05671,
ABS|contig04453,
ABS|CL5170Contig1,
ABS|contig02526,
But, when I do, grep "ABS" filename.txt, I get the result like this:
ABS,
ABS,
ABS,
ABS,
Any help is greatly appreciated. Thanks in advance.
From man grep:
Context Line Control
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-C NUM, -NUM, --context=NUM
Print NUM lines of output context. Places a line containing a
group separator (--) between contiguous groups of matches. With
the -o or --only-matching option, this has no effect and a
warning is given.
So if you need the matching line and the following one, you do grep -A1 ABS file.txt, and similarly for the preceding line with -B1.
However, if you want to format the results in another way (e.g. put the two lines on one and separate by the pipe character) you need a different tool than grep. grep does searching, whereas you also want editing.
I'm using the operating systems dictionary file to scan. I'm creating a java program to allow a user to enter any concoction of letters to find words that contain those letters. How would I do this using grep commands?
To find words that contain only the given letters:
grep -v '[^aeiou]' wordlist
The above filters out the lines in wordlist that don't contain any characters except for those listed. It's sort of using a double negative to get what you want. Another way to do this would be:
grep '^[aeiou]+$' wordlist
which searches the whole line for a sequence of one or more of the selected letters.
To find words that contain all of the given letters is a bit more lengthy, because there may be other letters in between the ones we want:
cat wordlist | grep a | grep e | grep i | grep o | grep u
(Yes, there is a useless use of cat above, but the symmetry is better this way.)
You can use a single grep to solve the last problem in Greg's answer, provided your grep supports PCRE. (Based on this excellent answer, boiled down a bit)
grep -P "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)" wordlist
The positive lookahead means it will match anything with an "a" anywhere, and an "e" anywhere, and.... etc etc.