GREP: Find words containing multiple specific characters - grep

To clarify a bit, while I am aware how to grab words with a single specific character, I'm unsure how to approach looking for multiple of them. For example, what grep command would be used to retrieve only the words containing both "b" and "p" (in any order), not just one or the other?
Using the above example, if you're given words like "bear," "pear," "biography," and "printable," it would only return the last two words. These are some of my previous attempts.
grep -E "\b[bp]\b" input
grep -E "\b(b|p)\b" input
grep -E "\bb.*p\b" input

you can do it with a regular expression. For instance, here the code snippet for your problem.
grep '\w*[b]\w*[p]\w*\|\w*[p]\w*[b]\w*' test.txt
Helpful links to read further:
https://www.cyberciti.biz/faq/grep-regular-expressions/
https://regexr.com/

Related

GNU grep, backreferences and wildcards

Using grep (GNU grep 3.3) to search for all words with three consecutive double-letters (resulting in "bookkeeper"):
grep -E "((.)\2){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters, each of them followed by the letter "i" (resulting in "Mississippi"):
grep -E "((.)\2i){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters, each of them followed by any single letter (with a couple of results):
grep -E "((.)\2.){3}" /usr/share/dict/american-english
Changing this to search for words with three double-letters separated by an optional, single letter (even more results):
grep -E "((.)\2.?){3}" /usr/share/dict/american-english
Now, finally, my original task: Search for all words containing three double-letters:
grep -E "((.)\2.*){3}" /usr/share/dict/american-english
But this results in an empty set. Why? How can .? match something .* does not?
The POSIX regex engine does not handle patterns with back-references well, matching back references is an NP complete problem might provide some hints on why it is that difficult.
Since you are using a GNU grep, the problem is easily solved with the PCRE engine,
grep -P '((.)\2.*){3}' file
since the PCRE engine can handle back-references in a more efficient way than the POSIX regex engine.
See the online demo.

How to grep for files using 'and' operator, words might not be on the same line

I have a directory /dir
which has several text files in it, These files may or may not contain the words 'rock' and 'stone', so basically some files might just contain the word 'rock', some may just contain the word 'stone', some may contain both, and some may contain neither.
How can I list all files in this directory that contain both 'rock' and 'stone'? These words might not be on the same line so I don't think piping through grep twice would work.
Appreciate any help, I was not able to find a stackoverflow post with this problem so I figured I'd ask.
To search files that match the given two (or more) words at any line anywhere in the file, you may want to try ugrep:
ugrep -F --files -e 'rock' --and -e 'stone' dir
This only matches files that have both rock and stone in them. Lines are output that have rock or stone, or you can use option -l to just list files. The -F option searches strings (like grep -F and fgrep), --files applies the --and file-wide, which you want instead of applying the --and per line. Note that we have more than one pattern in this case, so option -e should be used (like grep also requires this).
A shorter form with --bool:
ugrep -F --files --bool 'rock stone' dir
where --bool formulates a Boolean query with space as AND (or use AND).
If you want to search directory dir recursively in subdirectories, use option -r.

Does [:space:] in a grep command not include newlines and carriage returns? [duplicate]

This question already has answers here:
How to grep for the whole word
(7 answers)
Closed 11 months ago.
I'm curently writing a simple Bash script. The idea is to use grep to find the lines where a certain pattern is found, within some files. The pattern contains 3 capital letters at the start, followed by 6 digits; so the regex is [A-Z]{3}[0-9}{6}.
However, I need to only include the lines where this pattern is not concatenated with other strings, or in other words, if such a pattern is found, it has to be separated from other strings with spaces.
So if the string which matches the pattern is ABC123456 for example, the line something ABC123456 something should be fine, but somethingABC123456something should fail.
I've extended my regex using the [:space:] character class, like so:
[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]
And this seems to work, except for when the string which matches the pattern is the first or last one in the line.
So, the line something ABC123456 something will match correctly;
The line ABC123456 something won't;
And the line something ABC123456 won't as well.
I believe this has something to do with [:space:] not counting new lines and carriage returns as whitespace characters, even though it should from my understanding. Could anyone spot if I'm doing something wrong here?
A common solution to your problem is to normalize the input so that there is a space before and after each word.
sed 's/^ //;s/$/ /' file |
grep -oE '[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]'
Your question assumes that the newlines are part of what grep sees, but that is not true (or at least not how grep is commonly implemented). Instead, it reads just the contents of each new line into a memory buffer, and then applies the regular expression to that buffer.
A similar but different solution is to specify beginning of line or space, and correspondingly space or end of line:
grep -oE '(^|[[:space:]])[A-Z]{3}[0-9}{6}([[:space:]]|$)' file
but this might not be entirely portable.
You might want to postprocess the results to trim any spaces from the extracted strings, too; but I have already had to guess several things about what you are actually trying to accomplish, so I'll stop here.
(Of course, sed can do everything grep can do, and then some, so perhaps switch to sed or Awk entirely rather than build an elaborate normalization pipeline around grep.)

Ag / Grep Exact Match Only Search

I am having an issue with using Ag (The Silver Searcher)...
In the docs it says to use -Q for exact match, but I don't understand why it does not work for my purposes. If I type something like ag -Q actions or ag -Q 'actions' into my terminal, it returns all instances of actions, including things like transactions and any other strings that actions is part of.
I have tried a couple other combinations of flags from the docs, including -s and -S, among others, but still I cannot get strictly strings matching just actions to return for me.
I can't get this to work with grep either. Does anyone know how I can get what I need with ag? (or even with grep)...?
Thank you in advance!
Because ag (and grep), find files that contain something. ag -Q means to interpret the search as an exact literal string, not a fuzzy string or a regex. Okay. But a file that has the word "transactions" in it contains exactly, literally the character sequence actions. Sure, it contains more than that too, but that's not surprising.
Probably you're looking for a word-boundary search, grep '\bactions\b' or ag -w -Q actions (maybe ag -w -Q -s actions). But that is not at all the same thing as "just actions", it's a specific requirement on the things surrounding "actions" (namely that they be the beginning or end of a line, or non-letter characters). You have to tell the computer what you actually mean.

How to find match of words with reoccuring character in a file

It might seems like a question that would already have been answered before so pardon me if it's the case, but I can't seems to find a clear answer or an explanation on how to find words in a file with a specified number of repeated character, (ex: words containing 3 times the character '-', such as 'long-and-complex-word').
I'm aware that it is possible to use the command
grep-oE '.{n}'
To find words with consecutive repetition of character, but I'm looking for a way to find repetition of character in no particular order.
Here are the commands that I've tried that aren't working
grep -E '*[-]*[-]*[-]*' file
grep -Ex '* \-* \-* \ -*' file
Thanks.

Resources