Grep {n} The preceding item is matched exactly n times, is not clear to me in case of "Hair", "Haair" and "Haaair" - grep

Suppose there is three strings "Hair", "Haair" and "Haaair" , When i use grep -E '^Ha{1}' , it returns all the former three words, instead i was expecting only "Hair", as i have asked return a line which starts with H and is followed by letter 'a' exactly once.

grep does not check that its input matches the given search expression. Grep finds substrings of the input that match the search.
See:
grep test <<< This is a test.
The input does not exactly match test. Only part of the input matches,
This is a test.
but that is enough for grep to output the whole line.
Similarly, when you say
grep -E '^Ha{1}' <<< Haaair
The input does not exactly match the search, but a part of it does,
Haaair
and that is enough. Note that {n,m} syntax is purely a convenience: Ha{1} is exactly equivalent to Ha, Ha{3,} is Haaa+, Ha{2,5} is Haa(a?){3} is Haaa?a?a?, etc. In other words, {1} does not mean "exactly once", it just means "once".
What you want to do is match a Ha that is not followed by another a. You have two options:
If your grep supports PCRE, you can use a negative lookahead:
grep -P '^Ha(?!a)'
(?!a) is a zero-length assertion, like ^. It doesn't match any characters; it simply causes the match to fail if there is an a after the first one.
Or, you can keep it simple and use a negative []:
grep -E '^Ha([^a]|$)'
Where [^a] matches any single character that is not a, and the alternation with $ handles the case of no character at all.

Related

select only a word that is part of colon

I have a text file using markup language (similar to wikipedia articles)
cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.
I need to select the word "having:" only because that is part of regular text. I tried
grep -v '[*:*]' test.txt
This will correctly avoid the tags, but does not select the expected word.
The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)
grep has the option -o to only print the matching text, so something lie
grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt
would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.
However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.
sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt
(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)
In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)
Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.
sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'
The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)
If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like
... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'
where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)
Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).
Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.
awk '{ for(i=1; i<=NF; ++i) {
if($i ~ /\]/) { brackets=0; next }
if($i ~ /\[/) brackets=1;
if(brackets) next;
if($i ~ /:/) print $i }' file.txt
This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).
A combined solution using sed and awk:
sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.
Another solution using sed and grep:
sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt | grep ':$'
's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :
A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){
if(i==0 && $t~/:$/) print $t;
i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

grep for path in process(ps) containing number

I would like to grep for process path which has a variable. Example -
This is one of the proceses running.
/var/www/vhosts/rcsdfg/psd_folr/rcerr-m-deve-udf-172/bin/magt queue:consumers:start customer.import_proditns --single-thread --max-messages=1000
I would like to grep for "psd_folr/rcerr-m-deve-udf-172/bin/magt queue" from the running processes.
The catch is that the number 172 keeps changing, but it will be a 3 digit number only. Please suggest, I tried below but it is not returning any output.
sudo ps axu | grep "psd_folr/rcerr-m-deve-udf-'^[0-9]$'/bin/magt queue"
The most relevant section of your regular expression is -'^[0-9]$'/ which has following problems:
the apostrophes have no syntactical meaning to grep other than read an apostrophe
the caret ^ matches the beginning of a line, but there is no beginning of a line in ps's output at this place
the dollar $ matches the end of a line, but there is no end of a line in ps's output at this place
you want to read 3 digits but [0-9] will only match a single one
Thus, the part of your expression should be modified like this -[0-9]+/ to match any number of digits (+ matches the preceding character any number of times but at least once) or like this -[0-9]{3}/ to match exactly three times ({n} matches the preceding character exactly n times).
If you alter your command, give grep the -E flag so it uses extended regular expressions, otherwise you need to escape the plus or the braces:
sudo ps axu | grep -E "psd_folr/rcerr-m-deve-udf-[0-9]+/bin/magt queue"

Getting only grep exact matches

I am trying to grep a file for the exact occurrence of a match, but I get also longer spurious matches:
grep CAT1717O99 myfile.txt -F -w
Output:
CAT1717O99
CAT1717O99.5
I would like to output only the first exactly matching line. Is there any way to get rid of the second line?
Thanks in advance.
Arturo
This is the file 'myfile.txt':
CAT1717O99
CAT1717O99.5
This will do the work for you.
grep -Fx "CAT1717O99" textfile
-F means Fixed
-x mean exact
Use the power of Perl-compatible regular expression (PCRE) and search the matches to the given pattern:
grep -Po "\bCAT1717O99(\s|$)" myfile.txt
(\s|$) - alternative group, ensures matching substring CAT1717O99 if it's followed by whitespace or placed at the end of the line
-P option, allows regular expressions
-o option, prints only matched parts of matching lines
You'll need explicitly request spaces in order to ignore special chars.
grep -E '(^| )CAT1717O99( |$)' myFile.txt
from grep manual :
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

GREP How do I search for words that contain specific letters (one or more times)?

I'm using the operating systems dictionary file to scan. I'm creating a java program to allow a user to enter any concoction of letters to find words that contain those letters. How would I do this using grep commands?
To find words that contain only the given letters:
grep -v '[^aeiou]' wordlist
The above filters out the lines in wordlist that don't contain any characters except for those listed. It's sort of using a double negative to get what you want. Another way to do this would be:
grep '^[aeiou]+$' wordlist
which searches the whole line for a sequence of one or more of the selected letters.
To find words that contain all of the given letters is a bit more lengthy, because there may be other letters in between the ones we want:
cat wordlist | grep a | grep e | grep i | grep o | grep u
(Yes, there is a useless use of cat above, but the symmetry is better this way.)
You can use a single grep to solve the last problem in Greg's answer, provided your grep supports PCRE. (Based on this excellent answer, boiled down a bit)
grep -P "(?=.*a)(?=.*e)(?=.*i)(?=.*o)(?=.*u)" wordlist
The positive lookahead means it will match anything with an "a" anywhere, and an "e" anywhere, and.... etc etc.

Confusion in Linux grep command

I have a very basic confusion about grep. Suppose I have a following file to grep in:
test.txt:
This is an article
from some newspaper
Article is good
newspaper is not.
Now if I grep with following expression
grep -P "is\s*g" test.txt
I get the line:
Article is good
However if I do this:
grep -P "is*g" test.txt
I don't get anything. My question is since asterix (*) is a wildcard which represents 0 or more repetitions of the previous character, shouldn't the output of grep be the same. Why the zero or more repetitions of 's' is not giving any output?
What am I missing here. Thanks for the help!
Because there's nothing in your input that matches i, then 0 or more repetitions of s, then g. "Article is good" can't match because it has a space after the s, not a g. The pattern is\s*g matches because \s is a special pattern that matches any sort of whitespace — so the overall pattern is is, then any amount of space, then g, which naturally matches "is g".
I see no ig, isg, issg, issssg in your input...
Since I don't know what you wanted to match, here is my best guess:
grep -P "is.*g" test.txt
You should see regular expression first before you use grep, also you will find it usefull with other commands... http://www.regular-expressions.info/
It's 0 or more repetition of the previous regex atom, and that atom is \s. So \s* can match tab-space-tab-space-space.

Resources