Cutting a length of specific string with grep - grep

Let's say we have a string "test123" in a text file.
How do we cut out "test12" only or let's say there is other garbage behind "test123" such as test123x19853 and we want to cut out "test123x"?
I tried with grep -a "test123.\{1,4\}" testasd.txt and so on, but just can't get it right.
I also looked for example, but never found what I'm looking for.

expr:
kent$ x="test123x19853"
kent$ echo $(expr "$x" : '\(test.\{1,4\}\)')
test123x

What you need is -o which print out matched things only:
$ echo "test123x19853"|grep -o "test.\{1,4\}"
test123x
$ echo "test123x19853"|grep -oP "test.{1,4}"
test123x
-o, --only-matching show only the part of a line matching PATTERN

If you are ok with awkthen try following(not this will look for continuous occurrences of alphabets and then continuous occurrences of digits, didn't limit it to 4 or 5).
echo "test123x19853" | awk 'match($0,/[a-zA-Z]+[0-9]+/){print substr($0,RSTART,RLENGTH)}'
In case you want to look for only 1 to 4 digits after 1st continuous occurrence of alphabets then try following(my awk is old version so using --re-interval you could remove it in case you have latest version of ittoo).
echo "test123x19853" | awk --re-interval 'match($0,/[a-zA-Z]+[0-9]{1,4}/){print substr($0,RSTART,RLENGTH)}'

Related

how to avoid lookbehind assertion is not fixed length

I have a file that contains a version number that I need to output. This version number is apart of a string in this file, that looks something like this:
https://some-link:1234/path/to/file/name-of-file/1.2.345/name-of-file_CXP123456-1.2.345.jar"
I need to get the version number, which is 1.2.345.
This grep command works: grep -Po '(?<=/name-of-file_CXP123456-/)\d.\d.\d\d\d'. However, the CXP number changes and as such I thought I could do something like this: grep -Po '(?<=/name-of-file_*-/)\d.\d.\d\d\d' but that gives the following:
grep: lookbehind assertion is not fixed length
Is there anything I can add to the grep statement to avoid this?
Ultimately, this is part of a stage in Jenkins to get this version number. The sh command looks something like this:
VERSION = sh 'ssh -tt user#ip-address "cat dir/file*.content | grep -Po '(?<=/name-of-file_*-/)\d.\d.\d\d\d' 1>&2"'
You can use
grep -Po '/name-of-file_.*-\K\d+(?:\.\d+)+'
See the regex demo. Details:
/name-of-file_ - a literal text
.* - any zero or more chars other than line break chars as many as possible
- - a hyphen
\K - a match reset operator that omits all text matched so far from the memory buffer
\d+ - one or more digits
(?:\.\d+)+ - one or more sequences of a . and one or more digits.
You don't need lookbehind for this job. You also don't need PCREs, or grep at all.
#!/usr/bin/env bash
# ^^^^- bash, *not* sh
case $BASH_VERSION in '') echo "ERROR: bash required" >&2; exit 1;; esac
string="https://some-link:1234/path/to/file/name-of-file/1.2.345/name-of-file_CXP123456-1.2.345.jar"
regex='.*/name-of-file_CXP[[:digit:]]+-([[:digit:].]+)[.]jar'
if [[ $string =~ $regex ]]; then
echo "Version is ${BASH_REMATCH[1]}"
else
echo "No version found in $string"
fi
Maybe too long for a comment... It looks like the version number is the 2nd-to last field if you split on forward slash?
rev | cut -d/ -f 2 | rev
awk -F/ '{print $(NF-1)}'
perl -lanF/ -e 'print $F[-2]'
Or even something like: basename $(dirname $(cat filename))
For those that are really desperate there is another solution which requires you to pre-build your regex string.
It's not a solution I would recommend but if there is really no other way no one can stop you.
While even with this you won't have true dynamic look-behinds and it is still quite limited it is an option available to you.
The idea is to build the look-behind for each possible length you need it to be.
So for example only match if it's not preceded by a # (0 to a 100 characters look-behind).
reg='';
for ((i = 0 ; i <= 100 ; i++)); do reg+='(?<!#.{'"${i}"'})'; done;
reg+='someVariableName=.*?($|;|\\n)';
grep --perl-regexp "$reg" /usr/local/mgmsbox/msc/scripts/msc.cfg
This might not be the best example but it gets the idea across.
This solution has it's own pitfalls. For example you need to double escape \\ escape-sequences like \n and any character that should not be interpreted should be put in a single-quote string (or use printf).

print the rest of input along with matching line

I am new to linux and I am experimenting with basic terminal commands. I found out that I can list all users using compgen -u but what if I only want to display the bottom line outputs ?
Ok lets say the output of compgen -u goes like this:
extra
extra
extra
extra
extra
extra
extra
extra
extra
John
William
Kate
Harold
I can only use grep to find a single text (ex. compgen -u | grep John). But what if I want to use grep to display John as well as all the remaining entries after it ?
sed or awk solution would be easier, but if you can only use grep, then the option --after-context (or -A) might do:
grep -A 5 John file
The drawback is that you need to know the number of lines to display after the matching (or use an arbitrary big number for the rest of the file).
compgen -u | grep -A$(compgen -u| wc -l) John
Explanation:
From man grep
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line containing a group separator (described under --group-separator) between
contiguous groups of matches.
grep -A -- print number of rows after pattern
$() -- Execute unix command
compgen -u| wc -l --> Get total number of rows of output of command.
You can use the following one-liner :
n=$( compgen -u | grep -n John | head -1 | cut -d ":" -f 1 ) && compgen -u | tail -n +$n
This finds out the line number for first occurrence of John, and prints everything starting that line.

Grep: Capture just number

I am trying to use grep to just capture a number in a string but I am having difficulty.
echo "There are <strong>54</strong> cities | grep -o "([0-9]+)"
How am I suppose to just have it return "54"? I have tried the above grep command and it doesn't work.
echo "You have <strong>54</strong>" | grep -o '[0-9]' seems to sort of work but it prints
5
4
instead of 54
Don't parse HTML with regex, use a proper parser :
$ echo "There are <strong>54</strong> cities " |
xmllint --html --xpath '//strong/text()' -
OUTPUT:
54
Check RegEx match open tags except XHTML self-contained tags
You need to use the "E" option for extended regex support (or use egrep). On my Mac OSX:
$ echo "There are <strong>54</strong> cities" | grep -Eo "[0-9]+"
54
You also need to think if there are going to be more than one occurrence of numbers in the line. What should be the behavior then?
EDIT 1: since you have now specified the requirement to be a number between <strong> tags, I would recommend using sed. On my platform, grep does not have the "P" option for perl style regexes. On my other box, the version of grep specifies that this is an experimental feature so I would go with sed in this case.
$ echo "There are <strong>54</strong> 12 cities" | sed -rn 's/^.*<strong>\s*([0-9]+)\s*<\/strong>.*$/\1/p'
54
Here "r" is for extended regex.
EDIT 2: If you have the "PCRE" option in your version of grep, you could also utilize the following with positive lookbehinds and lookaheads.
$ echo "There are <strong>54 </strong> 12 cities" | grep -o -P "(?<=<strong>)\s*([0-9]+)\s*(?=<\/strong>)"
54
RegEx Demo

Use grep to report back only line numbers

I have a file that possibly contains bad formatting (in this case, the occurrence of the pattern \\backslash). I would like to use grep to return only the line numbers where this occurs (as in, the match was here, go to line # x and fix it).
However, there doesn't seem to be a way to print the line number (grep -n) and not the match or line itself.
I can use another regex to extract the line numbers, but I want to make sure grep cannot do it by itself. grep -no comes closest, I think, but still displays the match.
try:
grep -n "text to find" file.ext | cut -f1 -d:
If you're open to using AWK:
awk '/textstring/ {print FNR}' textfile
In this case, FNR is the line number. AWK is a great tool when you're looking at grep|cut, or any time you're looking to take grep output and manipulate it.
All of these answers require grep to generate the entire matching lines, then pipe it to another program. If your lines are very long, it might be more efficient to use just sed to output the line numbers:
sed -n '/pattern/=' filename
Bash version
lineno=$(grep -n "pattern" filename)
lineno=${lineno%%:*}
I recommend the answers with sed and awk for just getting the line number, rather than using grep to get the entire matching line and then removing that from the output with cut or another tool. For completeness, you can also use Perl:
perl -nE 'say $. if /pattern/' filename
or Ruby:
ruby -ne 'puts $. if /pattern/' filename
using only grep:
grep -n "text to find" file.ext | grep -Po '^[^:]+'
You're going to want the second field after the colon, not the first.
grep -n "text to find" file.txt | cut -f2 -d:
To count the number of lines matched the pattern:
grep -n "Pattern" in_file.ext | wc -l
To extract matched pattern
sed -n '/pattern/p' file.est
To display line numbers on which pattern was matched
grep -n "pattern" file.ext | cut -f1 -d:

Can grep show only words that match search pattern?

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?
Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer
It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.
You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th
Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly
grep command for only matching and perl
grep -o -P 'th.*? ' filename
I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010
cat *-text-file | grep -Eio "th[a-z]+"
You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple
I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename
To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq
You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'
grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal
$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.
ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

Resources