How does grep match directory URLs?

How does grep match directory URLs? - url

This is a .xml file that stores a large number of URLs and it is as follows.
<url><loc>http://www.example.com/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/en/rWGpqHtU/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/de/hVHaViPm/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/uk/ysbqqLRj/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/jp/EUvnikfR/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/hk/UqGauZTv/</loc><changefreq>daily</changefreq></url>
How do I match the url that starts with http://www.example.com/uk/ and http://www.example.com/hk/?
This is what I've tried so far and it will match all URLs.
cat sitemap.xml | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*"
Thank you!

You may use
grep -Eo 'https?://www\.example\.com/[uh]k/[^<]*'
Here, -E enables POSIX ERE syntax, and the pattern matches:
https?:// - http:// or https://
www\.example\.com/ - www.example.com/
[uh]k/ - uk/ or hk/
[^<]* - 0 or more chars other than <.
See the online demo:
#!/bin/bash
xml='<url><loc>http://www.example.com/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/en/rWGpqHtU/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/de/hVHaViPm/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/uk/ysbqqLRj/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/jp/EUvnikfR/</loc><changefreq>daily</changefreq></url>
<url><loc>http://www.example.com/hk/UqGauZTv/</loc><changefreq>daily</changefreq></url>'
grep -Eo "https?://www\.example\.com/[uh]k/[^<]*" <<< "$xml"
Output:
http://www.example.com/uk/ysbqqLRj/
http://www.example.com/hk/UqGauZTv/

Related

show filename with matching word from grep only

I am trying to find which words happened in logfiles plus show the logfilename for anything that matches following pattern:
'BA10\|BA20\|BA21\|BA30\|BA31\|BA00'
so if file dummylogfile.log contains BA10002 I would like to get a result such as:
dummylogfile.log:BA10002
it is totally fine if the logfile shows up twice for duplicate matches.
the closest I got is:
for f in $(find . -name '*.err' -exec grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} \+);do printf $f;printf ':';grep -o 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' $f;done
but this gives things like:
./register-05-14-11-53-59_24154.err:BA10
BA10
./register_mdw_files_2020-05-14-11-54-32_24429.err:BA10
BA10
./process_tables.2020-05-18-11-18-09_11428.err:BA30
./status_load_2020-05-18-11-35-31_9185.err:BA30
so,
1) there are empty lines with only the second match and
2) the full match (e.g., BA10004) is not shown.
thanks for the help

There are a couple of options you can pass to grep:
-H: This will report the filename and the match
-o: only show the match, not the full line
-w: The match must represent a full word (string build from [A-Za-z0-9_])
If we look at your regex, you use BA01, this will match only BA01 which can appear anywhere in the text, also mid word. If you want the regex to match a full word, it should read BA01[[:alnum:]_]* which adds any sequence of word-constituent characters (equivalent to [A-Za-z0-9_]). You can test this with
$ echo "foo BA01234 barBA012" | grep -Ho "BA01"
(standard input):BA01
(standard input):BA01
$ echo "foo BA01234 barBA012" | grep -How "BA01"
$ echo "foo BA01234 barBA012" | grep -How "BA01[[:alnum:]_]*"
(standard input):BA01234
So your grep should look like
grep -How "\('BA10\|BA20\|BA21\|BA30\|BA31\|BA00'\)[[:alnum:]_]*" *.err

From your example it seems that all files are in one directory. So the following works right away:
grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' *.err
If the files are in different directories:
find . -name '*.err' -print | xargs -I {} grep 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} /dev/null
Explanation: the addition of /dev/null to the filename {} forces grep to report the matching filename

How to grep in one line starting from particular string to end with particular string

I want to grep "[calleruid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse"
in Below file
2014-10-15 18:38:32,831 plivo-rest[2781]: INFO: Fetching GET http://*******/outbound_callback.aspx with smscresponse[to]=8912722fsf9&smscresponse[ALegUUID]=5bb516fsd64-546c-11e4-879f-551816a551303677&smscresponse[calluid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse[direction]=outbosund&smscresfdsponse[endreason]=UNALLOCATED_NUMBER&smscresponse[from]=83339995896999&smscresponse[starttime]=0&smscresponse[ALegRequestUUID]=5bb4bafc-546c-11e4-891d-000c29ec6e41&smscresponse[RequestUUID]=5bb4bafc-546c-11e4-891d-000c29ec6e41&smscresponse[callstatus]=completed&smscresponse[endtime]=1413378509&smscresponse[ScheduledHangupId]=5bb4c15a-546c-11e4-891d-000c29ec6e41&smscresponse[event]=missed_call_hangup
I used this command
$ grep -oP '(calluid).*$'
this greps upto end of file
I used this command
$ grep -oP '(calluid).{40}'
it fetches 40 characters but i have 1000's of calleruid's so each have different no.s of characters
So please guide me to grep exact callerid data

Use a lookahead to force the regex engine to do the match upto a specific character or a boundary.
$ grep -oP '\[calluid\][^\]\[]*(?=\[|$)' file
[calluid]=aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse

Here is an gnu awk (due to multiple characters in RS) version:
awk -v RS="[[]calluid[]]=" -F[ 'NR==2 {print $1}' file
aab01b055-89e3-49f3-839e-507bb128d07e&smscresponse
You can also set RS like this: RS="\\\[calluid]="

grep mistaking pattern for file?

cat file.txt | grep -x "\d*"
grep: \Documents and Settings: Is a directory
I want to search file.txt for any lines that are numbers only but grep seems to be viewing \d* as a wildcard for files and not the pattern. How can I specify that it's the pattern and it should use stdin for what to grep over?
The file is full of lines of datetime stamps, some end with a letter, some don't.
20140110122200
20131208041510M
...
I'm trying to only get the lines that don't end in a letter.
EDIT: I've also tried setting the filename instead of piping it with cat. Not much different.
C:\long\path>grep -ex "\d*" -f file.txt
grep: \Dell: Is a directory
grep: \Documents and Settings: Is a directory

Why are you using cat to pass the file to grep? Why not just give grep the filename directly?
grep -x '\d*' file.txt
I think the actual problem you're seeing is that the * wildcard is being expanded. That's why grep is giving you errors that mention actual directories (beginning with 'd') on your system.

Simple Grep Issue

I am trying to parse items out of a file I have. I cant figure out how to do this with grep
here is the syntax
<FQDN>Compname.dom.domain.com</FQDN>
<FQDN>Compname1.dom.domain.com</FQDN>
<FQDN>Compname2.dom.domain.com</FQDN>
I want to spit out just the bits between the > and the <
can anyone assist?
Thanks

grep can do some text extraction. however not sure if this is what you want:
grep -Po "(?<=>)[^<]*"
test
kent$ echo "<FQDN>Compname.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname1.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname2.dom.domain.com</FQDN>"|grep -Po "(?<=>)[^<]*"
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com

Grep isn't what you are looking for.
Try sed with a regular expression : http://unixhelp.ed.ac.uk/CGI/man-cgi?sed

You can do it like you want with grep :
grep -oP '<FQDN>\K[^<]+' FILE
Output:
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com

As others have said, grep is not the ideal tool for this. However:
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | egrep -io '[a-z]+\.[^<]+'
Compname.dom.domain.com
Remember that grep's purpose is to MATCH things. The -o option shows you what it matched. In order to make regex conditions that are not part of the expression that is returned, you'd need to use lookahead or lookbehind, which most command-line grep does not support because it's part of PCRE rather than ERE.
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | grep -Po '(?<=>)[^<]+'
Compname.dom.domain.com
The -P option will work in most Linux environments, but not in *BSD or OSX or Solaris, etc.

Can grep show only words that match search pattern?

Is there a way to make grep output "words" from files that match the search expression?
If I want to find all the instances of, say, "th" in a number of files, I can do:
grep "th" *
but the output will be something like (bold is by me);
some-text-file : the cat sat on the mat
some-other-text-file : the quick brown fox
yet-another-text-file : i hope this explains it thoroughly
What I want it to output, using the same search, is:
the
the
the
this
thoroughly
Is this possible using grep? Or using another combination of tools?

Try grep -o:
grep -oh "\w*th\w*" *
Edit: matching from Phil's comment.
From the docs:
-h, --no-filename
Suppress the prefixing of file names on output. This is the default
when there is only one file (or only standard input) to search.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.

Cross distribution safe answer (including windows minGW?)
grep -h "[[:alpha:]]*th[[:alpha:]]*" 'filename' | tr ' ' '\n' | grep -h "[[:alpha:]]*th[[:alpha:]]*"
If you're using older versions of grep (like 2.4.2) which do not include the -o option, then use the above. Else use the simpler to maintain version below.
Linux cross distribution safe answer
grep -oh "[[:alpha:]]*th[[:alpha:]]*" 'filename'
To summarize: -oh outputs the regular expression matches to the file content (and not its filename), just like how you would expect a regular expression to work in vim/etc... What word or regular expression you would be searching for then, is up to you! As long as you remain with POSIX and not perl syntax (refer below)
More from the manual for grep
-o Print each match, but only the match, not the entire line.
-h Never print filename headers (i.e. filenames) with output lines.
-w The expression is searched for as a word (as if surrounded by
`[[:<:]]' and `[[:>:]]';
The reason why the original answer does not work for everyone
The usage of \w varies from platform to platform, as it's an extended "perl" syntax. As such, those grep installations that are limited to work with POSIX character classes use [[:alpha:]] and not its perl equivalent of \w. See the Wikipedia page on regular expression for more
Ultimately, the POSIX answer above will be a lot more reliable regardless of platform (being the original) for grep
As for support of grep without -o option, the first grep outputs the relevant lines, the tr splits the spaces to new lines, the final grep filters only for the respective lines.
(PS: I know most platforms by now would have been patched for \w.... but there are always those that lag behind)
Credit for the "-o" workaround from #AdamRosenfield answer

It's more simple than you think. Try this:
egrep -wo 'th.[a-z]*' filename.txt #### (Case Sensitive)
egrep -iwo 'th.[a-z]*' filename.txt ### (Case Insensitive)
Where,
egrep: Grep will work with extended regular expression.
w : Matches only word/words instead of substring.
o : Display only matched pattern instead of whole line.
i : If u want to ignore case sensitivity.

You could translate spaces to newlines and then grep, e.g.:
cat * | tr ' ' '\n' | grep th

Just awk, no need combination of tools.
# awk '{for(i=1;i<=NF;i++){if($i~/^th/){print $i}}}' file
the
the
the
this
thoroughly

grep command for only matching and perl
grep -o -P 'th.*? ' filename

I was unsatisfied with awk's hard to remember syntax but I liked the idea of using one utility to do this.
It seems like ack (or ack-grep if you use Ubuntu) can do this easily:
# ack-grep -ho "\bth.*?\b" *
the
the
the
this
thoroughly
If you omit the -h flag you get:
# ack-grep -o "\bth.*?\b" *
some-other-text-file
1:the
some-text-file
1:the
the
yet-another-text-file
1:this
thoroughly
As a bonus, you can use the --output flag to do this for more complex searches with just about the easiest syntax I've found:
# echo "bug: 1, id: 5, time: 12/27/2010" > test-file
# ack-grep -ho "bug: (\d*), id: (\d*), time: (.*)" --output '$1, $2, $3' test-file
1, 5, 12/27/2010

cat *-text-file | grep -Eio "th[a-z]+"

You can also try pcregrep. There is also a -w option in grep, but in some cases it doesn't work as expected.
From Wikipedia:
cat fruitlist.txt
apple
apples
pineapple
apple-
apple-fruit
fruit-apple
grep -w apple fruitlist.txt
apple
apple-
apple-fruit
fruit-apple

I had a similar problem, looking for grep/pattern regex and the "matched pattern found" as output.
At the end I used egrep (same regex on grep -e or -G didn't give me the same result of egrep) with the option -o
so, I think that could be something similar to (I'm NOT a regex Master) :
egrep -o "the*|this{1}|thoroughly{1}" filename

To search all the words with start with "icon-" the following command works perfect. I am using Ack here which is similar to grep but with better options and nice formatting.
ack -oh --type=html "\w*icon-\w*" | sort | uniq

You could pipe your grep output into Perl like this:
grep "th" * | perl -n -e'while(/(\w*th\w*)/g) {print "$1\n"}'

grep --color -o -E "Begin.{0,}?End" file.txt
? - Match as few as possible until the End
Tested on macos terminal

$ grep -w
Excerpt from grep man page:
-w: Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character.

ripgrep
Here are the example using ripgrep:
rg -o "(\w+)?th(\w+)?"
It'll match all words matching th.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How does grep match directory URLs? - url

Related

show filename with matching word from grep only

How to grep in one line starting from particular string to end with particular string

grep mistaking pattern for file?

Simple Grep Issue

Can grep show only words that match search pattern?

Categories

Resources