How to grep out substring which can change? - grep

Basically I have a very large text file and each line contains
tag=yyyyy;id=xxxxx;db_ref=zzzzz;
What I want is to grep out the id, but the id can change in length and form, I was wondering if its possible to use grep -o and then grep for "id=" then extract everything that comes after it until the semicolon?

You could do:
$ grep -o 'id=[^;]*' file
And if you don't want to inlcude the id= part you can using positive look-behind:
$ grep -Po '(?<=id=)[^;]*' file

try :
grep -Po "(?<=id=)[^;]*" file

Via grep:
grep -o 'id=[^;]*'
Via awk:
awk -F';' '{ print $2}' testlog
id=xxxxx
edit: see sudo_O's answer for the look-behind. it's more to the point of your question, IMO.

You could try this awk. It should also work if there are multiple id= entries per line and it would not give a false positive for ...;pid=blabla;...
awk '/^id=/' RS=\; file

Try the following:
grep -oP 'id=\K[^;]*' file

perl -lne 'print $1 if(/id=([^\;]*);/)' your_file
tested:
> echo "tag=yyyyy;id=xxxxx;db_ref=zzzzz; "|perl -lne 'print $1 if(/id=([^\;]*);/)'
xxxxx
>

Related

grep -o search stop at first instance of second expression, rather then last? Greedy?

Not sure who to phrase this question
This is an example line.
30/Oct/2019:00:17:22 +0000|v1|177.95.140.78|www.somewebsite.com|200|162512|-|-|0.000|GET /product/short-velvet-cloak-with-hood/?attribute_pa_color=dark-blue&attribute_pa_accent-color=gold&attribute_pa_size=small HTTP/1.0|0|0|-
I need to extract attribute_pa_color=
So I have
cat somewebsite.access.log.2.csv | grep -o "?.*=" > just-parameters.txt
Which works but if there are multiple parameters in the URL is returns all of them
So instead of stopping the match at the first instance of "=" its taking the last instance of "=" in the line.
How can I make it stop at the first.
I tried this
cat somewebsite.access.log.2.csv | grep -o "?(.*?)=" > just-parameters2.txt
cat somewebsite.access.log.2.csv | grep -o "\?(.*?)=" > just-parameters2.txt
Both return nothing
Also I need each unique parameter so once I created the file I ran
sort just-parameters.txt | uniq > clean.txt
Which does not appear to work, is it possible to remove duplicates and have it be part of them same command?
You can try something like with awk
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|sort -u > clean.txt
This will work if attribute_pa_color is the first parameter on URL
If you want to extract only text attribute_pa_color= you can try something like:
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|awk -F\= '{print $1"="}'|sort -u > clean.txt
Instead of using second awk you can try something like:
awk -F'[?&]' '{split($2,a,=);print a[1]}' somewebsite.access.log.2.csv|sort -u > clean.txt
Split internally in awk using = as delimiter

Grep or egrep to exclude text in a line

I would like some advice on how to exclude a word in a line using grep but still keep the line?
So I have tried:
grep -v '1.942134' results.tbl | egrep '*.fits' results.tbl
to try to list all the string with extension .fits but exclude "1.942134" in the sentence but it still returns the full lines.
Any advice?
Or you can use awk
awk '/\.fits/ && !/1\.942134/` results.tbl
PS you should escape the . in both sed and awk or else it will mean just any character.
You should pipe to sed. Sed has lots of abilities, some of them more complicated than others, but one of its best is regexp substitutions.
grep '\.fits$' | sed 's/1.942134//'

using grep to find the value of a string

I have file(file.txt) contains the following
aa=testing
bb=hello
cc=hi
Expected result
the value of aa is testing
How to use grep to find the value of aa?
Use a positive lookbehind in grep:
grep -Po "(?<=aa=).*" file.txt
Output
testing
grep -oP 'aa=\K.*' file.txt
Output:
testing
See: http://www.charlestonsw.com/perl-regular-expression-k-trick/
awk -F= '/^aa=/ { print $2 }' file
sed -n '/^aa=/s|^.*=||p' file
sed -n 's|^aa=||p' file
Output:
testing

Simple Grep Issue

I am trying to parse items out of a file I have. I cant figure out how to do this with grep
here is the syntax
<FQDN>Compname.dom.domain.com</FQDN>
<FQDN>Compname1.dom.domain.com</FQDN>
<FQDN>Compname2.dom.domain.com</FQDN>
I want to spit out just the bits between the > and the <
can anyone assist?
Thanks
grep can do some text extraction. however not sure if this is what you want:
grep -Po "(?<=>)[^<]*"
test
kent$ echo "<FQDN>Compname.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname1.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname2.dom.domain.com</FQDN>"|grep -Po "(?<=>)[^<]*"
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com
Grep isn't what you are looking for.
Try sed with a regular expression : http://unixhelp.ed.ac.uk/CGI/man-cgi?sed
You can do it like you want with grep :
grep -oP '<FQDN>\K[^<]+' FILE
Output:
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com
As others have said, grep is not the ideal tool for this. However:
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | egrep -io '[a-z]+\.[^<]+'
Compname.dom.domain.com
Remember that grep's purpose is to MATCH things. The -o option shows you what it matched. In order to make regex conditions that are not part of the expression that is returned, you'd need to use lookahead or lookbehind, which most command-line grep does not support because it's part of PCRE rather than ERE.
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | grep -Po '(?<=>)[^<]+'
Compname.dom.domain.com
The -P option will work in most Linux environments, but not in *BSD or OSX or Solaris, etc.

Use grep to report back only line numbers

I have a file that possibly contains bad formatting (in this case, the occurrence of the pattern \\backslash). I would like to use grep to return only the line numbers where this occurs (as in, the match was here, go to line # x and fix it).
However, there doesn't seem to be a way to print the line number (grep -n) and not the match or line itself.
I can use another regex to extract the line numbers, but I want to make sure grep cannot do it by itself. grep -no comes closest, I think, but still displays the match.
try:
grep -n "text to find" file.ext | cut -f1 -d:
If you're open to using AWK:
awk '/textstring/ {print FNR}' textfile
In this case, FNR is the line number. AWK is a great tool when you're looking at grep|cut, or any time you're looking to take grep output and manipulate it.
All of these answers require grep to generate the entire matching lines, then pipe it to another program. If your lines are very long, it might be more efficient to use just sed to output the line numbers:
sed -n '/pattern/=' filename
Bash version
lineno=$(grep -n "pattern" filename)
lineno=${lineno%%:*}
I recommend the answers with sed and awk for just getting the line number, rather than using grep to get the entire matching line and then removing that from the output with cut or another tool. For completeness, you can also use Perl:
perl -nE 'say $. if /pattern/' filename
or Ruby:
ruby -ne 'puts $. if /pattern/' filename
using only grep:
grep -n "text to find" file.ext | grep -Po '^[^:]+'
You're going to want the second field after the colon, not the first.
grep -n "text to find" file.txt | cut -f2 -d:
To count the number of lines matched the pattern:
grep -n "Pattern" in_file.ext | wc -l
To extract matched pattern
sed -n '/pattern/p' file.est
To display line numbers on which pattern was matched
grep -n "pattern" file.ext | cut -f1 -d:

Resources