AWK remove query params from URL - parsing

I have access.log file with >1m lines. The exaple of line:
113.10.154.38 - - [27/May/2016:03:36:26 +0200] "POST /index.php?option=com_jce&task=plugin&plugin=imgmanager&file=imgmanager&method=form&cid=20&6bc427c8a7981f4fe1f5ac65c1246b5f=cf6dd3cf1923c950586d0dd595c8e20b HTTP/1.1" 200 22 "-" "BOT/0.1 (BOT for JCE)" "-"
I need to parse log lines to count 10 most common urls, BUT i need to remove query params from url. Without query params i wrote this code
awk '{print $7}' test.log | sort | uniq -c | sort -rn | \
head | awk '{print NR,"\b. URL:", $2,"\n Requests:", $1}'
But i don't know how to remove query params and count top 10 most common urls without params to get clear top of requests.

Use the sub() function to remove a pattern from a string.
You also need to do this when you're extracting the field to sort and count unique values.
awk '{sub(/\?.*/, "", $7); print $7}' test.log | sort | uniq -c | sort -rn | ...

Related

grep -o search stop at first instance of second expression, rather then last? Greedy?

Not sure who to phrase this question
This is an example line.
30/Oct/2019:00:17:22 +0000|v1|177.95.140.78|www.somewebsite.com|200|162512|-|-|0.000|GET /product/short-velvet-cloak-with-hood/?attribute_pa_color=dark-blue&attribute_pa_accent-color=gold&attribute_pa_size=small HTTP/1.0|0|0|-
I need to extract attribute_pa_color=
So I have
cat somewebsite.access.log.2.csv | grep -o "?.*=" > just-parameters.txt
Which works but if there are multiple parameters in the URL is returns all of them
So instead of stopping the match at the first instance of "=" its taking the last instance of "=" in the line.
How can I make it stop at the first.
I tried this
cat somewebsite.access.log.2.csv | grep -o "?(.*?)=" > just-parameters2.txt
cat somewebsite.access.log.2.csv | grep -o "\?(.*?)=" > just-parameters2.txt
Both return nothing
Also I need each unique parameter so once I created the file I ran
sort just-parameters.txt | uniq > clean.txt
Which does not appear to work, is it possible to remove duplicates and have it be part of them same command?
You can try something like with awk
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|sort -u > clean.txt
This will work if attribute_pa_color is the first parameter on URL
If you want to extract only text attribute_pa_color= you can try something like:
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|awk -F\= '{print $1"="}'|sort -u > clean.txt
Instead of using second awk you can try something like:
awk -F'[?&]' '{split($2,a,=);print a[1]}' somewebsite.access.log.2.csv|sort -u > clean.txt
Split internally in awk using = as delimiter

illegal string body character after dollar sign Shell script from groovy function

I'm trying to perform a shell script from a groovy function loaded by a jenkins-pipeline to retrieve a zip file from an external location. I am building the address out in the function and passing it into the shell script via $. But I am getting a syntax error and I'm not sure why.
I've tried escaping the $ but dont think thats the correct approach here and my code has been coverted from triple single quotes (''') to triple double (""") so I can pass the variable in.
def DownloadBaseLineFromNexus(groupID, artifactID){
//add code for this method
def nexusLink = "${GetNexusLink()}/${GetNexusProdRepo()}/${groupID}/${artifactID}/"
sh """
# retrieving all available version from release repo to versionFile.xml
curl ${nexusLink} | grep "<a href=.*</a>" | grep "http" | cut -d'>' -f3 |cut -d'/' -f1 > versionFile.xml
# creating array from versionFile.xml
fileItemString=$(cat versionFile.xml |tr "\n" " ")
fileItemArray=($fileItemString)
# Finding maximum of array element
maxValue=`printf "%d\n" "${fileItemArray[#]}" | sort -rn | head -1`
# Download latest version artifact from nexus
curl -o ${(artifactID)}.zip ${(nexusLink)}/${(artifactID)}-$maxValue.zip
# Unzip the tool
unzip ${(artifactID)}.zip
"""
}
the results I get are:
Script1.groovy: 28: illegal string body character after dollar sign;
solution: either escape a literal dollar sign "\$5" or bracket the value expression "${5}" # line 28, column 22.
curl "${nexusLink}" | grep "" | grep "http" | cut -d'>' -f3 |cut -d'/' -f1 > versionFile.xml
You have to add escape characters like below:-
curl ${nexusLink} | grep \"<a href=.*</a>\" | grep \"http\" | cut -d'>' -f3 |cut -d'/' -f1 > versionFile.xml

How to ignore grep if the line starts with # or ;

Need to ignore grep if the line starts with ; or # for a specific string in a file. file.ini contains below line
output_partition_key=FILE_CREATED_DATE
doing a grep as below returns values FILE_CREATED_DATE
grep -w "output_partition_key" file.ini | cut -d= -f2
but if say the line starts with ; or # then it should not grep anything
;output_partition_key=FILE_CREATED_DATE
I tried solutions from other posts but its not working.Can anyone tell me how to achieve the expected result
It seems like what you really want is to find lines that start with output_partition_key=. The simplest way to do that is:
grep ^output_partition_key= file.ini | cut -d= -f2
(where ^ means "the start of a line").

Grep: Count the number of times a string occurs if another string does not occur

I have a set of many .json.gz files. In each file, there are entries such as this:
{"type":"e1","public":true, "login":"username1", "org":{"dict","of":"lots_of_things"}}
{"type":"e2","public":true, "login":"username2"}
No matter where in each nested dict "login" appears, I want to be able to detect it and take the username, only if the key "org" does not exist anywhere in the nested dict. I also want to count the number of times each username appears in the files.
My final output should be a file of dicts that looks like this:
{'username2: 1}
because of course username1 wouldn't be counted: the key "org" appears in its dict.
I'm looking for something like:
zgrep -Rv "org" . | zgrep -o 'login":"[^"]*"' /path/to/files/* | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > outputfile.txt
I'm not sure about this part:
zgrep -Rv "org" . |
The rest successfully creates the type of file I'm looking for. I'm just unsure about the order of operations here.
EDIT
I should have been more clear, I apologize. There are also often multiple instances of the key "login" per main dict object. For example (using "k" for any key that is not login and not org, and using "v" for a value):
{"k":"v","k":{"k":{"k":"v","login":"username1"},"k":"v"},"k":{"k":"v","login":"username2"}}
{"k":{"k":"v","k":"v"},"k":{"org":{"k":"v","k":v,"login":"username3"},"k":"v"},"k":{"k":"v","login":"username4"}}
{"k":{"k":"v"},"k":{"k":{"k":"v","login":"username1"},"login":"username2"}}
Since the key org appears in the second dict, I want to exclude usernames 3 and 4 from the dict I make and save to a file.
For example, I want this in a file:
{'username1': 2}
{'username2': 2}
AWK solution and replacing find -R with more reliable find:
find . -type f -name "*.json.gz" -print0 | xargs -0 zgrep -v -h '"org"' | awk '{ if ( match($0,/"login":"[^"]+"/) ) logins[substr($0,RSTART+8,RLENGTH-8)]++; } END { for ( i in logins ) print("{" i ":" logins[i] "}"); }'
Example output:
{"username2":1}
not grep but gnu sed job with script, your data in 'a'
i=
for e in $(sed -nE '/.*\borg\b.*/!s/.*"login":"(\w+)".*/{\1:}/p' a)
{
let i++;echo ${e/:/:$i}
}
use '>' at end to save in file
if better regex : 'pcregrep' installed, it does as well;
pcregrep -io '(?!.*\borg\b.*)(?<="login":")\w+(?=".*)' a
replace sed... script above, with a bit adjusted printout
This worked:
zgrep -v "org" *.json.gz | zgrep -o 'login":"[^"]*"' | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > usernames_2011.txt

Search file for usernames, and sort number of instances for each user in file?

I am tasked with taking a file that has line entries that include string username=xxxx:
$ cat file.txt
Yadayada username=jdoe blablabla
Yadayada username=jdoe blablabla
Yadayada username=jdoe blablabla
Yadayada username=dsmith blablabla
Yadayada username=dsmith blablabla
Yadayada username=sjones blablabla
And finding how many times each user in the file shows up, which I can do manually by feeding username=jdoe for example:
$ grep -r "username=jdoe" file.txt | wc -l | tr -d ' '
3
What's the best way to report each user in the file, and the number of lines for each user, sorted from highest to lowest instances:
3 jdoe
2 dsmith
1 sjones
Been thinking of how to approach this, but drawing blanks, figured I'd check with our gurus on this forum. :)
TIA,
Don
In GNU awk:
$ awk '
BEGIN { RS="[ \n]" }
/=/ {
split($0,a,"=")
u[a[2]]++ }
END {
PROCINFO["sorted_in"]="#val_num_desc"
for(i in u)
print u[i],i
}' file
3 jdoe
2 dsmith
1 sjones
Using grep :
$ grep -o 'username=[^ ]*' file | cut -d "=" -f 2 | sort | uniq -c | sort -nr
Awk alone:
awk '
{sub(/.*username=/,""); sub(/ .*/,"")}
{a[$0]++}
END {for(i in a) printf "%d\t%s\n",a[i],i | "sort -nr"}
' file.txt
This uses awk's sub() function to achieve what grep -o does in other answers. It embeds the call to sort within the awk script. You could of course use that pipe after the awk script rather than within it if you prefer.
Oh, and unlike the other awk solutions presented here, this one (1) is portable to non-GNU-awk environments (like BSD, macOS) and doesn't depend on the username being in a predictable location on each line (i.e. $2).
Why might awk be a better choice than simpler tools like uniq? It probably wouldn't, for a super simple requirement like this. But good to have in your toolbox if you want something with the capability of a little more text processing.
Using sed, uniq, and sort:
sed 's/.*username=\([^ ]*\).*/\1/' file.txt | sort | uniq -c | sort -nr
If there are lines without usernames:
sed -n 's/.*username=\([^ ]*\).*/\1/p' input | sort | uniq -c | sort -nr
$ awk -F'[= ]' '{print $3}' file | sort | uniq -c | sort -nr
3 jdoe
2 dsmith
1 sjones
Following awk may help you on same too.
awk -F"[ =]" '{a[$3]++} END{for(i in a){print a[i],i | "sort -nr"}}' Input_file

Resources