Grep: Count the number of times a string occurs if another string does not occur - grep

I have a set of many .json.gz files. In each file, there are entries such as this:
{"type":"e1","public":true, "login":"username1", "org":{"dict","of":"lots_of_things"}}
{"type":"e2","public":true, "login":"username2"}
No matter where in each nested dict "login" appears, I want to be able to detect it and take the username, only if the key "org" does not exist anywhere in the nested dict. I also want to count the number of times each username appears in the files.
My final output should be a file of dicts that looks like this:
{'username2: 1}
because of course username1 wouldn't be counted: the key "org" appears in its dict.
I'm looking for something like:
zgrep -Rv "org" . | zgrep -o 'login":"[^"]*"' /path/to/files/* | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > outputfile.txt
I'm not sure about this part:
zgrep -Rv "org" . |
The rest successfully creates the type of file I'm looking for. I'm just unsure about the order of operations here.
EDIT
I should have been more clear, I apologize. There are also often multiple instances of the key "login" per main dict object. For example (using "k" for any key that is not login and not org, and using "v" for a value):
{"k":"v","k":{"k":{"k":"v","login":"username1"},"k":"v"},"k":{"k":"v","login":"username2"}}
{"k":{"k":"v","k":"v"},"k":{"org":{"k":"v","k":v,"login":"username3"},"k":"v"},"k":{"k":"v","login":"username4"}}
{"k":{"k":"v"},"k":{"k":{"k":"v","login":"username1"},"login":"username2"}}
Since the key org appears in the second dict, I want to exclude usernames 3 and 4 from the dict I make and save to a file.
For example, I want this in a file:
{'username1': 2}
{'username2': 2}

AWK solution and replacing find -R with more reliable find:
find . -type f -name "*.json.gz" -print0 | xargs -0 zgrep -v -h '"org"' | awk '{ if ( match($0,/"login":"[^"]+"/) ) logins[substr($0,RSTART+8,RLENGTH-8)]++; } END { for ( i in logins ) print("{" i ":" logins[i] "}"); }'
Example output:
{"username2":1}

not grep but gnu sed job with script, your data in 'a'
i=
for e in $(sed -nE '/.*\borg\b.*/!s/.*"login":"(\w+)".*/{\1:}/p' a)
{
let i++;echo ${e/:/:$i}
}
use '>' at end to save in file
if better regex : 'pcregrep' installed, it does as well;
pcregrep -io '(?!.*\borg\b.*)(?<="login":")\w+(?=".*)' a
replace sed... script above, with a bit adjusted printout

This worked:
zgrep -v "org" *.json.gz | zgrep -o 'login":"[^"]*"' | cut -d'"' -f3 | sort | uniq -c | sed '1i{
s/\s*\([0-9]*\)\s*\(.*\)/"\2": \1,/;$a}' > usernames_2011.txt

Related

show filename with matching word from grep only

I am trying to find which words happened in logfiles plus show the logfilename for anything that matches following pattern:
'BA10\|BA20\|BA21\|BA30\|BA31\|BA00'
so if file dummylogfile.log contains BA10002 I would like to get a result such as:
dummylogfile.log:BA10002
it is totally fine if the logfile shows up twice for duplicate matches.
the closest I got is:
for f in $(find . -name '*.err' -exec grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} \+);do printf $f;printf ':';grep -o 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' $f;done
but this gives things like:
./register-05-14-11-53-59_24154.err:BA10
BA10
./register_mdw_files_2020-05-14-11-54-32_24429.err:BA10
BA10
./process_tables.2020-05-18-11-18-09_11428.err:BA30
./status_load_2020-05-18-11-35-31_9185.err:BA30
so,
1) there are empty lines with only the second match and
2) the full match (e.g., BA10004) is not shown.
thanks for the help
There are a couple of options you can pass to grep:
-H: This will report the filename and the match
-o: only show the match, not the full line
-w: The match must represent a full word (string build from [A-Za-z0-9_])
If we look at your regex, you use BA01, this will match only BA01 which can appear anywhere in the text, also mid word. If you want the regex to match a full word, it should read BA01[[:alnum:]_]* which adds any sequence of word-constituent characters (equivalent to [A-Za-z0-9_]). You can test this with
$ echo "foo BA01234 barBA012" | grep -Ho "BA01"
(standard input):BA01
(standard input):BA01
$ echo "foo BA01234 barBA012" | grep -How "BA01"
$ echo "foo BA01234 barBA012" | grep -How "BA01[[:alnum:]_]*"
(standard input):BA01234
So your grep should look like
grep -How "\('BA10\|BA20\|BA21\|BA30\|BA31\|BA00'\)[[:alnum:]_]*" *.err
From your example it seems that all files are in one directory. So the following works right away:
grep -l 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' *.err
If the files are in different directories:
find . -name '*.err' -print | xargs -I {} grep 'BA10\|BA20\|BA21\|BA30\|BA31\|BA00' {} /dev/null
Explanation: the addition of /dev/null to the filename {} forces grep to report the matching filename

grep -o search stop at first instance of second expression, rather then last? Greedy?

Not sure who to phrase this question
This is an example line.
30/Oct/2019:00:17:22 +0000|v1|177.95.140.78|www.somewebsite.com|200|162512|-|-|0.000|GET /product/short-velvet-cloak-with-hood/?attribute_pa_color=dark-blue&attribute_pa_accent-color=gold&attribute_pa_size=small HTTP/1.0|0|0|-
I need to extract attribute_pa_color=
So I have
cat somewebsite.access.log.2.csv | grep -o "?.*=" > just-parameters.txt
Which works but if there are multiple parameters in the URL is returns all of them
So instead of stopping the match at the first instance of "=" its taking the last instance of "=" in the line.
How can I make it stop at the first.
I tried this
cat somewebsite.access.log.2.csv | grep -o "?(.*?)=" > just-parameters2.txt
cat somewebsite.access.log.2.csv | grep -o "\?(.*?)=" > just-parameters2.txt
Both return nothing
Also I need each unique parameter so once I created the file I ran
sort just-parameters.txt | uniq > clean.txt
Which does not appear to work, is it possible to remove duplicates and have it be part of them same command?
You can try something like with awk
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|sort -u > clean.txt
This will work if attribute_pa_color is the first parameter on URL
If you want to extract only text attribute_pa_color= you can try something like:
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|awk -F\= '{print $1"="}'|sort -u > clean.txt
Instead of using second awk you can try something like:
awk -F'[?&]' '{split($2,a,=);print a[1]}' somewebsite.access.log.2.csv|sort -u > clean.txt
Split internally in awk using = as delimiter

How to ignore grep if the line starts with # or ;

Need to ignore grep if the line starts with ; or # for a specific string in a file. file.ini contains below line
output_partition_key=FILE_CREATED_DATE
doing a grep as below returns values FILE_CREATED_DATE
grep -w "output_partition_key" file.ini | cut -d= -f2
but if say the line starts with ; or # then it should not grep anything
;output_partition_key=FILE_CREATED_DATE
I tried solutions from other posts but its not working.Can anyone tell me how to achieve the expected result
It seems like what you really want is to find lines that start with output_partition_key=. The simplest way to do that is:
grep ^output_partition_key= file.ini | cut -d= -f2
(where ^ means "the start of a line").

How to sort the output of recursive “grep -lr” chronologically by newest modification date last?

I want to get a list of all files, in the current directory or any subdirectory, containing a certain string sorted by modification date.
I am having trouble getting the answer to
How to sort the output of "grep -l" chronologically by newest modification date last?
to work for the purpose of a recursive grep search. How do I obtain such a ordered list such that all files that would be found by grep -lr are really included.
Assuming your file names don't contain newlines:
find dir -type f -printf '%T#\t%p\n' | sort | cut -f2- | xargs grep -l whatever
More robustly using GNU versions of the tools to deal with dir/file names containing exoctic characters:
find dir -type f -printf '%T#\t%p\0' | sort -z | cut -z -f2- | xargs -0 grep -l whatever

Search file for usernames, and sort number of instances for each user in file?

I am tasked with taking a file that has line entries that include string username=xxxx:
$ cat file.txt
Yadayada username=jdoe blablabla
Yadayada username=jdoe blablabla
Yadayada username=jdoe blablabla
Yadayada username=dsmith blablabla
Yadayada username=dsmith blablabla
Yadayada username=sjones blablabla
And finding how many times each user in the file shows up, which I can do manually by feeding username=jdoe for example:
$ grep -r "username=jdoe" file.txt | wc -l | tr -d ' '
3
What's the best way to report each user in the file, and the number of lines for each user, sorted from highest to lowest instances:
3 jdoe
2 dsmith
1 sjones
Been thinking of how to approach this, but drawing blanks, figured I'd check with our gurus on this forum. :)
TIA,
Don
In GNU awk:
$ awk '
BEGIN { RS="[ \n]" }
/=/ {
split($0,a,"=")
u[a[2]]++ }
END {
PROCINFO["sorted_in"]="#val_num_desc"
for(i in u)
print u[i],i
}' file
3 jdoe
2 dsmith
1 sjones
Using grep :
$ grep -o 'username=[^ ]*' file | cut -d "=" -f 2 | sort | uniq -c | sort -nr
Awk alone:
awk '
{sub(/.*username=/,""); sub(/ .*/,"")}
{a[$0]++}
END {for(i in a) printf "%d\t%s\n",a[i],i | "sort -nr"}
' file.txt
This uses awk's sub() function to achieve what grep -o does in other answers. It embeds the call to sort within the awk script. You could of course use that pipe after the awk script rather than within it if you prefer.
Oh, and unlike the other awk solutions presented here, this one (1) is portable to non-GNU-awk environments (like BSD, macOS) and doesn't depend on the username being in a predictable location on each line (i.e. $2).
Why might awk be a better choice than simpler tools like uniq? It probably wouldn't, for a super simple requirement like this. But good to have in your toolbox if you want something with the capability of a little more text processing.
Using sed, uniq, and sort:
sed 's/.*username=\([^ ]*\).*/\1/' file.txt | sort | uniq -c | sort -nr
If there are lines without usernames:
sed -n 's/.*username=\([^ ]*\).*/\1/p' input | sort | uniq -c | sort -nr
$ awk -F'[= ]' '{print $3}' file | sort | uniq -c | sort -nr
3 jdoe
2 dsmith
1 sjones
Following awk may help you on same too.
awk -F"[ =]" '{a[$3]++} END{for(i in a){print a[i],i | "sort -nr"}}' Input_file

Resources