Count text matches in files based on unique ID

Count text matches in files based on unique ID - grep

I have a file which has the following format:
LINK|Grouping_Indicator|ID_Dat|HASH_Akey|HASH_HUKey|
FALSE|75768163|XY100|c5157cba1b5f20|817f8b50bc9
FALSE|75768409|XY102|9f3de314a224f2|b686e4760f5
TRUE|75769393|XY1005|ce0a50207cc86c|f9233c0b8e7
TRUE|75769885|XY1012|ce0a50207cc86c|ef9eb8ea13f
TRUE|75723124|XY1111|df0q45677ee89v|gt8qc9fb24g
I am trying to count the numbers of TRUE where the HASH_Akey is unique.
I've managed to count the numbers of TRUE in total with the following command:
grep -c "TRUE" file.psv
However, I am unsure on how to count "TRUE" where the HASH_Akey is unique.
So the count for "TRUE" from the table above should only return 2
Thanks

I would do it with awk:
awk -F'|' '$1=="TRUE"{a[$(NF-1)]}END{print length(a)}' file
with your example, the above one-liner will print 2
You can also do it with:
awk -F'|' '$1=="TRUE"&&!a[$(NF-1)]++' file|wc -l
the line is a little bit shorter but it starts another process (wc) to do the counting.

Related

How to use grep command to filter a log file for a specific keyword within particular timestamp?

So
grep "xyz" file.log
will print all the lines having xyz as a key word and
grep "01/APR/2014:16:3[5-9]" file,log
will print lines within that time range.How to use both the feature i.e a key word filter within a time range?

Just pipe your two greps together:
grep “xyz” file.log | grep “01/APR/2014:16:3[5-9]”
The first grep will parse out all the lines with xyz, the second grep will winnow that list down by the date given. Depending on your data set, reversing the greps could be faster.

grep for matching 1 to 2 digits in a sequence of numbers

I have below numbers in a file
44700101
44700201
44700301
44700401
44700501
44700601
44700701
44700801
44700901
44701001
want to fetch the above numbers whose 5th and 6th digits are greater than 5 USING GREP WILDCARDS.
something like "grep ....[6-10].. file" should yield below
44700601
44700701
44700801
44700901
44701001
Any help will be appreciated. Thanks

gawk (GNU awk) approach:
awk '{split($0,a,"")}int(a[5]a[6])>5' file
gawk has the ability for FS and for the third argument to split() to be null strings
split($0,a,"") - splits the numeric string into separate numbers (filling array a)
int(a[5]a[6])>5 - print the line if integer representation of the 5th and 6th numbers is greater than 5
grep approach:
grep '^[0-9]\{4\}\([1-9]\|0[6-9]\).*' file
The output (for both approaches):
44700601
44700701
44700801
44700901
44701001

Just use awk:
$ awk 'substr($0,5,2)+0 > 5' file
44700601
44700701
44700801
44700901
44701001

Only output values within a certain range

I run a command that produce lots of lines in my terminal - the lines are floats.
I only want certain numbers to be output as a line in my terminal.
I know that I can pipe the results to egrep:
| egrep "(369|433|375|368)"
if I want only certain values to appear. But is it possible to only have lines that have a value within ± 50 of 350 (for example) to appear?

grep matches against string tokens, so you have to either:
figure out the right string match for the number range you want (e.g., for 300-400, you might do something like grep -E [34].., with appropriate additional context added to the expression and a number of additional .s equal to your floating-point precision)
convert the number strings to actual numbers in whatever programming language you prefer to use and filter them that way
I'd strongly encourage you to take the second option.

I would go with awk here:
./yourProgram | awk '$1>250 && $1<350'
e.g.
echo -e "12.3\n342.678\n287.99999" | awk '$1>250 && $1<350'
342.678
287.99999

Grep wilcard of unknown length In between pipes

I'm trying to grep the following string:
Line must start with a 15 and the rest of the string can have any length of numbers between the pipes. There must be nothing in between the last 2 pipes.
"15|155702|0101|1||"
So far i have:
grep "^15|" $CONCAT_FILE_NAME >> "VAS-"$CONCAT_FILE_NAME
I'm having trouble specifying any length of numbers when using [0-9]

You need to escape the |
grep -E '^15\|([[:digit:]]+\|)+\|$'

Assuming the beginning must start with 15| and there are a total of 5 pipes(|) and nothing between the last two pipes.. And any number of digits between the 2nd 3rd and 4th pipes.
grep "^15\|[0-9]*\|[0-9]*\|[0-9]*\|\|$" $CONCAT_FILE_NAME >> "VAS-"$CONCAT_FILE_NAME

Using awk
cat file
15|155702|0101|1||
15|155702|0101|1|test|
16|155702|0101|1||
awk -F\| '/^15/ && !$(NF-1)' file
15|155702|0101|1||
This prints a line only if it starts with 15 and the second last field, separated by | is blank
So this would be:
VAS-CONCAT_FILE_NAME=$(awk -F\| '/^15/ && !$(NF-1)' <<<"$CONCAT_FILE_NAME")
Another shorter regex that would work
awk '/^15.*\|\|$/' file
This search for all lines starting with 15 and ends with ||

How to generate random numbers under OpenWRT?

With a "normal" (i mean "full") linux distro, it works just fine:
sleep $(echo "$[ ($RANDOM % 9 ) ]")
ok, it waits for about 0-9 sec
but under OpenWRT [not using bash, rather "ash"]:
$ sleep $(echo "$[ ($RANDOM % 9 ) ]") sleep: invalid number '$[' $
and why:
$ echo "$[ ($RANDOM % 9 ) ]" $[ ( % 9 ) ] $
So does anyone has a way to generate random numbers under OpenWRT, so i can put it in the "sleep"?
Thank you

You might try something like this:
sleep `head /dev/urandom | tr -dc "0123456789" | head -c1`
Which works on my WhiteRussian OpenWRT router.
I actually don't know if this will always return a number, but when it does, it will always return 0-9, and only 1 digit (you could make it go up to 99 if you made the second head -c2).
Good luck!

you could also use awk
sleep $(awk 'BEGIN{srand();print int(rand()*9)}')

For some scenarios, this might not yield a sufficient diversity of answers. Another approach is to use /dev/urandom directly (eg https://www.2uo.de/myths-about-urandom/):
echo $(hexdump -n 4 -e '"%u"' </dev/urandom)
When using awk, note that awk uses the time of day as the seed (https://linux.die.net/man/1/awk). This might be relevant for scenarios where the time of day is reset (eg no battery backed time of day clock), or synchronised across a fleet (eg group restart).
srand([expr])
Uses expr as a new seed for the random number generator. If no expr is provided, the time of day is used. The return value is the previous seed for the random number generator.
This is confirmed by looking at the source in busybox (https://github.com/mirror/busybox/blob/master/editors/awk.c):
seed = op1 ? (unsigned)L_d : (unsigned)time(NULL);
At least for some versions of Openwrt, it seems an explicit call to srand() is required to avoid obtaining the same answers repeatedly:
# awk 'BEGIN{print rand(), rand()}'
0 0.345001
# awk 'BEGIN{print rand(), rand()}'
0 0.345001

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Count text matches in files based on unique ID - grep

I would do it with awk: awk -F'|' '$1=="TRUE"{a[$(NF-1)]}END{print length(a)}' file with your example, the above one-liner will print 2 You can also do it with: awk -F'|' '$1=="TRUE"&&!a[$(NF-1)]++' file|wc -l the line is a little bit shorter but it starts another process (wc) to do the counting.

Related

How to use grep command to filter a log file for a specific keyword within particular timestamp?

grep for matching 1 to 2 digits in a sequence of numbers

Only output values within a certain range

Grep wilcard of unknown length In between pipes

How to generate random numbers under OpenWRT?

Categories

Resources