Having a log file in the standard combined access_log format of nginx or apache, how would you, in UNIX shell, calculate the number of visits or page views (i.e. total requests) from each visitor (i.e. IP-address) that a given referrer once brought?
In other words, the number of ALL requests by each visitor that have found a link to your site on another site.
The best snippet I could come up with is the following:
fgrep http://t.co/ /var/www/logs/access.log | cut -d " " -f 1 | \
fgrep -f /dev/fd/0 /var/www/logs/access.log | cut -d " " -f 1 | sort | uniq -c
What does this do?
We first find unique IP-addresses of visits that have http://t.co/ in the log entry. (Notice that this will only count visits that came directly from the ref, but not those that stayed and browsed the site further.)
After having a list of IP-addresses that, at one point, were referred from a given URL, we pipe such list to another fgrep through stdin — /dev/fd/0 (a very inefficient alternative would have been xargs -n1 fgrep access.log -e instead of fgrep -f /dev/fd/0 access.log) for finding all hits from such addresses.
After the second fgrep, we get the same set of IP-addresses that we had in the first step, but now they repeat according to the total number of requests -- now sort, uniq -c, done. :)
Related
I am new to linux and I am experimenting with basic terminal commands. I found out that I can list all users using compgen -u but what if I only want to display the bottom line outputs ?
Ok lets say the output of compgen -u goes like this:
extra
extra
extra
extra
extra
extra
extra
extra
extra
John
William
Kate
Harold
I can only use grep to find a single text (ex. compgen -u | grep John). But what if I want to use grep to display John as well as all the remaining entries after it ?
sed or awk solution would be easier, but if you can only use grep, then the option --after-context (or -A) might do:
grep -A 5 John file
The drawback is that you need to know the number of lines to display after the matching (or use an arbitrary big number for the rest of the file).
compgen -u | grep -A$(compgen -u| wc -l) John
Explanation:
From man grep
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line containing a group separator (described under --group-separator) between
contiguous groups of matches.
grep -A -- print number of rows after pattern
$() -- Execute unix command
compgen -u| wc -l --> Get total number of rows of output of command.
You can use the following one-liner :
n=$( compgen -u | grep -n John | head -1 | cut -d ":" -f 1 ) && compgen -u | tail -n +$n
This finds out the line number for first occurrence of John, and prints everything starting that line.
I'm searching for a fast method to find all files in a folder which contain 2 or more patterns
grep -l -e foo -e bar ./*
or
rg -l -e foo -e bar
show all files containing 'foo' AND 'bar' in the same line or 'foo' OR 'bar' in different lines but I want only files that have at a minimum one 'foo' match AND one 'bar' match in different lines. Files which only have 'foo' matches or only 'bar' matches shall be filtered out.
I know I could chain the grep calls but this will be too slow.
rg with multiline does work, however it will print as result everything in-between the criteria and sometimes that's not useful.
For the use case of chaining searches (in e.g. html, json, etc), where the 1st criterium is just to narrow down the files, and the 2nd criterium is actually what I am looking for, this is a possible solution:
rg -0 -l crit1 | xargs -0 -I % rg -H crit2 %
Alternatively I have just discovered ugrep which supports combining multiple criteria using boolean operators both on line and file level. This is quite something. It's a bit slower than rg + xargs, however it prints nicely all lines matching all criteria from the files (instead of just showing the last criteria from above):
ugrep --files -e crit1 --and -e crit2
If you want to search for two or more words that occur on multiple lines you can use ripgrep's option --multiline-dotall, in addition to to provide -U/--multiline. You also need to search for foo before bar and bar before foo using the | operator:
rg -lU --multiline-dotall 'foo.*bar|bar.*foo' .
For any number of words you'll need to | all permutations of those words. For that I use a small python script (which I called rga) which searches in
the current directory (and downwards), for files that contain all arguments given on the commandline:
#! /opt/util/py310/bin/python
import sys
import subprocess
from itertools import permutations
rgarg = '|'.join(('.*'.join(x) for x in permutations(sys.argv[1:])))
cmd = ['rg', '-lU', '--multiline-dotall', rgarg, '.']
# print(' '.join(cmd))
proc = subprocess.run(cmd, capture_output=True)
sys.stdout.write(proc.stdout.decode('utf-8'))
I have searched successfully with six arguments, above that the commandline becomes to long. There are probably ways around that by saving the argument to a file and adding -f file_name, but I never needed/investigated that.
$ cat f1
afoot
2bar
$ cat f2
foo bar
$ cat f3
foot
$ cat f4
bar
$ cat f5
barred
123
foo3
$ rg -Ul '(?s)foo.*?\n.*?bar|bar.*?\n.*?foo'
f5
f1
You can use -U option to match across lines. The s flag will enable . to match newlines as well. Since you want the matches to be across different lines, you need to match a newline character in between the search terms as well.
So this doesn't perfectly answer the question, but, this is the StackOverflow question that pops up every time I google "ripgrep multiple patterns". So I'm leaving my answer here for the future googler (including myself)...
I primarily work in PowerShell, so this is how I perform an and search in ripgrep in PowerShell. This will match same line matches, which is why it's not a perfect answer, but it will identify files that match both patterns, and runs relatively quickly:
rg -l 'SecondSearchPattern' (rg -l 'FirstSearchPattern')
Explanation:
First the parens run: rg -l 'FirstSearchPattern', which searches all files for the pattern FirstSearchPattern. By using -l it returns a list of file paths only.
By placing it in (parentheses), it runs the whole command first, then "splats" the results of the command into the external rg command.
The external rg command is now run like this:
rg -l 'SecondSearchPattern' "file.txt" "directory\file.txt"
And yes, it does put them into quotes, so it handles paths with spaces. This searches all provided files that match the pattern SecondSearchPattern. Thus returning only files that match both patterns.
You can go one step further and add on | Get-Item (| gi) to return filesystem objects, and | % FullName to get the full path.
rg -l 'SecondSearchPattern' (rg -l 'FirstSearchPattern') | gi | % FullName
I want to do a search in a log file like this:
/logs/loggy.log:
INFO: cats are people
DEBUG: one doth fig're and therefore one doth be
INFO: cookies made via the catapultation of figs at an acceleration of 1 m/s^2.
INFO: informative information about my information systems
I want just the 3rd line. So I command:
grep 'cat.*fig' /logs/loggy.log
But it's a large file! Let's make it faster
grep -F -e cat -e fig /logs/loggy.log
0ops. Now I'm getting back all the lines because it now matches for either 'cat' or 'fig'. I want it to match only lines containing bolth. Is there a way to do this without going back into regular expressions land?
You can use agrep if it is available in your distro repos, which nativelly provides and operation:
$ agrep 'cat;fig' file1
Or you can use any of the following alternatives:
$ grep 'cat' file1 |grep 'fig'
$ awk '/cat/ && /fig/' file1
In all above cases the result is:
INFO: cookies made via the catapultation of figs at an acceleration of 1 m/s^2.
I have a file, for example, "queries.txt" that has hard return separated strings. I want to use this list to find matches in a second file, "biglist.txt".
"biglist.txt" may have multiple matches for each string in "queries.txt". I want to return only the first hit for each query and write this to another file.
grep -m 1 -wf queries.txt biglist.txt > output
only gives me one line in output. I should have output that is the same number of lines as queries.txt.
Any suggestions for this? Many thanks! I searched for past questions but did not find one that was exactly the same sort of case after a few minutes of reading.
If you want to "reset the counter" after each file, you could do
cat queries.txt | xargs -I{} grep -m 1 -w {} biglist.txt > output
This uses xargs to call grep once for each line in the input… should do the trick for you.
Explanation:
cat queries.txt - produce one "search word" per line
xargs -I{} - take the input one line at a time, and insert it at {}
grep -m 1 -w - find only one match of a whole word
{} - this is where xargs inserts the search term (once per call)
biglist.txt - the file to be searched
> output - the file where the result is to be written
An alternate method without xargs (which one should indeed learn):
(this method assumes there are no spaces in the lines in queries.txt)
cat queries.txt | while read target; do grep -m 1 $target biglist.txt; done > outr
I might not fully understand your question, but it sounds like something like this might work.
cat queries.txt | while read word; do grep "$word" biglist.txt | tee -a output.txt; done
I want to run ack or grep on HTML files that often have very long lines. I don't want to see very long lines that wrap repeatedly. But I do want to see just that portion of a long line that surrounds a string that matches the regular expression. How can I get this using any combination of Unix tools?
You could use the grep options -oE, possibly in combination with changing your pattern to ".{0,10}<original pattern>.{0,10}" in order to see some context around it:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e., force grep to behave as egrep).
For example (from #Renaud's comment):
grep -oE ".{0,10}mysearchstring.{0,10}" myfile.txt
Alternatively, you could try -c:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.
Pipe your results thru cut. I'm also considering adding a --cut switch so you could say --cut=80 and only get 80 columns.
You could use less as a pager for ack and chop long lines: ack --pager="less -S" This retains the long line but leaves it on one line instead of wrapping. To see more of the line, scroll left/right in less with the arrow keys.
I have the following alias setup for ack to do this:
alias ick='ack -i --pager="less -R -S"'
grep -oE ".\{0,10\}error.\{0,10\}" mylogfile.txt
In the unusual situation where you cannot use -E, use lowercase -e instead.
Explanation:
cut -c 1-100
gets characters from 1 to 100.
The Silver Searcher (ag) supports its natively via the --width NUM option. It will replace the rest of longer lines by [...].
Example (truncate after 120 characters):
$ ag --width 120 '#patternfly'
...
1:{"version":3,"file":"react-icons.js","sources":["../../node_modules/#patternfly/ [...]
In ack3, a similar feature is planned but currently not implemented.
Taken from: http://www.topbug.net/blog/2016/08/18/truncate-long-matching-lines-of-grep-a-solution-that-preserves-color/
The suggested approach ".{0,10}<original pattern>.{0,10}" is perfectly good except for that the highlighting color is often messed up. I've created a script with a similar output but the color is also preserved:
#!/bin/bash
# Usage:
# grepl PATTERN [FILE]
# how many characters around the searching keyword should be shown?
context_length=10
# What is the length of the control character for the color before and after the
# matching string?
# This is mostly determined by the environmental variable GREP_COLORS.
control_length_before=$(($(echo a | grep --color=always a | cut -d a -f '1' | wc -c)-1))
control_length_after=$(($(echo a | grep --color=always a | cut -d a -f '2' | wc -c)-1))
grep -E --color=always "$1" $2 |
grep --color=none -oE \
".{0,$(($control_length_before + $context_length))}$1.{0,$(($control_length_after + $context_length))}"
Assuming the script is saved as grepl, then grepl pattern file_with_long_lines should display the matching lines but with only 10 characters around the matching string.
I put the following into my .bashrc:
grepl() {
$(which grep) --color=always $# | less -RS
}
You can then use grepl on the command line with any arguments that are available for grep. Use the arrow keys to see the tail of longer lines. Use q to quit.
Explanation:
grepl() {: Define a new function that will be available in every (new) bash console.
$(which grep): Get the full path of grep. (Ubuntu defines an alias for grep that is equivalent to grep --color=auto. We don't want that alias but the original grep.)
--color=always: Colorize the output. (--color=auto from the alias won't work since grep detects that the output is put into a pipe and won't color it then.)
$#: Put all arguments given to the grepl function here.
less: Display the lines using less
-R: Show colors
S: Don't break long lines
Here's what I do:
function grep () {
tput rmam;
command grep "$#";
tput smam;
}
In my .bash_profile, I override grep so that it automatically runs tput rmam before and tput smam after, which disabled wrapping and then re-enables it.
ag can also take the regex trick, if you prefer it:
ag --column -o ".{0,20}error.{0,20}"