How can I use xargs to recursively parse email addresses out of text/html files? - grep

I tried recursively parsing email addresses from a directory of text/html files with xargs and grep but this command keep including the path (I just want the email addresses in my resulting emails.csv file).
find . -type f | xargs grep -E -o "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" >> ~/emails.csv
Can you explain what's wrong with my grep command? I don't need this to be sorted or unique. I want to match all occurrences of email addresses in files. I need to use xargs cause I'm parsing emails in 20 GB worth of text files.
Thanks.

When you tell grep to search in more than one file, it prepends the corresponding filename to the search result. Try the following to see the effect...
First, search in a single file:
grep local /etc/hosts
# localhost is used to configure the loopback interface
127.0.0.1 localhost
Now search in two files:
grep local /etc/hosts /dev/null
/etc/hosts:# localhost is used to configure the loopback interface
/etc/hosts:127.0.0.1 localhost
To suppress the filename in which the match was found, add the -h switch to grep like this
grep -h <something> <somewhere>

Related

Unable to exclude IPv4 addresses using regex in grep

I used a regex to grep and output only IPv4 addresses from the file content.
But when I try to use the same regex to exclude all IPv4 addresses, it just does not work.
File content:
# cat IPs
172.16.1.125
172.16.1.4
172.16.1.143
172.16.1.140
172.16.1.77
/dev/nvme101
/dev/sda1
/dev/sdb2
172.16.1.60
172.16.1.146
172.16.1.5
172.16.1.51
172.16.1.99
172.16.1.10
172.16.1.189
To grep only IPv4 addresses:
# grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" IPs
172.16.1.125
172.16.1.4
172.16.1.143
172.16.1.140
172.16.1.77
172.16.1.60
172.16.1.146
172.16.1.5
172.16.1.51
172.16.1.99
172.16.1.10
172.16.1.189
When I try to exclude the IPv4 addresses using the same regex:
# grep -voE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" IPs
#
No output at all.
I was expecting the following output:
/dev/nvme101
/dev/sda1
/dev/sdb2
Get rid of the -o. The -o flag says to only show what was matched rather than the entire line. That doesn't make sense when using -v for lines that do NOT match.
In ack, if you try to use -o and -v together, it throws an error.

regex start of line anchor alternative

I have "file.txt" with the following and I need to get only ip addresses that start a line.
I am using gnu utilities for windows and grep seems to be not behaving incorrectly.
Random Text Here
ABC 10.0.0.0 - 10.20.0.255
IP Ping Hostname
100.5.0.20 11ms N/S
GNU grep 2.5.4
grep -Po ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} file.txt
10.0.0.0
10.20.0.255
100.5.0.20
Correct behavior should only allow 100.5.0.20 since i specified the start line anchor.
Any other Linux command solutions?
I ended up improvising,
grep -oP "[0-9]+\.[0-9]+\.[0-9]+\.[0-9]{1,3} " file.txt| awk "{$1=$1};1" > file.txt
This will grab the ip addresses with 2 spaces, and then remove the spaces with awk.

How can I extract the IP addresses from .cap file?

I have a fwcapture.cap file, which is used by Wireshark.
in it, there have many IP addresses source IPs and destination IPs.
How can I extract the unique IP addresses(no mater source or destination) as a list?
You can use tshark, which already in Wireshark installation.
tshark -T json -e 'ip.src' -e 'ip.dst' -r filename.pcap | grep '\.[0-9]' | sort -u

How to output tcpdump with grep expression to stdout / file?

I am trying to output the following tcpdump grep expression to a file :
tcpdump -vvvs 1024 -l -A tcp port 80 | grep -E 'X-Forwarded-For:' --line-buffered | awk '{print $2}
I understand it is related to the line-buffered option, that sends the output to stdin. However, if I don't use --line-buffered I don't get any output at all from my tcpdump.
How can I use grep so that it will send my output directly to stdout / file in this case ?
I am trying to output the following tcpdump grep expression to a file
Then redirect the output of the last command in the pipeline to the file:
tcpdump -vvvs 1024 -l -A tcp port 80 | grep -E 'X-Forwarded-For:' --line-buffered | awk '{print $2}' >file
I understand it is related to the line-buffered option, that sends the output to stdin.
No, that's not with --line-buffered does:
$ man grep
...
--line-buffered
Force output to be line buffered. By default, output is line
buffered when standard output is a terminal and block buffered
otherwise.
so it doesn't change where the output goes, it just changes when the data is actually written to the output descriptor if it's not a terminal. It's not a terminal in this case - it's a pipe - so, by default, it's block buffered, so if grep writes 4 lines of output, and that's less than a full buffer block (buffer blocks, in this context, are typically 4K bytes in most modern UN*Xes and on Windows, so it's likely that those 4 lines won't fill the buffer), those lines will not immediately be written by grep to the pipe, so they won't show up immediately.
--line-buffered changes that behavior, so that each line is written to the pipe as it's generated, and awk sees it sooner.
You're using -l with tcpdump, which has the same effect, at least on UN*X:
$ man tcpdump
...
-l Make stdout line buffered. Useful if you want to see the data
while capturing it. E.g.,
tcpdump -l | tee dat
or
tcpdump -l > dat & tail -f dat
Note that on Windows,``line buffered'' means ``unbuffered'', so
that WinDump will write each character individually if -l is
specified.
-U is similar to -l in its behavior, but it will cause output to
be ``packet-buffered'', so that the output is written to stdout
at the end of each packet rather than at the end of each line;
this is buffered on all platforms, including Windows.
So the pipeline, as you've written it, will cause grep to see each line that tcpdump prints as soon as tcpdump prints it, and cause awk to see each of those lines that contains "X-Forwarded-For:" as soon as grep sees it and matches it.
However, if I don't use --line-buffered I don't get any output at all from my tcpdump.
You'll see it eventually, as long as grep produces a buffer's worth of output; however, that could take a very long time. --line-buffered causes grep to write out each line as it's produced, so it shows up as soon as grep produces it, rather than the buffer is full.
How can I use grep so that it will send my output directly to stdout / file in this case ?
grep is sending its (standard) output to awk, which is presumably what you want; you're extracting the second field from grep's output and printing only that.
So you don't want grep to send its (standard) output directly to the terminal or to a file, you want it to send its output to awk and have awk send its (standard) output there. If you want the output to be printed on your terminal, your command is doing the right thing; if you want it sent to a file, redirect the standard output of awk to that file.

combine grep with the watch and netstat command

Red Hat Enterprise Linux Server release 5.4 (Tikanga)
2.6.18-164.el5
I am trying to use the watch command combined with the netstat to see the 2 programs using certain ports.
However, with the command I using below doesn't work for both words:
watch -n1 "netstat -upnlt | grep gateway\|MultiMedia"
Is this the correct way to grep for both program names.
If I use one its ok, but both together doesn't work.
For the grep you need:
"grep gateway\|MultiMedia"
So perhaps try:
watch -n1 'netstat -upnlt | grep "gateway\|MultiMedia"'
There's also the new way of doing things... grep -E is nice and portable (Or egrep, which is simply quick for grep -E on linux&bsd) so you don't have to escape the quote. From the man pages:
-E Interpret pattern as an extended regular expression (i.e. force
grep to behave as egrep).
So...
watch "netstat -upnlt | grep -E 'gateway|multimedia'"
or
watch "netstat -upnlt | egrep 'gateway|multimedia'"
I had a similar problem monitoring an ssh connection.
> netstat -tulpan|grep ssh
tcp 0 0 192.168.2.52:58072 192.168.2.1:22 ESTABLISHED 31447/ssh
However watch -n 1 'netstat -tulpan|grep ssh' shows no output (apart from message from watch).
If I change it to watch -n 1 'netstat -tulpan|grep ":22"' I get the required output line. It seems as if the -p option is ignored when netstat is run through watch. Strange.

Resources