How can I remove duplicates (deduplicate) a mbox format email mailbox? - parsing

I've got a mbox mailbox containing duplicate copies of messages, which differ only in their "X-Evolution:" header.
I want to remove the duplicate ones, in as quick and simple a way as possible. It seems like this would have been written already, but I haven't found it, although I've looked at the Python mailbox module, the various perl mbox parsers, formail, and so forth.
Does anyone have any suggestions?

This a small script, which I used for it:
#!/bin/bash
IDCACHE=$(mktemp -p /tmp)
formail -D $((1024*1024*10)) ${IDCACHE} -s
rm ${IDCACHE}
The mailbox needs to be piped through it, and in the meantime it will be deduplicated.
-D $((1024*1024*10)) sets a 10 Mebibyte cache, which is more than 10x the amount needed to deduplicate an entire year of my mail. YMMV, so adjust it accordingly. Setting it too high will cause some performance loss, setting it to low will let it slip duplicates.
formail is part of the procmail utility bundle, mktemp is part of coreutils.

I didn't look at formail (part of procmail) in enough detail. It does have such such an option, as mentioned in places like: http://hints.macworld.com/comment.php?mode=view&cid=115683 and http://us.generation-nt.com/answer/deleting-duplicate-mail-messages-help-172481881.html

'formail -D' and 'reformail -D' can only process one email per execution. Each mail needs to be separated from mbox first before being processed. I use reformail from maildrop instead since it's still in active development.
remove old idcache, tmpmail, nmbox
run dedup.sh .
nmbox is the output with duplicate messages removed.
dedup.sh
#! /bin/sh
# $1 = mbox, thunderbird mailbox
# wmbox.sh is called for each mail.
cat $1 | reformail -s ./wmbox.sh
wmbox.sh
#! /bin/sh
# stdin: a email
# called by dedup.sh
TM=tmpmail
if [ -f $TM ] ; then
echo error!
exit 1
fi
cat > $TM
# mbox format, each mail end with a blank line
echo "" >> $TM
cat $TM | reformail -D 99999999 idcache
# if this mail isn't a dup (reformail return 1 if message-id is not found)
if [ $? != 0 ]; then
# each mail shall have a message-id
if grep -q -i '^message-id:' $TM; then
cat tmpmail >> nmbox
fi
fi
rm $TM

Related

using grep command to get spectfic word [LINUX]

I have a test.txt file with links for example:
google.com?test=
google.com?hello=
and this code
xargs -0 -n1 -a FUZZvul.txt -d '\n' -P 20 -I % curl -ks1L '%/?=DarkLotus' | grep -a 'DarkLotus'
When I type a specific word, such as DarkLotus, in the terminal, it checks the links in the file and it brings me the word which is reflected in the links i provided in the test file
There is no problem here, the problem is that I have many links, and when the result appears in the terminal, I do not know which site reflected the DarkLotus word.
How can i do it?
Try -n option. It shows the line number of file with the matched line.
Best Regards,
Haridas.
I'm not sure what you are up to there, but can you invert it? grep by default prints matching lines. The problem here is you are piping the input from the stdout of the previous commands into grep, and that can lack context at grep. Since you have a file to work with:
$ grep 'DarkLotus' FUZZvul.txt
If your intention is to also follow the link then it might be easier to write a bash script:
#!/bin/bash
for line in `grep 'DarkLotus FUZZvul.txt`
do
link=# extract link from line
echo ${link}
curl -ks1L ${link}
done
Then you could make your script accept user input:
#/bin/bash
word="${0}"
for line in `grep ${word} FUZZvul.txt`
...
and then
$ my_link_getter "DarkLotus"
https://google?somearg=DarkLotus
...
And then you could make the txt file a parameter.
etc.

Shell: Find Matching Lines Across Many Files

I am trying to use a shell script (well a "one liner") to find any common lines between around 50 files.
Edit: Note I am looking for a line (lines) that appears in all the files
So far i've tried grep grep -v -x -f file1.sp * which just matches that files contents across ALL the other files.
I've also tried grep -v -x -f file1.sp file2.sp | grep -v -x -f - file3.sp | grep -v -x -f - file4.sp | grep -v -x -f - file5.sp etc... but I believe that searches using the files to be searched as STD in not the pattern to match on.
Does anyone know how to do this with grep or another tool?
I don't mind if it takes a while to run, I've got to add a few lines of code to around 500 files and wanted to find a common line in each of them for it to insert 'after' (they were originally just c&p from one file so hopefully there are some common lines!)
Thanks for your time,
When I first read this I thought you were trying to find 'any common lines'. I took this as meaning "find duplicate lines". If this is the case, the following should suffice:
sort *.sp | uniq -d
Upon re-reading your question, it seems that you are actually trying to find lines that 'appear in all the files'. If this is the case, you will need to know the number of files in your directory:
find . -type f -name "*.sp" | wc -l
If this returns the number 50, you can then use awk like this:
WHINY_USERS=1 awk '{ array[$0]++ } END { for (i in array) if (array[i] == 50) print i }' *.sp
You can consolidate this process and write a one-liner like this:
WHINY_USERS=1 awk -v find=$(find . -type f -name "*.sp" | wc -l) '{ array[$0]++ } END { for (i in array) if (array[i] == find) print i }' *.sp
old, bash answer (O(n); opens 2 * n files)
From #mjgpy3 answer, you just have to make a for loop and use comm, like this:
#!/bin/bash
tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"
cp "$1" "$tmp1"
shift
for file in "$#"
do
comm -1 -2 "$tmp1" "$file" > "$tmp2"
mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"
Save in a comm.sh, make it executable, and call
./comm.sh *.sp
assuming all your filenames end with .sp.
Updated answer, python, opens only each file once
Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.
Here you go (in python3):
#!/bin/env python
import argparse
import sys
import multiprocessing
import os
EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}
def extract_set(filename):
with open(filename, 'rb') as f:
return set(line.rstrip(b'\r\n') for line in f)
def find_common_lines(filenames):
pool = multiprocessing.Pool()
line_sets = pool.map(extract_set, filenames)
return set.intersection(*line_sets)
if __name__ == '__main__':
# usage info and argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("in_files", nargs='+',
help="find common lines in these files")
parser.add_argument('--out', type=argparse.FileType('wb'),
help="the output file (default stdout)")
parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
help="(default: native)")
args = parser.parse_args()
# actual stuff
common_lines = find_common_lines(args.in_files)
# write results to output
to_print = EOLS[args.eol_style].join(common_lines)
if args.out is None:
# find out stdout's encoding, utf-8 if absent
encoding = sys.stdout.encoding or 'utf-8'
sys.stdout.write(to_print.decode(encoding))
else:
args.out.write(to_print)
Save it into a find_common_lines.py, and call
python ./find_common_lines.py *.sp
More usage info with the --help option.
Combining this two answers (ans1 and ans2) I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Hope this helps.
Best,
Alan Karpovsky
See this answer. I originally though a diff sounded like what you were asking for, but this answer seems much more appropriate.

How can I have grep not print out 'No such file or directory' errors?

I'm grepping through a large pile of code managed by git, and whenever I do a grep, I see piles and piles of messages of the form:
> grep pattern * -R -n
whatever/.git/svn: No such file or directory
Is there any way I can make those lines go away?
You can use the -s or --no-messages flag to suppress errors.
-s, --no-messages suppress error messages
grep pattern * -s -R -n
If you are grepping through a git repository, I'd recommend you use git grep. You don't need to pass in -R or the path.
git grep pattern
That will show all matches from your current directory down.
Errors like that are usually sent to the "standard error" stream, which you can pipe to a file or just make disappear on most commands:
grep pattern * -R -n 2>/dev/null
I have seen that happening several times, with broken links (symlinks that point to files that do not exist), grep tries to search on the target file, which does not exist (hence the correct and accurate error message).
I normally don't bother while doing sysadmin tasks over the console, but from within scripts I do look for text files with "find", and then grep each one:
find /etc -type f -exec grep -nHi -e "widehat" {} \;
Instead of:
grep -nRHi -e "widehat" /etc
I usually don't let grep do the recursion itself. There are usually a few directories you want to skip (.git, .svn...)
You can do clever aliases with stances like that one:
find . \( -name .svn -o -name .git \) -prune -o -type f -exec grep -Hn pattern {} \;
It may seem overkill at first glance, but when you need to filter out some patterns it is quite handy.
Have you tried the -0 option in xargs? Something like this:
ls -r1 | xargs -0 grep 'some text'
Use -I in grep.
Example: grep SEARCH_ME -Irs ~/logs.
I redirect stderr to stdout and then use grep's invert-match (-v) to exclude the warning/error string that I want to hide:
grep -r <pattern> * 2>&1 | grep -v "No such file or directory"
I was getting lots of these errors running "M-x rgrep" from Emacs on Windows with /Git/usr/bin in my PATH. Apparently in that case, M-x rgrep uses "NUL" (the Windows null device) rather than "/dev/null". I fixed the issue by adding this to .emacs:
;; Prevent issues with the Windows null device (NUL)
;; when using cygwin find with rgrep.
(defadvice grep-compute-defaults (around grep-compute-defaults-advice-null-device)
"Use cygwin's /dev/null as the null-device."
(let ((null-device "/dev/null"))
ad-do-it))
(ad-activate 'grep-compute-defaults)
One easy way to make grep return zero status all the time is to use || true
→ echo "Hello" | grep "This won't be found" || true
→ echo $?
0
As you can see the output value here is 0 (Success)

grep show all lines, not just matches, set exit status

I'm piping some output of a command to egrep, which I'm using to make sure a particular failure string doesn't appear in.
The command itself, unfortunately, won't return a proper non-zero exit status on failure, that's why I'm doing this.
command | egrep -i -v "badpattern"
This works as far as giving me the exit code I want (1 if badpattern appears in the output, 0 otherwise), BUT, it'll only output lines that don't match the pattern (as the -v switch was designed to do). For my needs, those lines are the most interesting lines.
Is there a way to have grep just blindly pass through all lines it gets as input, and just give me the exit code as appropriate?
If not, I was thinking I could just use perl -ne "print; exit 1 if /badpattern/". I use -n rather than -p because -p won't print the offending line (since it prints after running the one-liner). So, I use -n and call print myself, which at least gives me the first offending line, but then output (and execution) stops there, so I'd have to do something like
perl -e '$code = 0; while (<>) { print; $code = 1 if /badpattern/; } exit $code'
which does the whole deal, but is a bit much, is there a simple command line switch for grep that will just do what I'm looking for?
Actually, your perl idea is not bad. Try:
perl -pe 'END { exit $status } $status=1 if /badpattern/;'
I bet this is at least as fast as the other options being suggested.
$ tee /dev/tty < ~/.bashrc | grep -q spam && echo spam || echo no spam
How about doing a redirect to /dev/null, hence removing all lines, but you still get the exit code?
$ grep spam .bashrc > /dev/null
$ echo $?
1
$ grep alias .bashrc > /dev/null
$ echo $?
0
Or you can simply use the -q switch
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit
immediately with zero status if any match is found, even if an
error was detected. Also see the -s or --no-messages option.
(-q is specified by POSIX.)

Watch a web page for changes

I googled and couldn't find any could that would compare a webpage to a previous version.
In this case the page I'm trying to watch is link text. There are services that can watch a page, but I'd like to set this up on my own server.
I've set this up as a wiki so anyone can add to the code. Here's my idea
Check if previous version of file exists. If false then download page
If page exists, diff to find differences and email the new content along with dates of new and old versions.
This script would be called nightly via cron or on-demand via the browser (the latter is not a priority)
Sounds simple, maybe I'm just not looking in the right place.
Perhaps a simple sh-script like this, featuring wget, diff & test?
#!/bin/sh
WWWURI="http://foo.bar/testfile.html"
LOCALCOPY="testfile.html"
TMPFILE="tmpfile"
WEBFILE="changed.html"
MAILADDRESS="$(whoami)"
SUBJECT_NEWFILE="$LOCALCOPY is new"
BODY_NEWFILE="first version of $LOCALCOPY loaded"
SUBJECT_CHANGEDFILE="$LOCALCOPY updated"
SUBJECT_NOTCHANGED="$LOCALCOPY not updated"
BODY_CHANGEDFILE="new version of $LOCALCOPY"
# test for old file
if [ -e "$LOCALCOPY" ]
then
mv "$LOCALCOPY" "$LOCALCOPY.bak"
wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
diff "$LOCALCOPY" "$LOCALCOPY.bak" > $TMPFILE
# test for update
if [ -s "$TMPFILE" ]
then
echo "$SUBJECT_CHANGEDFILE"
( echo "$BODY_CHANGEDFILE" ; cat "$TMPFILE" ) | tee "$WEBFILE" | mail -s "$SUBJECT_CHANGEDFILE" "$MAILADDRESS"
else
echo "$SUBJECT_NOTCHANGED"
fi
else
wget "$WWWURI" -O"$LOCALCOPY" -o/dev/null
echo "$BODY_NEWFILE"
echo "$BODY_NEWFILE" | tee "$WEBFILE" | mail -s "$SUBJECT_NEWFILE" "$MAILADDRESS"
fi
[ -e "$TMPFILE" ] && rm "$TMPFILE"
Update: Pipe through tee, little spelling & remove of $TMPFILE
You can check This SO posting to get a few ideas and also information about the challenge of detecting "true" changes to a web page (with fluctuating advertisement block, and other "noise")

Resources