concatenate files and print - printing

I need to concatenate files and print it. However I also want to add a header (name of the file) when concatenating, to distinguish the files.
Example:
file1.txt
content of file1.txt
file2.txt
content of file2.txt
....
....
what are the ways of going about this? I am using the lp command to print.

for i in file?.txt ; do echo $i ; cat $i ; done | lp

Related

Unable to match patterns from one file line by line with contents of other file | bash shell

I have a file1.txt containing text like:
123 456 789
I need to search these strings line by line in another file2.txt like this:
"123" This is line 1
"456" This is line 2
"789" This is line 3
Matching lines need to be echoed or redirected to file3.txt
I tried couple of ways:
while read -r line; do
grep "$line" -c file2.txt
done < file1.txt
This doesn't give me any matches, although there are some.
I also tried grep like this:
grep -f file1.txt -c file2.txt
which unfortunately doesn't work either.
For all three matches, output should have been:
1
1
1
I am new to shell scripting. Could someone please suggest what is wrong here?
Thanks in advance.
In case you are ok with awk could you please try following then.
awk 'FNR==NR{a[$0];next} ($2 in a)' file1.txt FS="\"" file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file1.txt is being read.
a[$0] ##Creating array a with index current line here.
next ##next will skip further statements from here.
}
($2 in a) ##Checking condition if 2nd field is present in array a then print the current line from file2.txt
' file1.txt FS="\"" file2.txt ##Mentioning Input_file names here, where setting FS as " for file2.txt
2nd solution: With changing Input_file(s) sequence of reading.
awk 'FNR==NR{a[$2]=$0;next} ($0 in a){print a[$0]}' FS="\"" file2.txt file1.txt

Count the number of occurrence of string in a large file

I have a large 900MB xml file and the entire file is just one lines. There is no line break between tags. I need to count the occurence of a particular tag in that file.
I tried
grep -o '<start tag>' filename | wc -l
i get a grep: line too long error.
How can I get around this?
Here's a bit of a hack:
perl -ne 'BEGIN { $/ = ">"; $c = 0 } $c++ if /<start tag>/; END { print "$c\n" }' filename
The idea is to loop over "lines" that are terminated by > instead of \n (newline). That should avoid "line too long" errors.
Just use awk:
awk -F'<start tag>' '{print NF-1}' file
If that fails, you can do this with GNU awk (for multi-char RS):
awk -v RS='<start tag>' 'END{print NR-1}' file

joining 2 files on matching column values using awk

I know there have been similar questions posted but I'm still having a bit of trouble getting the output I want using awk FNR==NR...
I have 2 files as such
File 1:
123|this|is|good
456|this|is|better
...
File 2:
aaa|123
bbb|456
...
So I want to join on values from file 2/column2 to file 1/column1 and output file 1 (col 2,3,4) and file 2 (col 1).
Thanks in advance.
With awk you could do something like
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } $1 in val { $(NF + 1) = val[$1]; print }' file2 file1
NF is the number of fields in a record (line by default), so $NF is the last field, and $(NF + 1) is the field after that. By assigning the saved value from the pass over file2 to it, a new field is appended to the record before it is printed.
One thing to note: This behaves like an inner join, i.e., only records are printed whose key appears in both files. To make this a right join, you can use
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } { $(NF + 1) = val[$1]; print }' file2 file1
That is, you can drop the $1 in val condition on the append-and-print action. If $1 is not in val, val[$1] is empty, and an empty field will be appended to the record before printing.
But it's probably better to use join:
join -1 1 -2 2 -t \| file1 file2
If you don't want the key field to be part of the output, pipe the output of either of those commands through cut -d \| -f 2- to get rid of it, i.e.
join -1 1 -2 2 -t \| file1 file2 | cut -d \| -f 2-
If the files have the same number of lines in the same order, then
paste -d '|' file1 file2 | cut -d '|' -f 2-5
this|is|good|aaa
this|is|better|bbb
I see in a comment to Wintermute's answer that the files aren't sorted. With bash, process substitutions are handy to sort on the fly:
paste -d '|' <(sort -t '|' -k 1,1 file1) <(sort -t '|' -k 2,2 file2) |
cut -d '|' -f 2-5
To reiterate: this solution requires a one-to-one correspondence between the files

output of grep like "file:ln#:matchedpattern:paragraph" possible?

I use the following command to my liking, but perfection is better;-)
grep -w -i -r -n -f all.txt . > output.txt
./index.php:86:complete paragraph1
./index.php:89:complete paragraph2
With this:
grep -w -i -r -o -n -f all.txt . > output.txt
We get :
./index.php:86:match1
./index.php:89:match2
Is it also possible to get a combination of that? Like this:
./index.php:86:match1:complete paragraph1
./index.php:89:match2:complete paragraph2
Would be great, still better than that would be even a part ofthe paragraph, but i guess that is a little much to ask for with such a simple tool;-)
Thanks!
grep doesn't have a facility for this, but it's easy to reimplement the useful parts in a simple Awk script.
awk 'NR==FNR { p[++i] = tolower($0); next }
{ line = tolower($0); for (j=1; j<=i; ++j) if (match(line, p[j]))
{ printf "%s:%i:%s:%s\n", FILENAME, FNR, substr($0, RSTART, RLENGTH), $0;
next } }' all.txt files...
The NR==FNR condition matches on the first input file. Each line in that file is converted to lowercase and read into the array p.
The second action only applies to the second and subsequent files. It loops over the items in p and checks whether the current line matches. If so, a match message is printed, and we skip to the next input line.

Shell: Find Matching Lines Across Many Files

I am trying to use a shell script (well a "one liner") to find any common lines between around 50 files.
Edit: Note I am looking for a line (lines) that appears in all the files
So far i've tried grep grep -v -x -f file1.sp * which just matches that files contents across ALL the other files.
I've also tried grep -v -x -f file1.sp file2.sp | grep -v -x -f - file3.sp | grep -v -x -f - file4.sp | grep -v -x -f - file5.sp etc... but I believe that searches using the files to be searched as STD in not the pattern to match on.
Does anyone know how to do this with grep or another tool?
I don't mind if it takes a while to run, I've got to add a few lines of code to around 500 files and wanted to find a common line in each of them for it to insert 'after' (they were originally just c&p from one file so hopefully there are some common lines!)
Thanks for your time,
When I first read this I thought you were trying to find 'any common lines'. I took this as meaning "find duplicate lines". If this is the case, the following should suffice:
sort *.sp | uniq -d
Upon re-reading your question, it seems that you are actually trying to find lines that 'appear in all the files'. If this is the case, you will need to know the number of files in your directory:
find . -type f -name "*.sp" | wc -l
If this returns the number 50, you can then use awk like this:
WHINY_USERS=1 awk '{ array[$0]++ } END { for (i in array) if (array[i] == 50) print i }' *.sp
You can consolidate this process and write a one-liner like this:
WHINY_USERS=1 awk -v find=$(find . -type f -name "*.sp" | wc -l) '{ array[$0]++ } END { for (i in array) if (array[i] == find) print i }' *.sp
old, bash answer (O(n); opens 2 * n files)
From #mjgpy3 answer, you just have to make a for loop and use comm, like this:
#!/bin/bash
tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"
cp "$1" "$tmp1"
shift
for file in "$#"
do
comm -1 -2 "$tmp1" "$file" > "$tmp2"
mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"
Save in a comm.sh, make it executable, and call
./comm.sh *.sp
assuming all your filenames end with .sp.
Updated answer, python, opens only each file once
Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.
Here you go (in python3):
#!/bin/env python
import argparse
import sys
import multiprocessing
import os
EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}
def extract_set(filename):
with open(filename, 'rb') as f:
return set(line.rstrip(b'\r\n') for line in f)
def find_common_lines(filenames):
pool = multiprocessing.Pool()
line_sets = pool.map(extract_set, filenames)
return set.intersection(*line_sets)
if __name__ == '__main__':
# usage info and argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("in_files", nargs='+',
help="find common lines in these files")
parser.add_argument('--out', type=argparse.FileType('wb'),
help="the output file (default stdout)")
parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
help="(default: native)")
args = parser.parse_args()
# actual stuff
common_lines = find_common_lines(args.in_files)
# write results to output
to_print = EOLS[args.eol_style].join(common_lines)
if args.out is None:
# find out stdout's encoding, utf-8 if absent
encoding = sys.stdout.encoding or 'utf-8'
sys.stdout.write(to_print.decode(encoding))
else:
args.out.write(to_print)
Save it into a find_common_lines.py, and call
python ./find_common_lines.py *.sp
More usage info with the --help option.
Combining this two answers (ans1 and ans2) I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Hope this helps.
Best,
Alan Karpovsky
See this answer. I originally though a diff sounded like what you were asking for, but this answer seems much more appropriate.

Resources