Extracting n rows of text from a large csv file - grep

I have a CSV file (foo.csv) with 200,000 rows. I need to break it into four files (foo1.csv, foo2.csv... etc.) with 50,000 rows each.
I already tried simple ctrl-v/-c using gui text editors, but the my computer slows to a halt.
What unix command(s) could I use to accomplish this task?

I don't have a terminal handy to try it out, but it should be just split -d -l 50000 foo.csv.
Hopefully the naming isn't terribly important because with the -d option, the output files will be named foo.csv00 .. foo.csv03. You can add the -a 1 option so that the suffixes are 0-3, but there's no simple way to get the suffix to be injected into the middle of the filename.

you should use head and tail.
head -n 50000 myfile > part1.csv
head -n 100000 myfile | tail -n 50000 > part2.csv
head -n 150000 myfile | tail -n 50000 > part3.csv
etc ...
Else, but with no control on file names, you can use unix command split.

sed -n 2000,4000p somefile.txt
will print from lines 2000 to 4000 to stdout.

split -l50000 foo.csv

You can use sed

I wrote this little shell script for this topic very similar at yours.
This shell script + awk works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)

Related

Output file much larger than input files after cat + grep

I have 18 csv files, all between 1mb and 14mb. The sum of all files is 64mb. I want to create a new csv file that contains a subset of those files-- only the lines featuring the pattern "Hello" (or "HELLO", or "hello" ...). Here's what I'm doing
cat *.csv | head -n 1 > new.csv # I want to create a header first
cat *.csv | grep -i "hello" >> new.csv
I'm running Debian on WSL. The output file is much, much larger than the original 64mb (I stopped the process after 1+ hour, and the file was 300+ GB).
How can a subset of a text file be larger than the original files? Does it have anything to do with WSL?
This is not an OS issue. When you redirect your output to new.csv, shell creates that file first, before the glob expression *.csv is evaluated. That means the expansion of *.csv would include new.csv as well. That seems like the root cause of the recursive grep issue you are facing.
You are reading all the files twice, which is not necessary. You can make your operation a lot simpler and efficient with a single awk command:
awk 'NR==1 {print} tolower($0) ~ /hello/ {print}' *.csv > csv.new
mv csv.new new.csv
since the output file is named csv.new it won't interfere with the glob *.csv
NR==1 picks up the first line (header) from the very first file
The awk command can be written more succinctly as:
awk 'NR==1 || tolower($0) ~ /hello/' *.csv > csv.new
You are using *.csv and redirecting the output to new.csv which falls under *.csv which is causing recursion in grep result. perhaps you can try,
grep -i hello *.csv --exclude="new.csv" >> new.csv

run command taking two arguments with GNU parallel

I have a perl program that takes two arguments, dictionary file composed of
english words one per line, and file with concatenated words also one per
line, something like this:
lovetoplayguitar
...
...
So normally program is used like:
perl ./splitwords.pl words-en.txt bigfile.txt
It prints results to stdout.
I am trying to put it through GNU parallel like this:
time parallel -n 2 -j8 -k perl ./splitwords.pl {1} {2} ::: words-en.txt bigfile.txt > splitted.txt
but it doesn't work that way.. Tried many combinations so far but was unable
to run it using parallel.
EDIT
Actually this seems to be working, however it is using only one core..? Why..?
This will chop bigfile into 1 MB chunks:
cat bigfile.txt | parallel --pipe --cat -k perl ./splitwords.pl words-en.txt {}
If the perlscript only reads the file then this will be faster:
cat bigfile.txt | parallel --pipe --fifo -k perl ./splitwords.pl words-en.txt {}

grep using a list to find matches in a file, and print only the first occurrence for each string in the list

I have a file, for example, "queries.txt" that has hard return separated strings. I want to use this list to find matches in a second file, "biglist.txt".
"biglist.txt" may have multiple matches for each string in "queries.txt". I want to return only the first hit for each query and write this to another file.
grep -m 1 -wf queries.txt biglist.txt > output
only gives me one line in output. I should have output that is the same number of lines as queries.txt.
Any suggestions for this? Many thanks! I searched for past questions but did not find one that was exactly the same sort of case after a few minutes of reading.
If you want to "reset the counter" after each file, you could do
cat queries.txt | xargs -I{} grep -m 1 -w {} biglist.txt > output
This uses xargs to call grep once for each line in the input… should do the trick for you.
Explanation:
cat queries.txt - produce one "search word" per line
xargs -I{} - take the input one line at a time, and insert it at {}
grep -m 1 -w - find only one match of a whole word
{} - this is where xargs inserts the search term (once per call)
biglist.txt - the file to be searched
> output - the file where the result is to be written
An alternate method without xargs (which one should indeed learn):
(this method assumes there are no spaces in the lines in queries.txt)
cat queries.txt | while read target; do grep -m 1 $target biglist.txt; done > outr
I might not fully understand your question, but it sounds like something like this might work.
cat queries.txt | while read word; do grep "$word" biglist.txt | tee -a output.txt; done

Shell: Find Matching Lines Across Many Files

I am trying to use a shell script (well a "one liner") to find any common lines between around 50 files.
Edit: Note I am looking for a line (lines) that appears in all the files
So far i've tried grep grep -v -x -f file1.sp * which just matches that files contents across ALL the other files.
I've also tried grep -v -x -f file1.sp file2.sp | grep -v -x -f - file3.sp | grep -v -x -f - file4.sp | grep -v -x -f - file5.sp etc... but I believe that searches using the files to be searched as STD in not the pattern to match on.
Does anyone know how to do this with grep or another tool?
I don't mind if it takes a while to run, I've got to add a few lines of code to around 500 files and wanted to find a common line in each of them for it to insert 'after' (they were originally just c&p from one file so hopefully there are some common lines!)
Thanks for your time,
When I first read this I thought you were trying to find 'any common lines'. I took this as meaning "find duplicate lines". If this is the case, the following should suffice:
sort *.sp | uniq -d
Upon re-reading your question, it seems that you are actually trying to find lines that 'appear in all the files'. If this is the case, you will need to know the number of files in your directory:
find . -type f -name "*.sp" | wc -l
If this returns the number 50, you can then use awk like this:
WHINY_USERS=1 awk '{ array[$0]++ } END { for (i in array) if (array[i] == 50) print i }' *.sp
You can consolidate this process and write a one-liner like this:
WHINY_USERS=1 awk -v find=$(find . -type f -name "*.sp" | wc -l) '{ array[$0]++ } END { for (i in array) if (array[i] == find) print i }' *.sp
old, bash answer (O(n); opens 2 * n files)
From #mjgpy3 answer, you just have to make a for loop and use comm, like this:
#!/bin/bash
tmp1="/tmp/tmp1$RANDOM"
tmp2="/tmp/tmp2$RANDOM"
cp "$1" "$tmp1"
shift
for file in "$#"
do
comm -1 -2 "$tmp1" "$file" > "$tmp2"
mv "$tmp2" "$tmp1"
done
cat "$tmp1"
rm "$tmp1"
Save in a comm.sh, make it executable, and call
./comm.sh *.sp
assuming all your filenames end with .sp.
Updated answer, python, opens only each file once
Looking at the other answers, I wanted to give one that opens once each file without using any temporary file, and supports duplicated lines. Additionally, let's process the files in parallel.
Here you go (in python3):
#!/bin/env python
import argparse
import sys
import multiprocessing
import os
EOLS = {'native': os.linesep.encode('ascii'), 'unix': b'\n', 'windows': b'\r\n'}
def extract_set(filename):
with open(filename, 'rb') as f:
return set(line.rstrip(b'\r\n') for line in f)
def find_common_lines(filenames):
pool = multiprocessing.Pool()
line_sets = pool.map(extract_set, filenames)
return set.intersection(*line_sets)
if __name__ == '__main__':
# usage info and argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("in_files", nargs='+',
help="find common lines in these files")
parser.add_argument('--out', type=argparse.FileType('wb'),
help="the output file (default stdout)")
parser.add_argument('--eol-style', choices=EOLS.keys(), default='native',
help="(default: native)")
args = parser.parse_args()
# actual stuff
common_lines = find_common_lines(args.in_files)
# write results to output
to_print = EOLS[args.eol_style].join(common_lines)
if args.out is None:
# find out stdout's encoding, utf-8 if absent
encoding = sys.stdout.encoding or 'utf-8'
sys.stdout.write(to_print.decode(encoding))
else:
args.out.write(to_print)
Save it into a find_common_lines.py, and call
python ./find_common_lines.py *.sp
More usage info with the --help option.
Combining this two answers (ans1 and ans2) I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Hope this helps.
Best,
Alan Karpovsky
See this answer. I originally though a diff sounded like what you were asking for, but this answer seems much more appropriate.

How to find a pattern and surrounding content in a very large SINGLE line file?

I have a very large file 100Mb+ where all the content is on one line.
I wish to find a pattern in that file and a number of characters around that pattern.
For example I would like to call a command like the one below but where -A and -B are number of bytes not lines:
cat very_large_file | grep -A 100 -B 100 somepattern
So for a file containing content like this:
1234567890abcdefghijklmnopqrstuvwxyz
With a pattern of
890abc
and a before size of -B 3
and an after size of -A 3
I want it to return:
567890abcdef
Any tips would be great.
Many thanks.
You could try the -o option:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
and use a regular expression to match your pattern and the 3 preceding/following characters i.e.
grep -o -P ".{3}pattern.{3}" very_large_file
In the example you gave, it would be
echo "1234567890abcdefghijklmnopqrstuvwxyz" > tmp.txt
grep -o -P ".{3}890abc.{3}" tmp.txt
Another one with sed (you may need it on systems where GNU grep is not available):
sed -n '
s/.*\(...890abc...\).*/\1/p
' infile
Best way I can think of doing this is with a tiny Perl script.
#!/usr/bin/perl
$pattern = $ARGV[0];
$before = $ARGV[1];
$after = $ARGV[2];
while(<>) {
print $& if( /.{$before}$pattern.{$after}/ );
}
You would then execute it thusly:
cat very_large_file | ./myPerlScript.pl 890abc 3 3
EDIT: Dang, Paolo's solution is much easier. Oh well, viva la Perl!

Resources