I have a Ruby Map/Reduce pipeline where STDIO is cat'ed to the mapper etc.
cat input.csv | mapper.rb | sort | reducer.rb > output.csv
There's a line in mapper.rb which looks something like this:
ARGF.each do |line|
(field1, field2, field3) = line.split("\t")
# etc...
end
How would I do this in Elixir?
Also, I've read somewhere that File.stream! is much faster than IO.stream. In this particular case, I could eliminate cat and load the file directly if it's much faster that way.
Use IO.stream. File.stream! is faster for files, don't worry about it when it comes to IO.
Related
I'm receiving a CSV file that always includes extra lines at the end which I'd like to remove before copying the data into the postgresql database of my rails app.
I can't use head with a negative argument because I'm on MacOS X.
What's a clean and efficient way to pre-process this file?
Right now I'm doing this, but am wondering if there is less mish-mash way:
# Removes last n rows from the file located at PATH
total = `wc -c < #{PATH}`.strip.to_i
chop_index = `tail -n #{n} #{PATH} | wc -c`.strip.to_i
`dd if=/dev/null of=#{PATH} seek=1 bs=#{total - chop_index}`
This is about the simplest way I can think to do this in pure ruby that also works for large files, since it processes each line at a time instead of reading the whole file into memory:
INFILE = "input.txt"
OUTFILE = "output.txt"
total_lines = File.foreach(INFILE).inject(0) { |c, _| c+1 }
desired_lines = total_lines - 4
# open output file for writing
File.open(OUTFILE, 'w') do |outfile|
# open input file for reading
File.foreach(INFILE).with_index do |line, index|
# stop after reaching the desired line number
break if index == desired_lines
# copy lines from infile to outfile
outfile << line
end
end
However, this is about twice as slow as what you posted on a 160mb file I created. You can shave off about a third by using wc to get the total lines, and using pure Ruby for the rest:
total_lines = `wc -l < #{INFILE}`.strip.to_i
# rest of the Ruby File code
Another caveat is that your CSV must not have it's own line breaks within any cell content, in which case, you would need a CSV parser, and CSV.foreach(INFILE) do |row| could be used instead, but it is quite a bit slower in my limited testing, but you mentioned above that your cells should be ok to be processes by file line.
That said, what you posted using wc and dd is much faster, so maybe you should keep using that.
I'm trying to do a simple script task but I have a very serious lack of knowledge in AWK and I'm not able to understand exactly how to accomplish this silly task.
Basically I have a very big regular vhost.conf with hundreds of domains.
The idea is just iterate or parse this unique file and get a list of ServerName and DocumentRoot.
The file is divided in multiple parts. If I run this command I get an output like:
grep -E "DocumentRoot|ServerName" /etc/httpd/conf.d/vhost-pro.conf | awk '!/#/{print $2}'
/home/webs/t2m/PRO/default
t2m.net
/home/webs/uoc/PRO/default
uoc.com
so...now. How process this output? If I was able to concatenate the path and the domain name into just single one line, maybe I could store in a array or in a file and then just take the info piece by piece. But I simply don't know how to do it.
Any clue or tip about how to proceed?
Thanks!
I would suggest something like this:
awk '/<VirtualHost/ { sn=""; dr="" }
/ServerName/ { sn=$2 }
/DocumentRoot/ { dr = $2 }
/\/VirtualHost/ { print dr, sn }' /etc/httpd/conf.d/vhost-pro.conf
I appreciate your comments. I will pr command as well. I'm going to share how I did it. I was luck to find some kind of awk magic. To be honest, it worked but I don't what I'm doing and this is bad for me :P
I had this output:
/home/webs/t2m/PRO/default
t2m.net
/home/webs/uoc/PRO/default
uoc.com
So what I did to turn it into just one line was:
grep -E "DocumentRoot|ServerName" /etc/httpd/conf.d/vhost-pro.conf | awk '{key=$2;getline;print key " " $2}'
this way my new output is something like:
/home/webs/t2m/PRO/default t2m.net
/home/webs/uoc/PRO/default uoc.com
then I think I will store this into a temp file to iterate after and store it to a var.
Thanks!
Any help would be greatly appreciated. I can read code and figure it out, but I have trouble writing from scratch.
I need help starting a ksh script that would search a file for multiple strings and write each line containing one of those strings to an output file.
If I use the following command:
$ grep "search pattern" file >> output file
...that does what I want it to. But I need to search multiple strings, and write the output in the order listed in the file.
Again... any help would be great! Thank you in advance!
Have a look at the regular expression manuals. You can specify multiple strings in the search expression such as grep "John|Bill"
Man grep will teach you a lot about regular expressions, but there are several online sites where you try them out, such as regex101 and (more colorful) regexr.
Sometimes you need egrep.
egrep "first substring|second substring" file
When you have a lot substrings you can put them in a variable first
findalot="first substring|second substring"
findalot="${findalot}|third substring"
findalot="${findalot}|find me too"
skipsome="notme"
skipsome="${skipsome}|dirty words"
egrep "${findalot}" file | egrep -v "${skipsome}"
Use "-f" in grep .
Write all the strings you want to match in a file ( lets say pattern_file , the list of strings should be one per line)
and use grep like below
grep -f pattern_file file > output_file
I need to find some matching conditions from a file and recursively find the next conditions in previously matched files , i have something like this
input.txt
123
22
33
The files where you need to find above terms in following files, the challenge is if 123 is found in say 10 files , the 22 should be searched in these 10 files only and so on...
Example of files are like f1,f2,f3,f4.....f1200
so it is like i need to grep -w "123" f* | grep -w "123" | .....
its not possible to list them manually so any easier way?
You can solve this using awk script, i ve encountered a similar problem and this will work fine
awk '{ if(!NR){printf("grep -w %d f*|",$1)} else {printf("grep -w %d f*",$1)} }' input.txt | sh
What it Does?
it reads input.txt line by line
until it is at last record , it prints grep -w %d | (note there is a
pipe here)
which is then sent to shell for execution and results are piped back
to back
and when you reach the end the pipe is avoided
Perhaps taking a meta-programming viewpoint would help. Have grep output a series of grep commands. Or write a little PERL program. Maybe Ruby, if the mood suits.
You can use grep -lw to write the list of file names that matched (note that it will stop after finding the first match).
You capture the list of file names and use that for the next iteration in a loop.
I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501
1n1e+NDE+A+400 2qzl+IXS+A+449
1llf+F23+A+800 1y0g+8PP+A+320
1ewf+PC1+A+577 2a94+AP0+A+336
2ydx+TXP+E+1339 3g8i+RO7+A+1
1gvh+HEM+A+1398 1v9y+HEM+A+1140
2i0z+FAD+A+501 3m2r+F43+A+1
1h6d+NDP+A+500 3rt4+LP5+C+501
1w07+FAD+A+1660 2pgn+FAD+A+612
2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt
I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings;
open AS, "combi_output_2_fixed.txt";
open AQ, "NAMES.txt";
#arr=<AS>;
#arr1=<AQ>;
foreach $line(#arr)
{
#split=split(' ',$line);
foreach $line1(#arr1)
{
#split1=split(' ',$line1);
if($split[0] eq $split1[0] && $split[1] eq $split1[1])
{ print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
}
}
close AQ;
close AS;
Doing this uses up the entire memory: and shows Out of memory error message..
I am aware that this can be done using grep. but do not know hw to do it.
Can anyone please let me know how I can do this using grep -F AND WITHOUT USING UP THE ENTIRE MEMORY?
Thanks.
Does pattern.txt fit in memory?
If it does, you could use a command like grep -F -f pattern.txt data.txt to match lines in data.txt against the patterns. You would get the full line though, and extra processing would be required to get only the second column of numbers.
Or you could fix the Perl script. The reason you run out of memory is because you read the 8gb file entirely to memory, when you could be processing it line-by-line like grep. For the 8GB file you should use code like this:
open FH, "<", "data.txt";
while ($line = <FH>) {
# check $line against list of patterns ...
}
Try This
grep "`more pattern.txt`" data.txt | awk -F' ' '{ print $1 " " $2 " " $4}'