I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.
How does one use Ruby to break a file into n number of smaller pieces?
breaking up the CSV file into chunks is in order, but you have to keep in mind that each chunk needs to keep the first line with the CSV-header!
So UNIX 'split' will not cut it!
You'll have to write your own little Ruby script which reads the first line and stores it in a variable, then distributes the next N lines to a new partial CSV file, but first copying the CSV-header line into it. etc..
After creating each file with the header and a chunk of lines, you could then use Resque to enlist those files for parallel processing by a Resque worker.
http://railscasts.com/episodes/271-resque
For csv files, you can do this:
open("your_file.csv").each_line do |line|
# do your stuff here like split lines
line.split(",")
# or store them in an array
some_array << line
# or write them back to a file
some_file_handler << line
end
By storing lines(or splitted lines) in array(memory) or file, you can break a large file into smaller pieces. After that, threads can be used to process each piece:
threads = []
1.upto(5) { |i| threads << Thread.new { do your stuff with file[i] } }
threads.each(&:join)
Notice you are responsible for keeping threads safe.
Hope this helps!
update:
According to pguardiario's advice, we can use csv from stand library instead of opening the file directly.
I would use linux split command to split this file into many smaller files. then, would process these smaller parts.
Related
I have a BigQuery table where each row represent a text file (gs://...) and a line number.
file, line, meta
file1.txt, 10, meta1
file2.txt, 12, meta2
file1.txt, 198, meta3
Each file is about 1.5Gb and there are about 1k files in the my bucket. My goal is extract lines specified in the BQ table.
I decided to implement the following plan:
Map table => KV<file,line>
Reduce KV<file,line> => KV<file, [lines]>
Map KV<file, [lines]> => [KV<file, rowData>]
where rowData means actual data from file on the some line from lines.
If I read docs and SO carefully, TextIO.Read isn't supposed to be used in such conditions. As a workaround I can use GcsIoChannelFactory to read files from GCS. Is it correct? Is it a preferable approach for the described task?
Yes, your approach is correct. There is currently no better approach to reading lines with line numbers from text files, except for doing it yourself using GcsIoChannelFactory (or writing a custom FileBasedSource, but this is more complex, and wouldn't work in your case because the filenames are not known in advance).
This and other similar scenarios will get much better with Splittable DoFn - work on that is in progress, but it is a large amount of work, so no timeline yet.
I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck
I am reading a file that is 10mb in size and which contains some id's. I read them into a list in ruby. I am concerned that it might cause memory issues in the future, when the number of id's in file might increase. Is there a effective way of reading a large file in batches?
Thank you
With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size can be chosen freely.
header_lines = 1
batch_size = 2000
File.open("big_file") do |file|
file.lazy.drop(header_lines).each_slice(batch_size) do |lines|
# do something with batch of lines
end
end
It could be used to import a huge CSV file into a database:
require 'csv'
batch_size = 2000
File.open("big_data.csv") do |file|
headers = file.first
file.lazy.each_slice(batch_size) do |lines|
csv_rows = CSV.parse(lines.join, headers: headers)
# do something with 2000 csv rows, e.g. bulk insert them into a database
end
end
there's no universal way.
1) you can read file by chunks:
File.open('filename','r') do |f|
chunk = f.read(2048)
...
end
disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk
2) you can read file line-by-line
File.open('filename','r') do |f|
line = f.gets
...
end
disadvantage: this way it'd be 2x..5x slower than first method
If you're worried this much about speed/memory efficiency, have you considered shelling out to the shell and use grep, awk, sed etc.? If I knew a bit more about the structure of the input file and what you're trying to extract, I could potentially construct a command for you.
Hey. How can I get a total number of rows in a file (do not want to do it with loop). I'm reading CSV file.
Example 1
CSV.open('clients.csv', 'r')
Example 2
FasterCSV.foreach('clients.csv')
Thx.
How large is your file?
This option loads the entire file into memory, so if there are size/memory concerns it might not work.
numrows = FasterCSV.read('clients.csv').size
This option uses Ruby's built-in CSV module, which as you know is quite slow, but it does work. It also loads the entire file into memory:
numrows = CSV.readlines('clients.csv').size
Both FasterCSV.read and CSV.readlines return arrays of arrays, so you can use any array magic you want on the results.
I have large data files of values on a 2D grid.
They are organized such that subsequent rows of data in the grid are subsequent lines in the file.
Each column is separated by a tab character.
Essentially, this is a CSV file, but with tabs instead of columns.
I need the transpose the data (first row becomes first column) and output it to another file. What's the best way to do this? Any language is okay (I prefer to use Perl or C/C++). Currently, I have Perl script just read in the entire file into memory, but I have files which are simply gigantic.
The simplest way would be to make multiple passes through your input, extracting a subset of columns on each pass. The number of columns would be determined by how much memory you wanted to use and how many rows are in the input file.
For example:
On pass 1 you read the entire input file and process only the first, say, 10 columns. If the input had 1 million rows, the output would be a file with 1 million columns and 10 rows. On the next pass you would read the input again, and process columns 11 thru 20, appending the results to the original output file. And so on....
If you have Python with NumPy installed, it's as easy as this:
#!/usr/bin/env python
import numpy, csv
with open('/path/to/data.csv', 'rb') as file:
csvdata = csv.reader()
data = numpy.array(csvdata)
transpose = data.T
... the csv module is part of Python's standard library.