Hey. How can I get a total number of rows in a file (do not want to do it with loop). I'm reading CSV file.
Example 1
CSV.open('clients.csv', 'r')
Example 2
FasterCSV.foreach('clients.csv')
Thx.
How large is your file?
This option loads the entire file into memory, so if there are size/memory concerns it might not work.
numrows = FasterCSV.read('clients.csv').size
This option uses Ruby's built-in CSV module, which as you know is quite slow, but it does work. It also loads the entire file into memory:
numrows = CSV.readlines('clients.csv').size
Both FasterCSV.read and CSV.readlines return arrays of arrays, so you can use any array magic you want on the results.
Related
I am trying to process a large set of text files which are delimitated by new lines. The files are gzipped and I've split the files into small chunks where uncompressed they are ~100mb or so. I have a total of 296 individual compressed files with a total uncompressed size of ~30Gb.
The rows are NQuads and I'm using a Bag to map the rows into a format which I can import into a database. The rows are being folded by key so that I can combine rows related to a single page.
This is the code I'm using to read the files and fold them.
with dask.config.set(num_workers=2):
n_quads_bag = dask.bag.\
read_text(files)
uri_nquads_bag = n_quads_bag.\
map(parser.parse).\
filter(lambda x: x is not None).\
map(nquad_tuple_to_page_dict).\
foldby('uri', binop=binop).\
pluck(1).\
map(lang_extract)
Then I'm normalizing the data into pages and entities. I'm doing this by a map function which splits things into a tuple with (page, entities). I am plucking the data and then writing it to two separate sets of files in Avro.
pages_entities_bag = uri_nquads_bag.\
map(map_page_entities)
pages_bag = pages_entities_bag.\
pluck(0).\
map(page_extractor).\
map(extract_uri_details).\
map(ntriples_to_dict)
entities_bag = pages_entities_bag.\
pluck(1) .\
flatten().\
map(entity_extractor).\
map(ntriples_to_dict)
with ProgressBar():
pages_bag.to_avro(
os.path.join(output_folder, 'pages.*.avro'),
schema=page_avro_scheme,
codec='snappy',
compute=True)
entities_bag.to_avro(
os.path.join(output_folder, 'entities.*.avro'),
schema=entities_avro_schema,
codec='snappy',
compute=True)
The code is failing on pages_bag.to_avro(... compute=True) with Killed/MemoryError. I've played around with reducing the partition sizes and reduced the processor count to 2.
Am I wrong in setting compute=True? Is this the reason that the whole dataset is being brought into memory? If so how else can I get the files to be written?
Or is it possible that the partitions of the pages or entities are way too big for the computer?
Another question I had is am I using the Bags incorrectly and is this the right approach for the problem I want to solve?
The specs of the Machine I'm running this on:
4 CPU
16GB of Ram
375 Scratch Disk
The way to get this to not run out of memory is to keep the files ~100MB uncompressed and to use a groupby. As the Dask documentation states you can force it shuffle on disk. The groupby supports setting a number of partitions on the output.
with dask.config.set(num_workers=2):
n_quads_bag = dask.bag.\
read_text(files)
uri_nquads_bag = n_quads_bag.\
map(parser.parse).\
filter(lambda x: x is not None).\
map(nquad_tuple_to_page_dict).\
groupby(lambda x: x[3], shuffle='disk', npartitions=n_quads_bag.npartitions).\
map(grouped_nquads_to_dict).\
map(lang_extract)
I have a BigQuery table where each row represent a text file (gs://...) and a line number.
file, line, meta
file1.txt, 10, meta1
file2.txt, 12, meta2
file1.txt, 198, meta3
Each file is about 1.5Gb and there are about 1k files in the my bucket. My goal is extract lines specified in the BQ table.
I decided to implement the following plan:
Map table => KV<file,line>
Reduce KV<file,line> => KV<file, [lines]>
Map KV<file, [lines]> => [KV<file, rowData>]
where rowData means actual data from file on the some line from lines.
If I read docs and SO carefully, TextIO.Read isn't supposed to be used in such conditions. As a workaround I can use GcsIoChannelFactory to read files from GCS. Is it correct? Is it a preferable approach for the described task?
Yes, your approach is correct. There is currently no better approach to reading lines with line numbers from text files, except for doing it yourself using GcsIoChannelFactory (or writing a custom FileBasedSource, but this is more complex, and wouldn't work in your case because the filenames are not known in advance).
This and other similar scenarios will get much better with Splittable DoFn - work on that is in progress, but it is a large amount of work, so no timeline yet.
I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.
How does one use Ruby to break a file into n number of smaller pieces?
breaking up the CSV file into chunks is in order, but you have to keep in mind that each chunk needs to keep the first line with the CSV-header!
So UNIX 'split' will not cut it!
You'll have to write your own little Ruby script which reads the first line and stores it in a variable, then distributes the next N lines to a new partial CSV file, but first copying the CSV-header line into it. etc..
After creating each file with the header and a chunk of lines, you could then use Resque to enlist those files for parallel processing by a Resque worker.
http://railscasts.com/episodes/271-resque
For csv files, you can do this:
open("your_file.csv").each_line do |line|
# do your stuff here like split lines
line.split(",")
# or store them in an array
some_array << line
# or write them back to a file
some_file_handler << line
end
By storing lines(or splitted lines) in array(memory) or file, you can break a large file into smaller pieces. After that, threads can be used to process each piece:
threads = []
1.upto(5) { |i| threads << Thread.new { do your stuff with file[i] } }
threads.each(&:join)
Notice you are responsible for keeping threads safe.
Hope this helps!
update:
According to pguardiario's advice, we can use csv from stand library instead of opening the file directly.
I would use linux split command to split this file into many smaller files. then, would process these smaller parts.
I am reading a file that is 10mb in size and which contains some id's. I read them into a list in ruby. I am concerned that it might cause memory issues in the future, when the number of id's in file might increase. Is there a effective way of reading a large file in batches?
Thank you
With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size can be chosen freely.
header_lines = 1
batch_size = 2000
File.open("big_file") do |file|
file.lazy.drop(header_lines).each_slice(batch_size) do |lines|
# do something with batch of lines
end
end
It could be used to import a huge CSV file into a database:
require 'csv'
batch_size = 2000
File.open("big_data.csv") do |file|
headers = file.first
file.lazy.each_slice(batch_size) do |lines|
csv_rows = CSV.parse(lines.join, headers: headers)
# do something with 2000 csv rows, e.g. bulk insert them into a database
end
end
there's no universal way.
1) you can read file by chunks:
File.open('filename','r') do |f|
chunk = f.read(2048)
...
end
disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk
2) you can read file line-by-line
File.open('filename','r') do |f|
line = f.gets
...
end
disadvantage: this way it'd be 2x..5x slower than first method
If you're worried this much about speed/memory efficiency, have you considered shelling out to the shell and use grep, awk, sed etc.? If I knew a bit more about the structure of the input file and what you're trying to extract, I could potentially construct a command for you.
I have large data files of values on a 2D grid.
They are organized such that subsequent rows of data in the grid are subsequent lines in the file.
Each column is separated by a tab character.
Essentially, this is a CSV file, but with tabs instead of columns.
I need the transpose the data (first row becomes first column) and output it to another file. What's the best way to do this? Any language is okay (I prefer to use Perl or C/C++). Currently, I have Perl script just read in the entire file into memory, but I have files which are simply gigantic.
The simplest way would be to make multiple passes through your input, extracting a subset of columns on each pass. The number of columns would be determined by how much memory you wanted to use and how many rows are in the input file.
For example:
On pass 1 you read the entire input file and process only the first, say, 10 columns. If the input had 1 million rows, the output would be a file with 1 million columns and 10 rows. On the next pass you would read the input again, and process columns 11 thru 20, appending the results to the original output file. And so on....
If you have Python with NumPy installed, it's as easy as this:
#!/usr/bin/env python
import numpy, csv
with open('/path/to/data.csv', 'rb') as file:
csvdata = csv.reader()
data = numpy.array(csvdata)
transpose = data.T
... the csv module is part of Python's standard library.