Filtering input file with chunksize and skiprows using line number as index in dask dataframe - dask

I have ~70gb output of MD simulations. A pattern of a fixed-number-of-lines explanation and a fixed-number-of-lines data regularly repeat in the file. How can I read the file in Dask Dataframe chunk by chunk in which the explanation lines are ignored?
I successfully wrote a lambda function in the skiprows argument of the pandas.read_csv to ignore the explanation lines and only read the data lines. I converted the pandas-entered code to dask one but it does not work. Here you can see the dask code written by replacing pandas.read_csv with dd.read_csv:
# First extracting number of atoms and hence, number of data lines:
with open(filename[0],mode='r') as file: # The same as Chanil's code
line = file.readline()
line = file.readline()
line = file.readline()
line = file.readline() # natoms
natoms = int(line)
skiplines = 9 # Number of explanation lines repeating after nnatoms lines of data
def logic_for_chunk(index):
"""This function read a chunk """
if index % (natoms+skiplines) > 8:
return False
return True
df_chunk = dd.read_csv('trajectory.txt',sep=' ',header=None,index_col=False,skiprows=lambda x: logic_for_chunk(x),chunksize=natoms)
Here the indexes of the dataframe is line numbers of the file. Using above code, at the first chunk, lines 0 to 8 in file are ignored, then the lines 9 to 58 are read. At the next chunk, the line 59 to 67 are ignored and then a natoms-size chunk from line 68 to 117 are read. This happens until all the data snapshots are read.
Unfortunately, while the above code works well in pandas, it does not works in dask. How can I implement a similar procedure in dask dataframe?

The dask dataframe read_csv function cuts the file up at byte locations. It is unable to determine exactly how many lines are in each partition, so it is unwise to depend on the row index within each partition.
If there is some other way to detect a bad line then I would try that. Ideally you will be able to determine a bad line based on the content of the line, not on its location within the file (like every eighth line).

Related

Iterate on (or access directly) xarray chunks

I'm after a way to iterate on xarray chunks, so something similar to dask.array.blocks but that would give me access to xarray chunks with coordinates and dimensions.
For the record, I'm aware that xarray.map_blocks exists, but what I'm doing maps input chunks to output chunks of unknown shape, so I'd like to write something custom by looping directly on the xarray chunks.
I've tried to look into the xarray.map_blocks source code, since I guess something similar to what I need is in there, but I had a hard time understanding what's going on there.
EDIT:
My use case is that I would like, for each xarray chunk, to get an output xarray chunk of variable length along a new dimension (called foo below), and eventually concatenate them along foo.
This is a mocked scenario that should at least clarify what I'm after.
For now I've solved the problem constructing, from each dask chunk of the DataArray, an "xarray" chunk (but this looks quite convoluted), and then using client.map(fn_on_chunk, xarray_chunks).
n = 1000
x_raster = y_raster = np.arange(n)
time = np.arange(10)
vals_raster = np.arange(n*n*10).reshape(n, n, 10)
da_raster = xr.DataArray(vals_raster, coords={"y": y_raster, "x": x_raster, 'time':time})
da_raster = da_raster.chunk(dict(x=100, y=100))
def fn_on_chunk(da_chunk):
# Tried to replicate the fact that I can't know in advance
# the lenght of one dimension of the output
len_range = np.random.randint(10)
outs = []
for foo in range(len_range):
# Do some magic that finds needed coordinates
# on this particular chunk
x_chunk, y_chunk = fn_magic(foo)
out = da_chunk.sel(x=x_chunk, y=y_chunk)
out['foo'] = foo
outs.append(out)
return xr.concat(outs, dim='foo')

Reading a column file of x y z into table in Lua

Been trying to find my way through Lua, so I have a file containing N lines of numbers, 3 per line, it is actually x,y,z coordinates. I could make it a CSV file and use some Lua CSV parser, but I guess it's better if I learn how to do this regardless.
So what would be the best way to deal with this? So far I am able to read each line into a table line by the code snippet below, but 1) I don't know if this is a string or number table, 2) if I print tbllinesx[1], it prints the whole line of three numbers. I would like to be able to have tbllines[1][1], tbllines[1][2] and tbllines[1][3] corresponding to the first 3 number of 1st line of my file.
local file = io.open("locations.txt")
local tbllinesx = {}
local i = 0
if file then
for line in file:lines() do
i = i + 1
tbllinesx[i] = line
end
file:close()
else
error('file not found')
end
From Programming in Lua https://www.lua.org/pil/21.1.html
You can call read with multiple options; for each argument, the
function will return the respective result. Suppose you have a file
with three numbers per line:
6.0 -3.23 15e12
4.3 234 1000001
... Now you want to print the maximum of each line. You can read all three numbers in a single call to read:
while true do
local n1, n2, n3 = io.read("*number", "*number", "*number")
if not n1 then break end
print(math.max(n1, n2, n3))
end
In any case, you should always consider the alternative of reading the
whole file with option "*all" from io.read and then using
gfind to break it up:
local pat = "(%S+)%s+(%S+)%s+(%S+)%s+"
for n1, n2, n3 in string.gfind(io.read("*all"), pat) do
print(math.max(n1, n2, n3))
end
I'm sure you can figure out how to modify this to put the numbers into table fields on your own.
If you're using three captures you can just use table.pack to create your line table with three entries.
Assuming you only have valid lines in your data file (locations.txt) all you need is change the line:
tbllinesx[i] = line
to:
tbllinesx[i] = { line:match '(%d+)%s+(%d+)%s+(%d+)' }
This will put each of the three space-delimited numbers into its own spot in a table for each line separately.
Edit: The repeated %d+ part of the pattern will need to be adjusted according to your actual input. %d+ assumes plain integers, you need something more involved for possible minus sign (%-?%d+) and for possible dot (%-?%d-%.?%d+), and so on. Of course the easy way would be to grab everything that is not space (%S+) as a potential number.

H5PY Writes Very Slow

I have a h5py dataset like below. I want to index the records by string instead of by numeric value. So, e.g. I would be able to get the value of the first record by dset[dset.attrs['id1']].
I am trying to write the attributes with the code below, but it is extremely slow. If I do a %timeit dset.attrs[rid] = idx in the loop a single write is about 310ms. The strings I am writing are 36 characters. I have about 100k records I need to write, which would take about 9 hours. Something must be terribly wrong? Also the CPU is pegged.
ids = ['id1', 'id2', 'id3']
h5 = h5py.File("/tmp/ds.h5", "w")
dset = h5.create_dataset("lds", (100000, ), dtype='float32')
for idx, id in enumerate(ids): # loop takes forever
dset.attrs[id] = idx # takes about ~310ms
EDIT
Minimal "working" example.
for idx, rid in enumerate(range(10)):
%timeit dset.attrs[str(rid)] = idx
10 loops, best of 3: 470 ms per loop
10 loops, best of 3: 470 ms per loop
...
Nearly 0.5 second for a single write.
Use the latest value for parameter libver. This is a lot faster. So, e.g.
h5py.File('ds.h5', 'w', libver='latest')
See here: https://github.com/h5py/h5py/issues/705

How to read a file block in Rails without read it again from beginning

I have a growing file (log) that I need to read by blocks.
I make a call by Ajax to get a specified number of lines.
I used File.foreach to read the lines I want, but it reads always from the beginning and I need to return only the lines I want, directly.
Example Pseudocode:
#First call:
File.open and return 0 to 10 lines
#Second call:
File.open and return 11 to 20 lines
#Third call:
File.open and return 21 to 30 lines
#And so on...
Is there anyway to make this?
Solution 1: Reading the whole file
The proposed solution here:
https://stackoverflow.com/a/5052929/1433751
..is not an efficient solution in your case, because it requires you to read all the lines from the file for each AJAX request, even if you just need the last 10 lines of the logfile.
That's an enormous waste of time, and in computing terms the solving time (i.e. process the whole logfile in blocks of size N) approaches exponential solving time.
Solution 2: Seeking
Since your AJAX calls request sequential lines we can implement a much more efficient approach by seeking to the correct position before reading, using IO.seek and IO.pos.
This requires you to return some extra data (the last file position) back to the AJAX client at the same time you return the requested lines.
The AJAX request then becomes a function call of this form request_lines(position, line_count), which enables the server to IO.seek(position) before reading the requested count of lines.
Here's the pseudocode for the solution:
Client code:
LINE_COUNT = 10
pos = 0
loop {
data = server.request_lines(pos, LINE_COUNT)
display_lines(data.lines)
pos = data.pos
break if pos == -1 # Reached end of file
}
Server code:
def request_lines(pos, line_count)
file = File.open('logfile')
# Seek to requested position
file.seek(pos)
# Read the requested count of lines while checking for EOF
lines = count.times.map { file.readline if !file.eof? }.compact
# Mark pos with -1 if we reached EOF during reading
pos = file.eof? ? -1 : file.pos
f.close
# Return data
data = { lines: lines, pos: pos }
end

Reading and parsing a large .dat file

I am trying to parse a huge .dat file (4gb). I have tried with R but it just takes too long. Is there a way to parse a .dat file by segments, for example every 30000 lines? Any other solutions would also be welcomed.
This is what it looks like:
These are the first two lines with header:
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167| <br/>
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
This is an option to read data faster in R by using the fread function in the data.table package.
EDIT
I removed all <br/> new-line tags. This is the edited dataset
ST|ZIPCODE|GEO_ID|GEO_TTL|FOOTID_GEO|NAICS2012|NAICS2012_TTL|FOOTID_NAICS|YEAR|EMPSZES|EMPSZES_TTL|ESTAB|ESTAB_F
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|001|All establishments|167|
01|35004|8610000US35004|35004(MOODY,AL)||00|Total for all sectors||2012|212|Establishments with 1 to 4 employees|91|
Then I matched variables with classes. You should use nrows ~ 100.
colclasses = sapply(read.table(edited_data, nrows=1, sep="|", header=T),class)
Then I read the edited data.
your_data <- fread(edited_data, sep="|", sep2=NULL, nrows=-1L, header=T, na.strings="NA",
stringsAsFactors=FALSE, verbose=FALSE, autostart=30L, skip=-1L, select=NULL,
colClasses=colclasses)
Everything worked like a charm. In case you have problems removing the tags, use this simple Python script (it will take some time for sure):
original_file = file_path_to_original_file # e.g. "/Users/User/file.dat"
edited_file = file_path_to_new_file # e.g. "/Users/User/file_edited.dat"
with open(original_file) as inp:
with open(edited_file, "w") as op:
for line in inp:
op.write(line.replace("<br/>", "")
P.S.
You can use read.table with similar optimizations, but it won't give you nearly as much speed.

Resources