Most memory efficient way to save binary file from the web with Python 2.6? - buffer

I'm trying to download (and save) a binary file from the web using Python 2.6 and urllib.
As I understand it, read(), readline() and readlines() are the 3 ways to read a file-like object.
Since the binary files aren't really broken into newlines, read() and readlines() read teh whole file into memory.
Is choosing a random read() buffer size the most efficient way to limit memory usage during this process?
i.e.
import urllib
import os
title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
downloadurl.split('/')[-1])
if not os.path.exists(mydirpath):
print "Downloading...%s" % title
localFile = open(mydirpath, 'wb')
data = webFile.read(1000000) #1MB at a time
while data:
localFile.write(data)
data = webFile.read(1000000) #1MB at a time
webFile.close()
localFile.close()
print "Finished downloading: %s" % title
else:
print "%s already exists." % mydirypath
I chose read(1000000) arbitrarily because it worked and kept RAM usage down. I assume if I was working with a raw network buffer choosing a random amount would be bad since the buffer might run dry if the transfer rate was too low. But it seems urllib is already handling lower level buffering for me.
With that in mind, is choosing an arbitrary number fine? Is there a better way?
Thanks.

You should use urllib.urlretrieve for this. It will handle everything for you.

Instead of using your own read-write loop, you should probably check out the shutil module. The copyfileobj method will let you define the buffering. The most efficient method varies from situation to situation. Even copying the same source file to the same destination may vary due to network issues.

Related

How to save dask array as .png files slice by slice?

I'm running a machine learning pipeline for segmentation of very large 3D images. I would like to store the results (dask arrays) as .png files, with each file corresponding to one slice of the dask array. Do you have any suggestions on how to implement this?
I have been trying to save the results by building a parallel for loop using the joblib dask parallel backend and then looping through the results slice by slice. This works fine until a certain point at which my pipe gets stuck without any apparent reason (no memory issue, no too many open file descriptors etc.).
array_to_save has been persisted in memory with client.persist()
with joblib.parallel_backend('dask'):
joblib.Parallel(verbose=100)(joblib.delayed(png_sav)(j, stack_height, client.compute(array_to_save[j])) for j in range(stack_height))
def png_sav(j, stack_height, prediction):
img = Image.fromarray(prediction.result().astype('uint32'), 'I') # I to save as 16 bit binary image
img.save(png_pn+str(j)+'_slice_prediction.png', "PNG")
img.close()
You might consider using either ...
the map_blocks method to call a function on every block of your data. Your function can take a block_info= keyword argument if it wants to know where it is in the stack.
Convert your array to a list of delayed arrays. Maybe something like this (untested, you should read the docs here)
x = x.rechunk((1, None, None)) # many chunks along the first axis
slices = x.to_delayed().flatten()
saves = [dask.delayed(numpy_array_to_png)(slc, filename='...') for slc in slices]
dask.compute(*saves)
Check in with the dask-image project. I suspect that they have something https://github.com/dask/dask-image
thanks a lot for your hints.
I'm trying to understand how to use .map_blocks() and in particular block_info=. But I don't understand how to use the information given by block_info. I would like to save each chunk separately but don't know how to do this. Any hints? Thanks a lot!
da.map_blocks(png_sav(stack_height, prediction,
block_info=True), dtype='uint16')
def png_sav(stack_height, prediction, block_info=True):
# I don't get how I can save each chunk separately
img = Image.fromarray("prediction_chunk".astype('uint32'), 'I') # I to save as 16 bit binary image
img.save(png_pn+str(j)+'_slice_prediction.png', "PNG")
img.close()

Advice (Best practices) for handling large number of large 2D arrays in HDF5 files

I am using a python program to write a 4000x4000 array into an hdf5 file.
Then, I read the data by a c-program where I need it as an input to do some simulations. I need approximately 1000 of these 4000x4000 arrays (meaning, I am doing 1000 simulation runs).
My question now is the following: Which way is "better", 1000 separate hdf5 files or one big hdf5-file with 1000 different dataset (named 'dataset_%04d')?
Any advice or best practices behaviour for this kind of problem is greatly appreciated (as I am not too familiar with hdf5).
In case, this is of interest, here is the python code I am using to write the hdf5 file:
import h5py
h5f = h5py.File( 'data_0001.h5', 'w' )
h5f.create_dataset( 'dataset_1', data=myData )
h5f.close
This is really interesting as I'm currently dealing with similar problem.
Performace
To investigate the problem a little closer, I have created following file
import h5py
import numpy as np
def one_file(shape=(4000, 4000), n=1000):
h5f = h5py.File('data.h5', 'w')
for i in xrange(n):
dataset = np.random.random(shape)
dataset_name = 'dataset_{:08d}'.format(i)
h5f.create_dataset(dataset_name, data=dataset)
print i
h5f.close()
def more_files(shape=(4000, 4000), n=1000):
for i in xrange(n):
file_name = 'data_{:08d}'.format(i)
h5f = h5py.File(file_name, 'w')
dataset = np.random.random(shape)
h5f.create_dataset('dataset', data=dataset)
h5f.close()
print i
Then, in IPython,
>>> from testing import one_file, more_files
>>> %timeit one_file(n=25) # with n=25, the resulting file is 3.0GB
1 loops, best of 3: 42.5 s per loop
>>> %timeit more_files(n=25)
1 loops, best of 3: 41.7 s per loop
>>> %timeit one_file(n=250)
1 loops, best of 3: 7min 29s per loop
>>> %timeit more_files(n=250)
1 loops, best of 3: 8min 10s per loop
The difference is quite surprising to me, for n=25 having more files is faster, however this is no longer truth for more datasets.
Experience
As others noted in comments, there is probably no correct answer as this is very problem specific. I deal with hdf5 files for my research in plasma physics. I don't know if it helps you but I can share my hdf5 experience.
I'm running lots of simulations and output for a given simulation used to go to one hdf5 file. When a simulation finished, it dumped it's state to this hdf5 file, so later I was able to take this state and extend simulation from that point (I could change some parameters as well and I don't need to start from scratch). The output from this simulation went again to the same file. This was great - I had only one file for one simulation. However, there are certain drawbacks with this approach:
When a simulation crashes, you end up with a file that is not 'complete' - you can't start new simulation from that file.
There is no simple way, how you can safely take a look into hdf5 file when another process is writing to that file. If it happen that you try to read from and another process is writing to, you end up corrupted file and all you data is lost!
I don't know of any simple way how you can delete groups from a file (I anyone know a way, let me know). So, if I need to restructure a file, I need to create new one from it (h5copy, h5repack, ...).
So I end up with this approach which works much better:
I'm periodically flushing state from a simulation and after that I'm writing to a new file. If simulation crashes, I need only delete last file and I don't lost that much of cpu time.
I'm currently only plotting data from all files but the last one. Note that there is another way: see here, but my approach is definitely simpler and I'm ok with that.
It is much better to process more small files than one huge file - you see the progress and so on.
Hope this helps.
A little late to the party, I know, but I thought I'd share my experiences. My data sizes are smaller, but from a simplicity-of-analysis standpoint I actually prefer one large (1000, 4000, 4000) dataset. In your case, it looks like you'd need to use the maxshape property to make it extendable as you create new results. Saving multiple separate datasets makes it hard to look at trends across datasets since you have to slice them all separately. With one dataset you could do eg. data[:, 5, 20] to look across the 3rd axis. Also, to address the corruption problem, I highly recommend using h5py.File as a context manager:
with h5py.File('myfilename') as f:
f.create_dataset('mydata', data=data, maxshape=(1000, 4000, 4000))
This automatically closes the file even if there is an exception. I used to curse incessantly due to corrupted data and then I started doing this and haven't had a problem since.

how to find a particular redis key memory size in lua script

redis.call('select','14')
local allKeys = redis.call('keys','orgId#1:logs:email:uid#*')
for i = 1 , #allKeys ,1
do
local object11 = redis.call('DEBUG OBJECT',allKeys[i])
print("kk",object11[1])
end
Here "DEBUG OBJECT" is run successfully on redis-cli, but if we want to run through lua script on multiple key. That send error like this.
(error) ERR Error running script (call to f_b003d960240545d9540ebc2319d863221045
3815): Wrong number of args calling Redis command From Lua script
DEBUG OBJECT is not a good bet. It shows the serialized length of the value, so it is just the size of the object once stored on an RDB file.
To have some hint about the size of an object in Redis, you need to resort to more complex techniques, but you can only get an approximation. You need to run:
TYPE
OBJECT ENCODING
The object-type specific command to get its length.
Sample a few elements to understand the average string length of the object.
Based on this four informations, you need to check the Redis source code to check the different memory footprints of the internal structures used, and do the math. Not easy...
A much more viable approximation is to just to use:
APPROX_USED_MEM = num_elements * avg_size * overhead_factor
You may want to pick an overhead factor which makes sense for a variety of data types. The error is big, but is an approximation good enough for some use cases. Maybe overhead_factor may be something like 2.
TLDR: What you are trying to do is complex and leads to errors. In the future the idea is to provide a MEMORY command which is able to do this.

retrieve sequence alignment score produced by emboss in biopython

I'm trying to retrieve the alignment score of two sequences compared using emboss in biopython. The only way that I know is to retrieve it from an output text file produced by emboss. The problem is that there will be hundreds of these files to iterate over. Is there an easier/cleaner method to retrieve the alignment score, without resorting to that? This is the main part of the code that I'm using.
From Bio.Emboss.Applications import StretcherCommandline
needle_cline = StretcherCommandline(asequence=,bsequence=,gapopen=,gapextend=,outfile=)
stdout, stderr = needle_cline()
I had the same problem and after some time spent on searching for a neat solution I popped up a white flag.
However, to speed up significantly the processing of output files I did the following things:
1) I used re python module for handling regular expressions to extract all data needed.
2) I created a ramdisk space for the output files. The use of a ramdisk here allowed for processing and exchanging all the data in RAM memory (much faster than writing and reading the output files from a hard drive, not to mention it saves your hdd in case of processing massive number of alignments).
I don't know if there is one specifically for your command.
For Primer3CommandLine, there is Primer3. Make your life much easier with something like:
from Bio.Emboss import Primer3
inputFile = "./wherever/your/outputfileis.out"
with open(inputFile) as fileHandle:
record = Primer3.parse(fileHandle)
# XXX check is len>0
primers = record.next().primers
numPrimers = len(primers)
# you should have access to each primer, using a for loop
# to check how to access the data you care about. For example:
I would also check http://biopython.org/wiki/SeqIO#Sequence_Input

+ [NSString stringWithContentsOfFile:] performance

I need to load files in iOS, now I use the + [NSString stringWithContentsOfFile:].
The files are mostly 500kb to 5mb.
I load an approx. 4mb large file and Instruments and stopwatch told me it needs 1.5 seconds to load this file. In my opinion its a bit slow, is there a way to get the string faster?
EDIT:
I try some things and notice now, the creation of the NSString is my problem it takes 97% of the time and not the real loading from disk.
If you either know the encoding or can determine it (the API you use now is basic), you can just treat it as a char buffer (in a manner which is encoding aware).
I'd begin by opening it using memory mapped data (mmap, you can also approach this using NSData). madvise can be used to hint how you will access the file.
If memory mapped I/O consumes too much memory for your use, you should drop down to incremental reads, like C I/O facilities - fopen, fread, etc.. This will typically require more I/O events than memory mapped data (can be much slower, depending on how the data is accessed).
In both cases, you would treat the string as a C string -- don't simply convert the whole file to an NSString upon opening.
Foundation has a lot of tricks, so make sure this actually improves performance for your specific use case.
If those solutions are too 'core for your use, just consider using smaller files instead (dividing existing files).

Resources