I'm trying to retrieve the alignment score of two sequences compared using emboss in biopython. The only way that I know is to retrieve it from an output text file produced by emboss. The problem is that there will be hundreds of these files to iterate over. Is there an easier/cleaner method to retrieve the alignment score, without resorting to that? This is the main part of the code that I'm using.
From Bio.Emboss.Applications import StretcherCommandline
needle_cline = StretcherCommandline(asequence=,bsequence=,gapopen=,gapextend=,outfile=)
stdout, stderr = needle_cline()
I had the same problem and after some time spent on searching for a neat solution I popped up a white flag.
However, to speed up significantly the processing of output files I did the following things:
1) I used re python module for handling regular expressions to extract all data needed.
2) I created a ramdisk space for the output files. The use of a ramdisk here allowed for processing and exchanging all the data in RAM memory (much faster than writing and reading the output files from a hard drive, not to mention it saves your hdd in case of processing massive number of alignments).
I don't know if there is one specifically for your command.
For Primer3CommandLine, there is Primer3. Make your life much easier with something like:
from Bio.Emboss import Primer3
inputFile = "./wherever/your/outputfileis.out"
with open(inputFile) as fileHandle:
record = Primer3.parse(fileHandle)
# XXX check is len>0
primers = record.next().primers
numPrimers = len(primers)
# you should have access to each primer, using a for loop
# to check how to access the data you care about. For example:
I would also check http://biopython.org/wiki/SeqIO#Sequence_Input
Related
I'm running a machine learning pipeline for segmentation of very large 3D images. I would like to store the results (dask arrays) as .png files, with each file corresponding to one slice of the dask array. Do you have any suggestions on how to implement this?
I have been trying to save the results by building a parallel for loop using the joblib dask parallel backend and then looping through the results slice by slice. This works fine until a certain point at which my pipe gets stuck without any apparent reason (no memory issue, no too many open file descriptors etc.).
array_to_save has been persisted in memory with client.persist()
with joblib.parallel_backend('dask'):
joblib.Parallel(verbose=100)(joblib.delayed(png_sav)(j, stack_height, client.compute(array_to_save[j])) for j in range(stack_height))
def png_sav(j, stack_height, prediction):
img = Image.fromarray(prediction.result().astype('uint32'), 'I') # I to save as 16 bit binary image
img.save(png_pn+str(j)+'_slice_prediction.png', "PNG")
img.close()
You might consider using either ...
the map_blocks method to call a function on every block of your data. Your function can take a block_info= keyword argument if it wants to know where it is in the stack.
Convert your array to a list of delayed arrays. Maybe something like this (untested, you should read the docs here)
x = x.rechunk((1, None, None)) # many chunks along the first axis
slices = x.to_delayed().flatten()
saves = [dask.delayed(numpy_array_to_png)(slc, filename='...') for slc in slices]
dask.compute(*saves)
Check in with the dask-image project. I suspect that they have something https://github.com/dask/dask-image
thanks a lot for your hints.
I'm trying to understand how to use .map_blocks() and in particular block_info=. But I don't understand how to use the information given by block_info. I would like to save each chunk separately but don't know how to do this. Any hints? Thanks a lot!
da.map_blocks(png_sav(stack_height, prediction,
block_info=True), dtype='uint16')
def png_sav(stack_height, prediction, block_info=True):
# I don't get how I can save each chunk separately
img = Image.fromarray("prediction_chunk".astype('uint32'), 'I') # I to save as 16 bit binary image
img.save(png_pn+str(j)+'_slice_prediction.png', "PNG")
img.close()
I am using a python program to write a 4000x4000 array into an hdf5 file.
Then, I read the data by a c-program where I need it as an input to do some simulations. I need approximately 1000 of these 4000x4000 arrays (meaning, I am doing 1000 simulation runs).
My question now is the following: Which way is "better", 1000 separate hdf5 files or one big hdf5-file with 1000 different dataset (named 'dataset_%04d')?
Any advice or best practices behaviour for this kind of problem is greatly appreciated (as I am not too familiar with hdf5).
In case, this is of interest, here is the python code I am using to write the hdf5 file:
import h5py
h5f = h5py.File( 'data_0001.h5', 'w' )
h5f.create_dataset( 'dataset_1', data=myData )
h5f.close
This is really interesting as I'm currently dealing with similar problem.
Performace
To investigate the problem a little closer, I have created following file
import h5py
import numpy as np
def one_file(shape=(4000, 4000), n=1000):
h5f = h5py.File('data.h5', 'w')
for i in xrange(n):
dataset = np.random.random(shape)
dataset_name = 'dataset_{:08d}'.format(i)
h5f.create_dataset(dataset_name, data=dataset)
print i
h5f.close()
def more_files(shape=(4000, 4000), n=1000):
for i in xrange(n):
file_name = 'data_{:08d}'.format(i)
h5f = h5py.File(file_name, 'w')
dataset = np.random.random(shape)
h5f.create_dataset('dataset', data=dataset)
h5f.close()
print i
Then, in IPython,
>>> from testing import one_file, more_files
>>> %timeit one_file(n=25) # with n=25, the resulting file is 3.0GB
1 loops, best of 3: 42.5 s per loop
>>> %timeit more_files(n=25)
1 loops, best of 3: 41.7 s per loop
>>> %timeit one_file(n=250)
1 loops, best of 3: 7min 29s per loop
>>> %timeit more_files(n=250)
1 loops, best of 3: 8min 10s per loop
The difference is quite surprising to me, for n=25 having more files is faster, however this is no longer truth for more datasets.
Experience
As others noted in comments, there is probably no correct answer as this is very problem specific. I deal with hdf5 files for my research in plasma physics. I don't know if it helps you but I can share my hdf5 experience.
I'm running lots of simulations and output for a given simulation used to go to one hdf5 file. When a simulation finished, it dumped it's state to this hdf5 file, so later I was able to take this state and extend simulation from that point (I could change some parameters as well and I don't need to start from scratch). The output from this simulation went again to the same file. This was great - I had only one file for one simulation. However, there are certain drawbacks with this approach:
When a simulation crashes, you end up with a file that is not 'complete' - you can't start new simulation from that file.
There is no simple way, how you can safely take a look into hdf5 file when another process is writing to that file. If it happen that you try to read from and another process is writing to, you end up corrupted file and all you data is lost!
I don't know of any simple way how you can delete groups from a file (I anyone know a way, let me know). So, if I need to restructure a file, I need to create new one from it (h5copy, h5repack, ...).
So I end up with this approach which works much better:
I'm periodically flushing state from a simulation and after that I'm writing to a new file. If simulation crashes, I need only delete last file and I don't lost that much of cpu time.
I'm currently only plotting data from all files but the last one. Note that there is another way: see here, but my approach is definitely simpler and I'm ok with that.
It is much better to process more small files than one huge file - you see the progress and so on.
Hope this helps.
A little late to the party, I know, but I thought I'd share my experiences. My data sizes are smaller, but from a simplicity-of-analysis standpoint I actually prefer one large (1000, 4000, 4000) dataset. In your case, it looks like you'd need to use the maxshape property to make it extendable as you create new results. Saving multiple separate datasets makes it hard to look at trends across datasets since you have to slice them all separately. With one dataset you could do eg. data[:, 5, 20] to look across the 3rd axis. Also, to address the corruption problem, I highly recommend using h5py.File as a context manager:
with h5py.File('myfilename') as f:
f.create_dataset('mydata', data=data, maxshape=(1000, 4000, 4000))
This automatically closes the file even if there is an exception. I used to curse incessantly due to corrupted data and then I started doing this and haven't had a problem since.
I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck
I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.
I'm trying to download (and save) a binary file from the web using Python 2.6 and urllib.
As I understand it, read(), readline() and readlines() are the 3 ways to read a file-like object.
Since the binary files aren't really broken into newlines, read() and readlines() read teh whole file into memory.
Is choosing a random read() buffer size the most efficient way to limit memory usage during this process?
i.e.
import urllib
import os
title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
downloadurl.split('/')[-1])
if not os.path.exists(mydirpath):
print "Downloading...%s" % title
localFile = open(mydirpath, 'wb')
data = webFile.read(1000000) #1MB at a time
while data:
localFile.write(data)
data = webFile.read(1000000) #1MB at a time
webFile.close()
localFile.close()
print "Finished downloading: %s" % title
else:
print "%s already exists." % mydirypath
I chose read(1000000) arbitrarily because it worked and kept RAM usage down. I assume if I was working with a raw network buffer choosing a random amount would be bad since the buffer might run dry if the transfer rate was too low. But it seems urllib is already handling lower level buffering for me.
With that in mind, is choosing an arbitrary number fine? Is there a better way?
Thanks.
You should use urllib.urlretrieve for this. It will handle everything for you.
Instead of using your own read-write loop, you should probably check out the shutil module. The copyfileobj method will let you define the buffering. The most efficient method varies from situation to situation. Even copying the same source file to the same destination may vary due to network issues.