I am new to biopython and want to search a protein fasta file and extract the neighborhood of 5 upstream and downstream proteins from an input accession number. Then store the 10 protein sequences in a fasta. How do I "scroll" through a fasta file and even create multiple neighborhood fasta files based on a series of input accessions?
Thanks
Related
I have several HDF5 files all of which have a /dataset that contains vectors. I would like to combine all these vectors into one dataset in one file (that is repeatedly append from one file to another). The combined dataset would have chunked storage and be resizable.
Every option I've seen for doing this seems to require reading all the data into a buffer, and then writing it back out, is there a way to more simply pass a dataset/dataspace from one file to another in order to append the data?
Have you investigated h5py Group .copy() method? Although documented as a group action, it works with any h5py object (groups, datasets, links and references). By default it copies object attributes, and supports recursive copying of group members. If you prefer a command line tool, the HDF Group has one to do this. Take a look at h5copy here: HDF5 Group h5 copy doc
Here is a example that demonstrates a simple h5py .copy() implementation. It creates a set of 3 files -- each with 1 dataset (named /dataset, dtype=float, shape=(10,10)). It then creates a NEW HDF5 file, and is followed by another loop to open the previous files and copies the dataset from the "read" file (h5r) to the new "write" file (h5w).
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='w') as h5f:
arr = np.random.random(100).reshape(10,10)
h5f.create_dataset('dataset',data=arr)
with h5py.File('SO_68025342_all.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
h5r.copy('dataset', h5w, name='dataset_'+str(i) )
Here is a method to copy data from multiple files to a single dataset in the merged file. It comes with caveats: 1) all datasets must have the same shape, and 2) you know the number of datasets in advance to size the new dataset. (If not, you can create a resizeable dataset by addingmaxshape=(None,a0,a1), and then use .resize() as needed. I have another post with 2 examples here: How can I combine multiple .h5 file? Look at Methods 3a and 3b.
with h5py.File('SO_68025342_merge.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
if 'dataset' not in h5w.keys():
a0, a1 = h5r['dataset'].shape
h5w.create_dataset('dataset', shape=(3,a0,a1))
h5w['dataset'][i-1,:] = h5r['dataset']
Assuming your files aren't so conveniently named, you can use glob.iglob() to loop on the file names to read. Then use .keys() to get the dataset names in each file. Also, if all of your datasets really are named /dataset, you need to come up with a naming convention for the new datasets.
Here is a link to the h5py docs with more details: h5py Group .copy() method
If you are not bound to a particular library and programming language, one way to solve your issue could be with the usage of HDFql (in C, C++, Java, Python, C#, Fortran or R).
Given that your posts seem to mention C# quite often, find below a solution in C#. It assumes that 1) the dataset name is dset, 2) each dataset is of data type float, and 3) each dataset is a vector of one dimension (size 100) - feel free to adapt the code to your concrete use-case:
// declare variable
float []data = new float[100];
// retrieve all file names (from current directory) that end with '.h5'
HDFql.Execute("SHOW FILE LIKE \\.h5$");
// create an HDF5 file named 'output.h5' and use (i.e. open) it
HDFql.Execute("CREATE AND USE FILE output.h5");
// create a chunked and extendible HDF5 dataset named 'dset' in file 'output.h5'
HDFql.Execute("CREATE CHUNKED(100) DATASET dset AS FLOAT(0 TO UNLIMITED)");
// register variable 'data' for subsequent usage (by HDFql)
HDFql.VariableRegister(data);
// loop cursor and process each file found
while(HDFql.CursorNext() == HDFql.Success)
{
// alter (i.e. extend) dataset 'dset' (from file 'output.h5') with more 100 floats
HDFql.Execute("ALTER DIMENSION dset TO +100");
// select (i.e. read) dataset 'dset' (from file found) and populate variable 'data'
HDFql.Execute("SELECT FROM \"" + HDFql.CursorGetChar() + "\" dset INTO MEMORY " + HDFql.VariableGetNumber(data));
// insert (i.e. write) values stored in variable 'data' into dataset 'dset' (from file 'output.h5') at the end of it (using an hyperslab)
HDFql.Execute("INSERT INTO dset(-1:::) VALUES FROM MEMORY " + HDFql.VariableGetNumber(data));
}
I have a batch data parsing job where the inputs is a list of zip files and each zip file has numerous small text files to parse. In the order of 100Gb compressed across 50 zip files, each zip has 1 million text files.
I am using Apache Beam's package in Python and running the job through Dataflow.
I wrote it as
Create collection from the list of zip file paths
FlatMap with a function that yields for every text file inside the zip (one output is a bytes string for all the bytes read from the text file)
ParDo with a method that yields for every row in the data from the text file / bytes read
...do other stuff like insert each row in the relevant table of some database
I notice this is too slow - CPU resources are only a few % utilised. I suspect that each node is getting a zip file, but work is not distributed among local CPUs - so it's just one CPU working per node. I don't understand why that is the case considering I used FlatMap.
The Dataflow runner makes use of Fusion optimisation:
'...Such optimizations can include fusing multiple steps or transforms in your pipeline's execution graph into single steps.'
If you have a transform which in its DoFn has a large fan-out, which I suspect the Create transform in your description does, then you may want to manually break fusion by introducing a shuffle stage to your pipeline as described in the linked documentation.
I am using biopython's wrapper API for ncbi eutils to retrieve related proteins, identical proteins and variant proteins (transcripts, splice variants, etc) for a certain protein coding gene.
This information is displayed for a protein coding gene on its ncbi page under the "mRNA and Protein(s)" section.
I am retrieving identical proteins via LinkName=protein_protein_identical and related via LinkName=protein_protein.
Example call
Is there a way to retrieve the transcripts for a protein coding gene?
It's easy but annoying (XML craziness involved). First you retrieve your record from Entrez:
handle = Entrez.efetch(db="gene",
id="10555",
retmode="xml")
Now handle is a generator for XML lines. You can parse them with Entrez.parse() from Biopython, but I find the XML too entangled to deal with it. Your mRNA ids are in
<Entrezgene_comments>
<Gene-commentary>
<Gene-commentary_comment>
<Gene-commentary>
<Gene-commentary_products>
<Gene-commentary>
<Gene-commentary_type value="mRNA">
<Gene-commentary_products>
<Gene-commentary>
<Gene-commentary_type value="peptide">
<Gene-commentary_accession>NP_001012745</Gene-commentary_accession>
After parsing with Entrez.parse() you'll have a mix of dicts with lists to dive in until you reach your accession id. Once you have this id, you can ask for the sequence to entrez with:
handle = Entrez.efetch(db="protein",
id="NP_001012745",
rettype="fasta",
retmode="text")
An alternative approach involves parsing a gene_table. Fetch the same handle than before, but instead of a XML ask for a gene_table:
handle = Entrez.efetch(db="gene",
id="10555",
rettype="gene_table",
retmode="text")
In the gene_table you'll find some lines in the form:
mRNA transcript variant 2 NM_001012727.1
protein isoform b precursor NP_001012745.1
Exon table for mRNA NM_001012727.1 and protein NP_001012745.1
From where you can get your ids.
Suppose I have a dataset I want to run a Mahout clustering job on. I want each data point to have a unique identifier, such as an ID number. I don't want to append the ID to the vector as this way it will be included in the clustering calculations. How can I include an identifier in the data without the algorithm including the ID number in its calculations? Is there a way to have the input be a key-value pair where the key is the ID and the value is the Vector I want to run the algorithm on?
Alison before worrying about this, see the output first. Many times, you have lines of assignedCLusterIDs, where line orders in input and output files are the same. For example, the node in the first line of your input file will be in the first line of the output file. So you can keep ids in a separate file, their vectors in the input file. Then you can combine the separate file and the output file to see which node is assigned which cluster.
I have large data files of values on a 2D grid.
They are organized such that subsequent rows of data in the grid are subsequent lines in the file.
Each column is separated by a tab character.
Essentially, this is a CSV file, but with tabs instead of columns.
I need the transpose the data (first row becomes first column) and output it to another file. What's the best way to do this? Any language is okay (I prefer to use Perl or C/C++). Currently, I have Perl script just read in the entire file into memory, but I have files which are simply gigantic.
The simplest way would be to make multiple passes through your input, extracting a subset of columns on each pass. The number of columns would be determined by how much memory you wanted to use and how many rows are in the input file.
For example:
On pass 1 you read the entire input file and process only the first, say, 10 columns. If the input had 1 million rows, the output would be a file with 1 million columns and 10 rows. On the next pass you would read the input again, and process columns 11 thru 20, appending the results to the original output file. And so on....
If you have Python with NumPy installed, it's as easy as this:
#!/usr/bin/env python
import numpy, csv
with open('/path/to/data.csv', 'rb') as file:
csvdata = csv.reader()
data = numpy.array(csvdata)
transpose = data.T
... the csv module is part of Python's standard library.