How do I do multiple pairwise alignments from a FASTA file and print the percentage similarity? - biopython

I want to multiple pairwise comparisons for every protein sequence contained in a FASTA file and then print the percentage sequence similarity (either an average or individually). I think I need to use itertools to create all of the combinations, align them and then probably divide the number of matches by the aligned sequence length to get the % sequence similarity but I am having trouble with the specific script I need to do this, preferably in biopython if possible. Any help is appreciated.

My answer does not involve Biopython, but since no other answer has been posted yet, I will post it anyway:
The bioinformatics package Biotite (https://www.biotite-python.org/), a package I am currently developing, would solve your problem using the following script:
import numpy as np
import biotite
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.sequence.align as align
import biotite.database.entrez as entrez
# 5 example sequences (bacterial luciferase variants)
uids = [
'Q7N575', 'P19839', 'P09140', 'P07740', 'P24113'
]
# Download these sequences as one file from NCBI
file_name = entrez.fetch_single_file(
uids, biotite.temp_file("fasta"), db_name="protein", ret_type="fasta"
)
# Read each sequence in the file as 'ProteinSequence' object
fasta_file = fasta.FastaFile()
fasta_file.read(file_name)
sequences = list(fasta.get_sequences(fasta_file).values())
# BLOSUM62
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
# Matrix that will be filled with pairwise sequence identities
identities = np.ones((len(sequences), len(sequences)))
# Iterate over sequences
for i in range(len(sequences)):
for j in range(i):
# Align sequences pairwise
alignment = align.align_optimal(
sequences[i], sequences[j], substitution_matrix
)[0]
# Calculate pairwise sequence identities and fill matrix
identity = align.get_sequence_identity(alignment)
identities[i,j] = identity
identities[j,i] = identity
print(identities)
The output:
[[1. 0.97214485 0.62921348 0.84225352 0.59776536]
[0.97214485 1. 0.62359551 0.85352113 0.60055866]
[0.62921348 0.62359551 1. 0.61126761 0.85393258]
[0.84225352 0.85352113 0.61126761 1. 0.59383754]
[0.59776536 0.60055866 0.85393258 0.59383754 1. ]]

Related

No result after calculating the similarity of two words based on word vectors via Spacy's parser?

I have an example in spacy code:
from numpy import dot
from numpy.linalg import norm
from spacy.lang.en import English
parser = English()
# you can access known words from the parser's vocabulary
nasa = parser.vocab[u'NASA']
# cosine similarity
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
# gather all known words, take only the lowercased versions
allWords = list({w for w in parser.vocab if w.has_vector and
w.orth_.islower() and w.lower_ != unicode("nasa")})
# sort by similarity to NASA
allWords.sort(key=lambda w: cosine(w.vector, nasa.vector))
allWords.reverse()
print("Top 10 most similar words to NASA:")
for word in allWords[:10]:
print(word.orth_)
The result is like this:
Top 10 most similar words to NASA:
Process finished with exit code 0
So there is no similar words come out.
I have tried to install the parser and glove via cmd:
python -m spacy.en.download parser
python -m spacy.en.download glove
But failed, it turned out to be:
C:\Python\python.exe: No module named en
By the way, I use:
Python 2.7.9
Spacy 2.0.9
What's wrong with it? Thank you
The parser you are instantiating contains no word vectors. Check https://spacy.io/models/ for an overview of models.

'Pipeline' object has no attribute 'get_feature_names' in scikit-learn

I am basically clustering some of my documents using mini_batch_kmeans and kmeans algorithm. I simply followed the tutorial is the scikit-learn website the link for that is given below:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
They are using some of the method for the vectorizing one of which is HashingVectorizer. In the hashingVectorizer they are making a pipeline with TfidfTransformer() method.
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
Once doing so, the vectorizer what I get from that does not have the method get_feature_names(). But since I am using it for clustering, I need to get the "terms" using this "get_feature_names()"
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
How do I solve this error?
My whole code is show below:
X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)
The count vectorizor piped with tfidf.
def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow : ", X_train_vecs.shape)
return X_train_vecs, vectorizer
and the mini_batch_kmeans class is as below:
class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is : ",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")
if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()
Pipeline doesn't have get_feature_names() method, as it is not straightforward to implement this method for Pipeline - one needs to consider all pipeline steps to get feature names. See https://github.com/scikit-learn/scikit-learn/issues/6424, https://github.com/scikit-learn/scikit-learn/issues/6425, etc. - there is a lot of related tickets and several attempts to fix it.
If your pipeline is simple (TfidfVectorizer followed by MiniBatchKMeans) then you can get feature names from TfidfVectorizer.
If you want to use HashingVectorizer, it is more complicated, as HashingVectorizer doesn't provide feature names by design. HashingVectorizer doesn't store vocabulary, and uses hashes instead - it means it can be applied in online setting, and that it dosn't require any RAM - but the tradeoff is exactly that you don't get feature names.
It is still possible to get feature names from HashingVectorizer though; to do this you need to apply it for a sample of documents, store which hashes correspond to which words, and this way learn what these hashes mean, i.e. what are the feature names. There may be collisions, so it is not possible to be 100% sure the feature name is correct, but usually this approach works ok. This approach is implemented in eli5 library; see http://eli5.readthedocs.io/en/latest/tutorials/sklearn-text.html#debugging-hashingvectorizer for an example. You will have to do something like this, using InvertableHashingVectorizer:
from eli5.sklearn import InvertableHashingVectorizer
ivec = InvertableHashingVectorizer(vec) # vec is a HashingVectorizer instance
# X_sample is a sample from contents; you can use the
# whole contents array, or just e.g. every 10th element
ivec.fit(content_sample)
hashing_feat_names = ivec.get_feature_names()
Then you can use hashing_feat_names as your feature names, as TfidfTransformer doesn't change input vector size and just scales the same features.
From the make_pipeline documentation:
This is a shorthand for the Pipeline constructor; it does not require, and
does not permit, naming the estimators. Instead, their names will be set
to the lowercase of their types automatically.
so, in order to access the feature names, after you have fitted to data, you can:
# Perform an IDF normalization on the output of HashingVectorizer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
hasher = HashingVectorizer(n_features=10,
stop_words='english', non_negative=True,
norm=None, binary=False)
tfidf = TfidfVectorizer()
vectorizer = make_pipeline(hasher, tfidf)
# ...
# fit to the data
# ...
# use the instance's class name to lower
terms = vectorizer.named_steps[tfidf.__class__.__name__.lower()].get_feature_names()
# or to be more precise, as used in `_name_estimators`:
# terms = vectorizer.named_steps[type(tfidf).__name__.lower()].get_feature_names()
# btw TfidfTransformer and HashingVectorizer do not have get_feature_names afaik
Hope this helps, good luck!
Edit: After seeing your updated question with the example you follow, #Vivek Kumar is correct, this code terms = vectorizer.get_feature_names() will not run for the pipeline but only when:
vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
min_df=2, stop_words='english',
use_idf=opts.use_idf)

TensorBoard - Plot training and validation losses on the same graph?

Is there a way to plot both the training losses and validation losses on the same graph?
It's easy to have two separate scalar summaries for each of them individually, but this puts them on separate graphs. If both are displayed in the same graph it's much easier to see the gap between them and whether or not they have begin to diverge due to overfitting.
Is there a built in way to do this? If not, a work around way? Thank you much!
The work-around I have been doing is to use two SummaryWriter with different log dir for training set and cross-validation set respectively. And you will see something like this:
Rather than displaying the two lines separately, you can instead plot the difference between validation and training losses as its own scalar summary to track the divergence.
This doesn't give as much information on a single plot (compared with adding two summaries), but it helps with being able to compare multiple runs (and not adding multiple summaries per run).
Just for anyone coming accross this via a search: The current best practice to achieve this goal is to just use the SummaryWriter.add_scalars method from torch.utils.tensorboard. From the docs:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
r = 5
for i in range(100):
writer.add_scalars('run_14h', {'xsinx':i*np.sin(i/r),
'xcosx':i*np.cos(i/r),
'tanx': np.tan(i/r)}, i)
writer.close()
# This call adds three values to the same scalar plot with the tag
# 'run_14h' in TensorBoard's scalar section.
Expected result:
Many thanks to niko for the tip on Custom Scalars.
I was confused by the official custom_scalar_demo.py because there's so much going on, and I had to study it for quite a while before I figured out how it worked.
To show exactly what needs to be done to create a custom scalar graph for an existing model, I put together the following complete example:
# + <
# We need these to make a custom protocol buffer to display custom scalars.
# See https://developers.google.com/protocol-buffers/
from tensorboard.plugins.custom_scalar import layout_pb2
from tensorboard.summary.v1 import custom_scalar_pb
# >
import tensorflow as tf
from time import time
import re
# Initial values
(x0, y0) = (-1, 1)
# This is useful only when re-running code (e.g. Jupyter).
tf.reset_default_graph()
# Set up variables.
x = tf.Variable(x0, name="X", dtype=tf.float64)
y = tf.Variable(y0, name="Y", dtype=tf.float64)
# Define loss function and give it a name.
loss = tf.square(x - 3*y) + tf.square(x+y)
loss = tf.identity(loss, name='my_loss')
# Define the op for performing gradient descent.
minimize_step_op = tf.train.GradientDescentOptimizer(0.092).minimize(loss)
# List quantities to summarize in a dictionary
# with (key, value) = (name, Tensor).
to_summarize = dict(
X = x,
Y_plus_2 = y + 2,
)
# Build scalar summaries corresponding to to_summarize.
# This should be done in a separate name scope to avoid name collisions
# between summaries and their respective tensors. The name scope also
# gives a title to a group of scalars in TensorBoard.
with tf.name_scope('scalar_summaries'):
my_var_summary_op = tf.summary.merge(
[tf.summary.scalar(name, var)
for name, var in to_summarize.items()
]
)
# + <
# This constructs the layout for the custom scalar, and specifies
# which scalars to plot.
layout_summary = custom_scalar_pb(
layout_pb2.Layout(category=[
layout_pb2.Category(
title='Custom scalar summary group',
chart=[
layout_pb2.Chart(
title='Custom scalar summary chart',
multiline=layout_pb2.MultilineChartContent(
# regex to select only summaries which
# are in "scalar_summaries" name scope:
tag=[r'^scalar_summaries\/']
)
)
])
])
)
# >
# Create session.
with tf.Session() as sess:
# Initialize session.
sess.run(tf.global_variables_initializer())
# Create writer.
with tf.summary.FileWriter(f'./logs/session_{int(time())}') as writer:
# Write the session graph.
writer.add_graph(sess.graph) # (not necessary for scalars)
# + <
# Define the layout for creating custom scalars in terms
# of the scalars.
writer.add_summary(layout_summary)
# >
# Main iteration loop.
for i in range(50):
current_summary = sess.run(my_var_summary_op)
writer.add_summary(current_summary, global_step=i)
writer.flush()
sess.run(minimize_step_op)
The above consists of an "original model" augmented by three blocks of code indicated by
# + <
[code to add custom scalars goes here]
# >
My "original model" has these scalars:
and this graph:
My modified model has the same scalars and graph, together with the following custom scalar:
This custom scalar chart is simply a layout which combines the original two scalar charts.
Unfortunately the resulting graph is hard to read because both values have the same color. (They are distinguished only by marker.) This is however consistent with TensorBoard's convention of having one color per log.
Explanation
The idea is as follows. You have some group of variables which you want to plot inside a single chart. As a prerequisite, TensorBoard should be plotting each variable individually under the "SCALARS" heading. (This is accomplished by creating a scalar summary for each variable, and then writing those summaries to the log. Nothing new here.)
To plot multiple variables in the same chart, we tell TensorBoard which of these summaries to group together. The specified summaries are then combined into a single chart under the "CUSTOM SCALARS" heading. We accomplish this by writing a "Layout" once at the beginning of the log. Once TensorBoard receives the layout, it automatically produces a combined chart under "CUSTOM SCALARS" as the ordinary "SCALARS" are updated.
Assuming that your "original model" is already sending your variables (as scalar summaries) to TensorBoard, the only modification necessary is to inject the layout before your main iteration loop starts. Each custom scalar chart selects which summaries to plot by means of a regular expression. Thus for each group of variables to be plotted together, it can be useful to place the variables' respective summaries into a separate name scope. (That way your regex can simply select all summaries under that name scope.)
Important Note: The op which generates the summary of a variable is distinct from the variable itself. For example, if I have a variable ns1/my_var, I can create a summary ns2/summary_op_for_myvar. The custom scalars chart layout cares only about the summary op, not the name or scope of the original variable.
Here is an example, creating two tf.summary.FileWriters which share the same root directory. Creating a tf.summary.scalar shared by the two tf.summary.FileWriters. At every time step, get the summary and update each tf.summary.FileWriter.
import os
import tqdm
import tensorflow as tf
def tb_test():
sess = tf.Session()
x = tf.placeholder(dtype=tf.float32)
summary = tf.summary.scalar('Values', x)
merged = tf.summary.merge_all()
sess.run(tf.global_variables_initializer())
writer_1 = tf.summary.FileWriter(os.path.join('tb_summary', 'train'))
writer_2 = tf.summary.FileWriter(os.path.join('tb_summary', 'eval'))
for i in tqdm.tqdm(range(200)):
# train
summary_1 = sess.run(merged, feed_dict={x: i-10})
writer_1.add_summary(summary_1, i)
# eval
summary_2 = sess.run(merged, feed_dict={x: i+10})
writer_2.add_summary(summary_2, i)
writer_1.close()
writer_2.close()
if __name__ == '__main__':
tb_test()
Here is the result:
The orange line shows the result of the evaluation stage, and correspondingly, the blue line illustrates the data of the training stage.
Also, there is a very useful post by TF team to which you can refer.
For completeness, since tensorboard 1.5.0 this is now possible.
You can use the custom scalars plugin. For this, you need to first make tensorboard layout configuration and write it to the event file. From the tensorboard example:
import tensorflow as tf
from tensorboard import summary
from tensorboard.plugins.custom_scalar import layout_pb2
# The layout has to be specified and written only once, not at every step
layout_summary = summary.custom_scalar_pb(layout_pb2.Layout(
category=[
layout_pb2.Category(
title='losses',
chart=[
layout_pb2.Chart(
title='losses',
multiline=layout_pb2.MultilineChartContent(
tag=[r'loss.*'],
)),
layout_pb2.Chart(
title='baz',
margin=layout_pb2.MarginChartContent(
series=[
layout_pb2.MarginChartContent.Series(
value='loss/baz/scalar_summary',
lower='baz_lower/baz/scalar_summary',
upper='baz_upper/baz/scalar_summary'),
],
)),
]),
layout_pb2.Category(
title='trig functions',
chart=[
layout_pb2.Chart(
title='wave trig functions',
multiline=layout_pb2.MultilineChartContent(
tag=[r'trigFunctions/cosine', r'trigFunctions/sine'],
)),
# The range of tangent is different. Let's give it its own chart.
layout_pb2.Chart(
title='tan',
multiline=layout_pb2.MultilineChartContent(
tag=[r'trigFunctions/tangent'],
)),
],
# This category we care less about. Let's make it initially closed.
closed=True),
]))
writer = tf.summary.FileWriter(".")
writer.add_summary(layout_summary)
# ...
# Add any summary data you want to the file
# ...
writer.close()
A Category is group of Charts. Each Chart corresponds to a single plot which displays several scalars together. The Chart can plot simple scalars (MultilineChartContent) or filled areas (MarginChartContent, e.g. when you want to plot the deviation of some value). The tag member of MultilineChartContent must be a list of regex-es which match the tags of the scalars that you want to group in the Chart. For more details check the proto definitions of the objects in https://github.com/tensorflow/tensorboard/blob/master/tensorboard/plugins/custom_scalar/layout.proto. Note that if you have several FileWriters writing to the same directory, you need to write the layout in only one of the files. Writing it to a separate file also works.
To view the data in TensorBoard, you need to open the Custom Scalars tab. Here is an example image of what to expect https://user-images.githubusercontent.com/4221553/32865784-840edf52-ca19-11e7-88bc-1806b1243e0d.png
The solution in PyTorch 1.5 with the approach of two writers:
import os
from torch.utils.tensorboard import SummaryWriter
LOG_DIR = "experiment_dir"
train_writer = SummaryWriter(os.path.join(LOG_DIR, "train"))
val_writer = SummaryWriter(os.path.join(LOG_DIR, "val"))
# while in the training loop
for k, v in train_losses.items()
train_writer.add_scalar(k, v, global_step)
# in the validation loop
for k, v in val_losses.items()
val_writer.add_scalar(k, v, global_step)
# at the end
train_writer.close()
val_writer.close()
Keys in the train_losses dict have to match those in the val_losses to be grouped on the same graph.
Tensorboard is really nice tool but by its declarative nature can make it difficult to get it to do exactly what you want.
I recommend you checkout Losswise (https://losswise.com) for plotting and keeping track of loss functions as an alternative to Tensorboard. With Losswise you specify exactly what should be graphed together:
import losswise
losswise.set_api_key("project api key")
session = losswise.Session(tag='my_special_lstm', max_iter=10)
loss_graph = session.graph('loss', kind='min')
# train an iteration of your model...
loss_graph.append(x, {'train_loss': train_loss, 'validation_loss': validation_loss})
# keep training model...
session.done()
And then you get something that looks like:
Notice how the data is fed to a particular graph explicitly via the loss_graph.append call, the data for which then appears in your project's dashboard.
In addition, for the above example Losswise would automatically generate a table with columns for min(training_loss) and min(validation_loss) so you can easily compare summary statistics across your experiments. Very useful for comparing results across a large number of experiments.
Please let me contribute with some code sample in the answer given by #Lifu Huang. First download the loger.py from here and then:
from logger import Logger
def train_model(parameters...):
N_EPOCHS = 15
# Set the logger
train_logger = Logger('./summaries/train_logs')
test_logger = Logger('./summaries/test_logs')
for epoch in range(N_EPOCHS):
# Code to get train_loss and test_loss
# ============ TensorBoard logging ============#
# Log the scalar values
train_info = {
'loss': train_loss,
}
test_info = {
'loss': test_loss,
}
for tag, value in train_info.items():
train_logger.scalar_summary(tag, value, step=epoch)
for tag, value in test_info.items():
test_logger.scalar_summary(tag, value, step=epoch)
Finally you run tensorboard --logdir=summaries/ --port=6006and you get:

Biopython retrieving particular CDS from a whole genome

I am new to Stackoverflow. I am trying to automate search process using Biopython. I have two lists, one with protein GI numbers and one with corresponding nucleotide GI numbers.
For example:
protein_GI=[588489721,788136950,409084506]
nucleo_GI=[588489708,788136846,409084493]
Second list was created using ELink. However, the nucleotide GIs correspond to whole genomes. I need to retrieve particular CDS from each genome which match the protein GI.
I tried using again ELink with different link names ("protein_nucleotide_cds","protein_nuccore") but all I get is id numbers for whole genomes. Should I try some other link names?
I also tried the following EFetch code:
import Bio
from Bio import Entrez
Entrez.email = None
handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())
This method gives me nucleotide and protein sequences in fasta file but the nucleotide sequence is a whole genome.
I would be very grateful, if somebody could help me.
Thanking you in advance!
I hope help you
import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#example.com"
gi_protein = "GI:588489721"
gi_genome = "GI:588489708"
handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()
handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()
#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]
#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand
seq = genome[start: end]
if strand == 1:
print seq.seq
else:
#if strand is -1 then to get reverse complement
print seq.reverse_complement().seq
print protein.seq
then you get:
ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG
CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC
TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA
ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT
AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC
TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT
TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT
CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA
CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC
ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA
GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC
ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT
AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA
GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT
CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC
ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA
AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA
TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT
GTATTAGATGAACAAAGAAAAAAGTAG
MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK
TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS
YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT
IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID
NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA
TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR
VLDEQRKK

How can I read a complex HDF5 array in Julia?

I have many HDF5 datasets containing complex number arrays, which I have created using Python and h5py. For example:
import numpy, h5py
with h5py.File("test.h5", "w") as f:
f["mat"] = numpy.array([1.0 + .5j, 2.0 - 1.0j], dtype=complex)
HDF5 has no native concept of complex numbers, so h5py stores them as a compound data type, with fields "r" and "i" for the real and imaginary parts.
How can I load such arrays of complex numbers in Julia, using HDF5.jl?
EDIT: The obvious attempt
using HDF5
h5open("test.h5", "r") do fd
println(read(fd, "mat"))
end
returns a cryptic response:
HDF5Compound(Uint8[0,0,0,0,0,0,240,63,0,0,0,0,0,0,224,63,0,0,0,0,0,0,0,64,0,0,0,0,0,0,240,191],Type[Float64,Float64],ASCIIString["r","i"],Uint64[0,8])
As #simonster pointed out, there is a fast and safe way to do this.
If you had written:
a = read(fd, "mat"))
then the complex vector that you want is simply:
cx_vec = reinterpret(Complex{Float64}, a.data)
I hadn't thought of this before, but one solution is simply to use h5py with PyCall:
using PyCall
#pyimport h5py
f = h5py.File("test.h5", "r")
mat = get(get(f, "mat"), pybuiltin("Ellipsis"))
f[:close]()
println(mat)
In Julia 0.6 you can do the following. As long as you have the HDF5 module and DataFrames already installed this example is immediately executable because the example HDF5 file comes with HDF5.jl. In all likelihood it only works on common types. I haven't tested it beyond the example file as I'm still trying to figure out how to write/create compound tables from Julia.
using HDF5
using DataFrames
# Compound Table Read
d = h5read(joinpath(Pkg.dir("HDF5"),"test","test_files","compound.h5"),"/data")
# Convert d to a dataframe, D
types = [typeof(i) for i in d[1].data] # Data type list
names_HDF5 = [Symbol(i) for i in d[1].membername] # Column name list
D = DataFrame(types,names_HDF5,length(d)) # Preallocate the array
rows = length(d) # Number of rows
cols = length(d[1].data) # Number of columns
for i=1:rows
for j=1:cols
D[i,j] = d[i].data[j] # Save each element to the preallocated dataframe
end
end
d is a vector of table rows. Each element is of type HD5FCompound which each have three fields: data, membername, and membertype.

Resources