Performance of multiple chunked datasets in the same HDF5 file? - hdf5

Suppose (i am adding a code example below) that i create multiple chunked datasets in the same HDF5 file, and i start appending data to each dataset in random order. Since HDF does not know in advance what size to allocate for each dataset, i would think that each append operation (or perhaps a dataset buffer when filled) is directly appended to the HDF5 file. If so, the data of each dataset would be interleaved with the data from the other datasets, and would be spread out in chunks over the entire HDF5 file.
My question is: If the above description is more or less accurate, would this not adversely affect the performance of the read operations done later from that file, and perhaps also the file size if more metadata records are required? And (corrollary), if the option exists to store each dataset in a separate file, would it not be better to do so from the viewpoint of read performance?
Here is an example of how the HDF5 file that i describe in the beginning could be created:
import h5py, numpy as np
dtype1 = np.dtype( [ ('t','f8'), ('T','f8') ] )
dtype2 = np.dtype( [ ('q','i2'), ('Q','f8'), ('R','f8') ] )
dtype3 = np.dtype( [ ('p','f8'), ('P','i8') ] )
with h5py.File('foo.hdf5','w') as f:
dset1 = f.create_dataset('dset1', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype1))
dset2 = f.create_dataset('dset2', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype2))
dset3 = f.create_dataset('dset3', (1,), maxshape=(None,), dtype=h5py.vlen_dtype(dtype3))
for _ in range(10):
random_lengths = np.random.randint(low=1, high=10, size=3)
d1 = np.ones( (random_lengths[0],), dtype=dtype1 )
dset1[-1] = d1
dset1.resize( (dset1.shape[0]+1,) )
d2 = np.ones( (random_lengths[1],), dtype=dtype2 )
dset2[-1] = d2
dset2.resize( (dset2.shape[0]+1,) )
d3 = np.ones( (random_lengths[2],), dtype=dtype3 )
dset3[-1] = d3
dset3.resize( (dset3.shape[0]+1,) )
I know i could try it both ways (single file multiple datasets or multiple files single datasets) and time it, but the result might depend on the specifics of the example data used and i would rather have a more general answer to this question, and possibly some insight into how HDF5/h5py work internally in this case.

Related

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]
I also have a dataset which contains 540000 other words.
Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]
the problem is when I want to embed Vocabs to vectors it takes time forever.
words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)
is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again?
I will really appreciate you for your time trying help me.
You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.
import pickle
with open("my-embeddings.pkl", "wb") as fOut:
pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)
Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:
if not os.path.exists(embedding_cache_path):
# read your corpus etc
corpus_sentences = ...
print("Encoding the corpus. This might take a while")
corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)
print("Storing file on disc")
with open(embedding_cache_path, "wb") as fOut:
pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)
else:
print("Loading pre-computed embeddings from disc")
with open(embedding_cache_path, "rb") as fIn:
cache_data = pickle.load(fIn)
corpus_sentences = cache_data['sentences']
corpus_embeddings = cache_data['embeddings']

Writing Dask/XArray to NetCDF - Parallel IO

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF file is quite slow (~3hrs) and seems to not run in parallel. It is unclear to me if the "to_netcdf" function in Xarray is supposed to support parallel writes. Currently my approach is to write an empty netcdf file with NetCDF4 and then append the data from the Xarray:
f_mosaic = 't1.nc'
meta = {'width': dat_f.shape[1],
'height': dat_f.shape[2],
'crs': rasterio.crs.CRS(init='epsg:'+fi['CPER']['Reflectance']['Metadata']['Coordinate_System']['EPSG Code'].value.decode("utf-8")),
'transform': aff_final,
'count': dat_f.shape[0]}
with netCDF4.Dataset(f_mosaic, mode='w', format="NETCDF4") as t1:
# Create spatial dimensions
y = t1.createDimension('y', meta['width'])
x = t1.createDimension('x', meta['height'])
wl_dim = t1.createDimension('wl',meta['count'])
reflectance = t1.createVariable("reflectance","int16",("wl","y","x",),fill_value=null_val,zlib=True)
reflectance.setncattr('grid_mapping', 'crs')
crs = t1.createVariable('crs', 'c')
crs.spatial_ref = meta['crs'].wkt
crs.epsg_code = meta['crs'].to_string()
crs.GeoTransform = " ".join(str(x) for x in meta['transform'].to_gdal())
dat_f.to_netcdf(path=f_mosaic,mode='a',format='NETCDF4',encoding={'reflectance':{'zlib':True}})
Overall, the question is, how can I write this data to a NETCDF4 file quickly? Does dask/Xarray support parallel writes with NETCDF4? If so, what am I doing incorrectly?
Thanks!

Keras Text Preprocessing - Saving Tokenizer object to file for scoring

I've trained a sentiment classifier model using Keras library by following the below steps(broadly).
Convert Text corpus into sequences using Tokenizer object/class
Build a model using the model.fit() method
Evaluate this model
Now for scoring using this model, I was able to save the model to a file and load from a file. However I've not found a way to save the Tokenizer object to file. Without this I'll have to process the corpus every time I need to score even a single sentence. Is there a way around this?
The most common way is to use either pickle or joblib. Here you have an example on how to use pickle in order to save Tokenizer:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# loading
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
Tokenizer class has a function to save date into JSON format:
tokenizer_json = tokenizer.to_json()
with io.open('tokenizer.json', 'w', encoding='utf-8') as f:
f.write(json.dumps(tokenizer_json, ensure_ascii=False))
The data can be loaded using tokenizer_from_json function from keras_preprocessing.text:
with open('tokenizer.json') as f:
data = json.load(f)
tokenizer = tokenizer_from_json(data)
The accepted answer clearly demonstrates how to save the tokenizer. The following is a comment on the problem of (generally) scoring after fitting or saving. Suppose that a list texts is comprised of two lists Train_text and Test_text, where the set of tokens in Test_text is a subset of the set of tokens in Train_text (an optimistic assumption). Then fit_on_texts(Train_text) gives different results for texts_to_sequences(Test_text) as compared with first calling fit_on_texts(texts) and then text_to_sequences(Test_text).
Concrete Example:
from keras.preprocessing.text import Tokenizer
docs = ["A heart that",
"full up like",
"a landfill",
"no surprises",
"and no alarms"
"a job that slowly"
"Bruises that",
"You look so",
"tired happy",
"no alarms",
"and no surprises"]
docs_train = docs[:7]
docs_test = docs[7:]
# EXPERIMENT 1: FIT TOKENIZER ONLY ON TRAIN
T_1 = Tokenizer()
T_1.fit_on_texts(docs_train) # only train set
encoded_train_1 = T_1.texts_to_sequences(docs_train)
encoded_test_1 = T_1.texts_to_sequences(docs_test)
print("result for test 1:\n%s" %(encoded_test_1,))
# EXPERIMENT 2: FIT TOKENIZER ON BOTH TRAIN + TEST
T_2 = Tokenizer()
T_2.fit_on_texts(docs) # both train and test set
encoded_train_2 = T_2.texts_to_sequences(docs_train)
encoded_test_2 = T_2.texts_to_sequences(docs_test)
print("result for test 2:\n%s" %(encoded_test_2,))
Results:
result for test 1:
[[3], [10, 3, 9]]
result for test 2:
[[1, 19], [5, 1, 4]]
Of course, if the above optimistic assumption is not satisfied and the set of tokens in Test_text is disjoint from that of Train_test, then test 1 results in a list of empty brackets [].
I've created the issue https://github.com/keras-team/keras/issues/9289 in the keras Repo. Until the API is changed, the issue has a link to a gist that has code to demonstrate how to save and restore a tokenizer without having the original documents the tokenizer was fit on. I prefer to store all my model information in a JSON file (because reasons, but mainly mixed JS/Python environment), and this will allow for that, even with sort_keys=True
I found the following snippet provided at following link by #thusv89.
Save objects:
import pickle
with open('data_objects.pickle', 'wb') as handle:
pickle.dump(
{'input_tensor': input_tensor,
'target_tensor': target_tensor,
'inp_lang': inp_lang,
'targ_lang': targ_lang,
}, handle, protocol=pickle.HIGHEST_PROTOCOL)
Load objects:
with open("dataset_fr_en.pickle", 'rb') as f:
data = pickle.load(f)
input_tensor = data['input_tensor']
target_tensor = data['target_tensor']
inp_lang = data['inp_lang']
targ_lang = data['targ_lang']
Quite easy, because Tokenizer class has provided two funtions for save and load:
save —— Tokenizer.to_json()
load —— keras.preprocessing.text.tokenizer_from_json
In to_json() method,it call "get_config" method which handle this:
json_word_counts = json.dumps(self.word_counts)
json_word_docs = json.dumps(self.word_docs)
json_index_docs = json.dumps(self.index_docs)
json_word_index = json.dumps(self.word_index)
json_index_word = json.dumps(self.index_word)
return {
'num_words': self.num_words,
'filters': self.filters,
'lower': self.lower,
'split': self.split,
'char_level': self.char_level,
'oov_token': self.oov_token,
'document_count': self.document_count,
'word_counts': json_word_counts,
'word_docs': json_word_docs,
'index_docs': json_index_docs,
'index_word': json_index_word,
'word_index': json_word_index
}

Save a meta-model for future use

I am using openMDAO to construct a co-kriging metamodel that I would like to export and then import in another python code.
I've found a message on the old forum (http://openmdao.org/forum/questions/444/how-can-i-save-the-metamodel-for-later-use?sort=votes) in which someone used pickle to save a meta-model.
I have also read about the recorders however I didn't succeed in the different tests I performed.
Is there a way to save the meta-model and use it in a future code?
EDIT: I think I found a kind of solution using 'pickle'. I succeded to do this with a kriging meta-model but i assume I would work the same with the co-kriging.
Like in the post on the 'old' forum of openMDAO, I saved the trained meta-model in a file and then reuse it in another python script. I joined here the part of the code saving the trained meta-model:
cok = MultiFiCoKrigingSurrogate()
prob = Problem(Simulation(cok, nfi=2))
prob.setup(check=False)
prob['mm.train:x1'] = DATA_HF_dim
prob['mm.train:x1_fi2'] = DATA_LF_dim
prob['mm.train:y1'] = rastri_e
prob['mm.train:y1_fi2'] = rastri_c
prob.run()
import pickle
f = open('meta_model_info.p','wb')
pickle.dump(prob,f)
f.close
Once the trained meta-model is saved in the file meta_model_info.p, I can load it in another script, skipping the learning phase. Part of the code of the second script is here:
class Simulation(Group):
def __init__(self, surrogate, nfi):
super(Simulation, self).__init__()
self.surrogate = surrogate
mm = self.add("mm", MultiFiMetaModel(nfi=nfi))
mm.add_param('x1', val=0.)
mm.add_output('y1', val=(0.,0.), surrogate=surrogate)
cok = MultiFiCoKrigingSurrogate()
prob = Problem(Simulation(cok, nfi=2))
prob.setup(check=False)
import pickle
f = open('meta_model_info.p','rb')
clf = pickle.load(f)
pred_cok_clf = []
for x in inputs:
clf['mm.x1'] = x
clf.run()
pred_cok_clf.append(clf['mm.y1'])
pred_mu_clf = np.array([float(p[0]) for p in pred_cok_clf])
pred_sigma_clf = np.array([float(p[1]) for p in pred_cok_clf])
However I was forced to redefine the class of the problem and to setup the problem either in this second script.
I don't know if it is a proper use of 'pickle' or if there is another way to do this, if you have any suggestion :)
There is not currently any provision for saving and reloading the surrogate model. You have two options:
1) Save off the training data, then import and re-train the model in your other script. You can call the fit and predict methods of the surrogate model directly for this by importing them from the library.
2) If you want to skip the cost of re-training each time, then you need to modify the surrogate model itself to save off the result of the fitting process, then re-load it into a new instance later: https://github.com/OpenMDAO/OpenMDAO/blob/c69e00f6f9eeb617863e782246e2e7ed1fe9e019/openmdao/surrogate_models/multifi_cokriging.py#L322

How can I read a complex HDF5 array in Julia?

I have many HDF5 datasets containing complex number arrays, which I have created using Python and h5py. For example:
import numpy, h5py
with h5py.File("test.h5", "w") as f:
f["mat"] = numpy.array([1.0 + .5j, 2.0 - 1.0j], dtype=complex)
HDF5 has no native concept of complex numbers, so h5py stores them as a compound data type, with fields "r" and "i" for the real and imaginary parts.
How can I load such arrays of complex numbers in Julia, using HDF5.jl?
EDIT: The obvious attempt
using HDF5
h5open("test.h5", "r") do fd
println(read(fd, "mat"))
end
returns a cryptic response:
HDF5Compound(Uint8[0,0,0,0,0,0,240,63,0,0,0,0,0,0,224,63,0,0,0,0,0,0,0,64,0,0,0,0,0,0,240,191],Type[Float64,Float64],ASCIIString["r","i"],Uint64[0,8])
As #simonster pointed out, there is a fast and safe way to do this.
If you had written:
a = read(fd, "mat"))
then the complex vector that you want is simply:
cx_vec = reinterpret(Complex{Float64}, a.data)
I hadn't thought of this before, but one solution is simply to use h5py with PyCall:
using PyCall
#pyimport h5py
f = h5py.File("test.h5", "r")
mat = get(get(f, "mat"), pybuiltin("Ellipsis"))
f[:close]()
println(mat)
In Julia 0.6 you can do the following. As long as you have the HDF5 module and DataFrames already installed this example is immediately executable because the example HDF5 file comes with HDF5.jl. In all likelihood it only works on common types. I haven't tested it beyond the example file as I'm still trying to figure out how to write/create compound tables from Julia.
using HDF5
using DataFrames
# Compound Table Read
d = h5read(joinpath(Pkg.dir("HDF5"),"test","test_files","compound.h5"),"/data")
# Convert d to a dataframe, D
types = [typeof(i) for i in d[1].data] # Data type list
names_HDF5 = [Symbol(i) for i in d[1].membername] # Column name list
D = DataFrame(types,names_HDF5,length(d)) # Preallocate the array
rows = length(d) # Number of rows
cols = length(d[1].data) # Number of columns
for i=1:rows
for j=1:cols
D[i,j] = d[i].data[j] # Save each element to the preallocated dataframe
end
end
d is a vector of table rows. Each element is of type HD5FCompound which each have three fields: data, membername, and membertype.

Resources