How to read variable-length strings with h5py - hdf5

I'm trying to use h5py to read an array of variable length strings from an HDF5 file created in C. For a simple example, I used the variable length string array example from the HDF5 group, h5ex_t_vlstringatt.c at https://support.hdfgroup.org/HDF5/examples/api-c.html. I compile with h5pcc and the example program runs fine (it reads in the file it writes and prints out the contents).
However, I just get an empty object in python; with a simple example program
import h5py
fnam = 'h5ex_t_vlstringatt.h5'
data = h5py.File(fnam, 'r')
print data['DS1']
I get
<HDF5 dataset "DS1": shape None, type "<i4">
Also, I'm using an anaconda distribution of python that I just updated, so h5py is version ~2.8.

Downloading the file from the link:
2148:~/mypy$ h5dump h5ex_t_vlstringatt.h5
HDF5 "h5ex_t_vlstringatt.h5" {
GROUP "/" {
DATASET "DS1" {
DATATYPE H5T_STD_I32LE
DATASPACE NULL
DATA {
}
ATTRIBUTE "A1" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_SPACEPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
DATA {
(0): "Parting", "is such", "sweet", "sorrow."
}
}
}
}
}
In a Ipython session
In [167]: f = h5py.File('h5ex_t_vlstringatt.h5', 'r')
In [168]: list(f.keys())
Out[168]: ['DS1']
In [169]: f['DS1']
Out[169]: <HDF5 dataset "DS1": shape (), type "<i4">
In [170]: f['DS1'].attrs
Out[170]: <Attributes of HDF5 object at 2826604252>
In [171]: list(f['DS1'].attrs.keys())
Out[171]: ['A1']
In [172]: f['DS1'].attrs['A1']
Out[172]: array([b'Parting', b'is such', b'sweet', b'sorrow.'], dtype=object)
The strings are stored in an attribute of the dataset, not as the value of the set.

Related

Saving an sklearn.svm.SVR model as JSON instead of pickling

I have a trained SVR model which needs to be saved in a JSON format instead of pickling.
The idea behind JSONifying the trained model is to simply capture the state of the weights and other 'fitted' attributes. Then, I can set these attributes later to make predictions. Here is an implementation of it I did:
# assume SVR has been trained
regressor = SVR()
regressor.fit(x_train, y_train)
# saving the regressor params in a JSON file for later retrieval
with open(f'saved_regressor_params.json', 'w', encoding='utf-8') as outfile:
json.dump(regressor.get_params(), outfile)
# finding the fitted attributes of SVR()
# if an attribute is trailed by '_', it's a fitted attribute
attrs = [i for i in dir(regressor) if i.endswith('_') and not i.endswith('__')]
remove_list = ['coef_', '_repr_html_', '_repr_mimebundle_'] # unnecessary attributes
for attr in remove_list:
if attr in attrs:
attrs.remove(attr)
# deserialize NumPy arrays and save trained attribute values into JSON file
attr_dict = {i: getattr(regressor, i) for i in attrs}
for k in attr_dict:
if isinstance(attr_dict[k], np.ndarray):
attr_dict[k] = attr_dict[k].tolist()
# dump JSON for prediction
with open(f'saved_regressor_{index}.json', 'w', encoding='utf-8') as outfile:
json.dump(attr_dict,
outfile,
separators=(',', ':'),
sort_keys=True,
indent=4)
This would create two separate json files. One file called saved_regressor_params.json which saves certain required parameters for SVR and another is called saved_regressor.json which stores attributes and their trained values as objects. Example (saved_regressor.json):
{
"_dual_coef_":[
[
-1.0,
-1.0,
-1.0,
]
],
"_intercept_":[
1.323423423
],
...
...
"_n_support_":[
3
]
}
Later, I can create a new SVR() model and simply set these parameters and attributes into it by calling them from the existing JSON files we just created. Then, call in the predict() method to predict. Like so (in a new file):
predict_svr = SVR()
#load the json from the files
obj_text = codecs.open('saved_regressor_params.json', 'r', encoding='utf-8').read()
params = json.loads(obj_text)
obj_text = codecs.open('saved_regressor.json', 'r', encoding='utf-8').read()
attributes = json.loads(obj_text)
#setting params
predict_svr.set_params(**params)
# setting attributes
for k in attributes:
if isinstance(attributes[k], list):
setattr(predict_svr, k, np.array(attributes[k]))
else:
setattr(predict_svr, k, attributes[k])
predict_svr.predict(...)
However, during this process, a particular attribute called: n_support_ cannot be set due to some reason. And even if I ignore n_support_ attribute, it creates additional errors. (Is my logic wrong or am I missing something here?)
Therefore, I am looking for different ways or ingenious methods to save an SVR model into JSON.
I have tried the existing third party helper libraries like: sklearn_json. These libraries tend to export perfectly for linear models but not for support vectors.
Making a reproducible example missing in the OP, based on the docs (version 1.1.2)
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
n_samples, n_features = 10, 5
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
regressor = SVR(C=1.0, epsilon=0.2)
regressor.fit(X, y)
Then a sketch of the a JSON serialization/deserialization
import json
# serialize
serialized = json.dumps({
k: v.tolist() if isinstance(v, np.ndarray) else v
for k, v in regressor.__dict__.items()
})
# deserialize
regressor2 = SVR()
regressor2.__dict__ = {
k: np.asarray(v) if isinstance(v, list) else v
for k, v in json.loads(serialized).items()
}
# test
assert np.all(regressor.predict(X) == regressor2.predict(X))
EDIT: Serialization preserving data type
A not so elegant solution to address the first issue mentioned in a comment is to save the data type together with the data.
import json
# serialize
serialized = json.dumps({
k: [v.tolist(), 'np.ndarray', str(v.dtype)] if isinstance(v, np.ndarray) else v
for k, v in regressor.__dict__.items()
})
# deserialize
regressor2 = SVR()
regressor2.__dict__ = {
k: np.asarray(v[0], dtype=v[2]) if isinstance(v, list) and v[1] == 'np.ndarray' else v
for k, v in json.loads(serialized).items()
}
# test
assert np.all(regressor.predict(X) == regressor2.predict(X))

How to handle alphanumeric values in machine learning

I am trying to the find the best algorithm for my claims data. The claims data include some diagnosis code which are alphanumeric like 'EA43454' . when i run the below code to evaluate the models
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=None)
cv_results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i get the error
ValueError: could not convert string to float: 'U0003'
How to handle these alphanumeric values?
You need to convert your strings to an indicator variable (dummy variables). Each value of the string variable has to be associated with a number so that the models can train on that data.
Scikit-learn has several preprocessors to help you with this such as OneHotEncoder. You can also use pandas.get_dummies, but using sklearn's own classes is more composable - for example, you can use them as part of a pipeline.
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
rng = np.random.default_rng()
animals = pd.DataFrame({"animal": rng.choice(["cat", "dog"], size=10),
"age": rng.integers(1, 20, size=10)})
animals_ohe = OneHotEncoder().fit_transform(animals.drop(columns=["age"]))

How to fine tune a masked language model?

I'm trying to follow the huggingface tutorial on fine tuning a masked language model (masking a set of words randomly and predicting them). But they assume that the dataset is in their system (can load it with from datasets import load_dataset; load_dataset("dataset_name")). However, my input dataset is a long string:
text = "This is an attempt of a great example. "
dataset = text * 3000
I followed their approach and tokenized each it:
from transformers import AutoTokenizer
from transformers import AutoModelForMaskedLM
import torch
from transformers import DataCollatorForLanguageModeling
model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
def tokenize_long_text(tokenizer, long_text):
individual_sentences = long_text.split('.')
tokenized_sentences_list = tokenizer(individual_sentences)['input_ids']
tokenized_sequence = [x for xs in tokenized_sentences_list for x in xs]
return tokenized_sequence
tokenized_sequence = tokenize_long_text(tokenizer, long_text)
Following by chunking it into equal length segments:
def chunk_long_tokenized_text(tokenizer_text, chunk_size):
# Compute length of long tokenized texts
total_length = len(tokenizer_text)
# We drop the last chunk if it's smaller than chunk_size
total_length = (total_length // chunk_size) * chunk_size
return [tokenizer_text[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
chunked_sequence = chunk_long_tokenized_text(tokenized_sequence, 30)
Created a data collator for random masking:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) # expects a list of dicts, where each dict represents a single chunk of contiguous text
Example of how it works:
d = {}
d['input_ids'] = chunked_sequence[0]
d
>>>{'input_ids': [101,
2023,
2003,
1037,
2307,
103,...
for chunk in data_collator([ d ])["input_ids"]:
print(f"\n'>>> {tokenizer.decode(chunk)}'")
>>>'>>> [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this is a great [MASK] [SEP] [CLS] this'
However, the remaining steps (which I believe is just the training component) seem to only work using their trainer method, which can only take their dataset.
How can this work with a dataset in the form of a string?

h5py reading raw data with escape characters

Hi I want to read the hdf5 file data as it is written
But when I read it with the following code I get the following output
COde
hf = h5py.File('Json.h5', 'r')
data_read = hf.get("BinaryData_metadata")
rmdwrite = open("Test.json", "w")
rmdwrite.write(str(np.array(data_read)))
rmdwrite.close()
hf.close()
Output
[b'{\n\t"TestReport": {\n\t\t"TestName": "XYZ",\n\t\t"Description"................
How to get the exact output with the same formatting in my output file?
When I print with this
Data_arr = str(np.array(data_read))
Data_arr = repr(Data_arr)
I get
'[b\'{\\n\\t"TestReport": {\\n\\t\\t"Te................
OKey this is how I am writing the data via C++
DataSpace dataspace(1, dimsf); //Creating Dataspace
StrType datatype(PredType::C_S1); //Creating Datatype of type char
datatype.setOrder(order); //Data Store Order
datatype.setSize(file_datastring.length()); //Datalength
datatype.setCset(H5T_CSET_UTF8);
DataSet dataset = Hdf5::fileObject.createDataSet(WriteDataSet, datatype, dataspace); //Create dataset
dataset.write(file_datastring, datatype); //Write to dataset
is there something here which is appending that extra \
The solution which I found was
hf = h5py.File(H5FileName, 'r')
FileObj = open(OutFileName, "w")
hf.get(H5DataSetName).value.tofile(FileObj)
FileObj.close()
hf.close()
This works perfectly
Regards.
Siddharth

how to convert pandas str.split call to to dask

I have a dask data frame where the index is a string which looks like this:
12/09/2016 00:00;32.0046;-106.259
12/09/2016 00:00;32.0201;-108.838
12/09/2016 00:00;32.0224;-106.004
(its basically a string encoding the datetime;latitude;longitude of the row)
I'd like to split that while still in the dask context to individual columns representing each of the fields.
I can do that with a pandas dataframe as:
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
But that doesn't work in dask for several of the attempts I've tried. If I directly substitute the df for a dask df I get the error:
'Index' object has no attribute 'str'
If I use the column name instead of index as:
forecastDf['date'], forecastDf['Lat'], forecastDf['Lon'] = forecastDf['dateLocation'].str.split(';', 2).str
I get the error:
TypeError: 'StringAccessor' object is not iterable
Here is an runnable example of this working in Pandas
import pandas as pd
df = pd.DataFrame()
df['dateLocation'] = ['12/09/2016 00:00;32.0046;-106.259','12/09/2016 00:00;32.0201;-108.838','12/09/2016 00:00;32.0224;-106.004']
df = df.set_index('dateLocation')
df['date'], df['Lat'], df['Lon'] = df.index.str.split(';', 2).str
df.head()
Here is the error I get if I directly convert that to dask
import dask.dataframe as dd
dd = dd.from_pandas(df, npartitions=1)
dd['date'], dd['Lat'], dd['Lon'] = dd.index.str.split(';', 2).str
>>TypeError: 'StringAccessor' object is not iterable
forecastDf['date'] = forecastDf['dateLocation'].str.partition(';')[0]
forecastDf['Lat'] = forecastDf['dateLocation'].str.partition(';')[2]
forecastDf['Lon'] = forecastDf['dateLocation'].str.partition(';')[4]
Let me know if this works for you!
First make sure the column is string dtype
forecastDD['dateLocation'] = forecastDD['dateLocation'].astype('str')
Then you can use this to split in dask
splitColumns = client.persist(forecastDD['dateLocation'].str.split(';',2))
You can then index the columns in the new dataframe splitColumns and add them back to the original data frame.
forecastDD = forecastDD.assign(Lat=splitColumns.apply(lambda x: x[0], meta=('Lat', 'f8')), Lon=splitColumns.apply(lambda x: x[1], meta=('Lat', 'f8')), date=splitColumns.apply(lambda x: x[2], meta=('Lat', np.dtype(str))))
Unfortunately I couldn't figure out how to do it without calling compute and creating the temp dataframe.

Resources