I am training a huge data file for libsvm and the resulting training file is too large. Is there any way to save the libsvm libraries model file in binary format?
If you are using Matlab:
Download svm_savemodel.c and svm_model_matlab.c (this is already included in libsvm, you can try to use the original one, but if it doesn't work, try this link) to your libsvm dir. Compile the Mex file (mex svm_savemodel.c), then it should work:
%save model model
fid = fopen('model.bin','w');
model = fwrite(fid, model, 'int16');
%load('model.mat');
fid = fopen('model.bin','rb');
model = fread(fid, model, 'int16');
svm_savemodel(model,'model.model');
If you are using C++:
There is a function that saves a model to a file:
int svm_save_model(const char *model_file_name, const struct svm_model *model);
More details are included in the github.
Related
I train GBM models with H2O and want to use them in my backend (not Java). To do so, I download the MOJOs, convert it to ONNX and run it in my apps.
In order to make inference, I need to know how categorical columns transformed to their one-hot encoded versions. I was able to find it in the POJO:
static final void fill(String[] sa) {
sa[0] = "Age";
sa[1] = "Fare";
sa[2] = "Pclass.1";
sa[3] = "Pclass.2";
sa[4] = "Pclass.3";
sa[5] = "Pclass.missing(NA)";
sa[6] = "Sex.female";
sa[7] = "Sex.male";
sa[8] = "Sex.missing(NA)";
}
So, here is the workflow for non-Java backend as I see it:
Encode categorical features with OneHotExplicit.
Train GBM model.
Download MOJO and convert to ONNX.
Download POJO and find feature alignment in the source code.
Implement the inference in your backend.
Is it the most straightforward and correct way?
Thank you for your question.
Can you access the stored categorical values here?
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoModel.java#L72
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoReader.java#L34
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/tree/SharedTreeMojoWriter.java#L61
The index in the array means the translated categorical value.
The EasyPredictModelWrapper did it this way:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/RowToRawDataConverter.java#L44
Can you access the model.ini inside of the zip? There is [domains] tag and under the tag is a list of files in domains/ directory which correspond the categorical encoding for each feature.
e.g:
[columns]
AGE
RACE
DPROS
DCAPS
PSA
VOL
GLEASON
CAPSULE
[domains]
7: 2 d000.txt
means 7th column (CAPSULE) has 2 categorical variables in d000.txt
or there is a experimental/modelDetails.json file that has categorical values under output.domains. The index in the list correspond to the feature in the output.names list.
e.g output.domains[7] are domains for output.names[7] feature.
Hallo can someone tell me in what format my input data has to be. Now I have it in csv format with the first column being the target variable but I always get a Algorithm Error which I think is due to wrong input data format.
trainpath = sess.upload_data(
path='revenue_train.csv', bucket=bucket,
key_prefix='production')
testpath = sess.upload_data(
path='revenue_test.csv', bucket=bucket,
key_prefix='production')
# launch training job, with asynchronous call
sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)
when you use a custom Docker or framework estimator (like you do) you can use any file format (csv, pdf, mp4, whatever you have in S3). The Sklearn container and estimator are agnostic of the file format ; it is the role of your user-provided Python code in the estimator to know how to read those files.
I have several HDF5 files all of which have a /dataset that contains vectors. I would like to combine all these vectors into one dataset in one file (that is repeatedly append from one file to another). The combined dataset would have chunked storage and be resizable.
Every option I've seen for doing this seems to require reading all the data into a buffer, and then writing it back out, is there a way to more simply pass a dataset/dataspace from one file to another in order to append the data?
Have you investigated h5py Group .copy() method? Although documented as a group action, it works with any h5py object (groups, datasets, links and references). By default it copies object attributes, and supports recursive copying of group members. If you prefer a command line tool, the HDF Group has one to do this. Take a look at h5copy here: HDF5 Group h5 copy doc
Here is a example that demonstrates a simple h5py .copy() implementation. It creates a set of 3 files -- each with 1 dataset (named /dataset, dtype=float, shape=(10,10)). It then creates a NEW HDF5 file, and is followed by another loop to open the previous files and copies the dataset from the "read" file (h5r) to the new "write" file (h5w).
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='w') as h5f:
arr = np.random.random(100).reshape(10,10)
h5f.create_dataset('dataset',data=arr)
with h5py.File('SO_68025342_all.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
h5r.copy('dataset', h5w, name='dataset_'+str(i) )
Here is a method to copy data from multiple files to a single dataset in the merged file. It comes with caveats: 1) all datasets must have the same shape, and 2) you know the number of datasets in advance to size the new dataset. (If not, you can create a resizeable dataset by addingmaxshape=(None,a0,a1), and then use .resize() as needed. I have another post with 2 examples here: How can I combine multiple .h5 file? Look at Methods 3a and 3b.
with h5py.File('SO_68025342_merge.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
if 'dataset' not in h5w.keys():
a0, a1 = h5r['dataset'].shape
h5w.create_dataset('dataset', shape=(3,a0,a1))
h5w['dataset'][i-1,:] = h5r['dataset']
Assuming your files aren't so conveniently named, you can use glob.iglob() to loop on the file names to read. Then use .keys() to get the dataset names in each file. Also, if all of your datasets really are named /dataset, you need to come up with a naming convention for the new datasets.
Here is a link to the h5py docs with more details: h5py Group .copy() method
If you are not bound to a particular library and programming language, one way to solve your issue could be with the usage of HDFql (in C, C++, Java, Python, C#, Fortran or R).
Given that your posts seem to mention C# quite often, find below a solution in C#. It assumes that 1) the dataset name is dset, 2) each dataset is of data type float, and 3) each dataset is a vector of one dimension (size 100) - feel free to adapt the code to your concrete use-case:
// declare variable
float []data = new float[100];
// retrieve all file names (from current directory) that end with '.h5'
HDFql.Execute("SHOW FILE LIKE \\.h5$");
// create an HDF5 file named 'output.h5' and use (i.e. open) it
HDFql.Execute("CREATE AND USE FILE output.h5");
// create a chunked and extendible HDF5 dataset named 'dset' in file 'output.h5'
HDFql.Execute("CREATE CHUNKED(100) DATASET dset AS FLOAT(0 TO UNLIMITED)");
// register variable 'data' for subsequent usage (by HDFql)
HDFql.VariableRegister(data);
// loop cursor and process each file found
while(HDFql.CursorNext() == HDFql.Success)
{
// alter (i.e. extend) dataset 'dset' (from file 'output.h5') with more 100 floats
HDFql.Execute("ALTER DIMENSION dset TO +100");
// select (i.e. read) dataset 'dset' (from file found) and populate variable 'data'
HDFql.Execute("SELECT FROM \"" + HDFql.CursorGetChar() + "\" dset INTO MEMORY " + HDFql.VariableGetNumber(data));
// insert (i.e. write) values stored in variable 'data' into dataset 'dset' (from file 'output.h5') at the end of it (using an hyperslab)
HDFql.Execute("INSERT INTO dset(-1:::) VALUES FROM MEMORY " + HDFql.VariableGetNumber(data));
}
I am using a custom image set to train a neural network using Tensorflow API. After successful training process I get these checkpoint files containing values of different training var. I now want to get an inference model from these checkpoint files, I found this script which does that, which I can then use to generate deepdream images as explained in this tutorial. The problem is when I load my model using:
import tensorflow as tf
model_fn = 'export'
graph = tf.Graph()
sess = tf.InteractiveSession(graph=graph)
with tf.gfile.FastGFile(model_fn, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
t_input = tf.placeholder(np.float32, name='input')
imagenet_mean = 117.0
t_preprocessed = tf.expand_dims(t_input-imagenet_mean, 0)
tf.import_graph_def(graph_def, {'input':t_preprocessed})
I get this error:
graph_def.ParseFromString(f.read())
self.MergeFromString(serialized)
raise message_mod.DecodeError('Unexpected end-group tag.')
google.protobuf.message.DecodeError: Unexpected end-group tag.
The script expect a protocol buffer file, I am not sure the script I am using to generate inference models is giving me proto buffer files or not.
Can someone please suggest what am I doing wrong, or is there a better way to achieve this. I simply want to convert checkpoint files generated by tensor to proto buffer.
Thanks
The link to the script you ran is broken, but in any case the recommended thing is not to try to generate an inference model from a checkpoint, but rather to embed code at the end of your training program that will emit a "SavedModel" export (which is not the same thing as a checkpoint).
Please see [1], and in particular the heading "Building a Saved Model". Note that a Saved Model constitutes multiple files, one of which is indeed a protocol buffer (which directly answers your question I hope); the others are variable files and (optional) asset files.
[1] https://www.tensorflow.org/programmers_guide/saved_model
I have a file containing vectors of data, where each row contains a comma-separated list of values. I am wondering how to perform k-means clustering on this data using mahout. The example provided in the wiki mentions creating sequenceFiles, but otherwise I am not sure if I need to do some type of conversion in order to obtain these sequenceFiles.
I would recommend manually reading in the entries from the CSV file, creating NamedVectors from them, and then using a sequence file writer to write the vectors in a sequence file. From there on, the KMeansDriver run method should know how to handle these files.
Sequence files encode key-value pairs, so the key would be an ID of the sample (it should be a string), and the value is a VectorWritable wrapper around the vectors.
Here is a simple code sample on how to do this:
List<NamedVector> vector = new LinkedList<NamedVector>();
NamedVector v1;
v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
vector.add(v1);
Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
Path path = new Path("datasamples/data");
//write a SequenceFile form a Vector
SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for(NamedVector v:vector){
vec.set(v);
writer.append(new Text(v.getName()), v);
}
writer.close();
Also, I would recommend reading chapter 8 of Mahout in Action. It gives more details on data representation in Mahout.
maybe you could use Elephant Bird to write vectors in mahout format
https://github.com/kevinweil/elephant-bird#hadoop-sequencefiles-and-pig