I've trained a Keras LSTM classification model on characters, saved its model architecture and weights, and now want to load it into a separate application that I can run separate from the training system, stick a REST endpoint on it, and then be able to make predictions via REST...
I haven't found - maybe poor googlefu - references to how other people are doing it, and the main vagueness I'm running into is how to load the original text index & the corresponding labels index.
i.e. the index of 1="a",2="g",3=" ",4="b", and the "original" labels of ["green","blue","red","orange"] prior to the label being 1-hot encoded...
So this is my understanding:
the weights are based on the numerical inputs that were given to the originally trained model
the numerical inputs & the generated index are based on the specific data set that was used for training
the outputs from the model that represent the classification are based on the order in which the original data set's labels were added - i.e. if green was in position 0 of the training labels, and is in position 1 of the actual "runtime" labels, then that's not gonna work... true?
which means that reusing the model + weights, not only requires the actual model architecture & weights, but it also requires the indices of the input & output data...
Is that correct? Or am I missing something major?
Because the thing then is... IF this is the case, is there a way to save & load the indices other than doing it manually?
Because if it needs to be done manually, then we kinda lose the benefits of keras' preprocessing functionality (like the Tokenizer and the np_utils.to_categorically) that we WERE able to use in the training system...
Does anybody have a pattern for doing this sort of activity?
I'm currently doing something along the lines of:
save the X index & the Y label array during training together with the model architecture & weights
in the prediction application, load and recreate the model with the architecture & weights
have a custom class to tokenise input words based on the X index, pad it to the right length, etc
make the prediction
take the prediction and map the highest probability item to the original Y labels array, and therefore figure out what the predicted label is
Related
I have a list of columns and each column is to be labelled by a label from another list of labels.
Eg: Two columns namely, ALT_ID and MTRC_NM are matched with labels Alternate ID and Metric Name respectively.
This fuzzy string matching has been taken care of. Problem is, I want to incorporate a learning model in this.
Essentially, after the matched results are displayed, the user curates the matches as CORRECT or INCORRECT. Based on this feedback and other features of the column (like minimum value, maximum value), I want to train a classifier such that the learning model will eventually stop making the incorrect matches in the future.
Note: In the first run, only the name of the column is used to produce the first set of results. After this, I want to use other features(like minimum value) to train the model.
Problem is, there can be 10,000 terms (or labels), maybe even more and the user just marks these as CORRECT or INCORRECT. For incorrect classifications, the user does not tell us what the correct classification should be.
I believe one solution could be to make separate classifiers for each label and based on the Correct/Incorrect feedback for a particular classification, we can use these feature vectors to train a classifier for this classification. So in the future, if the fuzzy string matching nominates Metric Name as the classification for some column, we can let the "Metric Name" classifier decide if it is correct or incorrect.
I don't know how to make separate classifiers for each label. I also don't know if this approach is feasible. Any other solution to this problem will also help.
You do not want to create separate models for each label as training more than 10 000 models isn't really feasible. Two possible things that come to my mind are:
Create a supervised learning model with one label as input and probability of each of 10 000 labels as output which only uses correct examples for predictions.
Create a reinforcement learning model with the same input but with output which maximises reward function defined as +1 for each positive prediction and -1 for each negative prediction. This model will also try to maximise the number of correct predictions but will be able to learn from incorrect predictions at the same time i.e. predict -1 score for an incorrect pair (x,y).
I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.
I have data in an excel file that I need to use to perform multi-label classification using SVM. It has two columns as shown below. 'tweet' - A,B,C,D,E,F,G and 'category' = X,Y,Z
tweet category
A X
B Y
C Z
D X,Y
E Y,Z
F X,Y,Z
G X,Z
Given a tweet, I want to train my model to predict the category it belongs to. Both the tweets and categories are text. I am trying to use Weka's LibSVM classifier to do the classification as I read it does multi-label classification. I converted the csv file to arff file and loaded it in Weka. I then ran the "LibSVM" classifier. However, I am getting very poor results as shown below. Any idea what I am doing wrong ? Is multi-label text classification even possible with "LibSVM" ?
Correctly Classified Instances 82 25.9494 %
Incorrectly Classified Instances 234 74.0506 %
Kappa statistic 0
Mean absolute error 0.0423
Root mean squared error 0.2057
Relative absolute error 89.9823 %
Root relative squared error 134.3377 %
Total Number of Instances 316
SVM can definitely be used for multiclass classification.
I have not used Weka LibSV before, but you if you already haven't you would need to do some data cleaning before you input text for any sort of classification.
The type of cleaning also depends on your classification task, but you can look into the following techniques which are used in practice for text analysis:
1) Remove twitter handles from your text
2) Remove stop words or words that you know for sure do not impact your classifications. Maybe you can only preserve pronouns and remove any other words. You can use POS tagging to perform this task. More info here
3) Remove punctuations
4) Use n-grams to get contextual meaning out of your text. This site has some good explanation of how that works. Essentially, this would mean that you would treat a sequence of words as a feature rather than using a single word as a data point in your model. Mind you this might impact the amount of memory your model occupies up while training.
5) Remove words that either occur too frequently or do not occur too frequently in your data set.
6) Balance your classes or categories in your case. This means before training your model, make sure the training data has a similar number of X,Y and Z categories. It is possible that your data had a lot of tweets that classify to X and Y but in your test set you had tweets that mostly mapped to the Z category.
Let's suppose I have a noisy 2d data set where one person watching the data could easily draw a straight line in the data so that the mean squared error is minimized.
The model of the line has the form y = mx + b, where x is the input value, y is the predicted value of the model and m and b are trained variables to minimize the cost.
My question is that if we plug some input x1 to the model, it will always output the same number, not taking into account how sparse the data is. How can a model like this predict different values from same inputs?
Maybe this could be done taking all the errors from the model line to the points, making a distribution of them, taking an expected value of such distribution and then adding that value to y?
If the data is 2d, and it can be perfectly modeled with a straight line then there is no data-based nor statistical-based reason not to claim that the process is fully deterministic, and you should output one value.
However, if you have many more dimensions, or your fit is not perfect (error is minimised but not 0) then what you are after is either predicting distribution of values or at least confidence bounds. There are many probabilistic models that can model distribution of the outputs rather than a singe value. In particular linear regression does that, it assumes that you have a Gaussian error around your predictions, thus effectively once you obtain MSE "A" you can draw predictions from N(mx+b, A) - which, as you can easily see degenerates to deterministic model when A=0. These predictions are optimal in expectation, and they are simply your way of "simulating observations" according to the model. There are also meta methods, if you treat your predictor as a black box - you can train multiple models on subsets of data, and treat their predictions as samples to fit a distribution (again for simplicity it could be a single Gaussian).
I have a training set where the input vectors are speed, acceleration and turn angle change. Output is a crisp class- an activity state from the given set {rest, walk, run}. e.g- say for input vectors [3.1 1.2 2]-->run ; [2.1 1 1]-->walk and so on.
I am using weka to develop a Neural Network model. The output I am defining as crisp ones (or rather qualitative ones in words- categorical values). After training the model, the model can fairly classify on test data.
I was wondering how the internal process (mapping function) is taking place? Is the qualitative output states are getting some nominal value inside the model and after processing it is again getting converted to the categorical data? because a NN model cannot map float input values to a categorical data through hidden neurons, so what is actually happening, although the model is working fine.
If the model converts the categorical outputs into nominal ones and then start processing then on what basis it converts the categorical value into some arbitrary numerical values?
Yes, categorical values are usually being converted to numbers, and the networks learn to associate input data with these numbers. However these numbers are often further encoded, not to use only single output neuron. The most common way to do it, for unordered labels, is to add dummy output neurons dedicated to each category and use 1-of-C encoding, with 0.1 and 0.9 as target values. Output is interpreted using the Winner-take-all paradigm.
Using only one neuron and encoding categories with different numbers for unordered labels often leads to problems - as the network will treat middle categories as "averages" of the boundary categories. This however may sometimes be desired, if you have ordered categorical data.
You can find very good explanation of this issue in this part of the online Neural Network FAQ.
The neural net's computations all take place on continuous values. To do multiclass classification with discrete output, its final layer produces a vector of such values, one for each class. To make a discrete class prediction, take the index of the maximum element in that vector.
So if the final layer in a classification network for four classes predicts [0 -1 2 1], then the third element of the vector is the largest and the third class is selected. Often, these values are also constrained to form a probability distribution by means of a softmax activation function.