Supervised training and testing in GenSims FastText implementation - machine-learning

I am currently training a Gensim FastText model with a document from a certain domain with the unsupervised training method from Gensim.
After this training of the word representations i would like to train a set of sentence+label lines and ultimately test the model and return a precision and recall value like it is possible in facebooks fastText implementation via train_supervised + test. Does GenSims implementation support the supervised training and testing? I couldnt get it to work / find the required methods.
Any help is much appreciated.

Gensim's FastText implementation has so far chosen not to support the same supervised mode of Facebook's original FastText, where known-labels can be used to drive the training of word-vectors – because gensim sees it focus as being unsupervised topic-modeling techniques.

Related

Image Classification using Single Class Dataset using Transfer Learning

I only have around 1000 images of vehicle. I need to train a model that can identify if the image is vehicle or not-vehicle. I do not have a dataset for not-vehicle, as it could be anything besides vehicle.
I guess the best method for this would be to apply transfer learning. I am trying to train data on a pre-trained VGG19 Model. But still, I am unaware on how to train a model with just vehicle images without any non-vehicle images. I am not being able to classify it.
I am new to ML Overall, Any solution based on practical implementation will be highly appreciated.
You are right about transfer learning approach. Have a look a this article, it is exactly about going from multi-class to binary classification with transfer learning - https://medium.com/#mandygu/seefood-creating-a-binary-classifier-using-transfer-learning-da751db7cf9c
You can try using pretrained model and take the output. You might need to apply dimensionality reduction e.g. PCA, to get a more managable size input. After that you can train novelty detection model to identify whether the output is different than your training set.
Refer to this example: https://github.com/J-Yash/Hotdog-Not-Hotdog
Hope this helps.
This is a binary classification problem: whether the input is a vehicle or not.
If you are new to ML, I would suggest you should start implementing basic binary classifiers like Logistic Regression, Support Vector Machines before jumping to Convolutional Neural Networks (CNNs).
I am providing some links for the binary classification problem implementations using different algorithms. I hope this would help.
Logistic Regression: https://github.com/JB1984/Logistic-Regression-Cat-Classifier
SVM: https://github.com/Witsung/SVM-Fruit-Image-Classifier
CNN: https://github.com/A-Jatin/CNN-implementation-for-binary-image-classification

Logistic Regression only recognizing predominant classes

I am participating in the Kaggle San Francisco Crime competition and i am currently trying o number of different classifiers to test benchmark performances. I am using a LogisticRegressionClassifier from sklearn, without any parameter tuning and I noticed from sklearn.metrict.classification_report that it is only predicting the predominant classses,i.e. the classes which have the highest number of occurrences in my training set.
Intuition tells me that this has to parameter tuning, but I am not sure which parameters I have to tweek in order to make the classifier more aware of less predominant classes ( LogisticRegressionClassifier has quite a few ). At the moment it is predicting only 3 classes from 38 or smth like that so it definitely needs improvement.
Any ideas?
If your model is classifying only predominant classes then you are facing problem of imbalance classes. Here are some good reads to tackle this in machine learning.
Logistic Regression is a binary classifier and uses one-vs-all or one-vs-one technique for multiclass classification, which is not good if you have higher number of output classes (33 in your case). Try using other classifier. For a start , use softmax classifier which is an extension of logistic classifier having support for multi-class classification. In scikit learn, set multi_class variable as multinomial to use softmax regression.
Other way to improve your model could be using GridSearch for parameter tuning.
On a side note, I would recommend you to use other models as well.

Machine Learning Text Classification technique

I am new to Machine Learning.I am working on a project where the machine learning concept need to be applied.
Problem Statement:
I have large number(say 3000)key words.These need to be classified into seven fixed categories.Each category is having training data(sample keywords).I need to come with a algorithm, when a new keyword is passed to that,it should predict to which category this key word belongs to.
I am not aware of which text classification technique need to applied for this.do we have any tools that can be used.
Please help.
Thanks in advance.
This comes under linear classification. You can use naive-bayes classifier for this. Most of the ml frameworks will have an implementation for naive-bayes. ex: mahout
Yes, I would also suggest to use Naive Bayes, which is more or less the baseline classification algorithm here. On the other hand, there are obviously many other algorithms. Random forests and Support Vector Machines come to mind. See http://machinelearningmastery.com/use-random-forest-testing-179-classifiers-121-datasets/ If you use a standard toolkit, such as Weka, Rapidminer, etc. these algorithms should be available. There is also OpenNLP for Java, which comes with a maximum entropy classifier.
You could use the Word2Vec Word Cosine distance between descriptions of each your category and keywords in the dataset and then simple match each keyword to a category with the closest distance
Alternatively, you could create a training dataset from already matched to category, keywords and use any ML classifier, for example, based on artificial neural networks by using vectors of keywords Cosine distances to each category as an input to your model. But it could require a big quantity of data for training to reach good accuracy. For example, the MNIST dataset contains 70000 of the samples and it allowed me reach 99,62% model's cross validation accuracy with a simple CNN, for another dataset with only 2000 samples I was able reached only about 90% accuracy
There are many classification algorithms. Your example looks to be a text classification problems - some good classifiers to try out would be SVM and naive bayes. For SVM, liblinear and libshorttext classifiers are good options (and have been used in many industrial applcitions):
liblinear: https://www.csie.ntu.edu.tw/~cjlin/liblinear/
libshorttext:https://www.csie.ntu.edu.tw/~cjlin/libshorttext/
They are also included with ML tools such as scikit-learna and WEKA.
With classifiers, it is still some operation to build and validate a pratically useful classifier. One of the challenges is to mix
discrete (boolean and enumerable)
and continuous ('numbers')
predictive variables seamlessly. Some algorithmic preprocessing is generally necessary.
Neural networks do offer the possibility of using both types of variables. However, they require skilled data scientists to yield good results. A straight-forward option is to use an online classifier web service like Insight Classifiers to build and validate a classifier in one go. N-fold cross validation is being used there.
You can represent the presence or absence of each word in a separate column. The outcome variable is desired category.

How do I update a trained model (weka.classifiers.functions.MultilayerPerceptron) with new training data in Weka?

I would like to load a model I trained before and then update this model with new training data. But I found this task hard to accomplish.
I have learnt from Weka Wiki that
Classifiers implementing the weka.classifiers.UpdateableClassifier interface can be trained incrementally.
However, the regression model I trained is using weka.classifiers.functions.MultilayerPerceptron classifier which does not implement UpdateableClassifier.
Then I checked the Weka API and it turns out that no regression classifier implements UpdateableClassifier.
How can I train a regression model in Weka, and then update the model later with new training data after loading the model?
I have some data mining experience in Weka as well as in scikit-learn and r and updateble regression models do not exist in weka and scikit-learn as far as I know. Some R libraries however do support updating regression models (take a look at this linear regression model for example: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.html), so if you are free to switching data mining tool this might help you out.
If you need to stick to Weka than I'm afraid that you would probably need to implement such a model yourself, but since I'm not a complete Weka expert please check with the guys at weka list (http://weka.wikispaces.com/Weka+Mailing+List).
The SGD classifier implementation in Weka supports multiple loss functions. Among them are two loss functions that are meant for linear regression, viz. Epsilon insensitive, and Huber loss functions.
Therefore one can use a linear regression trained with SGD as long as either of these two loss functions are used to minimize training error.

Which classification algorithm to choose?

I would like to classify text documents into four categories. Also I have lot of samples which are already classified that can be used for training. I would like the algorithm to learn on the fly.. please suggest an optimal algorithm that works for this requirement.
If by "on the fly" you mean online learning (where training and classification can be interleaved), I suggest the k-nearest neighbor algorithm. It's available in Weka and in the package TiMBL.
A perceptron will also be able to do this.
"Optimal" isn't a well-defined term in this context.
there are several algorithms which can be learned on fly. Examples: k-nearest neighbors, naive Bayes, neural networks. You can try how appropriate each of these methods are on a sample corpus.
Since you have unlabeled data you might want to use a model where this helps. The first thing that comes to my mind is nonlinear NCA: Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure, (Salakhutdinov, Hinton).
Well....I have to say that document classification is kind of different what you guys are thinking.
Typically, in document classification, after preprocessing, the test data is always extremely huge, for example, O(N^2)...Therefore it might be too computationally expensive.
The another typical classifier that came into my mind is discriminant classifier...which doesn't need the generative model for your dataset. After training, you have to do is to put your single entry to the algorithm, and it is gonna be classified.
Good luck with this. For example, you can check E. Alpadin's book, Introduction to Machine Learning.

Resources