Using scikit-learn to decide if a given text is similar to previously learnt texts - machine-learning

I am a newbie to skilearn.
What I want to do is quite simple - just feed my model with a bunch of similar texts.
Then, I want to be able to give it a new text, and see if it is similar to the existing texts in the dataset.
How should this be done?
Thanks very much in advance.

One good aproach might be using cosine similarity. This is a very good tutorial for starting:
Machine Learning :: Cosine Similarity for Vector Space Models (Part III)

Another good approach would be a Bayesian Classifier, like the ones used for SPAM detection. Take a look at this link to learn more about them.

Related

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Catalog of Features. Feature extraction from images for SVM

I'm looking for reliable features for classification of cell types in microscope images. I wonder what is the best approach.
1) I've tried the approach described by Pontil & Verii - using each pixel of normalized images as a feature. It is easy to implement, but the results are not fully satisfactory. And another problem is - the classification is done with some kind of statistic magic and I can't understand why some results are bad.
2) I've tried to extract high level features such as peaks, holes. My implementation is slow, but the advantage is I understand why one cell is identified as such and another not, as you can visualize these features in test images.
3) Recently I've found in an article such features:
angular second-order,
distance, contrast, entropy, anti-difference distance, relevant, mean
of sum, mean of difference, entropy of sum, entropy of difference,
variance, variance of sum, variance of difference.
I wonder whether there are some standard libraries for the extraction of these features (preferably in C/C++) ?
Is there a catalogue of feature-types with pros/cons, use-case description, etc?
Thank you for any suggestion in advance!
I can recommend this article by Lindblad et al, published in the scientific journal Cytometry. It covers some aspects of feature extraction and classification of cells. It does not utilize any standard libraries for feature extraction/classification, but it contains some information on how to build a classifier based on general features.
This might not solve your problem completely, but I hope it might help you move towards a better solution.
You should try Gabor feature extraction technique as it is supposed to extract features very similar to human visual cortical cells...by setting filters at different orientation and scale and then extracting features from each set-up .
you can Start learning from Wikipedea
I think that the Insight Segmentation and Registration Toolkit (ITK) or Visualization Toolkit (VTK) would work well.
Some other options (that might not necessarily include all the features you want) are
http://opencv.org/
http://gdal.org/
http://www.vips.ecs.soton.ac.uk/index.php?title=VIPS
http://www.xdp.it/cximage.htm
Finally I've found what I've searched for and would like to share:
https://sites.google.com/site/cvonlinewiki/home/geometric-feature-extraction-methods
The list looks pretty mature and complete.
EDIT
Another good article for features in biological cells is:
A feature set for cytometry on digitized
microscopic images
A good description of shape features:
http://www.math.uci.edu/icamp/summer/research_11/park/shape_descriptors_survey.pdf

How to use SIFT/SURF as features for a machine learning algorithm?

Im working on an automatic image annotation problem in which im trying to associate tags with images. For that im trying for SIFT features for learning. But the problem is all the SIFT features are a set of keypoints, each of which have a 2-D array, and the number of keypoints are also huge.How many and how do I give them for my learning algorithm which typically accepts only one-d features?
You can represent single SIFT as "visual word" which is one number and use it as SVM input, I think it is what you need. It is usually done by k-means clustering.
This method is called "bag-of-words" and described in this paper.
Short presentation review of method.
You should read the original paper about SIFT, it tells you what is SIFT and how to use it, you should carefully read the chapter 7 and rest for understanding how to use it practically.
Here is the link for original paper.
You can use the Bag of Words approach, of which you can read about in the following post:
http://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/
Sift and Surf are invariant feature extractors. There for matching features will help solving lots of problems.
But there is matching problem since all points may not be same in two different image. (and in the case of similarity problem). Therefore you should use the features which is matched the others may.
Another problem is this algorithms extract lots of features which is not possible to match in large datasets.
There is a good solution to those problems which is called "Bag of Visual Word"
https://github.com/dermotte/LIRE complete bag of visual word is fully implemented. Here is the lire Demo site.
Code is very simple if you know the bag of visual word you can modify also.
After getting visual word you should use information retrieval approaches used in search engines. By the way Lire also include an information retrieval library called lucene. You should fallow the lire way until you get the complete idea and implement your own.

Resources