HOG and LBP on weka - machine-learning

I'm new to the subject of ML so I apologize as my questions may seem too basic.
I have an image dataset and my supervisor asked me to do feature extraction using HOG and LBP filters. So far I have been working with weka, and I couldn't fine any useful tutorials on how to implement these filters on weka, is it possible? and if not, how else can implement these filters to extract features from my dataset?

Weka
You could use and/or extend the imageFilter Weka package. It uses LIRE under the hood, which has a range of feature extraction methods implemented.
ADAMS
If you don't mind using another framework, then you could make use of the plugins for LIRE in ADAMS. Its adams-imaging module also offers support for a range of LIRE feature generators (download either the adams-annotator or adams-base-all snapshot).
Using its workflow engine, you can run the flow adams-imaging-feature_generation, which generates generates PHOG and LocalBinaryPatterns features from a range of images and displays them as spreadsheet. You could use this flow as basis and turn it into one that allows you to select images interactively and then saves them as CSV or ARFF file.

Related

Difference between spacy_sklearn and tensorflow_embedding pipelines

I want to know if there is any basic difference between how spacy_sklearn and tensorflow_embedding pipelines operate under the hood.I mean tensorflow_embedding must also be using the same concepts of word embeddings,reducing the dimensionality of data using PCA etc. Is the only difference then that spacy_sklearn has some pre trained data to draw upon in the form of pre trained vectors and tensorflow pipeline does not?Is my understanding correct?Also how is tensorflow_embedding pipeline related to the tensorflow framework offered by google?
I tried looking up tensorflow framework on google, but could not get any specific answer.I also searched about it on RASA community page, but again found no help
The spacy_sklearn pipeline uses pre-trained word vectors.This is useful if we don’t have very much training data.
The tensorflow embedding pipeline doesn’t use any pre-trained word vectors,it fits specifically for our dataset. The advantage of the tensorflow_embedding pipeline is that the word vectors will be customised for our domain.
For more information ,please refer the below link
https://rasa.com/docs/nlu/choosing_pipeline/

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

How to use SIFT/SURF as features for a machine learning algorithm?

Im working on an automatic image annotation problem in which im trying to associate tags with images. For that im trying for SIFT features for learning. But the problem is all the SIFT features are a set of keypoints, each of which have a 2-D array, and the number of keypoints are also huge.How many and how do I give them for my learning algorithm which typically accepts only one-d features?
You can represent single SIFT as "visual word" which is one number and use it as SVM input, I think it is what you need. It is usually done by k-means clustering.
This method is called "bag-of-words" and described in this paper.
Short presentation review of method.
You should read the original paper about SIFT, it tells you what is SIFT and how to use it, you should carefully read the chapter 7 and rest for understanding how to use it practically.
Here is the link for original paper.
You can use the Bag of Words approach, of which you can read about in the following post:
http://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/
Sift and Surf are invariant feature extractors. There for matching features will help solving lots of problems.
But there is matching problem since all points may not be same in two different image. (and in the case of similarity problem). Therefore you should use the features which is matched the others may.
Another problem is this algorithms extract lots of features which is not possible to match in large datasets.
There is a good solution to those problems which is called "Bag of Visual Word"
https://github.com/dermotte/LIRE complete bag of visual word is fully implemented. Here is the lire Demo site.
Code is very simple if you know the bag of visual word you can modify also.
After getting visual word you should use information retrieval approaches used in search engines. By the way Lire also include an information retrieval library called lucene. You should fallow the lire way until you get the complete idea and implement your own.

OpenCV vs Mahout for Computer Vision based Machine Learning?

For some time, I have been using OpenCV. It satisfied all my needs of feature extraction, matching and clustering(k-means till now) and classification(SVM). Recently, I came across Apache Mahout. But, most of the algorithms for machine learning are already available in OpenCV as well. Are there any advantages of using Mahout over OpenCV if the work relates to Videos and Images ?
This question might be put on hold since it is opinion based. I still want to add a basic comparison.
OpenCV is capable of anything about vision and ml that is possibly researched, or invented. The vision literature is based on it, and it develops according to the literature. Even the newborn ml algorithms -like TLD, originated on MATLAB- (http://www.tldvision.com/) can also be implemented using OpenCV (http://gnebehay.github.io/OpenTLD/) with some effort.
Mahout is capable, too and specific to ml. It includes not only the well known ml algorithms, but also the specific ones. Say you came across to a paper "Processing Apples with K-means Orientation Filtering". You can find OpenCV implementations of this paper all around the web. Even the actual algorithm might be open source and developed using OpenCV. With OpenCV, say it takes 500 lines of code, but with Mahout, the paper might be already implemented with a single method making everything easier
An example about this case is http://en.wikipedia.org/wiki/Canopy_clustering_algorithm, which is harder to implement using OpenCV right now.
Since you are going to work with image data sets you will need to learn about HIPI, too.
To sum up, here is a simple pro-con table:
know-how (learning curve): OpenCV is easier, since you already know about it. Mahout+HIPI will take more time.
examples: Literature + vision community commonly use OpenCV. Open source algorithms are mostly created with C++ api of OpenCV.
ml algorithms: Mahout is only about ml, whereas OpenCV is more generic. Still OpenCV has access to basic ml algorithms.
development: Mahout is easier to work with in terms of coding and time complexity (I am not sure about the latter, but I reckon it is).

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Resources