I need to use Principal Component Analysis from Python. The MDP library offers PCA functionality, but the documentation is not very clear to me. I used previously PCA from SPSS, which offered various options for PCA (such as different rotations).
If anyone is familiar with both MDP and SPSS, please highlight the equivalent settings for PCA in SPSS to obtain similar results to PCA from MDP (mdp.nodes.PCANode() - for simplicity, let's consider default values).
Try scikit-learn. It's great toolkit and has gorgeous documentation.
PCA: http://scikit-learn.org/dev/modules/generated/sklearn.decomposition.PCA.html
Related
As we know Facebook's FastText is a great open-source, free, lightweight library which can be used for text classification. But here a problem is the pipeline seem to be end-to end black-box. Yes, we can change the hyper-parameters from these options for setting training configuration. But I couldn't manage to find a way to access to the vector embedding it generates internally.
Actually I want to do some manipulation on the vector embedding - like introducing tf-idf weighting apart from these word2vec representations and another thing I want to to is oversampling using SMOTE which requires numerical representation. For these reasons I need to introduce my custom code in between the overall pipeline which seems to be inaccessible for me. How introduce custom steps in this pipeline?
The full source code is available:
https://github.com/facebookresearch/fastText
So, you can make any changes or extensions you can imagine - if you're comfortable reading & modifying its C++ source code. Nothing is hidden or inaccessible.
Note that both FastText, and its supervised classification mode, are chiefly conventions for training a shallow neural-network. It may not be helpful to think of it as a "pipeline" like in the architecture of other classifier libraries - as none of the internal interfaces use that sort of language or modular layout.
Specifically, if you get the gist of word2vec training, FastText classifier mode really just replaces attempted-predictions of neighboring (in-context-window) vocabulary words, with attempted-predictions of known labels instead.
For the sake of understanding FastText's relationship to other techniques, and potential aspects for further extension, I think it's useful to also review:
this skeptical blog post comparing FastText to the much-earlier 'vowpal wabbit' tool: "Fast & easy baseline text categorization with vw"
Facebook's far-less discussed extension of such vector-training for more generic categorical or numerical tasks, "StarSpace"
Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.
I want to learn General SVM implementation which uses QP problem for training. Initially I do not want to learn Sequential minimal Optimization(SMO) kind of algorithm which over comes the QP matrix size issue. Can any one please give me some references to learn Pure General SVM implementation in any programming languages like C,C++ or Java. So that I can understand basic issues in SVM and it will help me in learning some other SVM optimized algorithms.
This blog post by Mathieu Blondel explains how to solve the SVM problem both with and without kernels using a generic QP solver in Python (in this case he is using CVXOPT).
The source code is published on this gist and is very simple to understand thanks to the numpy array notation for n-dimensional arrays (in this case, mostly 2D matrices and 1D vectors).
You could check some of the resources mentioned here. It is also advisable to have a look at the existing code. One of the most popular implementations, LIBSVM, is open-source, so you can study the implementation.
I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!
That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.
If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.
I was reading the paper on Relational Fisher Kernel which involves Bayesian Logic Programs to calculate the Fisher score and then uses SVM to obtain the class labels for each data item.
I don't have strong background from Machine learning. Can someone please let me know about how to go about implementing an end-to-end Relational Fisher Kernel and what sort of input would it expect? I could not find any easy step-by-step flow showing this implementation. I am ok with using libraries for SVM etc. (e.g. libsvm), but I would like to know the end-to-end flow (in as easy language as possible). Any help will be highly appreciated.
libsvm does not implement the Relation Fisher Kernel, however, you can calculate the Fisher information matrix as described in the paper, and the use it as the precomputed kernel input to libsvm. See: using precomputed kernels with libsvm