I'm a beginner to machine learning and scikit-learn so this might be a stupid question..
I'm trying to do something like this:
features = [['adam'], ['james'], ['amy']]
labels = ['hello adam', 'hello james', 'hello amy']
clf = clf.fit(features, labels)
print clf.predict(['john'])
# This should give out 'hello john'
Is this possible using scikit-learn?
Thanks in advance!
The principled way to solve this would be to do sequence to sequence learning which is a more complicated beast and outside of scikit-learn's scope.
With enough feature engineering and correct problem formulation you can still help a simpler algorithm like the ones in scikit learn achieve this task. There are two main difficulties that need to be tackled:
how to convert your features and your labels into a numeric representation (one-hot, embeddings, ...)
how to encode a variable length sequence into a fixed length vector that can be feed to scikit-learn algorithms (bag of word, mean pooling, rnn).
Related
I want to label some documents, I tried the LDA algorithm but the results were too messy. I decided to use a supervised approach, so I created my own topic-word matrix but I don't know how to generate a document-topic matrix. Do you know some good topic modeling algorithm that can be trained using topic-word matrix ?
If you do have a correct topic-word matrix created. You only need to compute the weights of topic for each documents. For example you could use the occurence of each word in each documents and then summing the topic weight of those words. You might need to add some coefficients like number of occurence but it is pretty straightforward.
You can also use LDA algorithm but ignoring the training step which is made to process the topic-word matrix. I do not know which implementation you use but following the one of Sklearn you can directly pass the matrix as components_ attributes and then use the transform function.
I have a datasets of MFCC that I know is good. I know how to put a row vector into a machine learning algorithm. My question is how to do it with MFCC, as it is a matrix? For example, how would I put this inside a machine learning algorithm:?
http://archive.ics.uci.edu/ml/machine-learning-databases/00195/Test_Arabic_Digit.txt
Any algorithm will work. I am looking at a binary classifier, but will be looking into it more. Scikit seems like a good resource. For now I would just like to know how to input MFCC into an algorithm. Step by step would help me a lot! I have looked in a lot of places but have not found an answer.
Thank you
In python, you can easily flatten a matrix so it becomes in a vector,for example you can use numpy and numpy's flatten function ,additionally an idea that comes to my mind(it's just an idea may or may not work) is to use convolutions, convolutions work very well with 2d structures.
Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!
I have 20 numeric input parameters (or more) and single output parameter and I have thousands of these data. I need to find the relation between input parameters and output parameter. Some input parameters might not relate to output parameter or all input parameters might not relate to output parameter. I want some magic system that can statistically calculate output parameter when I provide all input parameters and it much be better if this system also provide confident rate with output result.
What’s technique (in machine learning) that I need to use to solve this problem? I think it should be Neural network, genetic algorithm or other related thing. But I don't sure. More than that, I need to know the limitation of this technique.
Thanks,
Your question seems to simply define the regression problem. Which can be solved by numerous algorithms and models, not just neural networks.
Support Vector Regression
Neural Networks
Linear regression (and many modifications and generalizations) using for example OLS method
Nearest Neighbours Regression
Decision Tree Regression
many, many more!
Simply look for "regression methods", "regression models" etc. in particular, sklearn library implements many of such methods.
I would recommend Genetic Programming (GP), which is genetic-based machine learning approach where the learnt model is a single mathematical expression/equation that best fits your data. Most GP packages out there come with a standard regression suite which you can run "as is" with your data, and with minimal setup costs.
The problem is a bit different than traditional handwriting recognition. I have a dataset that are thousands of the following. For one drawn character, I have several sequential (x, y) coordinates where the pen was pressed down. So, this is a sequential (temporal) problem.
I want to be able to classify handwritten characters based on this data, and would love to implement HMMs for learning purposes. But, is this the right approach? How can they be used to do this?
I think HMM can be used in both problems mentioned by #jens. I'm working on online handwriting too, and HMM is used in many articles. The simplest approach is like this:
Select a feature.
If selected feature is continuous convert it to discrete.
Choose HMM parameters: Topology and # of states.
Train character models using HMM. one model for each class.
Test using test set.
for each item:
the simplest feature is angle of vector which connects consecutive
points. You can use more complicated features like angles of vectors
obtained by Douglas & Peucker algorithm.
the simplest way for discretization is using Freeman codes, but
clustering algorithms like k-means and GMM can be used too.
HMM topologies: Ergodic, Left-Right, Bakis and Linear. # of states
can be obtained by trial & error. HMM parameters can be variable for
each model. # of observations is determined by discretization.
observation samples can be have variable length.
I recommend Kevin Murphy HMM toolbox.
Good luck.
This problem is actually a mix of two problems:
recognizing one character from your data
recognizing a word from a (noisy) sequence of characters
A HMM is used for finding the most likely sequence of a finite number of discrete states out of noisy measurements. This is exactly problem 2, since noisy measurements of discrete states a-z,0-9 follow eachother in a sequence.
For problem 1, a HMM is useless because you aren't interested in the underlying sequence. What you want is to augment your handwritten digit with information on how you wrote it.
Personally, I would start by implementing regular state-of-the-art handwriting recognition which already is very good (with convolutional neural networks or deep learning). After that, you can add information about how it was written, for example clockwise/counterclockwise.