Is it possible to use Feature Learning for Binary Text Classification? - machine-learning

I'm currently working with the CHILDES corpus trying to create a classifier that distinguishes children whom suffer from specific language impairment (SLI) from those who are typically developing (TD).
In my readings I noticed that there really isn't a convincing set of features to distinguish the two that have been discovered yet, so I came upon the crazy idea of trying to create a feature learning algorithm that could potentially make better ones.
Is this possible? If so how do you suggest I approach this? From the reading I have done, most feature learning is done on image processing. Another problem is the dataset I have is potentially too small to make it work (in the 100's) unless I find a way to get more transcripts from children.

Create a dataset consisting of children text with three labels:
1- Normal
2- SLI
3- TD
So you'll have 3 labels.
You put aside 40% of your dataset 20% for development and 20% for test.
Then you run a LogisticRegression Classifier (e.g. using scikit-learn) using bag of character n-gram features. You can easily do this by TfidfVectorizer in scikit-learn.
Then you train the model over the 60% training set and you tune the hyper-parameters (e.g. regularization strength) by choosing the best performing development model.
Then, you train again using the chosen hyper-parameters and you get the top important features as in this example.
For each class, it gives you the weight of features associated with each label so you'll have your top linguistic symptoms for each of your two diseases.

Related

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.
This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!
It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

Machine learning with my car dataset

I’m very new to machine learning.
I have a dataset with data given me by a f1 race. User is playing this game and is giving me this dataset.
With machine learning, I have to work with this data and when a user (I know they are 10) plays a game I have to recognize who’s playing.
The data consists of datagram packet occurred in 1/10 second freq, the packets contains the following Time, laptime, lapdistance, totaldistance, speed, car position, traction control, last lap time, fuel, gear,..
I’ve thought to use a kmeans used in a supervised way.
Which algorithm could be better?
The task must be a multiclass classification. The very first step in any machine learning activity is to define a score metric (https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). That allows you to compare models between themselves and decide which is better. Then build a base model with random forest or/and logistic regression as suggested in another answer - they perform well out-of-the-box. Then try to play with features and understand which of them are more informative. And don't forget about a visualizations - they give many hints for data wrangling, etc.
this is somewhat a broad question, so I'll try my best
kmeans is unsupervised algorithm meaning it will find the classes itself and it best used when you know there are multiple classes but you don't know what exactly they are... using it with labeled data just means you will compute the distance of new vector v to each vector in the dataset and pick the one (or ones using majority vote) which give the min distance , this is not considered as machine learning
in this case when you do have the labels, supervised approach will yield much better results
I suggest try random forest and logistic regression at first, those are the most basic and common algorithms and they give pretty good results
if you haven't achieve the desired accuracy you can use deep learning and build a neural network with input layer as big as your packet's values and output layer of the number of classes, in between you can use one or multiple hidden layers with various nodes, but this is advanced approach and you better pick up some experience in machine learning field before pursue it
Note: the data is a time series, meaning that every driver has it's own behaviour of driving a car, so data should be considered as bulks of points, with this you can apply pattern matching technics, also there are a several neural networks build exactly for this data (like RNN) but this is far far advanced and much more difficult to implement

How to discover new classes in a classification machine learning algorithm?

I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.
However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.
So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.
So I'm wondering about if and how I can...
identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.
I can vaguely think of a few ideas that might be approaches to solve this problem:
If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.
However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.
Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?
Help on this is greatly appreciated.
As in any other machine learning problem, if you do not have a quality criterion, you suck.
When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").
So you need a supervisor. But how can you assist this supervisor? Two options come to mind:
Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.
If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

machine learning from words found in text

I would like to use a supervised machine learning algorithm to predict a binary function (true or false) for a set of sentences based on the presence or absence of words in the sentences.
Ideally, I would like to avoid having to hardcode the set of words used to decide on the output so that the algorithm automatically learns which words are (together ?) most likely to trigger specific outputs.
http://shop.oreilly.com/product/9780596529321.do (Programming Collective Intelligence) has a nice section in chapter 4 titled "Learning From Clicks" which describes how to do this by using 1 layer of hiden nodes in a neural network with one new hidden node for each new combination of input words.
Similarly, it is possible to create a feature for each word in the training data set and train pretty much any classic machine learning algorithm using these features. Adding new training data will generate new features which will require me to re-train the algorithm from scratch.
Which brings me to my questions:
is it actually a problem if I have to retrain everything from scratch whenever the training data set is extended ?
what kind of algorithm would more experience machine learning users recommend to use for this kind of problem ?
what criteria should I use in picking an algorithm versus another ? (other than actually trying them all and see which perform better with precision/recall metrics)
if you have worked on similar problems, what about extending the features with 2-grams (1 if a specific 2-gram is present, 0 if not) ? 3-grams ?
You could look into the general area of topic modelling if you want to find words which are generally found together.
The most simple approach would be to use latent semantic analysis ( http://en.wikipedia.org/wiki/Latent_semantic_analysis ), which is just applying SVD to a term document matrix. You'd then need to do some additional post hoc analysis to fit this to your particular outcome.
A more involved, and much more complex approach would be to use latent dirichlet allocation ( http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation )
In terms of just adding new features (words) that is fine as long as you are going to retrain. You can also use TF/IDF to give that particular word a value when representing the matrix (Instead of just a 1 or 0).
I don't know what programming language you are trying to do this in, but I know there are libraries out there in Java and Pythont hat do all of the above.

Resources