How to classify text with 35+ classes; only ~100 samples per class? [closed] - machine-learning

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

Related

Batch normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

Word embedding training [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have one corpus for word embedding. Using this corpus, I trained my word embedding. However, whenever I train my word embedding, the results are quite different(this results are based on K-Nearest Neighbor(KNN)). For example, in the first training, 'computer' nearest neighbor words are 'laptops', 'computerized' ,'hardware'. But, in the second training, this knn words are 'software', 'machine',...('laptops' is low ranked!) - all training are performed independently 20 epochs, and hyper-parameters are all the same.
I want to train my word embedding very similar(e.g., 'laptops' is high ranked). How should i do? Should I modulate hyper-parameters(learning rate, initializing, etc)?
You didn't say what word2vec software you're using, which might change the relevant factors.
The word2vec algorithm inherently uses randomness, in both initialization and several aspects of its training (like the selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). Additionally, if you're doing multithreaded training, the essentially-random jitter in the OS thread scheduling will change the order of training examples, introducing another source of randomness. So you shouldn't necessarily expect subsequent runs, even with the exact same parameters and corpus, to give identical results.
Still, with enough good data, suitable parameters, and a proper training loop, the relative-neighbors results should be fairly similar from run-to-run. If it's not, more data or more iterations might help.
Wildly-different results would be most likely if the model is overlarge (too many dimensions/words) for your corpus – and thus prone to overfitting. That is, it finds a great configuration for the data, through essentially memorizing its idiosyncracies, without achieving any generalization power. And if such overfitting is possible, there are typically many equally-good such memorizations – so they can be very different from run-to-tun. Meanwhile, a right-sized model with lots of data will instead be capturing true generalities, and those would be more consistent from run-to-run, despite any randomization.
Getting more data, using smaller vectors, using more training passes, or upping the minimum-count of word-occurrences to retain/train a word all might help. (Very-infrequent words don't get high-quality vectors, so wind up just interfering with the quality of other words, and then randomly intruding in most-similar lists.)
To know what else might be awry, you should clarify in your question things like:
software used
modes/metaparameters used
corpus size, in number of examples, average example size in words, and unique-words count (both in the raw corpus, and after any minumum-count is applied)
methods of preprocessing
code you're using for training (if you're managing the multiple training-passes yourself)

Can machine learning algorithm learn to predict a "random" number from a generator to at least 50% accuracy? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Because all random number generators are all pseudo random number generators, can a machine learning algorithm eventually, with enough test data, learn to predict the next random number with 50% accuracy?
If you are generating just random bits (0 or 1) then any method will get 50%, literally any, ML or not, trained on not. Anything besides direct exploitation of underlying random number generator (like reading the seed, and then using as a predictor the same random number generator). So the answer is yes.
If you consider more "numbers" then no, it is not possible, unless you do not have a valid random number generator. The weaker is the process and better your model you try to learn, it is easier to predict what is happening. For example if you know exactly how random number generator looks like, and this is just iterated function with some parameters f(x|params), where we start with some random seed s and parameters params and then x1=f(s|params), x2=f(x1|params), ... then you can learn using ML the state of such system, this is just about finding the "params", which fitted to f generate the actual values. Now - more complex f, the more complex is the problem. For typical random number generators f is too complex to learn, because you cannot observe any relation between close values - if you predict "5.8" and answer was "5.81" then next sample from your model might be "123" and from true generator "-2". This is completely chaotic process.
To sum up: this is possible only for very easy cases:
either there are just 2 values (then there is nothing to learn, literally any method, which is not cheating, will get 50%)
or random number generator is seriously flawed, and you have a knowledge about what type of flaw it is, and you can design a parametric model to approximate this.

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

NLP: Calculating probability a document belongs to a topic (with a bag of words)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Given a topic, how can I calculate the probability a document "belongs" to that topic(ie sports)
This is what I have to work with:
1) I know the common words in documents associated with that topics (eliminating all STOP words), and the % of documents that have that word
For instance if the topic is sports, I know:
75% of sports documents have the word "play"
70% have the word "stadium"
40% have the word "contract"
30% have the word "baseball"
2) Given this, and a document with a bunch of words, how can I calculate the probability this document belongs to that topic?
This is fuzzy classification problem with topics as classes and words as features. Normally you don't have bag of words for each topic, but rather set of documents and associated topics, so I will describe this case first.
The most natural way to find probability (in the same sense it is used in probability theory) is to use naive Bayes classifier. This algorithm has been described many times, so I'm not going to cover it here. You can find quite good explanation in this synopsis or in associated Coursera NLP lectures.
There are also many other algorithms you can use. For example, your description naturally fits tf*idf based classifiers. tf*idf (term frequency * inverse document frequency) is a statistic used in modern search engines to calculate importance of a word in a document. For classification, you may calculate "average document" for each topic and then find how close new document is to each topic with cosine similarity.
If you have the case exactly like you've described - only topics and associated words - just consider each bag of words as a single document with, possibly, duplicating frequent words.
Check out topic modeling (https://en.wikipedia.org/wiki/Topic_model) and if you are coding in python, you should check out radim's implementation, gensim (http://radimrehurek.com/gensim/tut1.html). Otherwise there are many other implementations from http://www.cs.princeton.edu/~blei/topicmodeling.html
There are many approaches to solving a clustering problem. I suggest start with simple logistic regression and look at the results. If you already have predefined ontology sets, you can add them as features in next stage to improve accuracy.

Resources