Word embedding training [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have one corpus for word embedding. Using this corpus, I trained my word embedding. However, whenever I train my word embedding, the results are quite different(this results are based on K-Nearest Neighbor(KNN)). For example, in the first training, 'computer' nearest neighbor words are 'laptops', 'computerized' ,'hardware'. But, in the second training, this knn words are 'software', 'machine',...('laptops' is low ranked!) - all training are performed independently 20 epochs, and hyper-parameters are all the same.
I want to train my word embedding very similar(e.g., 'laptops' is high ranked). How should i do? Should I modulate hyper-parameters(learning rate, initializing, etc)?

You didn't say what word2vec software you're using, which might change the relevant factors.
The word2vec algorithm inherently uses randomness, in both initialization and several aspects of its training (like the selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). Additionally, if you're doing multithreaded training, the essentially-random jitter in the OS thread scheduling will change the order of training examples, introducing another source of randomness. So you shouldn't necessarily expect subsequent runs, even with the exact same parameters and corpus, to give identical results.
Still, with enough good data, suitable parameters, and a proper training loop, the relative-neighbors results should be fairly similar from run-to-run. If it's not, more data or more iterations might help.
Wildly-different results would be most likely if the model is overlarge (too many dimensions/words) for your corpus – and thus prone to overfitting. That is, it finds a great configuration for the data, through essentially memorizing its idiosyncracies, without achieving any generalization power. And if such overfitting is possible, there are typically many equally-good such memorizations – so they can be very different from run-to-tun. Meanwhile, a right-sized model with lots of data will instead be capturing true generalities, and those would be more consistent from run-to-run, despite any randomization.
Getting more data, using smaller vectors, using more training passes, or upping the minimum-count of word-occurrences to retain/train a word all might help. (Very-infrequent words don't get high-quality vectors, so wind up just interfering with the quality of other words, and then randomly intruding in most-similar lists.)
To know what else might be awry, you should clarify in your question things like:
software used
modes/metaparameters used
corpus size, in number of examples, average example size in words, and unique-words count (both in the raw corpus, and after any minumum-count is applied)
methods of preprocessing
code you're using for training (if you're managing the multiple training-passes yourself)

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

How to Approach Creating an Accurate Multiclass Multinomial Naive Bayes with Unbalanced Data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have used sklearn to create a basic multiclass naive bayes text classifier. I have 3 classes and around 800 rows of data. Class A has 564 rows, Class B has 159, and Class C has 82. As you can see the data is unbalanced among the classes and I understand that this can affect the accuracy because Bayes Theorem takes into account the probability of a word occurring in the text given that the text is of a specific class in order to figure out the probability of the text being of said class given that it has the word in the text. This was my first go, and I plan to get more data, as you might imagine Class A was the easiest to get while Class C was the hardest to attain.
I am however confused as to how I should be approaching creating and improving this model and how balanced the class data sets should be. If I were to get perfectly proportionate data for each class say 1000 rows of data for each class, or undersample the data i already have, wouldnt this affect the accuracy as well? Since in reality, the occurrence of Class C is actually definitely less likely than A and B. In reality the proportions of the classes are somewhat similar (although varying from person to person) to the probability of a text being of said Class. And since the Bayes Theorem also takes into account the Probability of a piece of text being a specific class in order to calculate the probability of a text being a specific class given that it contains a word, wouldn't creating a balanced dataset with equal number of rows for each class decrease the accuracy as the probability of a class occuring in production is not taken into account as the probability is now essentially constant and the same for all classes since they occur equally. Although making all classes equal does remove the bias of a word due to unbalanced datasets.
So I am unsure how to approach creating this model efficiently as I feel with unbalanced data, common words in Class C are perceived by the model to be more likely to occur in an email of Class A when in reality they are probably more common in C but the skewed data is creating this bias. On the other hand, making the classes balanced ignores the actual probability of a piece of text being a specific Class although I have no way of calculating a universal probability of each class that is accurate for all individuals, (does that mean that making the classes balanced has less of a negative effect on accuracy?). Any guidance is greatly appreciated, I am quite new to this.
Tldr; Don't undersample/oversample, use text augmentation instead.
Undersampling/oversampling can be helpful in certain situations, but certainly not in your case with only 800 rows of data. Undersampling would make you lose too much valuable data, and oversampling would result in unreliable outcome. A much better solution would be to augment your data.
There are libraries like Snorkel that allows you to augment textual data by swapping or replacing with synonyms for adjectives, verbs, nouns, etc. in a probablistic way, which can greatly increase your data size. I highly recommend you taking a look at it, as it's often used in both academia and in the industry.
In regards to your concern with balancing your dataset, there are a few factors that can affect the outcome. Examples include the size of your dataset and overfitting, how distinctive the features are at classifying the samples, presence of outliers, etc. Just because you have 10k samples of cancer patients and 5k of healthy people, doesn't necessarily mean your prediction will be 2:1 ratio on real life dataset. That's because the model isn't necessarily memorizing the distribution of each class, but rather how the features result in the prediction of the class.
So in your example, if each class have distinctive words that often distinguishes one class from the other, you'd want to provide samples with those words in other classes to make sure you're not overfitting each class on those words.
Hope this helps!
When training from an imbalanced training set, the variances of your classifier parameters grow large. The more skewed your prior class distribution is (A, B, C), the larger this problem becomes.
You are recommended, when possible, to train from a balanced training set (the same number of 'A' and 'B' and 'C' cases). Correction to the actual prior class distribution can take place afterwards, see correction formula for posterior probabilities.
Your subsets of cases from the different classes must be selected at random from your complete data set. This to avoid any selection bias.

How can I get the right balance between classification loss and a regularizer? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on a deep learning classifier (Keras and Python) that classifies time series into three categories. The loss function that I'm using is the standard categorical cross-entropy. In addition to this, I also have an attention map which is being learnt within the same model.
I would like this attention map to be as small as possible, so I'm using a regularizer. Here comes the problem: how do I set the right regularization parameter? What I want is the network to reach its maximum classification accuracy first, and then starts minimising the intensity attention map. For this reason, I train my model once without regulariser and a second time with the regulariser on. However, if the regulariser parameter (lambda) is too high, the network loses completely accuracy and only minimises the attention, while if the regulariser is too small, the network only cares about the classification error and won't minimise the attention, even when the accuracy is already the maximum.
Is there a smarter way to combine the categorical cross-entropy with the regulariser? Maybe something that considers the variation of categorical cross-entropy in time, and if it doesn't go down for, say N iterations, it only considers the regulariser?
Thank you
Regularisation is a way to fight with overfitting. So, you should understood if your model overfits. A simple way to do it: you can compare f1 score for train and test. If f1 score for train is high and for test is low, seems, you have overfitting - so you need add some regularisation.

Text generation: character prediction RNN vs. word prediction RNN [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I've been researching text generation with RNNs, and it seems as though the common technique is to input text character by character, and have the RNN predict the next character.
Why wouldn't you do the same technique but using words instead of characters.
This seems like a much better technique to me because the RNN won't make any typos and it will be faster to train.
Am I missing something?
Furthermore, is it possible to create a word prediction RNN but with somehow inputting words pre-trained on word2vec, so that the RNN can understand their meaning?
Why wouldn't you do the same technique but using words instead of characters.
Word-based models are used just as often as character-based ones. See an example in this question. But there several important differences between the two:
Character-based model is more flexible and can learn rarely used words and punctuation. And Andrej Karpathy's post shows how effective this model can be. But this is also a downside, because this model can produce complete nonsense sometimes.
Character-based models have much smaller vocabulary, which makes it easier and faster to train. Since one-hot encoding and softmax loss are working perfectly, there's no need to complicate the model with embedding vectors and specially crafted loss functions (negative sampling, NCE, ...)
Word-based models can't generate out-of-vocabulary (OOV) words, they are more complex and resource demanding. But they can learn syntactically and grammatically correct sentences and are more robust than character-based ones.
By the way, there are also subword models, which are somewhat in the middle. See "Subword language modeling with neural networks" by T. Mikolov at al.
Furthermore, is it possible to create a word prediction RNN but with somehow inputting words pretrained on word2vec, so that the RNN can understand their meaning?
Yes, the example I referred to above is exactly about this kind of model.

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

Resources