Text generation: character prediction RNN vs. word prediction RNN [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I've been researching text generation with RNNs, and it seems as though the common technique is to input text character by character, and have the RNN predict the next character.
Why wouldn't you do the same technique but using words instead of characters.
This seems like a much better technique to me because the RNN won't make any typos and it will be faster to train.
Am I missing something?
Furthermore, is it possible to create a word prediction RNN but with somehow inputting words pre-trained on word2vec, so that the RNN can understand their meaning?

Why wouldn't you do the same technique but using words instead of characters.
Word-based models are used just as often as character-based ones. See an example in this question. But there several important differences between the two:
Character-based model is more flexible and can learn rarely used words and punctuation. And Andrej Karpathy's post shows how effective this model can be. But this is also a downside, because this model can produce complete nonsense sometimes.
Character-based models have much smaller vocabulary, which makes it easier and faster to train. Since one-hot encoding and softmax loss are working perfectly, there's no need to complicate the model with embedding vectors and specially crafted loss functions (negative sampling, NCE, ...)
Word-based models can't generate out-of-vocabulary (OOV) words, they are more complex and resource demanding. But they can learn syntactically and grammatically correct sentences and are more robust than character-based ones.
By the way, there are also subword models, which are somewhat in the middle. See "Subword language modeling with neural networks" by T. Mikolov at al.
Furthermore, is it possible to create a word prediction RNN but with somehow inputting words pretrained on word2vec, so that the RNN can understand their meaning?
Yes, the example I referred to above is exactly about this kind of model.

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

Word embedding training [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have one corpus for word embedding. Using this corpus, I trained my word embedding. However, whenever I train my word embedding, the results are quite different(this results are based on K-Nearest Neighbor(KNN)). For example, in the first training, 'computer' nearest neighbor words are 'laptops', 'computerized' ,'hardware'. But, in the second training, this knn words are 'software', 'machine',...('laptops' is low ranked!) - all training are performed independently 20 epochs, and hyper-parameters are all the same.
I want to train my word embedding very similar(e.g., 'laptops' is high ranked). How should i do? Should I modulate hyper-parameters(learning rate, initializing, etc)?
You didn't say what word2vec software you're using, which might change the relevant factors.
The word2vec algorithm inherently uses randomness, in both initialization and several aspects of its training (like the selection of negative-examples, if using negative-sampling, or random downsampling of very-frequent words). Additionally, if you're doing multithreaded training, the essentially-random jitter in the OS thread scheduling will change the order of training examples, introducing another source of randomness. So you shouldn't necessarily expect subsequent runs, even with the exact same parameters and corpus, to give identical results.
Still, with enough good data, suitable parameters, and a proper training loop, the relative-neighbors results should be fairly similar from run-to-run. If it's not, more data or more iterations might help.
Wildly-different results would be most likely if the model is overlarge (too many dimensions/words) for your corpus – and thus prone to overfitting. That is, it finds a great configuration for the data, through essentially memorizing its idiosyncracies, without achieving any generalization power. And if such overfitting is possible, there are typically many equally-good such memorizations – so they can be very different from run-to-tun. Meanwhile, a right-sized model with lots of data will instead be capturing true generalities, and those would be more consistent from run-to-run, despite any randomization.
Getting more data, using smaller vectors, using more training passes, or upping the minimum-count of word-occurrences to retain/train a word all might help. (Very-infrequent words don't get high-quality vectors, so wind up just interfering with the quality of other words, and then randomly intruding in most-similar lists.)
To know what else might be awry, you should clarify in your question things like:
software used
modes/metaparameters used
corpus size, in number of examples, average example size in words, and unique-words count (both in the raw corpus, and after any minumum-count is applied)
methods of preprocessing
code you're using for training (if you're managing the multiple training-passes yourself)

What are the algorithms which could be sued to match sentences? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say we have a list of 50 sentences and we have an input sentence. How can i choose the closest sentence to the input sentence from the list?
I have tried many methods/algorithms such as averaging word2vec vector representations of each token of the sentence and then cosine similarity of result vectors.
For example I want the algorithm to give a high similarity score between "what is the definition of book?" and "please define book".
I am looking for a method (probably a combinations of methods) which
1. looks for semantics
2. looks for syntax
3. gives different weights for different tokens with different role (e.g. in the first example 'what' and 'is' should get lower weights)
I know this might be a bit general but any suggestion is appreciated.
Thanks,
Amir
before counting a distance between sentences, you need to clean them,
For that:
A Lemmatization of your words is needed to get the root of each word, so your sentence "what is the definition of book" woul be "what be the definition of bood"
You need to delete all preposition, verb to be and all Word without meaning, like : "what be the definition of bood" would be "definintion book"
And then you transform you sentences into vectors of number by using tf-idf method or wordToVec.
Finnaly you can count the distance between your sentences by using cosine between the vectors, so if the cosine is small so the your two sentences are similar.
Hop that will help
Your sentences are too sparse to compare the two documents directly. Aggressive morphological transformations (such as stemming, lemmatization, etc) might help some, but will probably fall short given your examples.
What you could do is compare the 'search results' of the 2 sentences in a large document collection with a number of methods. According to the distributional hypothesis similar sentences should occur in similar context (see Distributional hypothesis, but also Rocchio's algorithm, co-occurrence and word2vec). Those context (when gathered in a smart way) could be large enough to do some comparison (such as cosine similarity).

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

NLP: Calculating probability a document belongs to a topic (with a bag of words)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Given a topic, how can I calculate the probability a document "belongs" to that topic(ie sports)
This is what I have to work with:
1) I know the common words in documents associated with that topics (eliminating all STOP words), and the % of documents that have that word
For instance if the topic is sports, I know:
75% of sports documents have the word "play"
70% have the word "stadium"
40% have the word "contract"
30% have the word "baseball"
2) Given this, and a document with a bunch of words, how can I calculate the probability this document belongs to that topic?
This is fuzzy classification problem with topics as classes and words as features. Normally you don't have bag of words for each topic, but rather set of documents and associated topics, so I will describe this case first.
The most natural way to find probability (in the same sense it is used in probability theory) is to use naive Bayes classifier. This algorithm has been described many times, so I'm not going to cover it here. You can find quite good explanation in this synopsis or in associated Coursera NLP lectures.
There are also many other algorithms you can use. For example, your description naturally fits tf*idf based classifiers. tf*idf (term frequency * inverse document frequency) is a statistic used in modern search engines to calculate importance of a word in a document. For classification, you may calculate "average document" for each topic and then find how close new document is to each topic with cosine similarity.
If you have the case exactly like you've described - only topics and associated words - just consider each bag of words as a single document with, possibly, duplicating frequent words.
Check out topic modeling (https://en.wikipedia.org/wiki/Topic_model) and if you are coding in python, you should check out radim's implementation, gensim (http://radimrehurek.com/gensim/tut1.html). Otherwise there are many other implementations from http://www.cs.princeton.edu/~blei/topicmodeling.html
There are many approaches to solving a clustering problem. I suggest start with simple logistic regression and look at the results. If you already have predefined ontology sets, you can add them as features in next stage to improve accuracy.

Resources