We have pre-trained fast text word embeddings. Can we use it find the character embeddings. Although I found a blog This link. But in this blog author has just averaged the character over all the words. Is there any other way to have character embeddings without training the RNN or CNN.
Related
I have been doing a course which teaches you about Deep Neural Networks, during one of the exercises I was made to make an RNN for sentiment classification which I did, but I did not understand how an RNN is able to deal with sentences of different lengths while conducting sentiment classification.
The RNN doesn't care the length of the original sentences, because all data it takes have the same length. Converting all sentences in the same length is about the method which you use in the data processing step.
For example the simplest method is Bag of Words -> https://machinelearningmastery.com/gentle-introduction-bag-words-model/
So, the given sentences to the RNN have the same length and it is equal to the numbers of the input layer's neurons, otherwise the RNN throws an error.
I have a set of pre-trained word2vec word vectors and a corpus. I want to use the word vectors to represent words in the corpus. The corpus has some words in it that I don't have trained word vectors for. What's the best way to handle those words for which there is no pre-trained vector?
I've heard several suggestions.
use a vector of zeros for every missing word
use a vector of random numbers for every missing word (with a bunch of suggestions on how to bound those randoms)
an idea I had: take a vector whose values are the mean of all values in that position from all pre-trained vectors
Anyone with experience with the problem have thoughts on how to handle this?
FastText from Facebook assembles word vectors from subword n-grams which allows it to handle out of vocabulary words. See more about this approach at: Out of Vocab Word Embedding
In a pre-trained word2vec embedding matrix, you can usually use word unk as index to find a predesigned vector which is often the best vector.
I want to classify texts using svm (smo) in weka. The file I have, contains some sentences (Persian) and a word in front of each sentence which shows its class. The question is: should I change these sentences to a binary vector and give these vectors to weka as input or is it enough if I just turn the sentences to vector by choosing "string to word vector" in weka itself?
sample file:
https://www.dropbox.com/s/ohpyortve8jbwhe/shoor.arff?dl=0
Although, it works with choosing "string to word vector" in weka, it's better to change the sentences to vectors according to 1000 most frequent words or any other features. It works faster.
I have a list of sentence/label pairs to train the model, how should I encode the sentences as input to, say an SVM?
Are the sentences in the same language? You could start with the pretrained word2vec file that you can download from Google if it's English. Pay attention to how the train file was created, whether stemming was applied, etc. It's also somewhat important from which corpus it was generated; you'd get different results if this is from newsgroups or if it was extracted from the web or from more formal text.
Word2Vec basically encodes every word into a higher dimensional vector space. This is usually 200,300 or 500 dimensions large. After it is trained, then the "test" sentences are basically bag of words and need not be in any order.
You'd then, for each word in the bag of words, figure out the corresponding word2vec vector. Then you can create features by averaging the vectors, taking the 'minimum', the 'maximum' and if you're comparing text, look at calculating the cosine similarity between vectors. Then use those features in an SVM.
Im planning on using an NN for sarcasm detection on a number of tweets. Im unsure of how to prepare the word embeddings I will train the NN on. If I tokenize the strings and tag emoticons, capitalisation, user tags, hashtags etc, how do i then combine the resulting strings with word embeddings? do i train the word embeddings on the resulting corpus of tweets?
You can start by reading some papers on sarcasm detection in twitter, e.g. Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon, which uses patterns of content words and high frequency words, or closer to your question Sarcastic or Not: Word Embeddings to Predict the Literal or Sarcastic
Meaning of Words which uses word2vec. The latter views the sarcasm detection problem as disambiguation problem of literal and sarcastic meanings of the same word. Perhaps you can employ this approach using the recently published sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings.
Try to use the techniques used in the papers, and when you encounter a specific problem ask a question with a minimal working example.