I was working with LDA for document classification. I am confused in one part.
Should we use LDA for classification using document Title, or document content?
I have a huge set of documents, and using LDA on content keeps throwing MemoryError for even a small number of topics(~5-10).
I understand that it requires 8*num_topics*dictionary_size bytes of memory, which is probably the reason it runs out of memory. It works better on the title of the documents.
Should I use LDA for topic, and some other algorithm like Word2Vec for content?
Related
I'm working with a dataset of emails' content which I want to transform with doc2vec. This is a labeled dataset (spam/not-spam) and it is unbalanced (90-10 ratio).
My question is: when tokenizing the emails' content, should I first oversample (using SMOTE), or is it ok to use the dataset as is?
Try both, pick which works better.
(Separately: avoid using the known-labels as the document-identifiers in Doc2Vec, as in practice that turns the dataset into just two giant documents – far too few for training doc-vectors of any useful dimensionality – instead of the many varied documents that are needed for an interesting/useful high-dimensional doc-vector set.)
In deep learning, in particularly NLP, words are transformed into a vector representation to be fed into a neural network such as an RNN. By referring to the link:
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/#Word%20Embeddings
In the section of Word Embeddings, it is said that:
A word embedding W:words→Rn is a paramaterized function mapping words
in
some language to high-dimensional vectors (perhaps 200 to 500 dimensions)
I do not understand the purpose of the dimension of the vectors. What does it mean to have a vector of 200 dimensions compared to a vector of 20 dimensions?
Does it improve the overall accuracy of the model? Could anyone give me a simple example regarding the choice of dimension of the vectors.
These word embeddings also called Distributed Word Embedding is based on
you know a word by the company it keeps
as quoted by John Rupert Firth
So we know the meaning of a word by its context. You can think of each scalar in the vector (of a word) represents its strength for a concept. This slide from Prof. Pawan Goyal explains it all.
So you want good vector size to capture decent amount of concepts but you do not want a too huge vector because it will then become the bottleneck in training of models where these embeddings are used.
Also the vector size is mostly fixed as most do not train their own embedding but rather use openly available embeddings as they are trained for many hours on huge data. So using them will force us to use an embedding layers with dimensions as given by the openly available embedding you are using (word2vec, glove etc)
Distributed Word Embeddings is a major milestone in the area of deep learning in NLP. They give better accuracy as compared of tfidf based embeddings.
I am trying to use BERT for a document ranking problem. My task is pretty straightforward. I have to do a similarity ranking for an input document. The only issue here is that I don’t have labels - so it’s more of a qualitative analysis.
I am on my way to try a bunch of document representation techniques - word2vec, para2vec and BERT mainly.
For BERT, i came across Hugging face - Pytorch library. I fine tuned the bert-base-uncased model, with around 150,000 documents. I ran it for 5 epochs, with a batch size of 16 and max seq length 128. However, if I compare the performance of Bert representation vs word2vec representations, for some reason word2vec is performing better for me right now. For BERT, I used the last four layers for getting the representation.
I am not too sure why the fine tuned model didn’t work. I read up this paper, and this other link also that said that BERT performs well when fine tuned for a classification task. However, since I don’t have the labels, I fined tuned it as it's done in the paper - in an unsupervised manner.
Also, my documents vary a lot in their length. So I’m sending them sentence wise right now. In the end I have to average over the word embeddings anyway to get the sentence embedding. Any ideas on a better method? I also read here - that there are different ways of pooling over the word embeddings to get a fixed embedding. Wondering if there is a comparison of which pooling technique works better?
Any help on training BERT better or a better pooling method will be greatly appreciated!
You can check out this blog post:
BERT even has a special [CLS] token whose output embedding is used for classification tasks, but still turns out to be a poor embedding of the input sequence for other tasks. [Reimers & Gurevych, 2019]
Sentence-BERT, presented in [Reimers & Gurevych, 2019] and accompanied by a Python implementation, aims to adapt the BERT architecture by using siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity
I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.
I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html