How many classes can CNN classify the short text? - machine-learning

I know that CNN(conv-neural-network) could classify more than 10 thousands of images of the ImageNet.
I find that CNN could only classify 10-20 text classes as this paper write.
How many classes can CNN classify the short text? What is the high limit of the classes number?

The number of categories a classifier could classify with good precision/recall is decided by (but not limited to):
how distinct each category is?
how many features you could derive from the content (short text definitely carries much less information here than images) -- since you are using CNN for text, I assume the features would be merely characters or words.
How these features work to differentiate between categories?
how many high-quality labeled examples you have? (We don't have a public labeled large multi-category dataset for short text)
It's hard to just give you a number without knowing the answers to above questions.

Related

Creating word embedings from bert and feeding them to random forest for classification

I have used bert base pretrained model with 512 dimensions to generate contextual features. Feeding those vectors to random forest classifier is providing 83 percent accuracy but in various researches i have seen that bert minimal gives 90 percent.
I have some other features too like word2vec, lexicon, TFIDF and punctuation features.
Even when i merged all the features i got 83 percent accuracy. The research paper which i am using as base paper mentioned an accuracy score of 92 percent but they have used an ensemble based approach in which they classified through bert and trained random forest on weights.
But i was willing to do some innovation thus didn't followed that approach.
My dataset is biased to positive reviews so according to me the accuracy is less as model is also biased for positive labels but still I am looking for an expert advise
Code implementation of bert
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Bert_Features.ipynb
Random forest on all features independently
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/RandomForestClassifier.ipynb
Random forest on all features jointly
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Merging_Feature.ipynb
Regarding the "no improvements despite adding more features" - some researchers believe that the BERT word embeddings already contain all the available information presented in text, so then it doesn't matter how fancy a classification head you add to it, doesn't matter if it is a linear model that uses the embeddings, or a complicated ML algorithm with a number of other features, they will not provide significant improvements in many tasks. They argue, that since BERT is a context-aware, bidirectional language model - that is trained extensively on MLM and NSP tasks, it already grasps most of the things that additional features for punctuation, word2vec and tfidf could convey. The lexicon could probably help a little in the sentiment task, if it is relevant, but the one or two extra variables, that you likely use to represent it, probably get drowned in all the other features.
Other than that, the accuracy of BERT-based models depends on the dataset used, sometimes the data is simply too diverse to obtain a perfect score, e.g. if there are some instances of observations that are very similar, but with different class labels etc. You can see in the BERT papers, that the accuracy widely depends on the task, e.g. in some tasks it is indeed 90+%, but for some tasks, e.g. Masked Language Modeling, where the model needs to choose a particular word from a vocab of over 30K words, the accuracy of 20% could be impressive in some cases. So in order to obtain a reliable comparison with bert papers, you'd need to pick a dataset that they've used and then compare.
Regarding the dataset balance, for deep learning models in general, the rule of thumb is that the training set should be more or less balanced w.r.t. the fraction of data covered by each class label. So if you have 2 labels, should be ~50-50, if 5 labels, then each should be at around 20% of training dataset, etc.
That is because most NN's work in batches, where they update the model weights based on the feedback from each batch. So if you have too many values of one class, the batch updates will be dominated by that one class, effectively worsening the quality of your training.
So, if you want to improve the accuracy of your model, balancing the dataset could be an easy fix. And if you have e.g. 5 ordered classes with differing sizes, you may consider merging some of them (e.g. reviews from 1-2 as bad, 3 as neutral, 4-5 as good) and then rebalancing, if still necessary.
(Unless it's a situation where e.g. 1 class has 80% of data, and 4 classes share the remaining 20%. In such a case you should probably consider some more advanced options, such as partitioning the algo to two parts, one predicting whether or not an instance is in class 1 (so a binary classifier), the other to distinguish between the 4 underrepresented classes. )

Word Embedding Model

I have been searching and attempting to implement a word embedding model to predict similarity between words. I have a dataset made up 3,550 company names, the idea is that the user can provide a new word (which would not be in the vocabulary) and calculate the similarity between the new name and existing ones.
During preprocessing I got rid of stop words and punctuation (hyphens, dots, commas, etc). In addition, I applied stemming and separated prefixes with the hope to get more precision. Then words such as BIOCHEMICAL ended up as BIO CHEMIC which is the word divided in two (prefix and stem word)
The average company name length is made up 3 words with the following frequency:
The tokens that are the result of preprocessing are sent to word2vec:
#window: Maximum distance between the current and predicted word within a sentence
#min_count: Ignores all words with total frequency lower than this.
#workers: Use these many worker threads to train the model
#sg: The training algorithm, either CBOW(0) or skip gram(1). Default is 0s
word2vec_model = Word2Vec(prepWords,size=300, window=2, min_count=1, workers=7, sg=1)
After the model included all the words in the vocab , the average sentence vector is calculated for each company name:
df['avg_vector']=df2.apply(lambda row : avg_sentence_vector(row, model=word2vec_model, num_features=300, index2word_set=set(word2vec_model.wv.index2word)).tolist())
Then, the vector is saved for further lookups:
##Saving name and vector values in file
df.to_csv('name-submission-vectors.csv',encoding='utf-8', index=False)
If a new company name is not included in the vocab after preprocessing (removing stop words and punctuation), then I proceed to create the model again and calculate the average sentence vector and save it again.
I have found this model is not working as expected. As an example, calculating the most similar words pet is getting the following results:
ms=word2vec_model.most_similar('pet')
('fastfood', 0.20879755914211273)
('hammer', 0.20450574159622192)
('allur', 0.20118337869644165)
('wright', 0.20001833140850067)
('daili', 0.1990675926208496)
('mgt', 0.1908089816570282)
('mcintosh', 0.18571510910987854)
('autopart', 0.1729743778705597)
('metamorphosi', 0.16965581476688385)
('doak', 0.16890916228294373)
In the dataset, I have words such as paws or petcare, but other words are creating relationships with pet word.
This is the distribution of the nearer words for pet:
On the other hand, when I used the GoogleNews-vectors-negative300.bin.gz, I could not add new words to the vocab, but the similarity between pet and words around was as expected:
ms=word2vec_model.most_similar('pet')
('pets', 0.771199643611908)
('Pet', 0.723974347114563)
('dog', 0.7164785265922546)
('puppy', 0.6972636580467224)
('cat', 0.6891531348228455)
('cats', 0.6719794869422913)
('pooch', 0.6579219102859497)
('Pets', 0.636363685131073)
('animal', 0.6338439583778381)
('dogs', 0.6224827170372009)
This is the distribution of the nearest words:
I would like to get your advice about the following:
Is this dataset appropriate to proceed with this model?
Is the length of the dataset enough to allow word2vec "learn" the relationships between the words?
What can I do to improve the model to make word2vec create relationships of the same type as GoogleNews where for instance word pet is correctly set among similar words?
Is it feasible to implement another alternative such as fasttext considering the nature of the current dataset?
Do you know any public dataset that can be used along with the current dataset to create those relationships?
Thanks
3500 texts (company names) of just ~3 words each is only around 10k total training words, with a much smaller vocabulary of unique words.
That's very, very small for word2vec & related algorithms, which rely on lots of data, and sufficiently-varied data, to train-up useful vector arrangements.
You may be able to squeeze some meaningful training from limited data by using far more training epochs than the default epochs=5, and far smaller vectors than the default size=100. With those sorts of adjustments, you may start to see more meaningful most_similar() results.
But, it's unclear that word2vec, and specifically word2vec in your averaging-of-a-name's-words comparisons, is matched to your end goals.
Word2vec needs lots of data, doesn't look at subword units, and can't say anything about word-tokens not seen during training. An average-of-many-word-vectors can often work as an easy baseline for comparing multiword texts, but might also dilute some word's influence compared to other methods.
Things to consider might include:
Word2vec-related algorithms like FastText that also learn vectors for subword units, and can thus bootstrap not-so-bad guess vectors for words not seen in training. (But, these are also data hungry, and to use on a small dataset you'd again want to reduce vector size, increase epochs, and additionally shrink the number of buckets used for subword learning.)
More sophisticated comparisons of multi-word texts, like "Word Mover's Distance". (That can be quite expensive on longer texts, but for names/titles of just a few words may be practical.)
Finding more data that's compatible with your aims for a stronger model. A larger database of company names might help. If you just want your analysis to understand English words/roots, more generic training texts might work too.
For many purposes, a mere lexicographic comparison - edit distances, count of shared character-n-grams – may be helpful too, though it won't detect all synonyms/semantically-similar words.
Word2vec does not generalize to unseen words.
It does not even work well for wards that are seen but rare. It really depends on having many many examples of word usage. Furthermore a you need enough context left and right, but you only use company names - these are too short. That is likely why your embeddings perform so poorly: too little data and too short texts.
Hence, it is the wrong approach for you. Retraining the model with the new company name is not enough - you still only have one data point. You may as well leave out unseen words, word2vec cannot work better than that even if you retrain.
If you only want to compute similarity between words, probably you don't need to insert new words in your vocabulary.
By eye, I think you can also use FastText without the need to stem the words. It also computes vectors for unknown words.
From FastText FAQ:
One of the key features of fastText word representation is its ability
to produce vectors for any words, even made-up ones. Indeed, fastText
word vectors are built from vectors of substrings of characters
contained in it. This allows to build vectors even for misspelled
words or concatenation of words.
FastText seems to be useful for your purpose.
For your task, you can follow FastText supervised tutorial.
If your corpus proves to be too small, you can build your model starting from availaible pretrained vectors (pretrainedVectors parameter).

How can I group text questions that are similar to each other?

I have a dataset of 200k questions, and I would like to group them together by similarity/duplicates.
How can I use NLP/machine learning to group these questions with similar intents together?
Given a question and a list of questions, how can I find the question or questions that are similar or duplicates?
Are there any services that can do this?
Generally, you'd want to convert the questions into a abstract numerical format (such as a single high-dimensional vector, or 'bags of words/vectors'), from which it is then possible to calculate numerical pairwise similarities between questions.
For example: you could turn each question into a simple average of the word-vectors for its individual words. (Those word-vectors might come from your own training corpus, that matches the questions' usage domain exactly, or from some other outside source that's good enough.)
If the word-vectors are 300-dimensional, averaging all the words-vectors of a question together then gives you a 300-dimensional vector for the question. You can then use a typical measure of vector-similarity, such as "cosine similarity", to get a number from -1.0 to 1.0 for each pair of questions, with larger values indicating "more similar".
Such a simple approach is often a good baseline. Being smarter about dropping some words, or weighting words by their observed significance (eg by "TF/IDF" weighting) may improve it.
But there are other ways to get summary vectors that may work better than a simple average. One relatively straightforward algorithm, largely similar to the way word-vectors are created, is called "Paragraph Vectors", and is sometimes called in popular libraries (like Python gensim) "Doc2Vec". It's not quite a simple average of word-vectors, as much as creating a synthetic word-like token for a full text, which then is trained to be as good as possible at predicting the text's words. Again, once you have a (for example) 300-dimensional text-vector, calculating cosine-similarity can rank question similarities.
There's also an interesting algorithm called "Word Mover's Distance", which leaves the texts as variable-sized bags of each constituent word-vector, as if each word-vector was a pile-of-meaning. It then calculates the "effort" to move the piles from one text's shape-of-piles, to another text's – and less effort seems to correlate well with humans' sense of text similarity. (However, finding these minimal-shifts is a lot more computationally expensive than simple cosine-similarity – so this works best with short texts, or small corpuses, or when you can massively parallelize the computation.)
Once you have any of these numeric-similarity measures working, then you can also clustering algorithms to find groups of highly-related questions – and often once you have those groups, the most-common words in those groups (as opposed to others), or human editorial work, can name the groups.

Good training data for text classification by LDA?

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html

Machine learning how to use the Facebook interest of a users to give a decision

I'm trying to figure out a way I could represent a Facebook user as a vector. I decided to go with stacking the different attributes/parameters of the user into one big vector (i.e. age is a vector of size 100, where 100 is the maximum age you can have, if you are lets say 50, the first 50 values of the vector would be 1 just like a thermometer). I just can't figure out a way to represent the Facebook interests as a vector too, they are a collection of words and the space that represents all the words is huge, I can't go for a model like a bag of words or something similar. Does anyone know how I should proceed? I'm still new to this, any reference would be highly appreciated.
In the case of a desire to down vote this question just let me know what is wrong about it so that I could improve the wording and context.
Thanks
The "right" approach depends on what your learning algorithm is and what the decision problem is.
It would often be better, though, to represent age as a single numeric feature rather than 100 indicator features. That way learning algorithms don't have to learn the relationship between those hundred features (it's baked-in), and the problem has 99 fewer dimensions, which'll make everything better.
To model the interests, you might want to start with an extremely high-dimensional bag of words model and then use one of various options to reduce the dimensionality:
a general dimensionality-reduction technique like PCA or smarter nonlinear ones, including Kernel PCA or various nonlinear approaches: see wikipedia's overview of dimensionality reduction and of specifically nonlinear techniques
pass it through a topic model and use the learned topic weights as your features; examples include LSA, LDA, HDP and many more

Resources