How to deal with out of vocabulary words in nlp? - machine-learning

I have a multi-label dataset that contains a lot of out-of-vocabulary words. The dataset is basically from a user forum site. The columns are post_title, post_description and tags. I want to predict the tags using machine learning models. But as the dataset contains many out-of-vocabulary words, the models are giving me very poor results. So what should I do in this case?

Related

How to measure similarity between two sentences using extracted named entities?

I'm developing an application that calculate the similarity between a query and a list of products but not using the full sentences but only the features ( named entities ) extracted from the query and the products, i already have a trained ner model fine tuned on distelbert model using spacy transformers library, so i have access to word embeddings for my sentences and the extracted named entities.
So i want to calculate the similarity between the query and the produts i have in my database using the word embeddings of only their extracted entities, this way i can focus only on the features that the user is looking for and not the whole query, this is just a theory i'm testing and would like to see the results, so now my problem is how to do that ?

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

Can a list of websites be considered a corpus for a particular category?

I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]

How can Topic Modeling noise be removed?

I am working on Topic Modeling where the given text corpus have lots of noise in form of supporting words after removal of stop words. These words have high term frequency but does not help in forming topic terms by using LDA along with other words with high frequency that are useful . How can this noise be removed?
LDA algorithms don't take tf-idf weights in input, but bag of words, however you could first filter words from your corpus based on their tf-idf score, and then feed the new texts to your LDA program.
Basic thing is that you do a TF-IDF and clean on scores, if that still doesnt help then you can create domain specific custom stopwords list. Suppose if I'm in a jobs domain, the word "job" is not a regular stopword but in jobs domain it is or the company name is a stopword since it repeats across many documents. So, building custom stopwords list is another way to go with.

Good training data for text classification by LDA?

I'm classifying content based on LDA into generic topics such as Music, Technology, Arts, Science
This is the process i'm using,
9 topics -> Music, Technology, Arts, Science etc etc.
9 documents -> Music.txt, Technology.txt, Arts.txt, Science.txt etc etc.
I've filled in each document(.txt file) with about 10,000 lines of content of what i think is "pure" categorical content
I then classify a test document, to see how well the classifier is trained
My Question is,
a.) Is this an efficient way to classify text (using the above steps)?
b.) Where should i be looking for "pure" topical content to fill each of these files? Sources which are not too large (text data > 1GB)
classification is only on "generic" topics such as the above
a) The method you describe sounds fine, but everything will depend on the implementation of labeled LDA that you're using. One of the best implementations I know is the Stanford Topic Modeling Toolbox. It is not actively developed anymore, but it worked great when I used it.
b) You can look for topical content on DBPedia, which has a structured ontology of topics/entities, and links to Wikipedia articles on those topics/entities.
I suggest you to use bag-of-words (bow) for each class you are using. Or vectors where each column is the frequency of important keywords related to the class you want to target.
Regarding the dictionaries you have DBPedia as yves referred or WordNet.
a.)The simplest solution is surely the k-nearest neighbors algorithm (knn). In fact, it will classify new texts with categorical content using an overlap metric.
You could find ressources here: https://github.com/search?utf8=✓&q=knn+text&type=Repositories&ref=searchresults
Dataset issue:
If you are dealing with classifying live user feeds, then I guess no single dataset will suffice your requirement.
Because if new movie X released, it might not catch by your classification dataset as the training dataset is obsoleted for it now.
For classification I guess to stay updated with latest datasets, use twitter training datasets. Develop dynamic algorithm which update the classifier with latest updated tweet datasets. You could select top 15-20 hash tag for each category of your choice to get most relevant dataset for each category.
Classifier:
Most of the classifier uses bag of words model, you can try out various classifiers and see which gives best result. see :
http://www.nltk.org/howto/classify.html
http://scikit-learn.org/stable/supervised_learning.html

Resources