Evaluating Semantic Search using Paraphrase Mining - machine-learning

I need to decide on a particular model (between various BERT models, DistilBERT, RoBERTa, etc, pretrained on different datasets) for semantic search. To make this decision, I need to evaluate these models for the semantic search task. Would paraphrase or duplicate mining be an adequate metric for evaluating this?
Using some fancier vocabulary - I want to determine the best model to search for semantic matches of a query within a corpus. Would evaluating the ability of a model to find paraphrases between two corpora be an accurate measure of this ability?
Thank you.

Related

For large missingness, what are the advantages of imputation versus training on available subsets for random forest?

I want to train a random forest model on a dataset with large missingness. I am aware of the 'standard method', where we impute missing data in the training set, use the same imputation rules to impute the test set, then train a random forest model on the imputed training set and use the same model to predict on the test set (potentially doing it with multiple imputation).
What I want to understand is the difference to the following method which I would like to use:
Subset the dataset according to missing patterns. Train random forest models for each of the missing patterns. Use the random forest model trained on missing pattern A to predict data from the test set with missing pattern A. Use the model trained on pattern B to predict data from the test set with pattern B etc.
What is the name for this method? What are the statistical advantages or disadvantages of the two methods? I would very much appreciate if someone could direct me to some literature on the second method, or the comparison of the two.
The difference in methods is in prediction capability.
If you will train different models according to different missing patterns it will be trained on a lower amount of the data (due to missing pattern separation) and will be used to predict only the corresponding test set. Using this approach you can easily miss common patterns in your data for all of your dataset, which otherwise (using all the data) you would detect.
It still heavily depends on your particular case and your data. The good test that will check if your models trained due to particular missing patterns generalize well will be taking another missing pattern dataset, do simple and fast imputation in it (mean/mode/median e.t.c) and check the difference in the metric.
In my opinion, this approach sounds a little extreme as you are voluntarily cutting your train dataset into much smaller parts, than it could be. Maybe, it could perform better on large amounts of data, where your train dataset reduction doesn't hurt your model performance much.
About the articles - I don't know any articles, that compare these two approaches, but can suggest some good ones about various "standard "imputation approaches:
https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4
https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779

How to apply filter feature selection in datasets with various data types?

I have recently been looking into different filter feature selection approaches and have noted that some are better suited for numerical data (Pearson) and some are better suited for categorical data (Chi-Square).
I am working with a dataset with a mixture of both data types and am unsure about what the best practice is in terms of applying the filter methods.
Is it best to split the dataset into categorical and numerical, performing different filter methods on each set and then joining the results?
Or should only one filter method be applied to the whole dataset?
You can have a look at Permutation Importance. The idea is to randomly shuffle the values of a feature and observe the change in error. If the feature is important, ideally the error should increase. It does not depend on the data type of the feature, unlike some statistical tests. Also it is very straightforward to implement and analyze. link1, link2

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

Scalable Classifier For Finding Missing Attributes

I have a large sparse matrix representing attributes for millions of entities. For example, one record, representing an entity, might have attributes "has(fur)", "has(tail)", "makesSound(meow)", and "is(cat)".
However, this data is incomplete. For example, another entity might have all the attributes of a typical "is(cat)" entity, but it might be missing the "is(cat)" attribute. In this case, I want to determine the probability that this entity should have the "is(cat)" attribute.
So the problem I'm trying to solve is determining which missing attributes each entity should contain. Given an arbitrary record, I want to find the top N most likely attributes that are missing but should be included. I'm not sure what the formal name is for this type of problem, so I'm unsure what to search for when researching current solutions. Is there a scalable solution for this type of problem?
My first is to simply calculate the conditional probability for each missing attribute (e.g. P(is(cat)|has(fur) and has(tail) and ... )), but that seems like a very slow approach. Plus, as I understand the traditional calculation of conditional probability, I imagine I'd run into problems where my entity contains a few unusual attributes that aren't common with other is(cat) entities, causing the conditional probability to be zero.
My second idea is to train a Maximum Entropy classifier for each attribute, and then evaluate it based on the entity's current attributes. I think the probability calculation would be much more flexible, but this would still have scalability problems, since I'd have to train separate classifiers for potentially millions attributes. In addition, if I wanted to find the top N most likely attributes to include, I'd still have to evaluate all the classifiers, which would likely take forever.
Are there better solutions?
This sounds like a typical recommendation problem. For each attribute use the word 'movie rating' and for each row use the word 'person'. For each person, you want to find the movies that they will probably like but haven't rated yet.
You should look at some of the more successful approaches to the Netflix Challenge. The dataset is pretty large, so efficiency is a high priority. A good place to start might be the paper 'Matrix Factorization Techniques for Recommender Systems'.
If you have a large data set and you're worried about scalability, then I would look into Apache Mahout. Mahout is a Machine Learning and Data Mining library that might help you with your project, in particular they have some of the most well known algorithms already built-in:
Collaborative Filtering
User and Item based recommenders
K-Means, Fuzzy K-Means clustering
Mean Shift clustering
Dirichlet process clustering
Latent Dirichlet Allocation
Singular value decomposition
Parallel Frequent Pattern mining
Complementary Naive Bayes classifier
Random forest decision tree based classifier
High performance java collections (previously colt collections)

Resources