How to measure similarity between two sentences using extracted named entities? - named-entity-recognition

I'm developing an application that calculate the similarity between a query and a list of products but not using the full sentences but only the features ( named entities ) extracted from the query and the products, i already have a trained ner model fine tuned on distelbert model using spacy transformers library, so i have access to word embeddings for my sentences and the extracted named entities.
So i want to calculate the similarity between the query and the produts i have in my database using the word embeddings of only their extracted entities, this way i can focus only on the features that the user is looking for and not the whole query, this is just a theory i'm testing and would like to see the results, so now my problem is how to do that ?

Related

word2vec train customized dataset from excel sheet to detect our company synonyms

I want to use Word2Vec Gensim model to detect customized synonyms for our company. My data are in an excel sheet in the form of 2 column "actual" and "equivalent". So i wanna train the model that whenever it sees the actual it would predict the equivalent.
My problem is I don't know how to label the data or reprocess or structure the data for detecting the equivalents based on the actual value. So literally just two words always
I'm open to other machine learning models as well

How to deal with out of vocabulary words in nlp?

I have a multi-label dataset that contains a lot of out-of-vocabulary words. The dataset is basically from a user forum site. The columns are post_title, post_description and tags. I want to predict the tags using machine learning models. But as the dataset contains many out-of-vocabulary words, the models are giving me very poor results. So what should I do in this case?

Printing regression results table for spatial regression models

I want to print the results of spatial regression models built with the spdep package into a nice table, but apparently conventional packages (e.g. stargazer, gtsummary, sjPlot, vtable) don't work well or don't function at all with these types of models. Is there any way to do this properly?

Determine document novelty/similarity with the aid of Latent Dirichlet allocation (LDA) or Named Entities

Given an index or database with a lot of (short) documents (~ 1 million), I am trying to do some kind of novelty detection for each newly incoming document.
I know that I have to compute the similarity of the new document with each document in the index. If the similarity is below a certain threshold, one can consider this document as novel. One common approach - which I want to do - is to use a Vector Space Model and compute the cosine similarity (e.g. by using Apache Lucene).
But this approach has two shortcomings: 1) it is computationally expensive and 2) it does not incorporate the semantics of documents and words respectively.
In order to overcome these shortcomings, my idea was to either use an LDA topic distribution or named entities to augment the Lucene index and the query (i.e. the document collection and each new document) with semantics.
Now, I am completely lost regarding the concrete implementation. I have already trained an LDA topic model using Mallet and I am also able to do Named Entity Recognition on the corpus. But I do not know how to use these topics and named entities in order to realise novelty detection. More specifically, I do not know how to use these features for index and query creation.
For example, is it already sufficient to store all named entities of one document as a separate field in the index, add certain weights (i.e. boost them) and use a MultiFieldQuery? I do not think that this already adds some kind of semantics to the similarity detection. The same applies to LDA topics: is it sufficient to add the topic probability of each term as a Payload and implement a new similarity score?
I would be very happy if you could provide some hints or even code snippets on how to incorporate LDA topics or named entities in Lucene for some kind of novelty detection or semantic similarity measure.
Thank you in advance.

Is Word2Vec and Glove vectors are suited for Entity Recognition?

I am working on Named Entity Recognition. I evaluated libraries, such as MITIE, Stanford NER , NLTK NER etc., which are built upon conventional nlp techniques. I also looked at deep learning models such as word2vec and Glove vectors for representing words in vector space, they are interesting since they provide the information about the context of a word, but specifically for the task of NER, I think its not well suited. Since all these vector models create a vocab and corresponding vector representation. If any word failed to be in the vocabulary it will not be recognised. Assuming that it is highly likely that a named entity is not present since they are not bound by the language. It can be anything. So if any deep learning technique have to be useful in such cases are the ones which are more dependent on the structure of the sentence by using standard english vocab i.e. ignoring named fields. Is there any such model or method available? Will CNN or RNN may be the answer for it ?
I think you mean texts of a certain language, but the named entities in that text may contain different names (e.g. from other languages)?
The first thing that comes to my mind is some semi-supervised learning techniques that the model is being updated periodically to reflect new vocabulary.
For example, you may want to use word2vec model to train the incoming data, and compare the word vector of possible NEs with existing NEs. Their cosine distance should be close.

Resources