Adding vocabulary and improve word embedding with another model that was built on bigger corpus - machine-learning

I'm new to NLP. I'm currently building a NLP system in a specific domain. After training a word2vec and fasttext model on my documents, I found that the embedding is not really good because I didn't feed enough number of documents (e.g. the embedding can't see that "bar" and "pub" is strongly correlated to each other because "pub" only appears a few in the documents). Later, I found a word2vec model online built on that domain-specific corpus which definitely has a way better embedding (so "pub" is more related to "bar"). Is there any way to improve my word embedding using the model I found? Thanks!

Word2Vec (and similar) models really require a large volume of varied data to create strong vectors.
But also, a model's vectors are typically only meaningful alongside other vectors that were trained together in the same session. This is both because the process includes some randomness, and the vectors only acquire their useful positions via a tug-of-war with all other vectors and aspects of the model-in-training.
So, there's no standard location for a word like 'bar' - just a good position, within a certain model, given the training data and model parameters and other words co-populating the model.
This means mixing vectors from different models is non-trivial. There are ways to learn a 'translation' that moves vectors from the space of one model to another – but that is itself a lot like a re-training. You can pre-initialize a model with vectors from elsewhere... but as soon as training starts, all the words in your training corpus will start drifting into the best alignment for that data, and gradually away from their original positions, and away from pure comparability with other words that aren't being updated.
In my opinion, the best approach is usually to expand your corpus with more appropriate data, so that it has "enough" examples of every word important to you, in sufficiently varied contexts.
Many people use large free text dumps like Wikipedia articles for word-vector training, but be aware that its style of writing – dry, authoritative reference texts – may not be optimal for all domains. If your problem-area is "business reviews", you'd probably do best finding other review texts. If it's fiction stories, more fictional writing. And so forth. You can shuffle these other text-soruces in with your data to expand the vocabulary coverage.
You can also potentially shuffle in extra repeated examples of your own local data, if you want it to effectively have relatively more influence. (Generally, merely repeating a small number of non-varied examples can't help improve word-vectors: it's the subtle contrasts of different examples that helps. But as a way to incrementally boost the influence of some examples, when there are plenty of examples overall, it can make more sense.)

Related

Is it a good idea to use word2vec for encoding of categorical features?

I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so.
I stumbled over word2vec, which is mostly used for NLP, but I was wondering whether I could use it to encode my variables, i.e. simply take the weights of the neural net as the encoded features.
However, I am not sure, whether it is a good idea since, the context words, which serve as the input features in word2vec are in my case more or less random, in contrast to real sentences which word2vec was originially made for.
Do you guys have any advice, thoughts, recommendations on this?
You should look into entity embedding if you are searching for a way to utilize embeddings for categorical variables.
google has a good crash course on the topic: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data
this is a good paper on arxiv written by a team from a Kaggle competition: https://arxiv.org/abs/1604.06737
It's certainly possible to use the word2vec algorithm to train up 'dense embeddings' for things like keywords, tags, categories, and so forth. It's been done, sometimes beneficially.
Whether it's a good idea in your case will depend on your data & goals – the only way to know for sure is to try it, and evaluate the results versus your alternatives. (For example, if the number of categories is modest from a controlled vocabulary, one-hot encoding of the categories may be practical, and depending on the kind of binary classifier you use downstream, the classifier may itself be able to learn the same sorts of subtle interrelationships between categories that could also otherwise be learned via a word2vec model. On the other hand, if categories are very numerous & chaotic, the pre-step of 'compressing' them into a smaller-dimensional space, where similar categories have similar representational vectors, may be more helpful.)
That such tokens don't quite have the same frequency distributions & surrounding contexts as true natural language text may mean it's worth trying a wider range of non-default training options on any word2vec model.
In particular, if your categories don't have a natural ordering giving rise to meaningful near-neighbors relationships, using a giant window (so all words in a single 'text' are in each others' contexts) may be worth considering.
Recent versions of the Python gensim Word2Vec allow changing a parameter named ns_exponent – which was fixed at 0.75 in many early implementations, but at least one paper has suggested can usefully vary far from that value for certain corpus data and recommendation-like applications.

Incorporating feedback to retrain WordToVec for finding document similarity

I have trained Gensim's WordToVec on a text corpus,converted it to DocToVec and then used cosine similarity to find the similarity between documents. I need to suggest similar documents. Now suppose among the top 5 suggestions for a particular document, we manually find that 3 of them are not similar.Can this feedback be incorporated in retraining the model?
It's not quite clear what you mean by "converted [a Word2Vec model] to DocToVec". The gensim Doc2Vec class doesn't use or require a Word2Vec model as input.
But, if you have many sets of hand-curated "this is a good suggestion" or "this is a bad suggestion" pairs for your corpus, you can use the model's scoring against all those to compare models, and train many variant models (with different model parameter values like size, window, min_count, sample, etc), picking the one that scores best on your tests.
That sort of automated-parameter-search is the most straightforward way to use performance on real evaluation data to adjust an unsupervised model like Word2Vec.
(Depending on the specifics of your data and problem-domain, you might also start to notice patterns in where the model is better or worse, that help you hand-tune parts of the data preprocessing. For example, a different handling of capitalization or tokenization might be suggested by error cases.)

How to evaluate word2vec build on a specific context files

Using gensim word2vec, built a CBOW model with a bunch of litigation files for representation of word as vector in a Named-Entity-recognition problem, but I want to known how to evaluate my representation of words. If I use any other datasets like wordsim353(NLTK) or other online datasets of google, it doesn't work because I built the model specific to my domain dataset of files. How do I evaluate my word2vec's representation of word vectors .I want words belonging to similar context to be closer in vector space.How do I ensure that the build model is doing it ?
I started by using a techniques called odd one out. Eg:
model.wv.doesnt_match("breakfast cereal dinner lunch".split()) --> 'cereal'
I created my own dataset(for validating) using the words in the training of word2vec .Started evaluating with taking three words of similar context and an odd word out of context.But the accuracy of my model is only 30 % .
Will the above method really helps in evaluating my w2v model ? Or Is there a better way ?
I want to go with word_similarity measure but I need a reference score(Human assessed) to evaluate my model or is there any techniques to do it? Please ,do suggest any ideas or techniques .
Ultimately this depends on the purpose you intend for the word-vectors – your evaluation should mimic the final use as much as possible.
The "odd one out" approach may be reasonable. It's often done with just 2 words that are somehow, via external knowledge/categorization, known to be related (in the aspects that are important for your end use), then a 3rd word picked at random.
If you think your hand-crafted evaluation set is of high-quality for your purposes, but your word-vectors aren't doing well, it may just be that there are other problems with your training: too little data, errors in preprocessing, poorly-chosen metaparameters, etc.
You'd have to look at individual failure cases in more detail to pick what to improve next. For example, even when it fails at one of your odd-one-out tests, do the lists of most-similar words, for each of the words included, still make superficial sense in an eyeball-test? Does using more data or more training iterations significantly improve the evaluation scoring?
A common mistake during both training and evaluation/deployment is to retain too many rare words, on the (mistaken) intuition that "more info must be better". In fact, words with only a few occurrences can't get very high-quality vectors. (Compared to more-frequent words, their end vectors are more heavily influenced by the random original initialization, and by the idiosyncracies of their few occurrences available rather than their most-general meaning.) And further, their presence tends to interfere with the improvement of other nearby more-frequent words. Then, if you include the 'long tail' of weaker vectors in your evaluations, they tend to somewhat arbitrarily intrude in rankings ahead of common words with strong vectors, hiding the 'right' answers to your evaluation questions.
Also, note that the absolute value of an evaluation score may not be that important, because you're just looking for something that points your other optimizations in the right direction for your true end-goal. Word-vectors that are just slightly-better at precise evaluation questions might still work well-enough in other fuzzier information-retrieval contexts.

The options for the first step of document clustering

I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.
I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.
As #ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:
feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.
For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".
You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.
Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.
OK, this is quite a very general question, and many answers are possible, none is definitive
because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on
the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.
You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.
Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others:
http://jgibblda.sourceforge.net/

Features in Document clustering/classification?

This may sound very naive but i just wanted to be sure that when talking in Machine Learning terminology, features in Document Clustering is words which are chosen from a document, if some are discarded after stemming or as stop-words.
I am trying to use LibSvm library and it says that there are different approaches for different types of { no_of_instances, no_of_features }.
Like if no_of_instances is much lower than no_of_features, linear kernels would do. if both are large, linear would be fast. However, if no_of_features is small, non-linear kernels is better.
So for my document clustering/classification, i have small number of documents like 100 each may have words around 2000. So i fall in small no_of_instances and large no_of_features category depending upon what i consider a feature is.
I would like to use tf-idf for the document.
So Is no_of_features is the size of the vector that i get from tf-idf ?
What you are talking about here is just one of possibilities, in fact the most trivial way of defining features for documents. In machine learning terminology feature is any mapping from the input space (in this particular example - from space of documents) into some abstract space, which is suited for a particular machine learning model. Most of the ML models (like neural networks, support vector machines, etc.) work on the numerical vectors, so features has to be mappings from documents to (constant size) vectors of numbers. This is a reason for sometimes choosing a representation of bag of owrds, where we have a words' count vector as a document representation. This limitation can be overcomed by using specific models, like for example Naive Bayes (or a custom kernel for SVM, which enables them to work with nonnumeric data), which can work on any objects, as long as we can define perticular conditional probabilities - here, the most basic approach is treating document containing a particular word or not as a "feature". In general this is not the only possibility, there are dozens of methods that use statistical features, semantic features (based on some ontologies like wordnet) etc.
To sum up - this is only one, most simple representation of document for the machine learning model. Good to start with, good to understand the basics, but far from being a "feature definition".
EDIT
no_of_features is size of the vector you use for your documents' representation, so if you use tf-idf, then size of resulting vecor is a no_of_featuers.

Resources