The acronym "CTR" is frequently used in CatBoost https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html to represent a type of value. But I couldn't find what the acronym stands for. Could you please spell out the non acronym form, and provide some references to its definition?
Ctr is an acronym of click-through rate. They encode categorically features using this metric to get better results. It is quite similar to mean encoding.
I didn't get any online resource for reference.
Click Through Rate which is commonly used in advertisement industry. CatBoost was created by Yandex which makes all the profits in ads so I am not surprised they used CTR metric as an example all over documentation.
Related
Here I have a word2vec model, suppose I use the google-news-300 model
import gensim.downloader as api
word2vec_model300 = api.load('word2vec-google-news-300')
I want to find the similar words for "AI" or "artifical intelligence", so I want to write
word2vec_model300.most_similar("artifical intelligence")
and I got errors
KeyError: "word 'artifical intelligence' not in vocabulary"
So what is the right way to extract similar words for bigram words?
Thanks in advance!
At one level, when a word-token isn't in a fixed set of word-vectors, the creators of that set of word-vectors chose not to train/model that word. So, anything you do will only be a crude workaround for its absence.
Note, though, that when Google prepared those vectors – based on a dataset of news articles from before 2012 – they also ran some statistical multigram-combinations on it, creating multigrams with connecting _ characters. So, first check if a vector for 'artificial_intelligence' might be present.
If it isn't, you could try other rough workarounds like averaging together the vectors for 'artificial' and 'intelligence' – though of course that won't really be what people mean by the distinct combination of those words, just meanings suggested by the independent words.
The Gensim .most_similar() method can take either a raw vectors you've created by operations such as averaging, or even a list of multiple words which it will average for you, as arguments via its explicit keyword positive parameter. For example:
word2vec_model300.most_similar(positive=[average_vector])
...or...
word2vec_model300.most_similar(positive=['artificial', 'intelligence'])
Finally, though Google's old vectors are handy, they're a bit old now, & from a particular domain (popular news articles) where senses may not match tose used in other domains (or more recently). So you may want to seek alternate vectors, or train your own if you have sufficient data from your area of interest, to have apprpriate meanings – including vectors for any particular multigrams you choose to tokenize in your data.
Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782
I'm looking for test datasets to optimize my Word2Vec model. I have found a good one from gensim:
gensim/test/test_data/questions-words.txt
Does anyone know other similar datasets?
Thank you!
It is important to note that there isn't really a "ground truth" for word-vectors. There are interesting tasks you can do with them, and some arrangements of word-vectors will be better on a specific tasks than others.
But also, the word-vectors that are best on one task – such as analogy-solving in the style of the questions-words.txt problems – might not be best on another important task – like say modeling texts for classification or info-retrieval.
That said, you can make your own test data in the same format as questions-words.txt. Google's original word2vec.c release, which also included a tool for statistically combining nearby words into multi-word phrases, also included a questions-phrases.txt file, in the same format, that can be used to test word-vectors that have been similarly constructed for 'words' that are actually short multiple-word phrases.
The Python gensim word-vectors support includes an extra method, evaluate_word_pairs() for checking word-vectors not on analogy-solving but on conformance to collections of human-determined word-similarity-rankings. The documentation for that method includes a link to an appropriate test-set for that method, SimLex-999, and you may be able to find other test sets of the same format elsewhere.
But, again, none of these should be considered the absolute test of word-vectors' overall quality. The best test, for your particular project's use of word-vectors, would be some repeatable domain-specific evaluation score you devise yourself, that's inherently correlated to your end goals.
Using a classsication algorythm (for example naive bayes or SVM), and StringToWordVector,
would it be possible to use TF/IDF and to count terms frequency in the whole current class instead of just looking in a single document?
Let me explain, I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.
Is it possible out of the box or does this need some extra developments?
Thanks :)
I would like the computation to give high score to words that are very frequent for a given class (not just for a given document) but not very frequent in the whole corpus.
You seem to want supervised term weighting. I'm not aware of any off-the-shelf implementation of that, but there's a host of literature about it. E.g. the weighting scheme tf-χ² replaces idf with the result of a χ² independence test, so terms that statistically depend on certain classes get boosted, and there are several others.
Tf-idf itself is by its very nature unsupervised.
I think you're confusing yourself here---what you're asking for is essentially the feature weight on that term for documents of that class. This is what the learning algorithm is intended to optimise. Just worry about a useful representation of documents, which must necessarily be invariant to the class to which they belong (since you won't know what the class is for unseen test documents).
A changed idf may help you in some scene.
You can use the idf defined as:
log(1+p(term in this class)/p(term in other class))
Disadvantages : Each class has a different idf, this can be interpreted as every term in different class has various contribution in distinguishing the category.
Application : By add the idf in Native Bayes, I get a improve in query keyword classification. And it perform good when extracting keywords.
I want to text classification based on the keywords appear in the text, because I do not have sample data to use naive bayes for text classification.
Example:
my document has some few words as "family, mother , father , children ... " that the categories of document are family.Or "football, tennis, score ... " that the category is sport
What is the best algorithm in this case ?.And is there any api java for this problem?
What you have are feature labels, i.e., labels on features rather than instances. There are a few methods for exploiting these, but usually it is assumed that one has instance labels (i.e., labels on documents) in addition to feature labels. This paradigm is referred to as dual-supervision.
Anyway, I know of at least two ways to learn from labeled features alone. The first is Generalized Expectation Criteria, which penalizes model parameters for diverging from a priori beliefs (e.g., that "moether" ought usually to correlate with "family"). This method has the disadvantage of being somewhat complex, but the advantage of having a nicely packaged, open-source Java implementation in the Mallet toolkit (see here, specifically).
A second option would basically be to use Naive Bayes and give large priors to the known word/class associations -- e.g., P("family"|"mother") = .8, or whatever. All unlabeled words would be assigned some prior, presumably reflecting class distribution. You would then effectively being making decisions only based on the prevalence of classes and the labeled term information. Settles proposed a model like this recently, and there is a web-tool available.
You likely will need an auxillary data set for this. You cannot rely on your data set to convey the information that "dad" and "father" and "husband" have a similar meaning.
You can try to do mine for co-occurrences to detect near-synonyms, but this is not very reliable.
Probably wordnet etc. are a good place to disambiguate such words.
You can download the freebase topic collection: http://wiki.freebase.com/wiki/Topic_API.