Explain sparse representation in relate to generalization? - machine-learning

Is there a relation between sparse representation (such as: one-hot encoder) and generalization. Kindly explain.
This is in relate to: section 13.2 of "Learning Deep Architectures for AI" by Yoshua Bengio.

Related

How to classify texts that are related to the bible based on their content

I have a database of texts from comments of social networks (FB,Twitter).
My goal is to classify texts that have strong relation to the bible based on their content (for example if there are cites or "biblical" words that are used.
This is a binary classification problem and i need help to figure out how to approach it (maybe use the bible as a dictionary somehow). Thanks!
You can train a supervised binary classifier (e.g. a logistic regression over TF-IDF counters, or a fasttext classifier, or fine-tune a BertForSequenceClassification).
Then apply this classifier to your database of comments and find a reasonable value of the probability threshold to keep only the comments in which the classifier is confident enough.
As positive examples for training, you can use the sentences from the Bible itself, sentences for Bible-related Wikipedia articles, etc. As negative samples, you can use any corpus of sentences collected from web - e.g. one of the Leipzig corpora.

Understanding embedding vectors dimension

In deep learning, in particularly NLP, words are transformed into a vector representation to be fed into a neural network such as an RNN. By referring to the link:
http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/#Word%20Embeddings
In the section of Word Embeddings, it is said that:
A word embedding W:words→Rn is a paramaterized function mapping words
in
some language to high-dimensional vectors (perhaps 200 to 500 dimensions)
I do not understand the purpose of the dimension of the vectors. What does it mean to have a vector of 200 dimensions compared to a vector of 20 dimensions?
Does it improve the overall accuracy of the model? Could anyone give me a simple example regarding the choice of dimension of the vectors.
These word embeddings also called Distributed Word Embedding is based on
you know a word by the company it keeps
as quoted by John Rupert Firth
So we know the meaning of a word by its context. You can think of each scalar in the vector (of a word) represents its strength for a concept. This slide from Prof. Pawan Goyal explains it all.
So you want good vector size to capture decent amount of concepts but you do not want a too huge vector because it will then become the bottleneck in training of models where these embeddings are used.
Also the vector size is mostly fixed as most do not train their own embedding but rather use openly available embeddings as they are trained for many hours on huge data. So using them will force us to use an embedding layers with dimensions as given by the openly available embedding you are using (word2vec, glove etc)
Distributed Word Embeddings is a major milestone in the area of deep learning in NLP. They give better accuracy as compared of tfidf based embeddings.

NLP - Word Representations

I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created.
The problem is, how do I know if my representation work? How do I know if it actually captures the meaning of each word? How do I compare my representation with other existing vector space models?
As of now, I am doing the following tests to check the quality of my word vectors:
Distance test:
Does the cosine distance between vectors reflect the semantic distance between words?
Analogy test:
Can the representation be used to solve problems like "King is to queen what man is to ________ ", (the answer should be woman)
Picking the odd one out:
Can the vectors be used to pick the odd word in a given list of words. If the input is {"cat","dog","phone"}, the output should be "phone"?
What are the other tests that I should do to check the quality of the vectors? What other tasks are word vectors expected to be capable of doing? Is there a benchmark for vector space models?
Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings.
In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates plots, gives correlations with word pair similarity rankings, and compares your embeddings with pre-trained vectors from previous research. You can find a more detailed description in the accompanying paper.

Confusion regarding difference of machine learning and statistical learning algorithms

I have read these lines in one of the IEEE Transaction on software learning
"Researchers have adopted a myriad of different techniques to construct software fault prediction models. These include various statistical techniques such as logistic regression and Naive Bayes which explicitly construct an underlying probability model. Furthermore, different machine learning techniques such as decision trees, models based on the notion of perceptrons, support vector machines, and techniques that do not explicitly construct a prediction model but instead look at a set of most similar
known cases have also been investigated.
Can anyone can explain what they are really want to convey.
Please give example.
Thanx in advance.
The authors seem to distinguish probabilistic vs non-probabilistic models, that is models that produce a distribution p(output | data) vs those that just produce an output output = f(data).
The description of the non-probabilistic algorithms is a bits odd to my taste, though. The difference between a (linear) support vector machine, a perceptron and logistic regression from the model and algorithmic perspective is not super large. Implying the former "look at a set of most similar known cases" and the latter doesn't seems strange.
The authors seem to be distinguishing models which compute per-class probabilities (from which you can derive a classification rule to assign an input to the most probable class, or, more complicated, assign an input to the class which has the least misclassification cost) and those which directly assign inputs to classes without passing through the per-class probability as an intermediate result.
A classification task can be viewed as a decision problem; in this case one needs per-class probabilities and a misclassification cost matrix. I think this approach is described in many current texts on machine learning, e.g., Brian Ripley's "Pattern Recognition and Neural Networks" and Hastie, Tibshirani, and Friedman, "Elements of Statistical Learning".
As a meta-comment, you might get more traction for this question on stats.stackexchange.com.

Representing documents in vector space model

I have a very fundamental question. I have two sets of documents, one for training and one for testing. I would like to train a Logistic regression classifier with the training documents. I want to know if I'm doing the right thing.
First find the list of all unique words in the training document and call it vocabulary.
For each word in the vocabulary, find its TFIDF in every training document. A document is then represented as vector of these TFIDF scores.
My question is:
1. How do I represent the test documents? Say, one of the test documents does not have any word that is in the vocabulary. In that case , the TFIDF scores will be zero for all words in the vocabulary for that document.
I'm trying to use LIBSVM which uses the sparse vector format. For the case of the above document, which has all entries set to 0 in its vector representation, how do I represent it?
You have to store enough information about the training corpus to do the TF IDF transform on unseen documents. This means you'll need the document frequencies of the terms in the training corpus. Ignoring unseen words in test docs is fine. Your svm won't learn a weight for them anyway. Note that unseen terms should be rare in the test corpus if your training and test distributions are similar. So even if a few terms are dropped, you'll still have plenty of terms to classify the doc.

Resources