I am trying to use house price prediction as a practical example to learn machine learning. Currently I ran into the problem regarding to neighborhood.
With most machine learning examples, I saw features such as number of bedrooms, floor spaces, land area are used. Intuitively, these features has strong correlations to house prices. However, this is not the case for neighborhood. Let's say I randomly assign a neighborhood_id to each neighborhood. I won't be able to tell neighborhood with id 100 has higher or lower house price than neighborhood with id 53.
I am wondering if I need to do some data pre-processing, such as find the average price for each neighborhood then use the processed data, or there are existing machine learning algorithm that figure out the relation from seemingly unrelated feature?
I'm assuming that you're trying to interpret the relationship between neighborhood and housing price in a regression model with continuous and categorical data. From what I remember, R handles categorical variables automatically using one-hot encoding.
There are ways to approach this problem by creating data abstractions from categorical variables:
1) One-Hot Encoding
Let's say you're trying to predict housing prices from floor space and neighborhood. Assume that floor space is continuous and neighborhood is categorical with 3 possible neighborhoods, being A, B and C. One possibility is to encode neighborhood as a one-hot vector and treat each categorical variables as a new binary variable:
neighborhood A B C
A 1 0 0
B 0 1 0
B 0 1 0
C 0 0 1
The regression model would be something like:
y = c0*bias + c1*floor_space + c2*A + c3*B + c4*C
Note that this neighborhood variable is similar to bias in regression models. The coefficient for each neighborhood can be interpreted as the "bias" of the neighborhood.
2) From categorical to continuous
Let's call Dx and Dy the horizontal and vertical distances from all neighborhoods to a fixed point on the map. By doing this, you create a data abstraction that transforms neighborhood, a categorical variable, into two continuous variables. By doing this, you can correlate housing prices to horizontal and vertical distance from your fixed point.
Note that this is only appropriate when the transformation from categorical to continuous makes sense.
Related
I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.
I have a classification and regression question on machine learning.
First question, the following dataset
http://it.tinypic.com/view.php?pic=oh3gj7&s=8#.VIjhRDGG_lF
Can we say, the data set is linearly separable?
In order to apply a linear model for classication, a transformation of the input space is not needed for this dataset, or is not possible for this dataset?
My answer is no, but I am not sure for the second, I am not sure a transformation is possible for the dataset.
Second question about regression probl:
Give the following data set f : R -> R
http://it.tinypic.com/view.php?pic=madsmr&s=8#.VIjhVjGG_lE
Can we say that :
A linear model for regression can be used to learn the function associated to this data set ?
Given this data set, it is not possible to determine an optimal conguration of the linear model?
I am reading the book of Tom Mitchell Machine learning, and Pattern Recognition and Machine Learning Bishop, but I still have trouble giving the right answer.
Thanks in advance.
Neither of this datasets can be modeled using linear classification/regression.
In case of the "input data transfromation" if only dataset is consistent (there are no two exact same points with two different labels) there always exists transformation after which data is linearly separable. In particular one can construct it with:
phi(x) = 1 iff label of x is "1"
in other words, you map all positive samples to "1" and negatives to "0", so your data is now trivialy linearly separable. Or simply map your N points into N unit vectors in R^N space in such a way that i'th point is mapped to [0 0 0 ... 1 ... 0 0 0]^T where this "1" appears at i'th place. Such dataset is trivialy linearly separable for any labeling.
I'm experimenting with Chi-2 feature selection for some text classification tasks.
I understand that Chi-2 test checks the dependencies B/T two categorical variables, so if we perform Chi-2 feature selection for a binary text classification problem with binary BOW vector representation, each Chi-2 test on each (feature,class) pair would be a very straightforward Chi-2 test with 1 degree of freedom.
Quoting from the documentation: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2,
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
It seems to me that we we can also perform Chi-2 feature selection on DF (word counts) vector presentation. My 1st question is: how does sklearn discretize the integer-valued feature into categorical?
My second question is similar to the first. From the demo codes here: http://scikit-learn.sourceforge.net/dev/auto_examples/document_classification_20newsgroups.html
It seems to me that we can also perform Chi-2 feature selection on a TF*IDF vector representation. How sklearn perform Chi-2 feature selection on real-valued features?
Thank you in advance for your kind advise!
The χ² features selection code builds a contingency table from its inputs X (feature values) and y (class labels). Each entry i, j corresponds to some feature i and some class j, and holds the sum of the i'th feature's values across all samples belonging to the class j. It then computes the χ² test statistic against expected frequencies arising from the empirical distribution over classes (just their relative frequencies in y) and a uniform distribution over feature values.
This works when the feature values are frequencies (of terms, for example) because the sum will be the total frequency of a feature (term) in that class. There's no discretization going on.
It also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.
I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).
I read a wiki article which describes about Jaccard index and explains the Tanimoto score as extended Jaccard index, but what exactly it tries to do?
How is it different from other similarity scores?
When is it used?
Thank you
I just read the wikipedia article too, so I can only interpret the content for you.
Jaccards score is used for vectors that take discrete values, most often for binary values (1 or 0). Tanimoto score is used for vectors that can take on continuous values. It is designed so that if the vector only takes values of 1 and 0, it works the same as Jaccards.
I would imagine you would Tanimoto's when you have a 'mixed' vector that has some continuous valued parts and some binary valued parts.
what exactly it tries to do?
Tanimoto score assumes that each data object is a
vector of attributes. The attributes may or may not be binary in this case. If they all are binary, the Tanimoto method reduces to the Jaccard method.
T(A,B)= A.B/(||A||2 + ||B||2 - A.B)
In the equation, A and B are data objects represented by vectors. The similarity score
is the dot product of A and B divided by the squared magnitudes of A and B minus the
dot product.
How is it different from other similarity scores?
Tanimoto v/s Jaccard: If the attributes are binary, Tanimoto is reduced to Jaccard Index.
There are various similarity scores available but let's compare with the most frequently used.
Tanimoto v/s Dice:
The Tanimoto coefficent is determined by looking at the number of attributes that are common to both data objects (the intersection of the data strings) compared to the number of attributes that are in either (the union of the data objects).
The Dice coefficient is the number of attributes in common to both data objects relative to the average size of the total number of attributes present, i.e.
( A intersect B ) / 0.5 ( A + B )
D(A,B) = A.B/(0.5(||A||2 + ||B||2))
Tanimoto v/s Cosine
Finding the cosine similarity between two data objects requires that both objects represent their attributes in a vector. Similarity is then measured as the angle between the two vectors.
Cos(θ) = A.B/(||A||.||B||)
You can also refer When can two objects have identical Tanimoto and Cosine score.
Tanimoto v/s Pearson:
The Pearson Coefficient is a complex and sophisticated approach to finding similarity. The method generates a "best fit" line between attributes in two data objects. The Pearson Coefficient is found using the following equation:
p(A,B) = cov(A,B)/σAσB
where,
cov(A,B) --> Covariance
σ A --> Standard deviation of A
σ B --> Standard deviation of B
The coefficient is found from dividing the covariance by the product of the standard deviations of the attributes of two data objects. It is more robust against data that isn't normalized. For example, if one person ranked movies "a", "b", and "c" with scores of 1, 2, and 3 respectively, he would have a perfect correlation to someone who ranked the same movies with a 4, 5, and 6.
For more information on Tanimoto score v/s other similarity scores/coefficients you can refer:
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
When is it used?
Tanimoto score can be used in both the situations:
When attributes are Binary
When attributes are Non-Binary
Following applications extensively use Tanimoto score:
Chemoinformatics
Clustering
Plagiarism detection
Automatic thesaurus extraction
To visualize high-dimensional datasets
Analyze market-basket transactional data
Detect anomalies in spatio-temporal data