Dose normalization help to get better correlation? - machine-learning

I have a set of numeric features and I would like to use then to predict the price of product. I didn't get any meaningful correlation between the features and the price. Can I improve the correlation by normalizing the features?

If you are just testing corr(x_i, y), where y is price, and x_i is i'th feature, then no, you cannot improve this value by normalization, as correlation is affine transformations invariant, meaning that for any choice of a, b, c, d > 0
corr(a + bx_i, c + dy) = corr(x_i, y)

Related

Vector coefficients based on similarity

I've been looking for a solution to create a recommendation system based on vectors similarity.
Basically, i have a few vectors per user for example:
User1: [0,3,7,8,5] , [3,5,8,2,4] , [1,5,3,9,4]
User2: [3,1,6,7,9] , [2,4,1,3,8] , [7,8,3,3,1]
For every vector i need to calculate a coefficient and based on that coefficient differentiate a vector from another. I've found formulas that would calculate coefficients based on similarity of 2 vectors which i don't really want that.I need a formula that would calculate a coefficient per vector and then i do some other calculations with those coefficients.Are there any good formulas for this?
Thanks
So going based off your response to my comment: I don't think there's a similarity coefficient measure that will do what you want. Let me explain why...
Similarity coefficients are functions f(x, y) -> c where x and y are vectors and c is a scalar. Note that f takes two parameters. f(x,y) = f(y,x), but f(x) is meaningless - its asking for the similarity of x relative to... nothing.
So what? We could just use a function g(x) = f(x, V) where V is a fixed vector. E.g. let V = [1, 1, ..., 1]. Now we have a monadic function that gives us a similarity value for every individual vector. But...
Knowing f(x,y) = c and f(x,z) = c' doesn't tell you a whole lot about f(y,z). Take vectors in 2-space, x = [1, 1], y = [0, 1], z = [1,0]. A similarity function symmetric in the two dimensions would say f(x,y) = f(x,z) but hopefully not = f(y,z) So our g function above isn't very useful, because knowing how similar two vectors are to V doesn't tell us much about how similar they are to each other.
So what can you do? I think a simple solution to your problem would be a variation of the k nearest neighbors algorithm. It allows you to find vectors close to a given vector (or, if you prefer to find clusters of vectors without specifying a given vector, look up clustering)
EDIT: inspiration from Yahya's answer: if your vectors are super huge and knn or clustering is too difficult, consider principle component analysis or some other method of cutting them down to size (reducing the number of dimensions) - just keep in mind whatever you do will likely be lossy

Machine learning - normalizing features with no theoretical maximum value

What approach is the best one to normalize / standardize features that have no theoretical maximum value ?
for example a trend like a stock value that has always been between 0-1000$ doesn't mean it couldn't go further up, so what is the correct approach?
i thought about training a model on a higher maximum (ex. 2000 ),but it doesn't feel right, because no data would be available for the 1000-2000 range, and i think this would introduce bias
TL;DR: use z-scores, maybe take log, maybe take inverse logit, maybe don't normalize at all.
If you wish to normalize safely, use a monotonic mapping, e.g.:
To map (0, inf) into (-inf, inf), you can use y = log(x)
To map (-inf, inf) into (0, 1), you can use y = 1 / (1 + exp(-x)) (inverse logit)
To map (0, inf) into (0, 1), you can use y = x / (1 + x) (inverse logit after log)
If you don't care about bounds, use a linear mapping: y=(x - m) / s, where m is the mean of your feature, and s is its standard deviation. This is called standard scaling, or sometimes z-scoring.
The question you should have asked yourself: why normalize at all?. What are you going to do with your data? Use it as an input feature? Or use it as a target to predict?
For an input feature, leaving it not-normalized is OK, unless you do regularization on model coefficients (like Ridge or Lasso), which works best if all the coefficients are in the same scale (that is, after standard scaling).
For a target feature, leaving it non-normalized is sometimes also OK.
Additive models (like linear regression or gradient boosting) sometimes work better with symmetric distributions. Distributions of stock values (and money values in general) are often skewed to the right, so taking log makes them more convenient.
Finally, if you predict your feature with a neural net with sigmoid activation function, it is inherently bounded. In this case, you might wish the target to be bounded as well. To achieve this, you may use x / (1 + x) as a target: if x is always positive, this value will always be between 0 and 1, just like output of the neural net.

Why not logistic regression uses multiplication instead of addition for error matrix?

This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.

Machine Learning: Why xW+b instead of Wx+b?

I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much
This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.
I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D

Probability for a vector x

In machine learning suppose we have a GDA (Gaussian Discriminant Analysis) model for classification.
If y can take values 0 or 1 and x represents the vector with n features(n x 1 dimensional)
What does p(x| y=0) or p(x|y=1) signify for a particular training example?
x is actually a vector..how is conditional probability defined for this?
Any help would be much appreciated.
Say that X0 is the set of vectors x that mapped to output 0, and X1 is the set of vectors x that mapped to output 1. Take the mean of each set's vectors, and, similarly, approximate the covariance.
Now build two multivariate normal distributions, with these means and covariances, respectively.
Once you have these distribution, simply plug in the vector you want into the PDF to obtain its density. Note that since the probabilities are continuous, the probability about which you asked is 0, in general.

Resources