Do combination of existing features make new features ? - machine-learning

Does it help in classifying better if I add linear, non-linear combinatinos of the existing features ? For example does it help to add mean, variance as new features computed from the existing features ? I believe that it definitely depends on the classification algorithm as in the case of PCA, the algorithm by itself generates new features which are orthogonal to each other and are linear combinations of the input features. But how does it effect in the case of decision tree based classifiers or others ?

Yes, combination of existing features can give new features and help for classification. Moreover, combination of the feature with itself (e.g. polynomial from the feature) can be used as this additional data to be used during classification.
As an example, consider logistic regression classifier with such linear formula as its core:
g(x, y) = 1*x + 2*y
Imagine, that you have 2 observations:
x = 6; y = 1
x = 3; y = 2.5
In both cases g() will be equal to 8. If observations belong to different classes, you have no possibility to distinguish them. But let's add one more variable (feature) z, which is combination of the previous 2 features - z = x * y:
g(x, y, z) = 1*x + 2*y + 0.5*z
Now for same observations we have:
x = 6; y = 1; z = 6 * 1 = 6 ==> g() = 11
x = 3; y = 2.5; z = 3 * 2.5 = 7.5 ==> g() = 11.75
So now we get 2 different points and can distinguish between 2 observations.
Polynomial features (x^2, x^3, y^2, etc.) do not give additional points, but instead change the graph of the function. For example, g(x) = a0 + a1*x is a line, while g(x) = a0 + a1*x + a2*x^2 is parabola and thus can fit data much more closely.

In general, it's always better to have more features. Unless you have very predictive features (i.e. they allow for perfect separation of the classes to predict) already, I would always recommend adding more features. In practice, many classification algorithms (and in particular decision tree inducers) select the best features for their purposes anyway.

There are open-source Python libraries that automate feature creation / combination:
We can automate polynomial feature creations with sklearn.
We can automatically create spline features with sklearn.
We can combine features mathematically with Feature-engine. With MathFeatures we combine feature groups, and with RelativeFeatures we combine feature pairs.

Related

MLJ: selecting rows and columns for training in evaluate

I want to implement a kernel ridge regression that also works within MLJ. Moreover, I want to have the option to use either feature vectors or a predefined kernel matrix as in Python sklearn.
When I run this code
const MMI = MLJModelInterface
MMI.#mlj_model mutable struct KRRModel <: MLJModelInterface.Deterministic
mu::Float64 = 1::(_ > 0)
kernel::String = "linear"
end
function MMI.fit(m::KRRModel,verbosity::Int,K,y)
K = MLJBase.matrix(K)
fitresult = inv(K+m.mu*I)*y
cache = nothing
report = nothing
return (fitresult,cache,report)
end
N = 10
K = randn(N,N)
K = K*K
a = randn(N)
y = K*a + 0.2*randn(N)
m = KRRModel()
kregressor = machine(m,K,y)
cv = CV(; nfolds=6, shuffle=nothing, rng=nothing)
evaluate!(kregressor, resampling=cv, measure=rms, verbosity=1)
the evaluate! function evaluates the machine on different subsets of rows of K. Due to the Representer Theorem, a kernel ridge regression has a number of nonzero coefficients equal to the number of samples. Hence, a reduced size matrix K[train_rows,train_rows] can be used instead of K[train_rows,:].
To denote I'm using a kernel matrix I'd set m.kernel = "" . How do I make evaluate! select the columns as well as the rows to form a smaller matrix when m.kernel = ""?
This is my first time using MLJ and I'd like to make as few modifications as possible.
Quoting the answer I got on the Julia Discourse from #ablaom
The intended use of evaluate! is to estimate the generalisation error
associated with some supervised learning model, by subsampling
observations, as in cross-validation, a common use-case. I’m afraid
there is no natural way for evaluate! do feature subsampling.
https://alan-turing-institute.github.io/MLJ.jl/dev/evaluating_model_performance/
FYI: There is a version of kernel regression implementing the MLJ
model interface, namely kernel partial least squares regression from
the package GitHub - lalvim/PartialLeastSquaresRegressor.jl:
Implementation of a Partial Least Squares Regressor 2 .

Why is make_friedman1 used?

I have started to learn ML, and am confused with make_friedman1. It highly improved my accuracy, and increased the data size. But the data isn't the same, it's changed after using this function. What does friedman! actually do?
If make_friedman1 asked here is the one in sklearn.datasets then it is the function which generates the “Friedman #1” regression problem. Here inputs are 10 independent variables uniformly distributed on the interval [0,1], only 5 out of these 10 are actually used. Outputs are created according to the formula::
y = 10 sin(π x1 x2) + 20 (x3 - 0.5)^2 + 10 x4 + 5 x5 + e
where e is N(0,sd)
Quoting from the Friedman's original paper, Multivariate Adaptive Regression Splines ::
A new method is presented for flexible regression modeling of high
dimensional data. The model takes the form of an expansion in product
spline basis functions, where the number of basis functions as well as
the parameters associated with each one (product degree and knot
locations) are automatically determined by the data. This procedure is
motivated by the recursive partitioning approach to regression and
shares its attractive properties. Unlike recursive partitioning,
however, this method produces continuous models with continuous
derivatives. It has more power and flexibility to model relationships
that are nearly additive or involve interactions in at most a few
variables
A spline is adding many polynomial curves end-to-end to make a new smooth curve.

Vector coefficients based on similarity

I've been looking for a solution to create a recommendation system based on vectors similarity.
Basically, i have a few vectors per user for example:
User1: [0,3,7,8,5] , [3,5,8,2,4] , [1,5,3,9,4]
User2: [3,1,6,7,9] , [2,4,1,3,8] , [7,8,3,3,1]
For every vector i need to calculate a coefficient and based on that coefficient differentiate a vector from another. I've found formulas that would calculate coefficients based on similarity of 2 vectors which i don't really want that.I need a formula that would calculate a coefficient per vector and then i do some other calculations with those coefficients.Are there any good formulas for this?
Thanks
So going based off your response to my comment: I don't think there's a similarity coefficient measure that will do what you want. Let me explain why...
Similarity coefficients are functions f(x, y) -> c where x and y are vectors and c is a scalar. Note that f takes two parameters. f(x,y) = f(y,x), but f(x) is meaningless - its asking for the similarity of x relative to... nothing.
So what? We could just use a function g(x) = f(x, V) where V is a fixed vector. E.g. let V = [1, 1, ..., 1]. Now we have a monadic function that gives us a similarity value for every individual vector. But...
Knowing f(x,y) = c and f(x,z) = c' doesn't tell you a whole lot about f(y,z). Take vectors in 2-space, x = [1, 1], y = [0, 1], z = [1,0]. A similarity function symmetric in the two dimensions would say f(x,y) = f(x,z) but hopefully not = f(y,z) So our g function above isn't very useful, because knowing how similar two vectors are to V doesn't tell us much about how similar they are to each other.
So what can you do? I think a simple solution to your problem would be a variation of the k nearest neighbors algorithm. It allows you to find vectors close to a given vector (or, if you prefer to find clusters of vectors without specifying a given vector, look up clustering)
EDIT: inspiration from Yahya's answer: if your vectors are super huge and knn or clustering is too difficult, consider principle component analysis or some other method of cutting them down to size (reducing the number of dimensions) - just keep in mind whatever you do will likely be lossy

PCA method after StandardScaler produces poor results

I tried to classification problem for fun with the scikit-learn library. I got 10000x10 dimension data, and I found very weird phenomenon (for me).
pca = PCA(n_components = 2)
ss = StandardScaler()
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.8
X = ss.fit_transform(X)
in this case, i got a wonderfull explained_variance_ratio_ almost 99%. but when I apply scaling first, suddely PCA's performence is dropped drastically and explained_variance_ratio decreased to 20%.
pca = PCA(n_components = 2)
ss = StandardScaler()
X = ss.fit_transform(X)
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.2
What makes this difference? Standard Scaler is just rescaling process, so I suppose no information loss. Can I apply the PCA before for visualizing conveniency? Or I must select Standardization for mathematical insurance?
Suppose, you have two features A and B that measure distance and both are in metres. Feature A has a greater range of numbers in it (suppose, 1 - 1000) as compared to a Feature B , which has a range( suppose, 1-10).
Then, the feature A will capture greater variance in the data as compared to B, and hence it is not a good idea to scale the features in this case .
But if , the features are having two different units,(say, kg and metre), then it will be wise to scale the features.
P.S: PCA preserves those components along which there is max. variance.

Why not logistic regression uses multiplication instead of addition for error matrix?

This is a very basic question but I cannot could not find enough reasons to convince myself. Why must logistic regression use multiplication instead of addition for the likelihood function l(w)?
Your question is more general than just joint likelihood for logistic regression. You're asking why we multiply probabilities instead of add them to represent a joint probability distribution. Two notes:
This applies when we assume random variables are independent. Otherwise we need to calculate conditional probabilities using the chain rule of probability. You can look at wikipedia for more information.
We multiply because that's how the joint distribution is defined. Here is a simple example:
Say we have two probability distributions:
X = 1, 2, 3, each with probability 1/3
Y = 0 or 1, each with probability 1/2
We want to calculate the joint likelihood function, L(X=x,Y=y), which is that X takes on values x and Y takes on values y.
For example, L(X=1,Y=0) = P(X=1) * P(Y=0) = 1/6. It wouldn't make sense to write P(X=1) + P(Y=0) = 1/3 + 1/2 = 5/6.
Now it's true that in maximum likelihood estimation, we only care about those values of some parameter, theta, which maximizes the likelihood function. In this case, we know that if theta maximizes L(X=x,Y=y) then the same theta will also maximize log L(X=x,Y=y). This is where you may have seen addition of probabilities come into play.
Hence we can take the log P(X=x,Y=y) = log P(X=x) + log P(Y=y)
In short
This could be summarized as "joint probabilities represent an AND". When X and Y are independent, P(X AND Y) = P(X,Y) = P(X)P(Y). Not to be confused with P(X OR Y) = P(X) + P(Y) - P(X,Y).
Let me know if this helps.

Resources