Apply different transformations on features in machine learning [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have features like Amount(20-100K$), Percent1(i.e. 0-1),Percent2(i.e.0-1). Here Amount values are between 20-100000 US dollars, and percent columns with decimals between 0-1. These features are positively skewed, so I applied log transformation on Amount, Yeo-Johnson using powertransformer on Percent1,Percent2 columns.
Is it right to apply different transformations on columns, will it have an effect on model performance or should I same same transformation on all columns?

There are some things that need to be known before we can answer the question.
The answer would depend on the model that you are using. In some models it’s better if the range of the different inputs are same. Some models are agnostic to that. And of course sometimes one might also be interested in assigning different priorities to the inputs.
To get back to your question, depending on the model, there might be absolutely no harm in applying different transformations, or there could be performance differences.
For example: Linear regression models would be greatly affected by the feature transformation. However supervised neural networks most likely wouldn’t.
You might want to check this stackoverflow
question: https://stats.stackexchange.com/questions/397227/why-feature-transformation-is-needed-in-machine-learning-statistics-doesnt-i

it's about understanding the benefits of transformation
so when we talk about some equation like f(x1,x2) = w1x1 + w2x2
then if x1 is about 100,000 like the amount
and if x2 is about 1.0 like the percent
and at the same time feature number 1 will be updated 100,000 faster than feature 2
mathematically when you update the weight then the equation of weights will be like
w1 = w1 - lr*(x1)
w2 = w2 - lr*(x2)
lr here represents the learning rate
then you are saying the amount feature is lot lot better than the percent feature
and that's why usually one transform the data into the same distribution to not make a feature better than another feature

Related

scaling data for a dataset with numerical ( both continuous and discrete) and categorical variables [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I am practicing regression and classification techniques on different datasets. Now I came to this dataset and I am going to practice regression algorithms.
I want to try algorithms on raw data, normalized data and standardized data ( just as a practice )
Now I know for categorical variables it is enough to define them as categories and no scaling is needed.
I learnt, dependent variable will not be scaled.
I learnt if continuous numeric variables have different units or large difference in values, I can try scaling on them.
but what about discrete numerical variables? should I scale them? I read in some content that if there is a small range of values I d better define that discrete variable as a categorical variable and do not scale them. is this a commen approach in machine learning?
I would appreciate any help understanding how to treat discrete variables.
Suppose, you have a feature that is the number of students in a university and you want to use that feature to predict the cost per year to study at that university (a simple linear regression task). In this case, it makes sense to take the feature as a continuous variable, and not a categorical one.
Again, suppose, you have a feature, which is the pH of an acid, and you want to use that feature to predict the color it will produce when it is added to universal indicator solution (a solution that shows different color for different pH values), (Now it's a classification problem: classify between colors). In this case, it does not make any sense to take the feature as a continuous variable, but a categorical one as different pH values have different colors, and no gradual change.
So, when should you define a discrete feature as a categorical variable and when to define it as a continuous variable? It actually depends on what the feature represents.

Batch normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

When doing a PCA analysis, how can we know which principal components were selected? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I read the article from the link below.
https://towardsdatascience.com/principal-component-analysis-pca-with-scikit-learn-1e84a0c731b0
The author does a nice job describing the process of PCA decomposition. I feel like I understand everything except for one thing. How can we know which principal components were selected and thus preserver for eventual improved performance of our ML algos? For instance, the author starts with this.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df.head()
Ok, I know what the features are. Great. Then all the fun happens, and we end up with this at the end.
df_new = pd.DataFrame(X_pca_95, columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10'])
df_new['label'] = cancer.target
df_new
Of the 30 features that we started with, how do we know what the final 10 columns consist of? It seems like there has to be some kind of last step to map the df_new to df?
To understand, you need to know a little bit more about PCA. In fact, PCA returns all principal components that shape the whole space of vectors, i.e., eigenvalues and eigenvectors of the covariance matrix of features. Hence, you can select eigenvectors based on the size of their corresponding eigenvalues. Hence, you need to pick up the biggest eigenvalues and their corresponding eigenvectors.
Now if you look at the documentation of PCA method in scikit learn, you find some useful properties like the following:
components_ ndarray of shape (n_components, n_features): Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
explained_variance_ratio_ ndarray of shape (n_components,)
Percentage of variance explained by each of the selected components.
If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
explained_variance_ratio_ is a very useful property that you can use it to select principal components based on the desired threshold for the percentage of the covered variance. For example, take values in this array are [0.4, 0.3, 0.2, 0.1]. If we take the first three components, the covered variance is 90% of the whole variance of the original data.
Almost surely the 10 resulting columns are all composed of pieces of all 30 original features. The PCA object has an attribute components_ that shows the coefficients defining the principal components in terms of the original features.

Which is given more importance Precision or Recall in classification report for obtaining a better prediction model [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
For a classification problem in machine learning in classification report out of precision and recall which one is given more importance to get better model?
It actually depends on your classification problem.
First, you need to understand the difference between precision and recall.
Wikipedia may be a good start, but I would suggest this resource by developers.google.
Now imagine you're trying to track covid cases with a classifier.
The classifier tells you if a patient is carrying covid or not.
Are you more interested in:
A) Identifying all possible covid cases?
B) Being sure that if you identify a covid case that one is actually a real covid case?
If A) is more important you should focus on recall. On the other hand, if you're more interested in B), then precision is probably what you're looking for.
Be aware though:
Let's say you're testing 1000 possible cases, and let's say 500 of them are positive, we just don't know yet. You use the classifier and it tells you all 1000 people are positive.
So you have:
true_positives = 500
false_negatives = 0
recall = true_positives / (true_positives + false_negatives)
recall = 500 / (500 + 0) = 1
So here you have a good recall, but you're not precise, nor accurate.
What I'm trying to express is that one shouldn't focus on one metric over another, but always keep a broad view on the problem.
However, if you want to focus on just one metric to sum up both precision and recall, that's what the F score was made for.

Can machine learning algorithm learn to predict a "random" number from a generator to at least 50% accuracy? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Because all random number generators are all pseudo random number generators, can a machine learning algorithm eventually, with enough test data, learn to predict the next random number with 50% accuracy?
If you are generating just random bits (0 or 1) then any method will get 50%, literally any, ML or not, trained on not. Anything besides direct exploitation of underlying random number generator (like reading the seed, and then using as a predictor the same random number generator). So the answer is yes.
If you consider more "numbers" then no, it is not possible, unless you do not have a valid random number generator. The weaker is the process and better your model you try to learn, it is easier to predict what is happening. For example if you know exactly how random number generator looks like, and this is just iterated function with some parameters f(x|params), where we start with some random seed s and parameters params and then x1=f(s|params), x2=f(x1|params), ... then you can learn using ML the state of such system, this is just about finding the "params", which fitted to f generate the actual values. Now - more complex f, the more complex is the problem. For typical random number generators f is too complex to learn, because you cannot observe any relation between close values - if you predict "5.8" and answer was "5.81" then next sample from your model might be "123" and from true generator "-2". This is completely chaotic process.
To sum up: this is possible only for very easy cases:
either there are just 2 values (then there is nothing to learn, literally any method, which is not cheating, will get 50%)
or random number generator is seriously flawed, and you have a knowledge about what type of flaw it is, and you can design a parametric model to approximate this.

Resources