Batch normalization [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?

Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

scaling data for a dataset with numerical ( both continuous and discrete) and categorical variables [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I am practicing regression and classification techniques on different datasets. Now I came to this dataset and I am going to practice regression algorithms.
I want to try algorithms on raw data, normalized data and standardized data ( just as a practice )
Now I know for categorical variables it is enough to define them as categories and no scaling is needed.
I learnt, dependent variable will not be scaled.
I learnt if continuous numeric variables have different units or large difference in values, I can try scaling on them.
but what about discrete numerical variables? should I scale them? I read in some content that if there is a small range of values I d better define that discrete variable as a categorical variable and do not scale them. is this a commen approach in machine learning?
I would appreciate any help understanding how to treat discrete variables.
Suppose, you have a feature that is the number of students in a university and you want to use that feature to predict the cost per year to study at that university (a simple linear regression task). In this case, it makes sense to take the feature as a continuous variable, and not a categorical one.
Again, suppose, you have a feature, which is the pH of an acid, and you want to use that feature to predict the color it will produce when it is added to universal indicator solution (a solution that shows different color for different pH values), (Now it's a classification problem: classify between colors). In this case, it does not make any sense to take the feature as a continuous variable, but a categorical one as different pH values have different colors, and no gradual change.
So, when should you define a discrete feature as a categorical variable and when to define it as a continuous variable? It actually depends on what the feature represents.

Apply different transformations on features in machine learning [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have features like Amount(20-100K$), Percent1(i.e. 0-1),Percent2(i.e.0-1). Here Amount values are between 20-100000 US dollars, and percent columns with decimals between 0-1. These features are positively skewed, so I applied log transformation on Amount, Yeo-Johnson using powertransformer on Percent1,Percent2 columns.
Is it right to apply different transformations on columns, will it have an effect on model performance or should I same same transformation on all columns?
There are some things that need to be known before we can answer the question.
The answer would depend on the model that you are using. In some models it’s better if the range of the different inputs are same. Some models are agnostic to that. And of course sometimes one might also be interested in assigning different priorities to the inputs.
To get back to your question, depending on the model, there might be absolutely no harm in applying different transformations, or there could be performance differences.
For example: Linear regression models would be greatly affected by the feature transformation. However supervised neural networks most likely wouldn’t.
You might want to check this stackoverflow
question: https://stats.stackexchange.com/questions/397227/why-feature-transformation-is-needed-in-machine-learning-statistics-doesnt-i
it's about understanding the benefits of transformation
so when we talk about some equation like f(x1,x2) = w1x1 + w2x2
then if x1 is about 100,000 like the amount
and if x2 is about 1.0 like the percent
and at the same time feature number 1 will be updated 100,000 faster than feature 2
mathematically when you update the weight then the equation of weights will be like
w1 = w1 - lr*(x1)
w2 = w2 - lr*(x2)
lr here represents the learning rate
then you are saying the amount feature is lot lot better than the percent feature
and that's why usually one transform the data into the same distribution to not make a feature better than another feature

What are the performance metrics for Clustering Algorithms? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm working on Kmeans clustering but unlike supervised learning I cannot figure the performance metrics for clustering algorithms. How to perform the accuracy after training the data?
For kmeans you can find the inertia_ of it. Which can give you an idea how well kmeans algorithm has worked.
kmeans = KMeans(...)
# Assuming you already have fitted data on it.
kmeans.inertia_ # lesser is better
Or, alternatively if you call score() function, which will give you the same but the sign will be negative. As we assume bigger score means better but for kmeans lesser inertia_ is better. So, to make them consistent an extra negation is applied on it.
# Call score with data X
kmeans.score(X) # greater is better
This is the very basic form of analyzing performance of kmeans. In reality if you take the number of clusters too high the score() will increase accordingly (in other words inertia_ will decrease), because inertia_ is nothing but the summation of the squared distances from each point to its corresponding cluster's centroid to which cluster it is assigned to. So if you increase the number of the clusters too much, the overall distances' squared summation will decrease as each point will get a centroid very near to it. Although, the quality of clustering is horrible in this case. So, for better analysis you should find out silhouette score or even better use silhouette diagram in this case.
You will find all of the implementations in this notebook: 09_unsupervised_learning.ipynb
The book corresponding to this repository is: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. It is a great book to learn all of these details.

Davies-Bouldin Index higher or lower score better [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
I trained a KMEANS clustering model using Google Bigquery, and it gives me these metrics in the evaluation tab of my model. My question is are we trying to maximize or minimize Davies-Bouldin index and mean-squared distance?
Davies-Bouldin index is a validation metric that is often used in order to evaluate the optimal number of clusters to use. It is defined as a ratio between the cluster scatter and the cluster’s separation and a lower value will mean that the clustering is better.
Regarding the second metric, the mean squared distance makes reference to the intra cluster variance, which we want to minimize as a lower WCSS (within-cluster sums of squares) will maximize the distance between clusters.

Resources