Confused about implementation of skip layers in CNN [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm reading about AlphaGo Zero's network structure and came across this cheatsheet:
I'm having a hard time understanding how skip connections work dimensionally.
Specifically, it seems like each residual layer ends up with 2 stacked copies of the input it receives. Would this not cause the input size to grow exponentially with the depth of the network?
And could this be avoided by changing the output channel size of the conv2d filter? I see that in_C and out_C don't have to be the same in pytorch, but I don't know enough to understand the implications of these values being different.

With skip connection, you can indeed end up with twice the number of channels per connection. This is the case when you are concatenating the channels together. However, it doesn't necessarily have to grow exponentially, if you keep the number of output channels (what you refer to as out_C) under control.
For instance, if you have a skip connection providing a total of n channels and the convolutional layer gets in_C channels as input. Then you can define out_C as n as well, such that the resulting number of channels after concatenation is equal to 2*n. Ultimately, you decide on the number of output channels for each convolution, it is all about network capacity and how much it will be capable of learning.

Related

How to classify text with 35+ classes; only ~100 samples per class? [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed yesterday.
This post was edited and submitted for review 19 hours ago.
Improve this question
The task is seemingly straightforward -- given a list of classes and some samples/rules of what belongs in the class, assign all relevant text samples to it. All the classes are arguably dissimilar, but they have a high degree of overlap in terms of vocab.
Precision is most important, but acceptable recall is about 80%.
Here is what I have done so far:
Checked if any of the samples have direct word matches/lemma matches to the samples that are in the class' corpora of words. (High precision but low recall -- got me to cover about 40% of text)
Formed a cosine_sim matrix of all the class' corpora of words and the remaining text samples. Cut off at an empirical threshold, it helped me identify a couple new texts that are very similar. (Covered maybe 10% more text)
I appended each sample picked by the word match/lemma match/embedding match (using sbert) to the class' corpora of words
Essentially I increased the number of samples in the class. Note that there are 35+ classes, and even with this method I got to maybe about 200-250 samples per class.
I converted each class' samples to embeddings via sbert, and then used UMAP to reduce dimensions. UMAP also has a secondary, but less used, use-case : it can learn representation and transform new data into similar representation! I used this concept to convert text to embeddings, then reduce them via UMAP, and saved the UMAP transformation. Using this reduced representation, I built a voting classifier ( with XGB, RF, KNearestNeighbours, SVC and Logistic Regression) and set it to a hard voting criteria.
The unclassified texts went through the prediction pipeline (sbert embeddings -> transformed lower dim embeddings via saved UMAP -> predict class via voter)
Is this the right approach for when trying to classify between a large number of classes with small training data size?

Best practices to run a random forest model as fast as possible [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I want to run a random forest classifier model. My data set is pretty big with 1 million rows and 300 columns. Of course, prefer not to run the model for like 3 days non-stop. So I was wondering if there are some good practices to find the optimal trade-off between running time and prediction quality.
Here are some examples of what a was thinking:
Can I use a random subsample of x rows to tune the parameters and then use does parameters for the model with all the data. (If yes how do I find the best value for x?)
Is there a way to know at what point it is useless to keep adding more data because the prediction will stop improving? (i.e., what is the minimum number of rows that will give me the best results for the running time)
How can I estimate the running time of the model? With 4000 rows the model takes 4 min with 8000 it takes 10 min. The running time is exponential or it's more or less linear and I could expect 1280min of running time with 1 million rows?
Random subsampling and then tuning on the full data rarely works, as the small subsample could be not representative of the full data.
About the amount of the data vs the model quality: try using learning curves from sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
This way you'll be able to plot the amount of the data vs the model performance.
Here are some examples of plotting:
https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_kernel_ridge_regression.html#sphx-glr-auto-examples-miscellaneous-plot-kernel-ridge-regression-py
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html#sphx-glr-auto-examples-model-selection-plot-learning-curve-py
Estimating total time is difficult, because it isn't linear.
Some additional practical suggestions:
set n_jobs=-1 to run the model in parallel on all cores;
use any feature selection approach to decrease the number of features. 300 features is really a lot, it should be possible to get rid of around half of them without serious decline of the model performance.

Batch normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why the batch normalization is working on the different samples of the same characteristics instead of different characteristics of the same sample? Shouldn't it be the normalization of different features? In the diagram, why do we use the first row and not the first column?
Could someone help me?
Because different features of the same object mean different things, and it's not logical to calculate some statistics over these values. They can have different range, mean, std, etc. E.g. one of your features could mean the age of a person and other one is the height of the person. If you calculate mean of these values you will not get any meaningful number.
In classic machine learning (especially in linear models and KNN) you should normalize your features (i.e. calculate mean and std of the specific feature over the entire dataset and transform your features to (X-mean(X)) / std(X) ). Batch normalization is analogue of this applied to stochastic optimization methods, like SGD (it's not meaningful to use global statistics on mini batch, furthermore you want to use batch norm more often than just before the first layer). More fundamenal ideas could be found in the original paper

Can anyone please explain me the functioning of standard scalers in python sklearn [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I read about them and found that, they basically scale up the values.So dont they change up the values of the records? ok if they scale up/down the values,so there graph must look same everytime,but i saw changes in the graph as per selection of scaler.Please let me know this as I am new to this.
Standardizing the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais.However, the minimum and maximum values vary according to how spread out the variable was, to begin with, and is highly influenced by the presence of outliers.
For example, A variable that ranges between 0 and 1000 will outweigh a variable that ranges between 0 and 1. Using these variables without standardization will give the variable with the larger range weight of 1000 in the analysis. Transforming the data to comparable scales can prevent this problem. Typical data standardization procedures equalize the range and/or data variability.
Note in particular that because the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different.StandardScaler cannot guarantee balanced feature scales in the presence of outliers.
As you saw changes in the graph as per selection of scaler, one resion can you used StandardScaler() to standardize data so far doesn't work with NaNs (missing values).It's not exactly that simple to deal with NaN values. It requires analyses of the data before taking any further step to deal with the NaN values. There are various ways you can deal with these missing values (the following is not an exhaustive list):
Ignore missing values altogther : the problem with this approach is that the missing rows might contain important information in other
columns and ignoring them would lead to incomplete analyses
Replace them with another value : this one of the commonly used approaches, but the choice of the value that you will use to replace
will affect your overall analysis. You could replace them with say
mean, or say a placeholder value (like -1) which you know never
occurs throughout the column.
Using regression to substitute the values
**Using KNN to substitue values **

Davies-Bouldin Index higher or lower score better [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
Improve this question
I trained a KMEANS clustering model using Google Bigquery, and it gives me these metrics in the evaluation tab of my model. My question is are we trying to maximize or minimize Davies-Bouldin index and mean-squared distance?
Davies-Bouldin index is a validation metric that is often used in order to evaluate the optimal number of clusters to use. It is defined as a ratio between the cluster scatter and the cluster’s separation and a lower value will mean that the clustering is better.
Regarding the second metric, the mean squared distance makes reference to the intra cluster variance, which we want to minimize as a lower WCSS (within-cluster sums of squares) will maximize the distance between clusters.

Resources