Why multiple kernel does‘t improve the accuracy when some of kernel get weigths? - machine-learning

Why multiple kernel does‘t improve the accuracy when some of kernel get weigths? it means multi-kernel can't always improve the accuracy when we combine some kernel

There are plenty of aspects of such question, I will list few of most common reasons:
your data is to small to learn complex function without heavy overfitting
your kernel weights assignment technique (optimization) is overfitting due to optimization nature and/or to flexible kernel ensemble definitions
not optimizing something (like weights, kernels) is actually kind of regularization - if you start to fit this part, you increase variance of your method, thus you will need to compensate by some variance reduction technique (like additional regularizers)
poor selection of kernels - too heavily correlated kernels, kernels with internal characteristic not well suited to the data might be a problem - weights assignment technique will struggle to remove redundant kernels
In general:
Each additional level of abstraction in the optimization process leads to non-linear growth in the problem complexity, thus in order to fit it correctly you need:
more data
better optimiztaion techniques
better knowledge of underlying mathematical models
as compared to the use of simple model.

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Can normalization decrease model performance of ensemble methods?

Normalization e.g. z-scoring is a common preprocessing method in Machine Learning.
I am analyzing a dataset and use ensemble methods like Random Forests or the XGBOOST framework.
Now I compare models using
non normalized features
z-scored features
Using crossvalidation I observe in both cases that with higher max_depth parameter the training error decreases.
For the 1. case the test error also decreases and saturates at a certain MAE:
For the z-scored features however the test error is non decreasing at all.
In this question: https://datascience.stackexchange.com/questions/16225/would-you-recommend-feature-normalization-when-using-boosting-trees it was discussed that normalization is not necessary for tree based methods. But the example above shows that it has a severe effect.
So I have two questions regarding this:
Does it imply that overfitting with ensemble based methods is possible even when the test error decreases?
Should normalization like z-scoring always be common practice when working with ensemble methods?
Is it possible that normalization methods decrease the model performance?
Thanks!
It is not easy to see what is going on in the absence of any code or data.
Normalisation may or may not be helpful depending on the particular data and how the normalisation step is applied.
Tree based methods ought to be robust enough to handle the raw data.
In your cross validations is your code doing the normalisation separately for each fold?
Doing a single normalisation prior to cv may lead to significant leakage.
With very high values of depth you will have a much more complex model that will fit the training data well but will fail to generalise to new data.
I tend to prefer max depths from 2 to 5.
If I can't get a reasonable model I turn my efforts to feature engineering rather than trying to tweak the hyperparameters too much.

Depthwise separable Convolution

I am new to Deep Learning and I recently came across Depth Wise Separable Convolutions. They significantly reduce the computation required to process the data and need only like 10% of standard convolution step computation.
I am curious as to what is the intuition behind doing this? We do achieve greater speed by reducing number of parameters and doing less computation, but is there a trade-off in performance?
Also, is it only used for some specific use cases like images, etc or can it be applied to all forms of data?
Intuitively, depthwise separable conovolutions (DSCs) model the spatial correlation and cross-channel correlation separately while regular convolutions model them simultaneously. In our recent paper published on BMVC 2018, we give a mathematical proof that DSC is nothing but the principal component of regular convolution. This means it can capture the most effective part of regular convolution and discard other redundant parts, making it super efficient. For the tradeoff, in our paper, we gave some empirical results on VGG16 data-free regular convolution decomposition. However, with enough fine-tuning, we could almost reduce the accuracy degradation. Hope our paper helps you understand DSC further.
Intuition
The intuition behind doing this is to decouple the spatial information (width and height) and the depthwise information (channels). While regular convolutional layers will merge feature maps over the number of input channels, depthwise separable convolutions will perform another 1x1 convolution before adding them up.
Performance
Using a depthwise separable convolutional layer as a drop-in replacement for a regular one will greatly reduce the number of weights in the model. It will also very likely hurt the accuracy due to the much smaller number of weights. However, if you change the width and depth of your architecture to increase the weights again, you may reach the same accuracy of the original model with less parameters. At the same time a depthwise separable model with the same number of weights might achieve a higher accuracy compared the original model.
Application
You can use them wherever you can apply a CNN. I'm sure you will find use cases of depthwise separable models outside image related tasks. It's just that CNNs have been most popular around images.
Further Reading
Let me shamelessly point you to an article of mine that discusses different types of convolutions in Deep Learning with some information how they work. Maybe that helps as well.

Is Back propagation Algorithm independent algorithm

Isn't the back propagation algorithm independent algorithm or do we need any other algorithms such as Bayesian along with it for neural network learning?And do we need any probabilistic approach for implementing back propagation algorithm?
Back propagation is just an efficient way of computing gradients in computational graphs. That's all. You do not have to use it (although computing gradients without it is extremely expensive), and what you do with it is up to you - there are hundreads of ways to use gradients. The most common one is to use it in order to run first order optimization techniques (such as SGD, RMSProp or Adam). Thus to address your question - backpropagation is enough if and only if your task is to compute a gradient. For learning neural net you need at least one more piece - an actual learning algorithm (such as SGD, which is literally a single line of code). It is hard to say how "independet" it is from other methods, as as I said - gradients can be used everywhere.

non linear svm kernel dimension

I have some problems with understanding the kernels for non-linear SVM.
First what I understood by non-linear SVM is: using kernels the input is transformed to a very high dimension space where the transformed input can be separated by a linear hyper-plane.
Kernel for e.g: RBF:
K(x_i, x_j) = exp(-||x_i - x_j||^2/(2*sigma^2));
where x_i and x_j are two inputs. here we need to change the sigma to adapt to our problem.
(1) Say if my input dimension is d, what will be the dimension of the
transformed space?
(2) If the transformed space has a dimension of more than 10000 is it
effective to use a linear SVM there to separate the inputs?
Well it is not only a matter of increasing the dimension. That's the general mechanism but not the whole idea, if it were true that the only goal of the kernel mapping is to increase the dimension, one could conclude that all kernels functions are equivalent and they are not.
The way how the mapping is made would make possible a linear separation in the new space.
Talking about your example and just to extend a bit what greeness said, RBF kernel would order the feature space in terms of hyperspheres where an input vector would need to be close to an existing sphere in order to produce an activation.
So to answer directly your questions:
1) Note that you don't work on feature space directly. Instead, the optimization problem is solved using the inner product of the vectors in the feature space, so computationally you won't increase the dimension of the vectors.
2) It would depend on the nature of your data, having a high dimensional pattern would somehow help you to prevent overfitting but not necessarily will be linearly separable. Again, the linear separability in the new space would be achieved because the way the map is made and not only because it is in a higher dimension. In that sense, RBF would help but keep in mind that it might not perform well on generalization if your data is not locally enclosed.
The transformation usually increases the number of dimensions of your data, not necessarily very high. It depends. The RBF Kernel is one of the most popular kernel functions. It adds a "bump" around each data point. The corresponding feature space is a Hilbert space of infinite dimensions.
It's hard to tell if a transformation into 10000 dimensions is effective or not for classification without knowing the specific background of your data. However, choosing a good mapping (encoding prior knowledge + getting right complexity of function class) for your problem improves results.
For example, the MNIST database of handwritten digits contains 60K training examples and 10K test examples with 28x28 binary images.
Linear SVM has ~8.5% test error.
Polynomial SVM has ~ 1% test error.
Your question is a very natural one that almost everyone who's learned about kernel methods has asked some variant of. However, I wouldn't try to understand what's going on with a non-linear kernel in terms of the implied feature space in which the linear hyperplane is operating, because most non-trivial kernels have feature spaces that it is very difficult to visualise.
Instead, focus on understanding the kernel trick, and think of the kernels as introducing a particular form of non-linear decision boundary in input space. Because of the kernel trick, and some fairly daunting maths if you're not familiar with it, any kernel function satisfying certain properties can be viewed as operating in some feature space, but the mapping into that space is never performed. You can read the following (fairly) accessible tutorial if you're interested: from zero to Reproducing Kernel Hilbert Spaces in twelve pages or less.
Also note that because of the formulation in terms of slack variables, the hyperplane does not have to separate points exactly: there's an objective function that's being maximised which contains penalties for misclassifying instances, but some misclassification can be tolerated if the margin of the resulting classifier on most instances is better. Basically, we're optimising a classification rule according to some criteria of:
how big the margin is
the error on the training set
and the SVM formulation allows us to solve this efficiently. Whether one kernel or another is better is very application-dependent (for example, text classification and other language processing problems routinely show best performance with a linear kernel, probably due to the extreme dimensionality of the input data). There's no real substitute for trying a bunch out and seeing which one works best (and make sure the SVM hyperparameters are set properly---this talk by one of the LibSVM authors has the gory details).

Resources