I'm having a problem implementing a Bayes Classifier with the Parzen window algorithm using a spherical (or isotropic) kernel.
I am running the algorithm with test data containing 2 dimensions and 3 different classes (For each class, I have 10 test points, and 40 training points, all in 2 dimensions). When I change the value of my hyper-parameter (sigma_sq for the spherical Gaussian kernel), I find that there is no effect on how the points are classified.
This is my density estimator. My self.sigma_sq is the same across all the dimensions of my data (2 dimensions)
for i in range(test_data.shape[0]):
log_prob_intermediate = 0
for j in range(n): #n is size of training set
c = -self.n_dims * np.log(2*np.pi)/2.0 - self.n_dims*np.log(self.sigma_sq)/2.0
log_prob_intermediate += (c - np.sum((test_data[i,:] - self.train_data[j,:])**2.0) / (2.0 * self.sigma_sq))
log_prob.append(log_prob_intermediate / n)
How I implemented my Bayes Classifier:
There are 3 classes that my Bayes Classifier must distinguish. I created 3 training sets and 3 test sets (one training and test set per class). For each point in my test set, I run the density estimator for each class on the point. This gives me a vector of 3 values: the log probability that my new point is in class1, class2, or class3. I then choose the maximum value and assign the new point to that class.
Since I am using a spherical Gaussian kernel, I am of the understanding that my sigma_sq must be common for each density estimator (one density estimator for each class). Is this correct? If I had a different sigma_sq for each dimension pair, wouldn't this give me somewhat of a diagonal Gaussian kernel?
For my list of 30 test points (10 for each class), I find that running the bayes classifier on these points continues to give me the exact same classification for each point, regardless of what sigma I use. Is this normal? Since it's a spherical Gaussian kernel, and all my dimensions use the same kernel, is increasing or decreasing my sigma_sq just having a proportional effect on my log probability with no change in the classification? OR do I have some sort of problem with my density estimator that I can't figure out.
Lets address each thing separately
Using the same sigma for each dimension makes your kernel radial, this is true; however, you can (and should!) use different sigma for each class, as each distribution usually requires different density estimator, for simple heuristics read for example about Scott's rule of thumb for the kernel width selection in gaussian case or later work by Silverman.
It is hard to tell whether in your particular case choice of sigma should change the classification - in general it should be true; but each dataset has its own properties. However, your data is just 2D, which makes it perfect for visualization. Draw your data, then, draw each KDE and simply visually investigate what is going on.
Related
I'd like to train a neural net to return a normalized color distance between two RGB pixels trained [at first] on a simple Euclidean distance with each color component being 0-255. there are 6 inputs (2 pixels x R,G,B components) and 1 output (a normalized 0-1 color distance). or should there be 48 binary inputs (2 pixels x 24 bits per px)?
the training set will probably consist of 1k - 10k randomly generated pixel pairs and their computed distances. i'm open to creating/chaining several simple nets rather than training one perfect one. e.g. splitting each by hue1/hue2 combo that can be cheaply determined in advance.
i'm new to ML and was hoping to get some advice since my intuition is basically zero for what type of net i need and ballpark for what the topology and params should be.
most supervised neural nets seem to do classification into buckets, but i don't quite understand how to get a net to predict a value that is not just one out of several categories. can i still use a simple feed-forward network for this task or something different?
thanks for any advice!
Iam a little bit confused about how to normalize/standarize image pixel values before training a convolutional autoencoder. The goal is to use the autoencoder for denoising, meaning that my traning images consists of noisy images and the original non-noisy images used as ground truth.
To my knowledge there are to options to pre-process the images:
- normalization
- standarization (z-score)
When normalizing using the MinMax approach (scaling between 0-1) the network works fine, but my question here is:
- When using the min max values of the training set for scaling, should I use the min/max values of the noisy images or of the ground truth images?
The second thing I observed when training my autoencoder:
- Using z-score standarization, the loss decreases for the two first epochs, after that it stops at about 0.030 and stays there (it gets stuck). Why is that? With normalization the loss decreases much more.
Thanks in advance,
cheers,
Mike
[Note: This answer is a compilation of the comments above, for the record]
MinMax is really sensitive to outliers and to some types of noise, so it shouldn't be used it in a denoising application. You can use quantiles 5% and 95% instead, or use z-score (for which ready-made implementations are more common).
For more realistic training, normalization should be performed on the noisy images.
Because the last layer uses sigmoid activation (info from your comments), the network's outputs will be forced between 0 and 1. Hence it is not suited for an autoencoder on z-score-transformed images (because target intensities can take arbitrary positive or negative values). The identity activation (called linear in Keras) is the right choice in this case.
Note however that this remark on activation only concerns the output layer, any activation function can be used in the hidden layers. Rationale: negative values in the output can be obtained through negative weights multiplying the ReLU output of hidden layers.
I am relatively new the subject and have been doing loads of reading. What I am particularly confused about is how a CNN learns its filters for a particular labeled feature in a training data set.
Is the cost calculated by which outputs should or shouldn't be active on a pixel by pixel basis? And if that is the case, how does mapping the activations to the labeled data work after having down sampled?
I apologize for any poor assumptions or general misunderstandings. Again, I am new to this field and would appreciate all feedback.
I'll break this up into a few small pieces.
Cost calculation -- cost / error / loss depends only on comparing the final prediction (the last layer's output) to the label (ground truth). This serves as a metric of how right or wrong the prediction is.
Inter-layer structure -- Each input to the prediction is an output of the prior layer. This output has a value; the link between the two has a weight.
Back-prop -- Each weight gets adjusted in proportion to the error comparison and its weight. A connection that contributed to a correct prediction gets rewarded: its weight is increased in magnitude. Conversely, a connection that pushed for a wrong prediction gets reduced.
Pixel-level control -- To clarify the terminology ... traditionally, each kernel is a square matrix of float values, each of which is called a "pixel". The pixels are trained individually. However, that training comes from sliding a smaller filter (also square) across the kernel, performing a dot-product of the window with the corresponding square sub-matrix of the kernel. The output of that dot-product is the value of a single pixel in the next layer.
When the strength of pixel in layer N is increased, this effectively increases the influence of the filter in layer N-1 providing that input. That filter's pixels are, in turn, tuned by the inputs from layer N-2.
My FCN is trained to detect 10 different classes and produces an output of 500x500x10 with each of the final dimensions being the prediction probabilities for a different class.
Usually, I've seen using a uniform threshold, for instance 0.5, to binarize the probability matrices. However, in my case, this doesn't quite cut it because the IoU for some of the classes increases when the threshold is 0.3 and for other classes it is 0.8.
Hence, I don't have to arbitrarily pick the threshold for each class but rather use a more probabilistic approach to finalizing the threshold values. I thought of using CRFs but this also requires the thresholding to have already been done. Any ideas on how to proceed?
Example: consider an image of a forest with 5 different birds. Now im trying to output an image that has segmented the forest and the five birds, 6 classes, each with a separate label. The network outputs 6 confusion matrices indicating the confidence that a pixel falls into a particular class. Now, the correct answer for a pixel isnt always the class with the highest confidence value. Therefore, a one size fits all method or a max value method won't work.
CRF Postprocessing Approch
You don't need to set thresholds to use a CRF. I'm not familiar with any python libraries for CRFs, but in principle, what you need to define is:
A probability distribution of the 10 classes for each of the nodes
(pixels), which is simply the output of your network.
Pairwise potentials: 10*10 matrix, where element Aij denotes the "strength" of the configuration that one pixel is of class i and the other of class j. If you set the potentials to have a value alpha (alpha >> 1) in the diagonal and 1 elsewhere, then alpha is the regularization force that gives you consistency of the predictions (if pixel X is of class Y, then the neighboring pixels of X are more likely to be of the same class).
This is just one example of how you can define your CRF.
End to End NN Approach
Add a loss to your network that will penalize pixels that have neighbors of a different class. Please note that you will still end up with a tune-able parameter for the weight of the new regularization loss.
I am not sure whether I am applying PCA correctly or not! I have p features and n observations (instances). I put these in an nxp matrix X. I perform mean normalization and I get the normalized matrix B. I calculate the eigenvalues and eigenvectors of the pxp covariance matrix C=(1/(n-1))B*.B where * denotes the conjugate transpose.
The eigenvectors corresponding to the descendingly ordered eigenvalues are in a pxp matrix E. Let's say I want to reduce the number of attributes from p to k. I use the equation X_new=B.E_reduced where E_reduced is produced by choosing the first k columns of E. Here are my questions:
1) Should it be X_new=B.E_reduced or X_new=X.E_reduced?
2) Should I repeat the above calculations in the testing phase? If testing phase is similar to training phase, then no speed-up is gained because I have to calculate all the p features for each instance in the testing phase and PCA makes the algorithm slower because of eigenvector calculation overhead.
3) After applying PCA, I noticed that the accuracy decreased. Is this related to the number k (I set k=p/2) or the fact that I am using linear PCA instead of kernel PCA? What is the best way to choose the number k? I read that I can find the ratio of summation of k eigenvalues over the summation of all eigenvalues and make a decision based on this ratio.
You apply the multiplication to the centered data usually, so your projected data is also centered.
Never re-run PCA during testing. Only usenit on training data, and keep the shift vector and projection matrix. You need to apply exactly the same projection as during training, not recompute a new projection.
Decreased performance can have many reasons. E.g. did you also apply scaling using the roots of the eigenvalues? And what method did you use the first place?