Convolutional neural networks vs downsampling? - machine-learning

After reading up on the subject I don't fully understand: Is the 'convolution' in neural networks comparable to a simple downsampling or 'sharpening' function?
Can you break this term down into a simple, understandable image/analogy?
edit: Rephrase after 1st answer: Can pooling be understood as downsampling of weight matrices?

Convolutional neural network is a family of models which are proved empirically to work great when it comes to image recognition. From this point of view - CNN is something completely different than downsampling.
But in framework used in CNN design there is something what is comparable to a downsampling technique. To fully understand that - you have to understand how CNN usually works. It is build by a hierarchical number of layers and at every layer you have a set of a trainable kernels which output has a dimension very similiar to spatial size of your input images.
This might be a serious problem - the output from such layer might be extremely huge (~ nr_of_kernels * size_of_kernel_output) which could make your computations intractable. This is the reason why a certain techniques are used in order to decrease size of the output:
Stride, pad and kernel size manipulation: be setting these values to a certain value you could decrese the size of the output (on the other hand - you may lose some of important information).
Pooling operation: pooling is an operation in which instead of passing as an output from a layer all outputs from all kernels - you might pass only specific aggregated statistics about it. It is considered as extremely useful and is widely used in CNN design.
For a detailed description you might visit this tutorial.
Edit: Yes, pooling is a kind of downsampling 😊

Related

How to deal with dataset of different features?

I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

Depthwise separable Convolution

I am new to Deep Learning and I recently came across Depth Wise Separable Convolutions. They significantly reduce the computation required to process the data and need only like 10% of standard convolution step computation.
I am curious as to what is the intuition behind doing this? We do achieve greater speed by reducing number of parameters and doing less computation, but is there a trade-off in performance?
Also, is it only used for some specific use cases like images, etc or can it be applied to all forms of data?
Intuitively, depthwise separable conovolutions (DSCs) model the spatial correlation and cross-channel correlation separately while regular convolutions model them simultaneously. In our recent paper published on BMVC 2018, we give a mathematical proof that DSC is nothing but the principal component of regular convolution. This means it can capture the most effective part of regular convolution and discard other redundant parts, making it super efficient. For the tradeoff, in our paper, we gave some empirical results on VGG16 data-free regular convolution decomposition. However, with enough fine-tuning, we could almost reduce the accuracy degradation. Hope our paper helps you understand DSC further.
Intuition
The intuition behind doing this is to decouple the spatial information (width and height) and the depthwise information (channels). While regular convolutional layers will merge feature maps over the number of input channels, depthwise separable convolutions will perform another 1x1 convolution before adding them up.
Performance
Using a depthwise separable convolutional layer as a drop-in replacement for a regular one will greatly reduce the number of weights in the model. It will also very likely hurt the accuracy due to the much smaller number of weights. However, if you change the width and depth of your architecture to increase the weights again, you may reach the same accuracy of the original model with less parameters. At the same time a depthwise separable model with the same number of weights might achieve a higher accuracy compared the original model.
Application
You can use them wherever you can apply a CNN. I'm sure you will find use cases of depthwise separable models outside image related tasks. It's just that CNNs have been most popular around images.
Further Reading
Let me shamelessly point you to an article of mine that discusses different types of convolutions in Deep Learning with some information how they work. Maybe that helps as well.

Machine learning for lighting direction estimation from pictures?

Newcomer to machine learning and stackoverflow.
Recently, I have been trying to create a machine learning algorithm that estimates the direction of a light source based on the reflection of an object.
I know this may be a complicated subject and that's why, as a first step, i tried to simplify it as much as possible.
I first changed my problem from a regression problem to a classification problem by only taking as output : Light source is on left side of object or Light source is on the right side of the object.
I am also only making one angle vary for my dataset.
Short version of my question :
Do you think that it is possible to do such thing with machine learning ? (my experience is too limited to really be sure)
If yes, what would be the more suited neural network for you ? CNN ? R-CNN? LSTM ? SVM ?
What would be the pipeline to complete this task ?
I am currently using Unity Engine with directional light that takes a random X angle between [10,60] / [120,170] and a sphere with metallic reflection to create and label a dataset. Here is an example :
https://imgur.com/a/FxNew Label : 0 (Left side)
https://imgur.com/a/9KFhi Label : 1 (Right side)
For the pre-processing :
Images are resized to a 64x64 image
Transformed from RGB to grayscale format.
For the machine learning, i'm currently using tensorflow and a convolutional neural network with :
10000 Balanced, labeled data of 64x64 grayscale pictures as input and 0/1 as Label
3 Convolutional Layers with filter [16,32,64] with size [5,5] RELU
3 Pooling Layers with size [2,2] and stride [2,2]
1 Dense layer with 1024 Hidden neurons and dropout (Rate = 0.4) RELU
1 Dense Layer with 2 output neurons (1 for each class) Softmax
As for the issue : My network is simply not learning the loss hardly goes down and accuracy show that good result are random, whatever the data, the number of layer, optimizer, learning rate, ... My output just average between the two classes : [0.5 , 0.5].
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
Another guess is that the convolutional layer doesn't take position into account, so for the convolution part, all the images are the same since the sphere is always the same, as well as the lighting pattern. It will always detect the same thing and won't take into account that the light region has moved. Do you have any advice on which network I could use to resolve this issue ?
I'm really looking for some advice, warning on how to tackle this kind of task.
Please remember that I am still pretty new to machine learning and still learning more than my machines hehe...
Thank you.
Do you think that it is possible to do such thing with machine learning ?
Absolutely. And you've correctly chosen a CNN model - it's the best suited for this task.
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
No, CNN has proven to classify pretty well from the raw pixels. It should figure out itself what to pay attention to.
Do you have any advice on which network I could use to resolve this issue ?
I would be great if you provide your full code. There are so many reasons for not learning: image pre-processing bugs, data mislabeling, poor choice of hyperparameters (learning rate, initialization, ...), wrong loss function, etc. There can be simply bugs.
What I suggest right away, based on described CNN architecture:
5x5 filter size is probably too large, since you don't have that many filters. Try 3x3 and increase the number of filters a bit, e.g. 32 - 64 - 64.
I assume that you use CONV - POLL - CONV - POLL - CONV - POOL, not CONV - CONV - CONV - POOL - POOL - POLL. Just to make sure.
You probably don't need so many neurons in your FC layer. You have just two classes and pretty similar images! Reduce 1024 to say 256.
You don't experience any overfitting at the moment, so disable the dropout for now: keep_probability=1.0.
Pay attention to initialization and learning rate. Try different values in log-scale, e.g. learning_rate = 0.1, 0.01, 0.001 and check if learning pattern ever changes.
Thank you to #Maxim for his answer. It was very helpful and helped me solve my problem as well as refining my network.
He pointed me out to the problem : Data Mislabeling.
I was pretty sure about my data labeling but verified anyway.
The problem was there...
I write the answer here so it can maybe help other unaware tensorflow users :
When you use tf.string_input_producer without specifying it, the default is : "Shuffle = True" which shuffles your filenames queue.
Since i use a .csv file for the labels and .png folder for the images, the labels where read in order from 1 to 10000 whereas the .png files were read randomly.
I feel very dumb about it, but that's how you learn hehe.

Instance Normalisation vs Batch normalisation

I understand that Batch Normalisation helps in faster training by turning the activation towards unit Gaussian distribution and thus tackling vanishing gradients problem. Batch norm acts is applied differently at training(use mean/var from each batch) and test time (use finalized running mean/var from training phase).
Instance normalisation, on the other hand, acts as contrast normalisation as mentioned in this paper https://arxiv.org/abs/1607.08022 . The authors mention that the output stylised images should be not depend on the contrast of the input content image and hence Instance normalisation helps.
But then should we not also use instance normalisation for image classification where class label should not depend on the contrast of input image. I have not seen any paper using instance normalisation in-place of batch normalisation for classification. What is the reason for that? Also, can and should batch and instance normalisation be used together. I am eager to get an intuitive as well as theoretical understanding of when to use which normalisation.
Definition
Let's begin with the strict definition of both:
Batch normalization
Instance normalization
As you can notice, they are doing the same thing, except for the number of input tensors that are normalized jointly. Batch version normalizes all images across the batch and spatial locations (in the CNN case, in the ordinary case it's different); instance version normalizes each element of the batch independently, i.e., across spatial locations only.
In other words, where batch norm computes one mean and std dev (thus making the distribution of the whole layer Gaussian), instance norm computes T of them, making each individual image distribution look Gaussian, but not jointly.
A simple analogy: during data pre-processing step, it's possible to normalize the data on per-image basis or normalize the whole data set.
Credit: the formulas are from here.
Which normalization is better?
The answer depends on the network architecture, in particular on what is done after the normalization layer. Image classification networks usually stack the feature maps together and wire them to the FC layer, which share weights across the batch (the modern way is to use the CONV layer instead of FC, but the argument still applies).
This is where the distribution nuances start to matter: the same neuron is going to receive the input from all images. If the variance across the batch is high, the gradient from the small activations will be completely suppressed by the high activations, which is exactly the problem that batch norm tries to solve. That's why it's fairly possible that per-instance normalization won't improve network convergence at all.
On the other hand, batch normalization adds extra noise to the training, because the result for a particular instance depends on the neighbor instances. As it turns out, this kind of noise may be either good and bad for the network. This is well explained in the "Weight Normalization" paper by Tim Salimans at al, which name recurrent neural networks and reinforcement learning DQNs as noise-sensitive applications. I'm not entirely sure, but I think that the same noise-sensitivity was the main issue in stylization task, which instance norm tried to fight. It would be interesting to check if weight norm performs better for this particular task.
Can you combine batch and instance normalization?
Though it makes a valid neural network, there's no practical use for it. Batch normalization noise is either helping the learning process (in this case it's preferable) or hurting it (in this case it's better to omit it). In both cases, leaving the network with one type of normalization is likely to improve the performance.
Great question and already answered nicely. Just to add: I found this visualisation From Kaiming He's Group Norm paper helpful.
Source: link to article on Medium contrasting the Norms
I wanted to add more information to this question since there are some more recent works in this area. Your intuition
use instance normalisation for image classification where class label
should not depend on the contrast of input image
is partly correct. I would say that a pig in broad daylight is still a pig when the image is taken at night or at dawn. However, this does not mean using instance normalization across the network will give you better result. Here are some reasons:
Color distribution still play a role. It is more likely to be a apple than an orange if it has a lot of red.
At later layers, you can no longer imagine instance normalization acts as contrast normalization. Class specific details will emerge in deeper layers and normalizing them by instance will hurt the model's performance greatly.
IBN-Net uses both batch normalization and instance normalization in their model. They only put instance normalization in early layers and have achieved improvement in both accuracy and ability to generalize. They have open sourced code here.
IN provide visual and appearance in-variance and BN accelerate training and preserve discriminative feature.
IN is preferred in Shallow layer(starting layer of CNN) so remove appearance variation and BN is preferred in deep layers(last CNN layer) should be reduce in order to maintain discrimination.

Why do we use fully-connected layer at the end of CNN?

I searched for the reason a lot but I didn't get it clear, May someone explain it in some more detail please?
In theory you do not have to attach a fully connected layer, you could have a full stack of convolutions till the very end, as long as (due to custom sizes/paddings) you end up with the correct number of output neurons (usually number of classes).
So why people usually do not do that? If one goes through the math, it will become visible that each output neuron (thus - prediction wrt. to some class) depends only on the subset of the input dimensions (pixels). This would be something among the lines of a model, which only decides whether an image is an element of class 1 depending on first few "columns" (or, depending on the architecture, rows, or some patch of the image), then whether this is class 2 on a few next columns (maybe overlapping), ..., and finally some class K depending on a few last columns. Usually data does not have this characteristic, you cannot classify image of the cat based on a few first columns and ignoring the rest.
However, if you introduce fully connected layer, you provide your model with ability to mix signals, since every single neuron has a connection to every single one in the next layer, now there is a flow of information between each input dimension (pixel location) and each output class, thus the decision is based truly on the whole image.
So intuitively you can think about these operations in terms of information flow. Convolutions are local operations, pooling are local operations. Fully connected layers are global (they can introduce any kind of dependence). This is also why convolutions work so well in domains like image analysis - due to their local nature they are much easier to train, even though mathematically they are just a subset of what fully connected layers can represent.
note
I am considering here typical use of CNNs, where kernels are small. In general one can even think of MLP as a CNN, where the kernel is of the size of the whole input with specific spacing/padding. However these are just corner cases, which are not really encountered in practise, and not really affecting the reasoning, since then they end up being MLPs. The whole point here is simple - to introduce global relations, if one can do it by using CNNs in a specific manner - then MLPs are not needed. MLPs are just one way of introducing this dependence.
Every fully connected (FC) layer has an equivalent convolutional layer (but not vice versa). Hence it is not necessary to add FC layers. They can always be replaced by convolutional layers (+ reshaping). See details.
Why do we use FC layers then?
Because (1) we are used to it (2) it is simpler. (1) is probably the reason for (2). For example, you would need to adjust the loss fuctions / the shape of the labels / add a reshape add the end if you used a convolutional layer instead of a FC layer.
I found this answer by Anil-Sharma on Quora helpful.
We can divide the whole network (for classification) into two parts:
Feature extraction:
In the conventional classification algorithms, like SVMs, we used to extract features from the data to make the classification work. The convolutional layers are serving the same purpose of feature extraction. CNNs capture better representation of data and hence we don’t need to do feature engineering.
Classification:
After feature extraction we need to classify the data into various classes, this can be done using a fully connected (FC) neural network. In place of fully connected layers, we can also use a conventional classifier like SVM. But we generally end up adding FC layers to make the model end-to-end trainable.
The CNN gives you a representation of the input image. To learn the sample classes, you should use a classifier (such as logistic regression, SVM, etc.) that learns the relationship between the learned features and the sample classes. Fully-connected layer is also a linear classifier such as logistic regression which is used for this reason.
Convolution and pooling layers extract features from image. So this layer doing some "preprocessing" of data. Fully connected layrs perform classification based on this extracted features.

Resources