Machine learning for lighting direction estimation from pictures?

Machine learning for lighting direction estimation from pictures? - machine-learning

Newcomer to machine learning and stackoverflow.
Recently, I have been trying to create a machine learning algorithm that estimates the direction of a light source based on the reflection of an object.
I know this may be a complicated subject and that's why, as a first step, i tried to simplify it as much as possible.
I first changed my problem from a regression problem to a classification problem by only taking as output : Light source is on left side of object or Light source is on the right side of the object.
I am also only making one angle vary for my dataset.
Short version of my question :
Do you think that it is possible to do such thing with machine learning ? (my experience is too limited to really be sure)
If yes, what would be the more suited neural network for you ? CNN ? R-CNN? LSTM ? SVM ?
What would be the pipeline to complete this task ?
I am currently using Unity Engine with directional light that takes a random X angle between [10,60] / [120,170] and a sphere with metallic reflection to create and label a dataset. Here is an example :
https://imgur.com/a/FxNew Label : 0 (Left side)
https://imgur.com/a/9KFhi Label : 1 (Right side)
For the pre-processing :
Images are resized to a 64x64 image
Transformed from RGB to grayscale format.
For the machine learning, i'm currently using tensorflow and a convolutional neural network with :
10000 Balanced, labeled data of 64x64 grayscale pictures as input and 0/1 as Label
3 Convolutional Layers with filter [16,32,64] with size [5,5] RELU
3 Pooling Layers with size [2,2] and stride [2,2]
1 Dense layer with 1024 Hidden neurons and dropout (Rate = 0.4) RELU
1 Dense Layer with 2 output neurons (1 for each class) Softmax
As for the issue : My network is simply not learning the loss hardly goes down and accuracy show that good result are random, whatever the data, the number of layer, optimizer, learning rate, ... My output just average between the two classes : [0.5 , 0.5].
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
Another guess is that the convolutional layer doesn't take position into account, so for the convolution part, all the images are the same since the sphere is always the same, as well as the lighting pattern. It will always detect the same thing and won't take into account that the light region has moved. Do you have any advice on which network I could use to resolve this issue ?
I'm really looking for some advice, warning on how to tackle this kind of task.
Please remember that I am still pretty new to machine learning and still learning more than my machines hehe...
Thank you.

Do you think that it is possible to do such thing with machine learning ?
Absolutely. And you've correctly chosen a CNN model - it's the best suited for this task.
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
No, CNN has proven to classify pretty well from the raw pixels. It should figure out itself what to pay attention to.
Do you have any advice on which network I could use to resolve this issue ?
I would be great if you provide your full code. There are so many reasons for not learning: image pre-processing bugs, data mislabeling, poor choice of hyperparameters (learning rate, initialization, ...), wrong loss function, etc. There can be simply bugs.
What I suggest right away, based on described CNN architecture:
5x5 filter size is probably too large, since you don't have that many filters. Try 3x3 and increase the number of filters a bit, e.g. 32 - 64 - 64.
I assume that you use CONV - POLL - CONV - POLL - CONV - POOL, not CONV - CONV - CONV - POOL - POOL - POLL. Just to make sure.
You probably don't need so many neurons in your FC layer. You have just two classes and pretty similar images! Reduce 1024 to say 256.
You don't experience any overfitting at the moment, so disable the dropout for now: keep_probability=1.0.
Pay attention to initialization and learning rate. Try different values in log-scale, e.g. learning_rate = 0.1, 0.01, 0.001 and check if learning pattern ever changes.

Thank you to #Maxim for his answer. It was very helpful and helped me solve my problem as well as refining my network.
He pointed me out to the problem : Data Mislabeling.
I was pretty sure about my data labeling but verified anyway.
The problem was there...
I write the answer here so it can maybe help other unaware tensorflow users :
When you use tf.string_input_producer without specifying it, the default is : "Shuffle = True" which shuffles your filenames queue.
Since i use a .csv file for the labels and .png folder for the images, the labels where read in order from 1 to 10000 whereas the .png files were read randomly.
I feel very dumb about it, but that's how you learn hehe.

Related

Autoencoder Failing to Capture Small Artifacts

tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?

IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.

Can CNN learn to weigh certain feature channels much, much more than others?

This is a hypothetical question.
Assumptions
I am working on a 2 class semantic segmentation task
My ground truths are binary masks
batch size is 1
at an arbitrary point in my network there is a convolution layer called 'conv_5' which has a feature map size of 90 x 45 x 512.
Let's assume I also decide that (during training) I will concatenate the ground truth mask to 'conv_5'. This will result in a new top we can call 'concat_1' which will be a 90 x 45 x 513 dimension feature map.
Assume that the rest of the network follows a normal pattern like a few more convolution layers, a fully connected, and softmax loss.
My question is, can the fully connected layers learn to weigh the first 512 feature channels very low and weigh the last feature channel (which we know is a perfect ground truth) very high?
If this is true then is it true in principle such that if the feature dimension was 1,000,000 channels and I add the last channel as the perfect ground truth it will still learn to effectively ignore all previous 1,000,000 feature channels?
My intuition is that if there is ever a VERY good feature channel passed in then a network should be able to learn to utilize this channel far more than the others. I would also like to think that this is independent to the number of channels.
(In practice I have a scenario where I am passing in a nearly perfect ground truth as the 513th feature map, but it seems to have no impact at all. Then when I examine the magnitudes of the weights across all 513 feature channels, the magnitudes are roughly the same across all channels. This leads me to believe that the "nearly perfect mask" is only being utilized about 1/513th of it's potential. This is what has motivated me to ask the question.)

Hypothetically, if you have a "killing feature" in your disposal, the net should learn to use it and ignore the "noise" from the rest of the features.
BTW, Why are you using a fully connected layer for semantic segmentation? I'm not sure this is "a normal pattern" for semantic segmentation nets.
What may prevent the net from identifying the "killing features"?
- The layers above "conv_5" mess things up: if they reduce resolution (sampling/pooling/striding...) then information is lost, and it is difficult to utilize the information. Specifically, I suspect the fully-connected layer that might globally mess things up.
- A bug: the way you add the "killing feature" is not aligned with the image. Either the mask is added transposed, or you erroneously add the mask of one image to another (do you "shuffle" the training samples?)
An interesting experiment:
You can check if the net has at least a locally optimal weights that uses the "killing features": you can use net surgery to manually set the weights such that "conv_5" is zero for all features but the "killing features" and the weights for the consequent layers are not messing this up. Then you should have very high accuracy and low loss. Training the net from this point should yield very small (if any) gradients and the weights should not change significantly even after many iterations.

Convolutional neural networks vs downsampling?

After reading up on the subject I don't fully understand: Is the 'convolution' in neural networks comparable to a simple downsampling or 'sharpening' function?
Can you break this term down into a simple, understandable image/analogy?
edit: Rephrase after 1st answer: Can pooling be understood as downsampling of weight matrices?

Convolutional neural network is a family of models which are proved empirically to work great when it comes to image recognition. From this point of view - CNN is something completely different than downsampling.
But in framework used in CNN design there is something what is comparable to a downsampling technique. To fully understand that - you have to understand how CNN usually works. It is build by a hierarchical number of layers and at every layer you have a set of a trainable kernels which output has a dimension very similiar to spatial size of your input images.
This might be a serious problem - the output from such layer might be extremely huge (~ nr_of_kernels * size_of_kernel_output) which could make your computations intractable. This is the reason why a certain techniques are used in order to decrease size of the output:
Stride, pad and kernel size manipulation: be setting these values to a certain value you could decrese the size of the output (on the other hand - you may lose some of important information).
Pooling operation: pooling is an operation in which instead of passing as an output from a layer all outputs from all kernels - you might pass only specific aggregated statistics about it. It is considered as extremely useful and is widely used in CNN design.
For a detailed description you might visit this tutorial.
Edit: Yes, pooling is a kind of downsampling 😊

Machine Learning - SVM

If one trains a model using a SVM from kernel data, the resultant trained model contains support vectors. Now consider the case of training a new model using the old data already present plus a small amount of new data as well.
SO:
Should the new data just be combined with the support vectors from the previously formed model to form the new training set. (If yes, then how to combine the support vectors with new graph data? I am working on libsvm)
Or:
Should the new data and the complete old data be combined together and form the new training set and not just the support vectors?
Which approach is better for retraining, more doable and efficient in terms of accuracy and memory?

You must always retrain considering the entire, newly concatenated, training set.
The support vectors from the "old" model might not be support vectors anymore in case some "new points" are closest to the decision boundary. Behind the SVM there is an optimization problem that must be solved, keep that in mind. With a given training set, you find the optimal solution (i.e. support vectors) for that training set. As soon as the dataset changes, such solution might not be optimal anymore.
The SVM training is nothing more than a maximization problem where the geometrical and functional margins are the objective function. Is like maximizing a given function f(x)...but then you change f(x): by adding/removing points from the training set you have a better/worst understanding of the decision boundary since such decision boundary is known via sampling where the samples are indeed the patterns from your training set.
I understand your concerned about time and memory efficiency, but that's a common problem: indeed training the SVMs for the so-called big data is still an open research topic (there are some hints regarding backpropagation training) because such optimization problem (and the heuristic regarding which Lagrange Multipliers should be pairwise optimized) are not easy to parallelize/distribute on several workers.
LibSVM uses the well-known Sequential Minimal Optimization algorithm for training the SVM: here you can find John Pratt's article regarding the SMO algorithm, if you need further information regarding the optimization problem behind the SVM.

Idea 1 has been already examined & assessed by research community
anyone interested in faster and smarter aproach (1) -- re-use support-vectors and add new data -- kindly review research materials published by Dave MUSICANT and Olvi MANGASARIAN on such their method referred as "Active Support Vector Machine"
MATLAB implementation: available from http://research.cs.wisc.edu/dmi/asvm/
PDF:[1] O. L. Mangasarian, David R. Musicant; Active Support Vector Machine Classification; 1999
[2] David R. Musicant, Alexander Feinberg; Active Set Support Vector Regression; IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 15, NO. 2, MARCH 2004

This is a purely theoretical thought on your question. The idea is not bad. However, it needs to be extended a bit. I'm looking here purely at the goal to sparsen the training data from the first batch.
The main problem -- which is why this is purely theoretical -- is that your data is typically not linear separable. Then the misclassified points are very important. And they will spoil what I write below. Furthermore the idea requires a linear kernel. However, it might be possible to generalise to other kernels
To understand the problem with your approach lets look at the following support vectors (x,y,class): (-1,1,+),(-1,-1,+),(1,0,-). The hyperplane is the a vertical line going trough zero. If you would have in your next batch the point (-1,-1.1,-) the max margin hyperplane would tilt. This could now be exploited for sparsening. You calculate the - so to say - minimal margin hyperplane between the two pairs ({(-1,1,+),(1,0,-)}, {(-1,-1,+),(1,0,-)}) of support vectors (in 2d only 2 pairs. higher dimensions or non-linear kernel might be more). This is basically the line going through these points. Afterwards you classify all data points. Then you add all misclassified points in either of the models, plus the support vectors to the second batch. Thats it. The remaining points can't be relevant.
Besides the C/Nu problem mentioned above. The curse of dimensionality will obviously kill you here
An image to illustrate. Red: support vectors, batch one, Blue, non-support vector batch one. Green new point batch two.
Redline first Hyperplane, Green minimal margin hyperplane which misclassifies blue point, blue new hyperplane (it's a hand fit ;) )

Random Perturbation of Data to get Training Data for Neural Networks

I am working on Soil Spectral Classification using neural networks and I have data from my Professor obtained from his lab which consists of spectral reflectance from wavelength 1200 nm to 2400 nm. He only has 270 samples.
I have been unable to train the network for accuracy more than 74% since the training data is very less (only 270 samples). I was concerned that my Matlab code is not correct, but when I used the Neural Net Toolbox in Matlab, I got the same results...nothing more than 75% accuracy.
When I talked to my Professor about it, he said that he does not have any more data, but asked me to do random perturbation on this data to obtain more data. I have research online about random perturbation of data, but have come up short.
Can someone point me in the right direction for performing random perturbation on 270 samples of data so that I can get more data?
Also, since by doing this, I will be constructing 'fake' data, I don't see how the neural network would be any better cos isn't the point of neural nets using actual real valid data to train the network?
Thanks,
Faisal.

I think trying to fabricate more data is a bad idea: you can't create anything with higher information content than you already have, unless you know the true distribution of the data to sample from. If you did, however, you'd be able to classify with the Bayes optimal error rate, which would be impossible to beat.
What I'd be looking at instead is whether you can alter the parameters of your neural net to improve performance. The thing that immediately springs to mind with small amounts of training data is your weight regulariser (are you even using regularised weights), which can be seen as a prior on the weights if you're that way inclined. I'd also look at altering the activation functions if you're using simple linear activations, and the number of hidden nodes in addition (with so few examples, I'd use very few, or even bypass the hidden layer entirely since it's hard to learn nonlinear interactions with limited data).
While I'd not normally recommend it, you should probably use cross-validation to set these hyper-parameters given the limited size, as you're going to get unhelpful insight from a 10-20% test set size. You might hold out 10-20% for final testing, however, so as to not bias the results in your favour.

First, some general advice:
Normalize each input and output variable to [0.0, 1.0]
When using a feedforward MLP, try to use 2 or more hidden layers
Make sure your number of neurons per hidden layer is big enough, so the network is able to tackle the complexity of your data
It should always be possible to get to 100% accuracy on a training set if the complexity of your model is sufficient. But be careful, 100% training set accuracy does not necessarily mean that your model does perform well on unseen data (generalization performance).
Random perturbation of your data can improve generalization performance, if the perturbation you are adding occurs in practice (or at least similar perturbation). This works because this means teaching your network on how the data could look different but still belong to the given labels.
In the case of image classification, you could rotate, scale, noise, etc. the input image (the output stays the same, naturally). You will need to figure out what kind of perturbation could apply to your data. For some problems this is difficult or does not yield any improvement, so you need to try it out. If this does not work, it does not necessarily mean your implementation or data are broken.

The easiest way to add random noise to your data would be to apply gaussian noise.
I suppose your measures have errors associated with them (a measure without errors has almost no meaning). For each measured value M+-DeltaM you can generate a new number with N(M,DeltaM), where n is the normal distribution.
This will add new points as experimental noise from previous ones, and will add help take into account exprimental errors in the measures for the classification. I'm not sure however if it's possible to know in advance how helpful this will be !

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart