How to make sense of tensorflow tensorboard Histograms? - machine-learning

I would like to know how to make sense of the tensor flow graphs/Histograms generated.
The code for this can be found here.
This graph is easy to understand Accuracy and loss are straight forward to understand.
Accuracy- Accuracy of current state of network for given train data.
Higher is better
Accuracy/Validation - Accuracy of current state of network for given Validation data which is
not seen by network before. Higher is better
Loss- Loss of network on train data. Lower is better.
Loss/Valadation - Loss of network on test data. Lower is better.
If loss increases it's a sign of over-fitting.
Conv2d/L2-Loss - Loss of particular layer wrt train data.
Basically what the graph signifies and how i could use it to understand my network and if possible what changes i can make to improve it.
How Do I interpret the histograms?

tf.summary.histogram takes an arbitrarily sized and shaped Tensor, and compresses it into a histogram data structure consisting of many bins with widths and counts. For example, let's say we want to organize the numbers [0.5, 1.1, 1.3, 2.2, 2.9, 2.99] into bins. We could make three bins: a bin containing everything from 0 to 1 (it would contain one element, 0.5), a bin containing everything from 1-2 (it would contain two elements, 1.1 and 1.3), * a bin containing everything from 2-3 (it would contain three elements: 2.2, 2.9 and 2.99).
Please follow below links for more details:
sunside answer
Tensorflow documentation

Related

Classifying pattern in time series

I am dealing with a repeating pattern in time series data. My goal is to classify every pattern as 1, and anything that does not follow the pattern as 0. The pattern repeats itself between every two peaks as shown below in the image.
The patterns are not necessarily fixed in sample size but stay within approximate sample size, let's say 500samples +-10%. The heights of the peaks can change. The random signal (I called it random, but basically it means not following pattern shape) can also change in value.
The data is from a sensor. Patterns are when the device is working smoothly. If the device is malfunctioning, then I will not see the patterns and will get something similar to the class 0 I have shown in the image.
What I have done so far is building a logistic regression model. Here are my steps for data preparation:
Grab data between every two consecutive peaks, resample it to a fixed size of 100 samples, scale data to [0-1]. This is class 1.
Repeated step 1 on data between valley and called it class 0.
I generated some noise, and repeated step 1 on chunk of 500 samples to build extra class 0 data.
Bottom figure shows my predictions on the test dataset. Prediction on the noise chunk is not great. I am worried in the real data I may get even more false positives. Any idea on how I can improve my predictions? Any better approach when there is no class 0 data available?
I have seen similar question here. My understanding of Hidden Markov Model is limited but I believe it's used to predict future data. My goal is to classify a sliding window of 500 sample throughout my data.
I have some proposals, that you could try out.
First, I think in this field often recurrent neural networks are used (e.g. LSTMs). But I also heard that some people also work with tree based method like light gbm (I think Aileen Nielsen uses this approach).
So if you don't want to dive into neural networks, which is probably not necessary, because your signals seem to be distinguishable relative easily, you can give light gbm (or other tree ensamble methods) a chance.
If you know the maximum length of a positive sample, you can define the length of your "sliding sample-window" that becomes your input vector (so each sample in the sliding window becomes one input feature), then I would add an extra attribute with the number of samples when the last peak occured (outside/before the sample window). Then you can check in how many steps you let your window slide over the data. This also depends on the memory you have available for this.
But maybe it would be wise then to skip some of the windows between a change between positive and negative, because the states might not be classifiable unambiguously.
In case memory becomes an issue, neural networks could be the better choice, because for training they do not need all training data available at once, so you can generate your input data in batches. With tree based methods this possible does not exist or only in a very limited way.
I'm not sure of what you are trying to achieve.
If you want to characterize what is a peak or not - which is an after the facts classification - then you can use a simple rule to define peaks such as signal(t) - average(signal, t-N to t) > T, with T a certain threshold and N a number of data points to look backwards to.
This would qualify what is a peak (class 1) and what is not (class 0), hence does a classification of patterns.
If your goal is to predict that a peak is going to happen few time units before the peak (on time t), using say data from t-n1 to t-n2 as features, then logistic regression might not necessarily be the best choice.
To find the right model you have to start with visualizing the features you have from t-n1 to t-n2 for every peak(t) and see if there is any pattern you can find. And it can be anything:
was there a peak in in the n3 days before t ?
is there a trend ?
was there an outlier (transform your data into exponential)
in order to compare these patterns, think of normalizing them so that the n2-n1 data points go from 0 to 1 for example.
If you find a pattern visually then you will know what kind of model is likely to work, on which features.
If you don't then it's likely that the white noise you added will be as good. so you might not find a good prediction model.
However, your bottom graph is not so bad; you have only 2 major false positives out of >15 predictions. This hints at better feature engineering.

Using Multilayer Perceptron (MLP) to categorise images and its performance

I am new to Machine/Deep learning area!
If I understood correctly, when I am using images as an input,
the number of neurons at input layer = the number of pixels (i.e resolution)
The weights and biases are updated through back-propagation to achieive low as possible error-rate.
Question 1.
So, even one single image data will adjust the values of weights & biases (through back-propagation algorithm), then how does adding more similar images into this MLP improve the performance?
(I must be missing something big.. however to me, it seems like it will only be optimised for the given single image and if i input the next one (of similar img), it will only be optimised for the next one )
Question 2.
If I want to train my MLP to recognise certain types of images ( Let's say clothes / animals ) , what is a good number of training set for each label(i.e clothes,animals)? I know more training set will produce better result, however how much number would be ideal for good enough performance?
Question 3. (continue)
A bit different angle question,
There is a google cloud vision API , which will take images as an input, and produce label/probability as an output. So this API will give me an output of 100 (lets say) labels and the probabilities of each label.
(e.g, when i put an online game screenshot, it will produce as below,)
Can this type of data be used as an input to MLP to categorise certain type of images?
( Assuming I know all possible types of labels that Google API produces and using all of them as input neurons )
Pixel values represent an image. But also, I think this type of API output results can represent an image in different angle.
If so, what would be the performance difference ?
e.g) when classifying 10 different types of images,
(pixels trained model) vs (output labels trained model)
I can help you with the "intuitive" picture.
First, it may be worth looking at convolution neural nets and deep learning and see how to handle images as input to reduce number of weights. It will not be 1 weight per pixel.
Also, what exactly you mean by "performance"? That is not a well defined question. If you use 1 image, say a cat, do you mean by performance that you can identify cats in other pictures, or how well you are able to get close to your cat?
Imagine you have a table of 3 weights, 1 input and 1 output, and trained your network to have error of < 0.01, and the desired output is 0.5
W1 | W2 | W3 | Output
0.1 0.2 0.05 0.5006
If you retrain the network, you may get a different
W1 | W2 | W3 | Output
0.3 0.2 0.08 0.49983
Since the weights are way different, you can imagine that there are several solutions.
Then, if you add another input, you can imagine that some of those weights which worked for first solution will work for the second.
Then you add another input. Then subset of the solutions with 2 inputs will work for 3 inputs. Etc.
When you have enough unrelated or noisy inputs, you won't find a subset of weights which meet your error criterion. Either you need to add weights (more degrees of freedom) or increase the error target, or both.
Now, you have a learning rate when you train a network. Say you are doing online training (for each input you update the weights), not batch training (you find the error vector for a batch (subset) of the input and you update your weights based on that, 1 time for the batch).
Now, suppose your learning rate was 0.01 and weight of 0.1. Intuitively:
If, for the first input, the first weight had derivative of 5, then your weight has new value of 0.1 - 0.01*5 = 0.05
If you feed your next input, say the derivative was -5. That means that the second input "disagrees" with the first change, and tries to go back to 0.01
If the derivative for the second input was 5, that means that the second weight "agrees" with the first.
If you have 20 inputs, some will pull the value up, some will push the value down. You keep looping through the training and then the value will approach a value which most of the inputs agree on, hence minimizing the error caused by that weight.
For question 2:
My mathematical guts feel tells me you definitely need at least 2*weight number to have any meaning to the training, but you should make that at least 10x the number of weights for the least minimum amount to even make a conclusion about your network, unless you are not trying to guess something new (for example, for xor gate, you can probably get away with way less input than weights, but that is a bit long discussion)
Note:
With 1 image, you can rotate it, stretch it, mix it with other images... to create another images and increase your input set.
If you have a simple input like xor gate, you can create inputs like (0.3, 0.7) (0.3, 0.6) (0.2, 0.8)... to expand your training set.
For question 3:
This is equivalent to chaining google's network with a network you create serially, but training each part separately.
Basically: You have Pictures --> 10 labels input to your network --> your classification
The problem I see there is, you may not know all the possible outputs of google's classification. But say they are consistent,
Is your label same as one of the 10 labels? If so, use the given label. If it is a different type of label, you can use that API to simplify your network. What are the consequences or what is the performance?
That is beyond me. In neural nets, while they have good mathematical theories to tell us what they can do, many posed problems such as the one you asked require either a special mathematical analysis (perhaps get PhD on some insight related to that class of problems) or, as most do, show empirical results.

Machine learning for lighting direction estimation from pictures?

Newcomer to machine learning and stackoverflow.
Recently, I have been trying to create a machine learning algorithm that estimates the direction of a light source based on the reflection of an object.
I know this may be a complicated subject and that's why, as a first step, i tried to simplify it as much as possible.
I first changed my problem from a regression problem to a classification problem by only taking as output : Light source is on left side of object or Light source is on the right side of the object.
I am also only making one angle vary for my dataset.
Short version of my question :
Do you think that it is possible to do such thing with machine learning ? (my experience is too limited to really be sure)
If yes, what would be the more suited neural network for you ? CNN ? R-CNN? LSTM ? SVM ?
What would be the pipeline to complete this task ?
I am currently using Unity Engine with directional light that takes a random X angle between [10,60] / [120,170] and a sphere with metallic reflection to create and label a dataset. Here is an example :
https://imgur.com/a/FxNew Label : 0 (Left side)
https://imgur.com/a/9KFhi Label : 1 (Right side)
For the pre-processing :
Images are resized to a 64x64 image
Transformed from RGB to grayscale format.
For the machine learning, i'm currently using tensorflow and a convolutional neural network with :
10000 Balanced, labeled data of 64x64 grayscale pictures as input and 0/1 as Label
3 Convolutional Layers with filter [16,32,64] with size [5,5] RELU
3 Pooling Layers with size [2,2] and stride [2,2]
1 Dense layer with 1024 Hidden neurons and dropout (Rate = 0.4) RELU
1 Dense Layer with 2 output neurons (1 for each class) Softmax
As for the issue : My network is simply not learning the loss hardly goes down and accuracy show that good result are random, whatever the data, the number of layer, optimizer, learning rate, ... My output just average between the two classes : [0.5 , 0.5].
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
Another guess is that the convolutional layer doesn't take position into account, so for the convolution part, all the images are the same since the sphere is always the same, as well as the lighting pattern. It will always detect the same thing and won't take into account that the light region has moved. Do you have any advice on which network I could use to resolve this issue ?
I'm really looking for some advice, warning on how to tackle this kind of task.
Please remember that I am still pretty new to machine learning and still learning more than my machines hehe...
Thank you.
Do you think that it is possible to do such thing with machine learning ?
Absolutely. And you've correctly chosen a CNN model - it's the best suited for this task.
My guess is that the problem is more complicated than i first thought, that my data doesn't give a good hint of what my prediction should be and that I should rather train a network that detects the reflection dot on an object and then use the orientation between the center of object and the dot. Am I right ?
No, CNN has proven to classify pretty well from the raw pixels. It should figure out itself what to pay attention to.
Do you have any advice on which network I could use to resolve this issue ?
I would be great if you provide your full code. There are so many reasons for not learning: image pre-processing bugs, data mislabeling, poor choice of hyperparameters (learning rate, initialization, ...), wrong loss function, etc. There can be simply bugs.
What I suggest right away, based on described CNN architecture:
5x5 filter size is probably too large, since you don't have that many filters. Try 3x3 and increase the number of filters a bit, e.g. 32 - 64 - 64.
I assume that you use CONV - POLL - CONV - POLL - CONV - POOL, not CONV - CONV - CONV - POOL - POOL - POLL. Just to make sure.
You probably don't need so many neurons in your FC layer. You have just two classes and pretty similar images! Reduce 1024 to say 256.
You don't experience any overfitting at the moment, so disable the dropout for now: keep_probability=1.0.
Pay attention to initialization and learning rate. Try different values in log-scale, e.g. learning_rate = 0.1, 0.01, 0.001 and check if learning pattern ever changes.
Thank you to #Maxim for his answer. It was very helpful and helped me solve my problem as well as refining my network.
He pointed me out to the problem : Data Mislabeling.
I was pretty sure about my data labeling but verified anyway.
The problem was there...
I write the answer here so it can maybe help other unaware tensorflow users :
When you use tf.string_input_producer without specifying it, the default is : "Shuffle = True" which shuffles your filenames queue.
Since i use a .csv file for the labels and .png folder for the images, the labels where read in order from 1 to 10000 whereas the .png files were read randomly.
I feel very dumb about it, but that's how you learn hehe.

Can CNN learn to weigh certain feature channels much, much more than others?

This is a hypothetical question.
Assumptions
I am working on a 2 class semantic segmentation task
My ground truths are binary masks
batch size is 1
at an arbitrary point in my network there is a convolution layer called 'conv_5' which has a feature map size of 90 x 45 x 512.
Let's assume I also decide that (during training) I will concatenate the ground truth mask to 'conv_5'. This will result in a new top we can call 'concat_1' which will be a 90 x 45 x 513 dimension feature map.
Assume that the rest of the network follows a normal pattern like a few more convolution layers, a fully connected, and softmax loss.
My question is, can the fully connected layers learn to weigh the first 512 feature channels very low and weigh the last feature channel (which we know is a perfect ground truth) very high?
If this is true then is it true in principle such that if the feature dimension was 1,000,000 channels and I add the last channel as the perfect ground truth it will still learn to effectively ignore all previous 1,000,000 feature channels?
My intuition is that if there is ever a VERY good feature channel passed in then a network should be able to learn to utilize this channel far more than the others. I would also like to think that this is independent to the number of channels.
(In practice I have a scenario where I am passing in a nearly perfect ground truth as the 513th feature map, but it seems to have no impact at all. Then when I examine the magnitudes of the weights across all 513 feature channels, the magnitudes are roughly the same across all channels. This leads me to believe that the "nearly perfect mask" is only being utilized about 1/513th of it's potential. This is what has motivated me to ask the question.)
Hypothetically, if you have a "killing feature" in your disposal, the net should learn to use it and ignore the "noise" from the rest of the features.
BTW, Why are you using a fully connected layer for semantic segmentation? I'm not sure this is "a normal pattern" for semantic segmentation nets.
What may prevent the net from identifying the "killing features"?
- The layers above "conv_5" mess things up: if they reduce resolution (sampling/pooling/striding...) then information is lost, and it is difficult to utilize the information. Specifically, I suspect the fully-connected layer that might globally mess things up.
- A bug: the way you add the "killing feature" is not aligned with the image. Either the mask is added transposed, or you erroneously add the mask of one image to another (do you "shuffle" the training samples?)
An interesting experiment:
You can check if the net has at least a locally optimal weights that uses the "killing features": you can use net surgery to manually set the weights such that "conv_5" is zero for all features but the "killing features" and the weights for the consequent layers are not messing this up. Then you should have very high accuracy and low loss. Training the net from this point should yield very small (if any) gradients and the weights should not change significantly even after many iterations.

Appropriateness of an artificial neural network in pose estimation

I am working on a project for uni which requires markerless relative pose estimation. To do this I take two images and match n features in certain locations of the picture. From these points I can find vectors between these points which, when included with distance, can be used to estimate the new postition of the camera.
The project is required to be deplyoable on mobile devices so the algorithm needs to be efficient. A thought I had to make it more efficient would be to take these vectors and put them into a Neural Network which could take the vectors and output an estimation of the xyz movement vector based on the input.
The question I have is if a NN could be appropriate for this situation if sufficiently trained? and, if so, how would I calculate the number of hidden units I would need and what the best activation function would be?
Using a neural network for your application can very well work, however, I feel you will need a lot of training samples to allow the network to generalize. Of course, this also depends on the type and number of poses you're dealing with. It sounds to me that with some clever maths it might be possible to derive the movement vector directly from the input vector -- if by any chance you can come up with a way of doing that (or provide more information so others can think about it too), that would very much be preferred, as in that case you would include prior knowledge you have about the task instead of relying on the NN to learn it from data.
If you decide to go ahead with the NN approach, keep the following in mind:
Divide your data into training and validation set. This allows you to make sure that the network doesn't overfit. You train using the training set and determine the quality of a particular network using the error on the validation set. The ratio of training/validation depends on the amount of data you have. A large validation set (e.g., 50% of your data) will allow more precise conclusions about the quality of the trained network, but often you have too few data to afford this. However, in any case I would suggest to use at least 10% of your data for validation.
As to the number of hidden units, a rule of thumb is to have at least 10 training examples for each free parameter, i.e., each weight. So assuming you have a 3-layer network with 4 inputs, 10 hidden units, and 3 output units, where each hidden unit and the output units have additionally a bias weight, you would have (4+1) * 10 + (10+1) * 3 = 83 free parameters/weights. In general you should experiment with the number of hidden units and also the number of hidden layers. From my experience 4-layer networks (i.e., 2 hidden layers) work better than 3-layer network, but that depends on the problem. Since you also have the validation set, you can find out what network architecture and size works without having to fear overfitting.
For the activation function you should use some sigmoid function to allow for non-linear behavior. I like the hyperbolic tangent for its symmetry, but from my experience you can just as well use the logistic function.

Resources