how to determine new features when doing feature engineering? - machine-learning

I'm working on a project where I need to build a Neural Network Model to compensate the error that a Vehicle GPS make. my Dataset consists of 4 features: longitudinal Acceleration, lateral Acceleration, speed and yaw Rate and my output is the azimuth (angle according to north from the false GPS coordinates to the true GPS coordinates) and distance between the false GPS coordinates the true GPS coordinates. so it's a regression Problem and I thought maybe a Feed Forward Neural Network would perform good in this task but I'm struggling to implement a neural network that can fit my data. My model starts to learn good for the first epochs and then it stucks at a high loss value like it is unable to learn anymore, it even fail to overfit the data, I tried all preprocessing approaches and neural network best practices but that didn't help. My last thoughts were that maybe something is wrong with the data or maybe I need more features to complete such a task so I want to make some feature engineering and that's why I want to ask here how can I make that in my case? I don't know whether it makes sense but I thought maybe I can take the derivative of the acceleration and add that as a features or maybe I can take the square of it or the square of the speed? so I have many Ideas but I'm not sure what to do and how can I have a sense for this so I wish someone here have the experience to help me on this.
PS: I don't have an Error in my NN Implementation because I tried to run my model only on 10 examples of the data and the NN achieved to map the relationship between those 10 examples, it was also good when I used 100-1000 examples but when I take more than 3000 examples, the NN starts to learn and then it get stuck at some high loss value that's the reason why I started to think maybe something is wrong with the data or that I need more features to do this

Related

How to reduce false negatives in image anomaly detection?

I'm currently working in a quality inspection project and I need to develop a program that can detect irregular parts. The problem I'm facing is that I don't have many irregular samples (only seven for more than 3,000 regular ones). I tried with CNN's but due to the unbalanced number of samples the model detects all as regular, so the approach I'm exploring is to use anomaly detection algorithms. I also tried with autoencoders but as the differences between regular and irregular are minimal, I could't get any good results. So far, the approach that gave me the best results is with Local Outlier Factor in combination with feature extractors (HOG). The only problem with this one is that even after tuning the algorithm's parameters it still gives me false positives (normal samples are labeled as irregular), which for this application is not acceptable. Is there anything I can add to the process to eliminate the false positives? o can you recommend me other approach? I'd really appreciate any help :)
Use Focal loss function since u have imbalanced data or u can try data augmentation technique as well.

CNN for audio recognition has near perfect test and validation accuracy but not generalizing to new data

I am creating a CNN to recognize a friend's speech. She uses unique vocalizations (not in any language) to communicate. To begin, I recorded 60 samples of three of these sounds (180 .wav samples total). After training the model, I was getting near perfect accuracy from both the test and validation data. I then recorded new sounds immediately after this training, and was getting about 50 percent accuracy, which showed some level of learning and generalizing, since random guesses on 3 classes should have been getting about 33% accuracy.
The next day I tried recording new audio again, and the model's predictions were as good as random. My guess as to the problem is that the model is sensitive to very small changes in environment. It showed some learning immediately after training because the environment would have been very similar. However, the following day, there were probably more substantial changes to environment (background noise, distance from microphone, sitting in different part of the room, etc.). Does this seem like a reasonable guess as to the cause of the problem? And if so, how can I make my model less sensitive to environment? Would adding white noise help? Are there ways to add background noise to my samples? Any help would be appreciated.
That is to be expected! 180 samples is not nearly enough to train the CNN on. A CNN contains thousands up to millions of parameters so you could quite possibly have many more parameters to tune than bytes of data in your dataset!
Futhermore, your claim of getting perfect accuracy on the test set seem suspicious. I'd wager that you have accidentally used test data to train the model with.
You could try "growing" your dataset by adding randomized noise to the sound files. I don't think that would help much though. The network would become resilient to the type of white noise you added, but probably not to the type of noise found in actual recordings. For example, in speech recognition noises people make when speaking like breathing, saying "uh" and "eh", or clearing their throats, can confuse the recognizer. It is very hard to synthetically add such noises.
Also, while two sounds may sound similar to human ears, their waveforms might be completely different. A song played in different keys sound similar or even identical to human ears but have totally different waveforms. You have the same effect if you listen to someone talking indoors vs. outdoors vs. in a noisy bar. Even whether someone is standing or sitting can completely change the sound of their voice.
Bottom line: you need waaaay more data. I also recommend experimenting with RNNs and Bidirectional RNNs. They are a better fit for temporal data like sound samples than CNNs. Generally they also require less parameters so training is faster.

How to solve the "cold start" problem in computer vision based deep learning models?

By “Cold Start” I mean that often computer vision models for object detection or semantic segmentation require about 5000 images per class. So if an idea if floated within the company for e.g. we want to use object detection to count the number of wood logs when the truck is dispatched and then use the same app to count the number that is received.
So now the challenge is that you have only a few images of woods logs on a truck but to train any model you need thousands, so what do practitioners typically do for these prototypes?
Because at this stage it is not clear what model to try? It is also not very feasible to ask business to invest in collecting thousands of images of logs and label them?
That is why I am calling this “Cold Start”. How do you start?
What I have looked into is Conditional GANs, Pix-2-Pix but I am trying to understand the recommended method on how to start when you have very few images per object class.
I expect that when I drop a few images in a folder and call this library I end up getting a lot more images per class so I can then start my prototyping.
Note that asking for software libraries is specifically off-topic here.
No, there is no magic solution: if your data set doesn't have enough information in its images to train a hand-crafted model, no amount of software will change that fact. However, the first approach is to challenge that "fact": how do you know that you don't have enough images? What happened when you used what you have to train a model? You will train for more epochs before the model converges, but you should be able to achieve far better than random accuracy by training a comparable quantity of iterations.
I seriously doubt that you'll need to collect and label thousands of images: you have a very restricted paradigm, photos of log trucks taken from an vantage point you control. Training a model to count non-overlapping near-circles will take much less differentiation than, say, distinguishing motor vehicles from postal boxes.
Experiment with the basic models you have at hand -- you already have much more of the solution than you realize. If your data set is too small, go out the yard with a digital camera and get twice as many, three times, whatever you need. Flip the images left-right to get more input.
Does that get you moving?
Transfer learning solves the problem you are describing as "Cold Start". Basically you can import the weights obtained after training using a big and open dataset and just fine-tune them using the smaller dataset you already have. Data augmentation, freezing some of the layers, etc may help improving the results of a fine-tuned model.

Should I adapt loss weights for misclassified samples during epochs?

I am using FCN (Fully Convolutional Networks) and trying to do image segmentation. When training, there are some areas which are mislabeled, however further training doesn't help much to make them go away. I believe this is because network learns about some features which might not be completely correct ones, but because there are enough correctly classified examples, it is stuck in local minimum and can't get out.
One solution I can think of is to train for an epoch, then validate the network on training images, and then adjust weights for mismatched parts to penalize mismatch more there in next epoch.
Intuitively, this makes sense to me - but I haven't found any writing on this. Is this a known technique? If yes, how is it called? If no, what am I missing (what are the downsides)?
It highly depends on your network structure. If you are using the original FCN, due to the pooling operations, the segmentation performance on the boundary of your objects is degraded. There have been quite some variants over the original FCN for image segmentation, although they didn't go the route you're proposing.
Just name a couple of examples here. One approach is to use Conditional Random Field (CRF) on top of the FCN output to refine the segmentation. You may search for the relevant papers to get more idea on that. In some sense, it is close to your idea but the difference is that CRF is separated from the network as a post-processing approach.
Another very interesting work is U-net. It employs some idea from the residual network (RES-net), which enables high resolution features from lower levels can be integrated into high levels to achieve more accurate segmentation.
This is still a very active research area. So you may bring the next break-through with your own idea. Who knows! Have fun!
First, if I understand well you want your network to overfit your training set ? Because that's generally something you don't want to see happening, because this would mean that while training your network have found some "rules" that enables it to have great results on your training set, but it also means that it hasn't been able to generalize so when you'll give it new samples it will probably perform poorly. Moreover, you never talk about any testing set .. have you divided your dataset in training/testing set ?
Secondly, to give you something to look into, the idea of penalizing more where you don't perform well made me think of something that is called "AdaBoost" (It might be unrelated). This short video might help you understand what it is :
https://www.youtube.com/watch?v=sjtSo-YWCjc
Hope it helps

TensorFlow - GradientDescentOptimizer - are we actually finding global optimum?

I was playing with tensorflow for quite a while, and I have more of a theoretical question. In general, when we train a network we usually use GradientDescentOptimizer (probably its variations like adagrad or adam) to minimize the loss function. In general it looks like we are trying to adjust weights and biases so that we get the global minimum of this loss function. But the issue is that I assume that this function has an extremely complicated look if you plot it, with lots of local optimums. What I wonder is how can we be sure that Gradient Descent finds global optimum and that we are not getting instantly stuck in some local optimum instead which is far away from global optimum?
I recollect that for example when you are performing clustering in sklearn it usually runs clustering algorithm several times with random initialization of cluster centers, and by doing this we ensure that we are not getting stuck with not optimal result. But we are not doing something like this while training ANNs in tensorflow - we start with some random weights and just travel along the slope of the function.
So, any insight into this? Why we can be more or less sure that the results of training via gradient descent are close to global minimum once the loss stops to decrease significantly?
Just to clarify, why I am wondering about this matter is that if we can't be sure that we get at least close to global minimum we can't easily judge which of 2 different models is actually better. Because we could run experiment, get some model evaluation which shows that model is not good... But actually it just stuck in local minimum shortly after training started. While other model which seemed for us to be better was just more lucky to start training from a better starting point and didn't stuck in local minimum fast. Moreover, this issue means that we can't even be sure that we get maximum from the network architecture we currently could be testing. For example, it could have really good global minimum but it is hard to find it and we mostly get stuck with poor solutions at local minimums, which would be far away from global optimum and never see the full potential of network at hand.
Gradient descent, by its nature, is looking at the function locally (local gradient). Hence, there is absolutely no guarantee that it will be the global minima. In fact, it probably will not be unless the function is convex. This is also the reason GD like methods are sensitive to initial position you start from. Having said that, there was a recent paper which said that in high-dimensional solution spaces, the number of maximas/minimas are not as many as previously thought.
Finding global minimas in high dimensional space in a reasonable way seems very much an unsolved problem. However, you might wanna focus more on saddle points rather than minimas. See this post for example:
High level description for saddle point problem
A more detailed paper is here (https://arxiv.org/pdf/1406.2572.pdf)
Your intuition is quite right. Complex models such as neural networks are typically applied to problems with high dimensional input, where the error surface has a very complex landscape.
Neural networks are not guaranteed to find the global optimum and getting stuck in local minima is a problem where a lot of research has been focussed. If you’re interested in finding out more about this, it would be good to look at techniques such as online learning and momentum, which have traditionally been used to avoid the problem of local minima. However, these techniques in themselves bring further difficulties e.g. integrating online learning is not possible for some optimisation techniques and the addition of a momentum hyper-parameter to the backpropagation algorithm brings further difficulties in training.
A really good video for visualising the influence of momentum (and how it overcomes local minma) during backpropagation can be found here.
Added after question edit - see comments
It’s the aforementioned nature of the problems neural networks are applied to that means we often can’t find a globally optimal solution, because (in the general case) traversing the entire search space for the optimal solution would be intractable using classical computing technology (quantum computers could change this for some problems). As such neural networks are trained to achieve what is hopefully a ‘good’ local optimum.
If you're interested in reading more detailed information about the techniques employed to find good local optima (i.e. something that approximates a global solution) a good paper to read might be this
No. Gradient descent method helps to find out the local minima. In case if global minima and local minima are same then only we get the actual result i.e. global minima.

Resources