Prediction works on file, not on live stream - opencv

We are trying to detect certain actions on videos by concatenating together a few consecutive frames and training a machine learning model. Our trained model predicts perfectly when we read from a file (webm, VP09 if it matters) but when we read from a stream, the predictions are completely off.
Experimenting, we found that the incoming stream (h264) is in "Raw YUV" format. So we tried to convert the frames to RGB and then grayscale. The numpy array looks OK in that some grayscale values are consistent with those areas in the image but the predictions are still off.
Is it not possible to simply take concatenated frames from OpenCV's VideoCapture's read and use them? We are novices at video streams and codecs, so is there something obvious that we are missing? Thanks!

Related

Time series Autoencoder with Metadata

At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)

Why does caffe io load_image normalize rgb values of png files from 0-1?

The caffe.io.load_image() function call on a png, returns a numpy 3d array, with the rgb values as normalized floats in the 0-1 range instead of 0-255.
Is this:
common practice when loading images into array like structures?
have something to do with how the caffe network layers uses the images?
something to do with how png files are stored?
Thanks
Normalizing pixel values to range [0..1] (instead of [0..255]) is common practice not only in deep learning, but also in other domains of image-processing/computer-vision.
This is mainly done since the native uint8 pixel values are not easy to work with - uint8 easily over/underfloat. Therefore, it is more convenient to convert pixel values to float type in range [0..1].
Trying to cope with vanishing/exploding gradients in deep nets, there are many theoretical papers analyzing the distribution of activation values (see e.g., this work). These works usually assume a normal distribution of values - thus the scaling. You will also come across many nets that in addition to scaling the nets subtract "image mean" from the input.

How to handle dynamic input size for audio spectrogram used in CNN?

A lot of articles are using CNNs to extract audio features. The input data is a spectrogram with two dimensions, time and frequency.
When creating an audio spectrogram, you need to specify the exact size of both dimensions. But they are usually not fixed. One can specify the size of the frequency dimension through the window size, but what about the time domain? The lengths of audio samples are different, but the size of the input data of CNNs should be fixed.
In my datasets, the audio length ranges from 1s to 8s. Padding or Cutting always impacts the results too much.
So I want to know more about this method.
CNNs are computed on frame windows basis. You take say 30 surrounding frames and train CNN to classify them. You need to have frame labels in this case which you can get from other speech recognition toolkit.
If you want to have pure neural network decoding, you'd better train recurrent neural network (RNN), they allow arbitrary length inputs. To increase accuracy of RNNs you also better have CTC layer which will allow adjust state alignment without network.
If you are interested in the subject you can try https://github.com/srvk/eesen, a toolkit designed for end-to-end speech recognition with recurrent neural networks.
Also related Applying neural network to MFCCs for variable-length speech segments
Ok, finally I found a paper that talks about it. In the paper they say:
All audio clips were standardized by padding/clipping to a 4 second
duration
So yes, what you say that impacts on your performance is what they do in papers for what I see.
An example of this kind of application can be UrbanSoundDataset. It's a dataset of different lengths audio and therefore any paper that uses it (for a non-RNN network) will be forced to use this or another approach that converts sounds to the same length vector/matrix. I recommend the paper Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification or ENVIRONMENTAL SOUND CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS. The latter has its code open-sourced and you can see it also gets audio to 4 seconds on function _load_audio in this notebook.
How to clip audios
from pydub import AudioSegment
audio = pydub.AudioSegment.silent(duration=duration_ms) # The length you want
audio = audio.overlay(pydub.AudioSegment.from_wav(path))
raw = audio.split_to_mono()[0].get_array_of_samples() # I only keep the left sound
Mel-spectrogram
Standard is to use a mel-spectrum for this kind of applications. You could use the Python library Essentia and follow this example or use librosa like this:
# Attention, I do not cut / pad this example
y, sr = librosa.load('your-wav-file.wav')
mel_spect = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=1024)

The best method for training a neural network?

I am trying to implement an OCR application that reads arabic numbers using Neural Network by openCV.
My Question is What give me the best performance and speed ?
entering the numbers to the NN in RGB form
entering the numbers to the NN in Grayscale form
entering the numbers to the NN in Binarized form
If you think about it, color information is completely irrelevant to recognizing numbers. It might also be irrelevant for speed since it's more preprocessing than the NN (depending on your setup). Performance-wise there shouldn't be a big difference between grayscale and binary if you've implemented a proper model. But if you're curious, you can test it quite easily by just binarizing your training data and comparing the results. It may depend on the data at hand (e.g. any existing noise that gets amplified by naive binarization).

read numbers and letters from an image using openCV

I am developing an application to read the letters and numbers from an image using opencv in c++. I first changed the given colour image and colour template to binary image, then called the method cvMatchTemplate(). This method just highlighted the areas where the template matches.. But not clear.. I just dont want to see the area.. I need to parse the characters(letters & numbers) from the image. I am new to openCV. Does anybody know any other method to get the result??
Image is taken from camera. the sample image is shown above. I need to get all the texts from the LED display(130 and Delft Tanthaf).
Friends I tried with the sample application of face detection, It detects the faces. the HaarCascade file is provided with the openCV. I just loaded that file and called the method cvHaarDetectObjects(); To detect the letters I created the xml file by using the application letter_recog.cpp provided by openCV. But when I loading this file, it shows some error(OpenCV error: UnSpecified error > in unknown function, file ........\ocv\opencv\src\cxcore\cxpersistence.cpp,line 4720). I searched in web for this error and got the information about lib files used. I did so, but the error still remains. Is the error with my xml file or calling the method to load this xml file((CvHaarClassifierCascade*)cvLoad("builded xml file name",0,0,0);)?? please HELP...
Thanks in advance
As of OpenCV 3.0 (in active dev), you can use the built-in "scene text" object detection module ~
Reference: http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html
Example: https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/textdetection.cpp
The text detection is built on these two papers:
[Neumann12] Neumann L., Matas J.: Real-Time Scene Text Localization
and Recognition, CVPR 2012. The paper is available online at
http://cmp.felk.cvut.cz/~neumalu1/neumann-cvpr2012.pdf
[Gomez13] Gomez L. and Karatzas D.: Multi-script Text Extraction from
Natural Scenes, ICDAR 2013. The paper is available online at
http://refbase.cvc.uab.es/files/GoK2013.pdf
Once you've found where the text in the scene is, you can run any sort of standard OCR against those slices (Tesseract OCR is common). And there's now an end-to-end sample in opencv using OpenCV's new interface to Tesseract:
https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/end_to_end_recognition.cpp
Template matching tend not to be robust for this sort of application because of lighting inconsistencies, orientation changes, scale changes etc. The typical way of solving this problem is to bring in machine learning. What you are trying to do by training your own boosting classifier is one possible approach. However, I don't think you are doing the training correctly. You mentioned that you gave it 1 logo as a positive training image and 5 other images not containing the logo as negative examples? Generally you need training samples to be in the order of hundreds or thousands or more. You cannot possibly train with 6 training samples and expect it to work.
If you are unfamiliar with machine learning, here is roughly what you should do:
1) You need to collect many positive training samples (from hundred onwards but generally the more the merrier) of the object you are trying to detect. If you are trying to detect individual characters in the image, then get cropped images of individual characters. You can start with the MNIST database for this. Better yet, to train the classifier for your particular problem, get many cropped images of the characters on the bus from photos. If you are trying to detect the entire rectangular LED board panel, then use images of them as your positive training samples.
2) You will need to collect many negative training samples. Their number should be in the same order as the number of positive training samples you have. These could be images of the other objects that appear in the images you will run your detector on. For example, you could crop images of the front of the bus, road surfaces, trees along the road etc. and use them as negative examples. This is to help the classifier rule out these objects in the image you run your detector on. Hence, negative examples are not just any image containing objects you don't want to detect. They should be objects that could be mistaken for the object you are trying to detect in the images you run your detector on (at least for your case).
See the following link on how to train the cascade of classifier and produce the XML model file: http://note.sonots.com/SciSoftware/haartraining.html
Even though you mentioned you only want to detect the individual characters instead of the entire LED panel on the bus, I would recommend first detecting the LED panel so as to localize the region containing the characters of interest. After that, either perform template matching within this smaller region or run a classifier trained to recognize individual characters on patches of pixels in this region obtained using sliding window approach, and possibly at multiple scale. (Note: The haarcascade boosting classifier you mentioned above will detect characters but it won't tell you which character it detected unless you only train it to detect that particular character...) Detecting characters in this region in a sliding window manner will give you the order the characters appear so you can string them into words etc.
Hope this helps.
EDIT:
I happened to chance upon this old post of mine after separately discovering the scene text module in OpenCV 3 mentioned by #KaolinFire.
For those who are curious, this is the result of running that detector on the sample image given by the OP. Notice that the detector is able to localize the text region, even though it returns more than one bounding box.
Note that this method is not foolproof (at least this implementation in OpenCV with the default parameters). It tends to generate false-positives, especially when the input image contains many "distractors".
Here are more examples obtained using this OpenCV 3 text detector on the Google Street View dataset:
Notice that it has a tendency to find "text" between parallel lines (e.g., windows, walls etc). Since the OP's input image is likely going to contain outdoor scenes, this will be a problem especially if he/she does not restrict the region of interest to a smaller region around the LED signs.
It seems that if you are able to localize a "rough" region containing just the text (e.g., just the LED sign in the OP's sample image), then running this algorithm can help you get a tighter bounding box. But you will have to deal with the false-positives though (perhaps discarding small regions or picking among the overlapping bounding boxes using a heuristic based on knowledge about the way letters appear on the LED signs).
Here are more resources (discussion + code + datasets) on text detection.
Code
Extracting text OpenCV
http://libccv.org/doc/doc-swt/
Stroke Width Transform (SWT) implementation (Python)
https://github.com/subokita/Robust-Text-Detection
Datasets
You will find the google streetview and MSRA datasets here. Although the images in these datasets are not exactly the same as the ones for the LED signs on buses, they may be helpful either for picking the "best" performing algorithm from among several competing algorithms, or to train a machine learning algorithm from scratch.
http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List
See my answer to How to read time from recorded surveillance camera video? You can/should use cvMatchTemplate() to do that.
If you are working with a fixed set of bus destinations, template matching will do.
However, if you want the system to be more flexible, I would imagine you would need some form of contour/shape analysis for each individual letter.
You can also look at EAST: Efficient Scene Text Detector - https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/
Under this link, you have examples with C++ and Python. I used this code to detect numbers of buses (after detecting that given object is a bus).

Resources