Can TensorFlow be used for audio pitch detection? - machine-learning

I'm really new to TensorFlow and ML and I just want to know if TensorFlow can work with audio. I want to create such classifier that can detect pitch of loudest sound in audio signal despite of its source (human, instrument, bird, etc.)

Just like you can use images in tensorflow by converting them to a numeric representation(bytes matrix in the case of images) , you can work with audio in tensorflow if you use a numeric representation. Hope this helps :)

Related

(Speech/sound recognition) Why do most article/book show codes that train machine using JPEG file of plotted spectrogram?

Background: Hi, I am a total beginner in machine learning with civil engineering background.
I am attempting on training a machine (using Python) to create an anomaly detecting algorithm that can detect defect inside of concrete by using sound from hammering test.
From what I understand, to train a machine for speech recognition, you need to process your sound signal using signal processing like Short Time Fourier Analysis and Wavelet Analysis. From this analysis, you will get your sound data decomposed to frequencies (and time) that it is made up of. So, a 3D array data; time-frequency-amplitude.
After that, most articles that I've read would plot spectrogram using this 3D array and save it in JPG/JPEG. And from the image data, it will be processed again to be fed into neural network. The rest will be the same as how we would do in training machine for image recognition algorithm.
My question is, why do we need to plot the 3D array to spectrogram (image file) and feed our machine with the image file instead of using the array directly.

Machine Learning: What is the format of video inputs that I can pass for my machine learning algorithm to analyze input video?

There are certain machine learning algorithms in use that takes videos files as input. If I have to pull all the videos from youtube that are associated with a certain tag and provide them as input to this algorithm, what should be my input format?
There is no format in which you can pass a video to a machine learning algorithm, since it won't understand the contents of the video.
You need to preprocess the video first, which might depend on how you have to use it. In general you can do something like converting each frame of the video to CSV (same as preprocessing an image), which you can pass to your machine learning algorithm. If you want to process your frames sequentially, you may want to use a Recurrent Neural Network. Also if the video has some audio, then just find its audio time series, and combine each part of the time series with its corresponding video frame.

Track shifting and zooming object between frames

I'm trying to track cars using video from dash cam. Most of the time there is
slight shifting of a vehicle in front of me
on/off brake lights
zoom in when it uses brakes
Zoom out when it accelerates.
What algorithm will be the best for this case? Of course, I can just run open cv, but I want to understand how it works.
Thank you!
I think that for your task you can use the Haar Cascade Classifier. It is a machine learning based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images.
There is a good openCV's implementation, with both the trainer and the detector.
On the web you can even find a lot of .xml files, that are the result of the training part, and use these .xml files to do directly the detection.
Even if i'm not really sure that you can find these files for the detection of the back of a car.
At this link you can learn the bases of the method and you can even understand how to use it in openCV http://docs.opencv.org/master/d7/d8b/tutorial_py_face_detection.html#gsc.tab=0
In this case you don't need the 4 features that you suggested, but maybe you can use that with another algotrithm at the end of the pipeline of the Haar Cascade Classifier for a double check.

How to extract human voice from an audio clip, using machine learning?

How can we use machine learning to get human voice from an audio clip which can be having a lot many noise over whole frequency domain.
As in any ML application the process is simple: collect samples, design features, train the classifier. For the samples you can use your noisy recordings or you can find a lot of noises in the web sound collections like freesound.org. For the features you can use mean-normalized mel-frequency coefficients, you can find implementation in CMUSphinx speech recognition toolkit. For classifier you can pick GMM or SVM. If you have enough data it will work fairly well.
To improve the accuracy you can add assumption that noise and voice are continuous, for that reason you can analyze detection history with hangover scheme (essentially HMM) to detect voice chunks instead of analysis of the every frame individually.

Simple technique to upsample/interpolate video features?

I'm trying to analyse audio and visual features in tandem. My audio speech features are mel-frequency cepstrum co-efficients sampled at 100fps using the Hidden Markov Model Toolkit. My visual features come from a lip-tracking programme I built and are sampled at 29.97fps.
I know that I need to interpolate my visual features so that the sample rate is also 100fps, but I can't find a nice explanation or tutorial on how to do this online. Most of the help I have found comes from the speech recognition community which assumes a knowledge of interpolation on behalf of the reader, i.e. most cover the step with a simple "interpolate the visual features so that the sample rate equals 100fps".
Can anyone pooint me in the right direction?
Thanks a million
Since face movement is not low-pass filtered prior to video capture, most of the classic DSP interpolation methods may not apply. You might as well try linear interpolation of your features vectors to get from one set of time points to a set at a different set of time points. Just pick the 2 closest video frames and interpolate to get more data points in between. You could also try spline interpolation if your facial tracking algorithm measures accelerations in face motion.

Resources