Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to develop an application to classify music in to probably favorite or not probably favorite by training a Neural network from music which are already marked as favorite by user himself. I never done audio analysis before so I know almost nothing about it. To make this an accurate classification model what features do I need to include in my dataset of music.
eg:- decibel values , frequency values, length of the audio
thank you
Start by using the music feature extractors from Essentia. You can for instance use their commandline tool. This provides you with tons of low-level audio features (30+ types), as well rhythm (6+ feature types) and tonal (6+ features).
You can also do the same with the Python bindings.
Spectrograms are a useful technique for visualising the spectrum of frequencies of a sound and how they vary during a very short period of time. You can use a similar technique known as Mel-Frequency Cepstral Coefficients (MFCC) as features for the dataset.
You can use Librosa's mfcc() function which generates an MFCC from time series audio data to make the task a lot easier
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 months ago.
Improve this question
I need to be able to count the number of syllables a speaker utters during a live recording. It should be noted that the speakers will not be using their native language and therefore most existing speech recognition solutions won't work. I've looked a little at CoreML, and I may be able to get access to speech corpora with the number of syllables coded, but I'm not sure how I would start training the model.
I'd also be happy with imperfect but generally consistent approaches that didn't use machine learning.
(This will be used in linguistics research in future)
In terms of naive solutions, I've found this on Language Log, but I'm not sure how it would be implemented in Swift (this sort of audio analysis is not my forte).
In terms of machine learning solutions, I found this on GitHub, but any attempts at porting the Tensorflow v.1 code to run in CoreML failed.
Any help you could offer would be greatly appreciated!
It depends on the technique used. If you want to get syllables from the audio that is definitively words, then another strategy would be to use a pre-trained speech-to-text model, like Whisper, to get the transcript. From the transcript, you could run a simple algorithm on the text that counts the syllables in the transcribed text.
This method will not account for utterances that are faint, however. It will also take some wrangling to get it real-time, but it is is state-of-the-art transcription.
Another option is to use the Apple speech-to-text feature provided by the SDK. This tutorial, here, will guide you to build a speech-to-text app with real-time transcription. Using the transcribed words, you can count the syllables from the transcript.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am working on building a custom facial recognition for our office.
I am planning to use Google FaceNet,
Now my question is that you can find or create your own version of facenet model in keras or pytorch there's no issue in that, but regarding creating dataset ,I want to know what are the best practices to capture photo of person when I don't have any prior photo of that person,all I have is a camera and a person ,should I create variance in by changing lightning condition or orientation or face size ?
A properly trained FaceNet model should already be somewhat invariant to lighting conditions, pose and other features that should not be a part of identifying a face. At least that is what is claimed in a draft of the FaceNet paper. If you only intend to compare feature vectors generated from the network, and intend to recognize a small group of people, your own dataset likely does not have to be particulary large.
Personally I have done something quite similar to what you are trying to achieve for a group of around ~100 people. The dataset consisted of 1 image per person and I used a 1-N-N classifier to classify the generated feature vectors. While I do not remember the exact results, it did work quite well. The pretrained network's architecture was different from FaceNet's but the overall idea was the same though.
The only way to truly answer your question though would be to experiment and see how well things work out in practice.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have a time line of a users data and i want to train a model to detect events.
For example an event could be a gesture in a time line of accelerometer data.
or
time line of looking at the time (looking at a watch), (labeling nerves or calm).
What machine learning algorithm will be appropriate for this problem?
Thanks
This task is known as Event Detection and can be performed using Natural Language Processing (NLP) techniques.
There is no 'appropriate' or 'not appropriate' algorithm. You have to extract various features (e.g. Part-of-Speech tags) that enable the algorithm(s) to detect events. Then, you need to evaluate the implemented algorithms/models (assuming that you have also tuned the corresponding parameters for each algorithm) and decide which one is the best (in terms of performance). Also, you need to decide which features are helpful and which are not.
These papers might be a good starting point:
Machine Learning Algorithms for Event Detection
Event Detection Challenges, Methods, and Applications
in Natural and Artificial Systems
There is no closed answer as to what is the best approach. Based on experience, my favourite approach to modelling series generally is LSTM nets. These work great with time events as long as you have enough data. You can either try to look for anomalies. For this you could use an LSTM that triggers when something 'unexpected' happens. Another option would be defining different states (e.g is.event = {0,1}) and train your LSTM as a normal classifier (check this question in Quora). You can use for example keras to implement this easily in python.
If data in not so abundant, you can also try other nice sequential models like HMM and HSMM. These are also supervised model that learn from sequential data. In the case of HSMM you also take into account the time each state has occur which depending on your data can be of use. As far as I know scikit-learn only supports HMM, however there is a HSMM library available here.
Finally, some remarks about processing your data. If you intend to do batch learning, any of the models here suggested should work fine. However, if you want to do on-line learning (meaning that you make prediction on the fly as data arrives), you will need to stick to LSTM or perhaps check this alternative if you decide to use any of the Bayesian Approach: paper on-line hsmm
Hope this helps!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
use for drone quadcopter to track human body
This problem depend on many factors :
Computational resources.
Quality of images.
How much accuracy do you expect from the algorithm
By the way, the easiest way for implementing such algorithm is Cascade Classifier which is implemented in OpenCV. You can train your own model or you can use the trained model which exists in openCV files. This method support three feature types: HOG,LBP and HAAR. The base of this method is paper Viola and Jones published on 2001. The test time is near to online in an ordinary computer.
If you need more accurate method you can try DPM (deformable part models) based method. There are many released version of this method on the internet. The speed of detection is almost 2 HZ.
If you need more accuracy I suggest you to go forward with CNN (Convolutional Neural Networks). Of Course you need more computational resources (GPU or high spec CPUs)
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I just came up with an idea that I want to develop into an application to distinguish/auto detect voices from different people.
Sample use case: After training with Obama and Romney's data, the application would be able to detect whenever either one speak again (not necessary the same content from the training data)
I am wondering if there are any existing research on this. (I don't know how to search for this. I tried a couple keywords and got no significant results.)
If not, what is a good way to start? How to choose features, data representation, models, etc.
Thanks!
I found Speaker recognition on Wikipedia which in turn linked to An overview of text-independent speaker recognition: From features to supervectors (Kinnunen, Li, 2010).
From the abstract of the paper:
This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods.